Although the previous version of the Yukatan data model was already quite usable, it still lacks many details that are needed for a truly great email client. To benefit fully from the power of the relational model, all useful structured information should be broken into relations.
The message identification fields, as specified in section "3.6.4. Identification fields" of RFC 2822, are quite simple to handle:
message-id = "Message-ID:" msg-id CRLF in-reply-to = "In-Reply-To:" 1*msg-id CRLF references = "References:" 1*msg-id CRLF msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
For our purposes it is enough to represent message identifiers just as the opaque string including the "<" and ">" characters. The "Message-ID" field contains just one message identifier (that should always be present) so we can store the opaque identifier string directly as the msgmessageid field of the message record. The "In-Reply-To" and "References" headers contain zero or more identifiers and are thus stored as separate relations (inreplytofield and referencesfield) keyed by the message serial number. The order of message identifiers is stored in both relations. If (against the specification) there are multiple "In-Reply-To" or "References" header fields, then the message identifiers from separate header fields are combined into a single list of identifiers to be stored in the respective relation.
A bit more complex structured fields are the originator and recipient fields described in sections "3.6.2. Originator fields" and "3.6.3. Destination address fields" of the RFC:
from = "From:" mailbox-list CRLF sender = "Sender:" mailbox CRLF reply-to = "Reply-To:" address-list CRLF to = "To:" address-list CRLF cc = "Cc:" address-list CRLF bcc = "Bcc:" (address-list / [CFWS]) CRLF
The difference between a "mailbox" and an "address" is that the former onlys specifies a single mailbox while the latter can specify also a group of mailboxes. More specifically:
address = mailbox / group mailbox = name-addr / addr-spec name-addr = [display-name] angle-addr angle-addr = [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr group = display-name ":" [mailbox-list / CFWS] ";" [CFWS] display-name = phrase mailbox-list = (mailbox *("," mailbox)) / obs-mbox-list address-list = (address *("," address)) / obs-addr-list
By treating each address group as just another list of mailbox addresses, we can break each of these header fields into separate relations just like we did with the message identifier fields. As a singular field, the "Sender" field is stored directly on the message record.
For our purposes we can express addresses as combinations of two parts, the display name and the actual email address. The display name may be NULL, but the email address (opaque string matching the "addr-spec" production) must always be included.
CREATE TABLE message ( msgno SERIAL PRIMARY KEY, msgdate TIMESTAMP WITH TIME ZONE NOT NULL, msgsendername CHARACTER VARYING, msgsenderaddress CHARACTER VARYING, msgmessageid CHARACTER VARYING, msgsubject CHARACTER VARYING, msgbody TEXT, msgsource BYTEA ); CREATE TABLE headerfield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, fieldno INTEGER NOT NULL, fieldname CHARACTER VARYING NOT NULL, fieldbody CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, fieldno) ); CREATE TABLE inreplytofield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, inreplytono INTEGER NOT NULL, inreplytoid CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, inreplytono) ); CREATE TABLE referencesfield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, referencesno INTEGER NOT NULL, referencesid CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, referencesno) ); CREATE TABLE fromfield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, fromno INTEGER NOT NULL, fromname CHARACTER VARYING, fromaddress CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, fromno) ); CREATE TABLE replytofield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, replytono INTEGER NOT NULL, replytoname CHARACTER VARYING, replytoaddress CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, replytono) ); CREATE TABLE tofield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, tono INTEGER NOT NULL, toname CHARACTER VARYING, toaddress CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, tono) ); CREATE TABLE ccfield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, ccno INTEGER NOT NULL, ccname CHARACTER VARYING, ccaddress CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, ccno) ); CREATE TABLE bccfield ( msgno INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, bccno INTEGER NOT NULL, bccname CHARACTER VARYING, bccaddress CHARACTER VARYING NOT NULL, PRIMARY KEY (msgno, bccno) );
The usage of NOT NULL and UNIQUE and FOREIGN KEY constraints in the above schema is carefully decided. Although for example the unique "Message-ID" header field should be included in all email messages, the schema allows the field to be NULL and doesn't even set a UNIQUE constraint on the field. This is because in practise the field might not be present or unique, but we'd still want to be able to store the message in our database. Note also how no email addresses or message identifiers are used in FOREIGN KEY constraints.
Most of the different header field relations above have a very similar structure, and could therefore very easily be combined into a single relation with an added type column. Or alternatively the inheritance features of PostgreSQL could be used. However at this point I think it is best to keep these relations separate to keep the semantic differences clearly visible. However at some later point it might turn out to be wise to restructure these relations.
An important issue that has yet not been discussed is the representation of SMTP envelope and other message delivery data in the data model. This data is strictly speaking not a part of the message itself, but for practical reasons (and to prevent losing information) should be represented somehow. This shortcoming will be addressed in Yukatan data model 0.5.
The other major shortcoming of the data model is that it still has no support for the widely used Multipurpose Internet Mail Extensions (MIME). Support for the MIME standards will be added incrementally starting from Yukatan data model 0.6.