Yukatan data model 0.4

Although the previous version of the Yukatan data model was already quite usable, it still lacks many details that are needed for a truly great email client. To benefit fully from the power of the relational model, all useful structured information should be broken into relations.

Message identifiers

The message identification fields, as specified in section "3.6.4. Identification fields" of RFC 2822, are quite simple to handle:

message-id   =  "Message-ID:" msg-id CRLF
in-reply-to  =  "In-Reply-To:" 1*msg-id CRLF
references   =  "References:" 1*msg-id CRLF
msg-id       =  [CFWS] "<" id-left "@" id-right ">" [CFWS]

For our purposes it is enough to represent message identifiers just as the opaque string including the "<" and ">" characters. The "Message-ID" field contains just one message identifier (that should always be present) so we can store the opaque identifier string directly as the msgmessageid field of the message record. The "In-Reply-To" and "References" headers contain zero or more identifiers and are thus stored as separate relations (inreplytofield and referencesfield) keyed by the message serial number. The order of message identifiers is stored in both relations. If (against the specification) there are multiple "In-Reply-To" or "References" header fields, then the message identifiers from separate header fields are combined into a single list of identifiers to be stored in the respective relation.

Address fields

A bit more complex structured fields are the originator and recipient fields described in sections "3.6.2. Originator fields" and "3.6.3. Destination address fields" of the RFC:

from         =  "From:" mailbox-list CRLF
sender       =  "Sender:" mailbox CRLF
reply-to     =  "Reply-To:" address-list CRLF
to           =  "To:" address-list CRLF
cc           =  "Cc:" address-list CRLF
bcc          =  "Bcc:" (address-list / [CFWS]) CRLF

The difference between a "mailbox" and an "address" is that the former onlys specifies a single mailbox while the latter can specify also a group of mailboxes. More specifically:

address      =  mailbox / group
mailbox      =  name-addr / addr-spec
name-addr    =  [display-name] angle-addr
angle-addr   =  [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group        =  display-name ":" [mailbox-list / CFWS] ";" [CFWS]
display-name =  phrase
mailbox-list =  (mailbox *("," mailbox)) / obs-mbox-list
address-list =  (address *("," address)) / obs-addr-list

By treating each address group as just another list of mailbox addresses, we can break each of these header fields into separate relations just like we did with the message identifier fields. As a singular field, the "Sender" field is stored directly on the message record.

For our purposes we can express addresses as combinations of two parts, the display name and the actual email address. The display name may be NULL, but the email address (opaque string matching the "addr-spec" production) must always be included.

SQL schema

CREATE TABLE message (
        msgno           SERIAL PRIMARY KEY,
        msgdate         TIMESTAMP WITH TIME ZONE NOT NULL,
        msgsendername   CHARACTER VARYING,
        msgsenderaddress CHARACTER VARYING,
        msgmessageid    CHARACTER VARYING,
        msgsubject      CHARACTER VARYING,
        msgbody         TEXT,
        msgsource       BYTEA
);
CREATE TABLE headerfield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        fieldno         INTEGER NOT NULL,
        fieldname       CHARACTER VARYING NOT NULL,
        fieldbody       CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, fieldno)
);
CREATE TABLE inreplytofield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        inreplytono     INTEGER NOT NULL,
        inreplytoid     CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, inreplytono)
);
CREATE TABLE referencesfield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        referencesno    INTEGER NOT NULL,
        referencesid    CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, referencesno)
);
CREATE TABLE fromfield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        fromno          INTEGER NOT NULL,
        fromname        CHARACTER VARYING,
        fromaddress     CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, fromno)
);
CREATE TABLE replytofield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        replytono       INTEGER NOT NULL,
        replytoname     CHARACTER VARYING,
        replytoaddress  CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, replytono)
);
CREATE TABLE tofield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        tono            INTEGER NOT NULL,
        toname          CHARACTER VARYING,
        toaddress       CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, tono)
);
CREATE TABLE ccfield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        ccno            INTEGER NOT NULL,
        ccname          CHARACTER VARYING,
        ccaddress       CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, ccno)
);
CREATE TABLE bccfield (
        msgno           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        bccno           INTEGER NOT NULL,
        bccname         CHARACTER VARYING,
        bccaddress      CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgno, bccno)
);

Notes

The usage of NOT NULL and UNIQUE and FOREIGN KEY constraints in the above schema is carefully decided. Although for example the unique "Message-ID" header field should be included in all email messages, the schema allows the field to be NULL and doesn't even set a UNIQUE constraint on the field. This is because in practise the field might not be present or unique, but we'd still want to be able to store the message in our database. Note also how no email addresses or message identifiers are used in FOREIGN KEY constraints.

Most of the different header field relations above have a very similar structure, and could therefore very easily be combined into a single relation with an added type column. Or alternatively the inheritance features of PostgreSQL could be used. However at this point I think it is best to keep these relations separate to keep the semantic differences clearly visible. However at some later point it might turn out to be wise to restructure these relations.

An important issue that has yet not been discussed is the representation of SMTP envelope and other message delivery data in the data model. This data is strictly speaking not a part of the message itself, but for practical reasons (and to prevent losing information) should be represented somehow. This shortcoming will be addressed in Yukatan data model 0.5.

The other major shortcoming of the data model is that it still has no support for the widely used Multipurpose Internet Mail Extensions (MIME). Support for the MIME standards will be added incrementally starting from Yukatan data model 0.6.