Yukatan data model 0.2

As the naive version of the Yukatan data model obviously is not good enough for any serious email management purposes, we need to make a more accurate version of the data model.

The definite standard on which to base this second version of the data model is RFC 2822, "Internet Message Format" - an updated version of the famous RFC 822 that specifies the "Standard for ARPA Internet Text Messages".

First, the sections "2.1. General description" and "3.5. Overall message syntax" of RFC 2822 describe an email message as a combination of header fields and an optional message body, that is simply text without any special semantics. Thus an email message is an object with the following properties:

any number of header fields
optional text body

A header field is described in section "2.2. Header Fields" as a combination of a field name and field body. Section "3.6. Field definition" mentions that the fields might be reordered but suggests that reordering should not be done. Thus the data model should preserve information about the header field order.

Expressed in SQL with an serial number as a unique message identifier, the model would quite simply be:

CREATE TABLE message (
        msgid           SERIAL PRIMARY KEY,
        body            TEXT
);
CREATE TABLE headerfield (
        msgid           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        fieldno         INTEGER NOT NULL
                        CHECK (fieldno >= 0),
        fieldname       CHARACTER VARYING NOT NULL
                        CHECK (LOWER(fieldname) = fieldname),
        fieldbody       CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgid, fieldno)
);

To simplify processing, the body field would have all CRLF line breaks converted to Unix-type LF line breaks. According to the original RFC 822 case should be ignored when interpreting the header field names. Thus the header field names should always be stored in lower case to simplify processing. The colon that separates the field name from the field body is not stored. Also, to simplify processing, extra whitespace and line foldings are removed from field bodies before they are stored.

This model captures a lot more details than the naive version 0.1, loosing in principle only some relatively low-level formatting information. However the data model is now much more difficult to use programmatically and still doesn't take into account the various specific rules mentioned by the RFC. Thus there is clear need for version 0.3 of the Yukatan data model.