Yukatan data model 0.3

The version 0.2 of the Yukatan data model has some shortcomings that should be addressed:

it fails to recognize and parse structured header fields
it is cumbersome to use
it still loses some message format data

Luckily all these shortcomings can be addressed by further enhancing the data model. But doing so we cannot escape the perils of duplicate data. However as the email messages (once stored in the database) should never change, there will be no problems keeping the duplicate data up to date as long as it is entered correctly in the first place.

First of all we can solve the problem of losing precious data by storing the entire original message source byte by byte in the database. This almost duplicates the message storage requirements, but as storage space is cheap nowadays I believe the benefit of having the exact original message around clearly outweights the loss of some extra disk space.

From section "3.6. Field definitions" of RFC 2822 we get the following special header fields that essentially allow us to combine the benefits of the two previous versions of this data model.

Date, always present, date-time, section 3.6.1.
From, always present, section 3.5.2.
Sender, optional, section 3.5.2.
Reply-to, optional, section 3.5.2.
To, optional, section 3.5.3.
Cc, optional, section 3.5.3.
Bcc, optional, section 3.5.3.
Message-ID, optional, section 3.5.4.
In-Reply-To, optional, section 3.5.4.
References, optional, section 3.5.4.
Subject, optional, section 3.5.5.

All of these fields, except the Subject field, are actually structured but for now it is enough to only parse the Date field into a more processing-friendly format.

With these special fields added, the data model becomes:

CREATE TABLE message (
        msgid           SERIAL PRIMARY KEY,
        msgdate         TIMESTAMP WITH TIME ZONE,
        msgfrom         CHARACTER VARYING NOT NULL,
        msgsender       CHARACTER VARYING,
        msgreplyto      CHARACTER VARYING,
        msgto           CHARACTER VARYING,
        msgcc           CHARACTER VARYING,
        msgbcc          CHARACTER VARYING,
        msgmessageid    CHARACTER VARYING,
        msginreplyto    CHARACTER VARYING,
        msgreferences   CHARACTER VARYING,
        msgsubject      CHARACTER VARYING,
        msgbody         TEXT,
        msgsource       BYTEA
);
CREATE TABLE headerfield (
        msgid           INTEGER NOT NULL
                        REFERENCES message
                        ON UPDATE CASCADE ON DELETE CASCADE,
        fieldno         INTEGER NOT NULL
                        CHECK (fieldno >= 0),
        fieldname       CHARACTER VARYING NOT NULL
                        CHECK (LOWER(fieldname) = fieldname),
        fieldbody       CHARACTER VARYING NOT NULL,
        PRIMARY KEY (msgid, fieldno)
);

Note that even though the identified header fields are stored in the "message" relation, the same information can still be found also in the "headerfield" relation.

This version seems already quite usable for simple problems, but it still fails to take advantage of the structure of most of the header fields and there still is no support for MIME messages.