The version 0.2 of the Yukatan data model has some shortcomings that should be addressed:
Luckily all these shortcomings can be addressed by further enhancing the data model. But doing so we cannot escape the perils of duplicate data. However as the email messages (once stored in the database) should never change, there will be no problems keeping the duplicate data up to date as long as it is entered correctly in the first place.
First of all we can solve the problem of losing precious data by storing the entire original message source byte by byte in the database. This almost duplicates the message storage requirements, but as storage space is cheap nowadays I believe the benefit of having the exact original message around clearly outweights the loss of some extra disk space.
From section "3.6. Field definitions" of RFC 2822 we get the following special header fields that essentially allow us to combine the benefits of the two previous versions of this data model.
All of these fields, except the Subject field, are actually structured but for now it is enough to only parse the Date field into a more processing-friendly format.
With these special fields added, the data model becomes:
CREATE TABLE message ( msgid SERIAL PRIMARY KEY, msgdate TIMESTAMP WITH TIME ZONE, msgfrom CHARACTER VARYING NOT NULL, msgsender CHARACTER VARYING, msgreplyto CHARACTER VARYING, msgto CHARACTER VARYING, msgcc CHARACTER VARYING, msgbcc CHARACTER VARYING, msgmessageid CHARACTER VARYING, msginreplyto CHARACTER VARYING, msgreferences CHARACTER VARYING, msgsubject CHARACTER VARYING, msgbody TEXT, msgsource BYTEA ); CREATE TABLE headerfield ( msgid INTEGER NOT NULL REFERENCES message ON UPDATE CASCADE ON DELETE CASCADE, fieldno INTEGER NOT NULL CHECK (fieldno >= 0), fieldname CHARACTER VARYING NOT NULL CHECK (LOWER(fieldname) = fieldname), fieldbody CHARACTER VARYING NOT NULL, PRIMARY KEY (msgid, fieldno) );
Note that even though the identified header fields are stored in the "message" relation, the same information can still be found also in the "headerfield" relation.
This version seems already quite usable for simple problems, but it still fails to take advantage of the structure of most of the header fields and there still is no support for MIME messages.