Yukatan data model 0.6

As discussed already before, the one major shortcoming still remaining in the previous versions of the Yukatan data model is missing support for MIME (Multipurpose Internet Mail Extensions) formatted email messages. This and the following few versions of the data model will address this shortcoming.

First, before considering the complexities of message body formats, we shall cover the easier issues of character encodings generally and in message header fields specifically.

Character sets and encoding

The first specification of internet email message format, RFC 822, only considered messages using the US-ASCII character set. The MIME specifications were created to overcome this limitation in addition to solving a number of other limits related to the email message model. Please refer to the various MIME RFCs for details.

The first step in adding support for different message encodings in our data model is to specify the character set used by the database schema. Luckily our database server of choise - PostgreSQL - has great support for various character sets. Thus we can just choose the character set to use, and inform the database server of our choise.

A decent email system should be able to cope with messages in just about any character encoding. Therefore our choises are either to store each message on its own character encoding, or alternatively to use a generic character set to which just about any incoming message can be converted. The latter option is much easier to handle conceptually, and luckily the Unicode standard specifies a generic character set and a couple of widely available character encodings that can quite easily be used for this purpose.

PostgreSQL supports the Unicode character set using the UTF-8 encoding. The name 'UNICODE' is used to select this character set and encoding for a database:

CREATE DATABASE yukatan WITH ENCODING 'UNICODE';

All text data (whatever the original encoding) within a Yukatan database is stored using the Unicode character set. Note however that the msgsource BYTEA field containing the original message source is an exact binary byte-by-byte representation of the original message.

Encoding of header fields

Header fields in an email message can given whatever character encoding as specified by RFC 2047. Whenever such header field contents are stored in a Yukatan database, the field content is first converted into the Unicode character set. All header field structure reflected in the Yukatan data model is parsed before conversion so any special characters will not cause problems in parsing.

SQL schema

As the SQL schema of the Yukatan data model grew quite large in the previous version, from now only changes to the schema are discussed and the complete version is only linked to.

Yukatan SQL schema 0.6

The only visible change in this version is the database creation command. In addition the semantics of all text columns, especially of the header field relations, were clarified as the Unicode character set was selected.