Yukatan data model 0.8

After solving the main problem of representing MIME entity trees in the version 0.7 of the Yukatan data model we can now start adding more details to our bare-bones MIME model. This version adds support for the MIME content types, content identifiers, and content descriptions.

Media types

One of the most important parts of the MIME standard is the concept of media types. Media types and the Content-Type header were first introduced in RFC 1049 and then generalized in the MIME RFC 2045, section 5., that says:

The purpose of the Content-Type field is to describe the data contained in the body fully enough that the receiving user agent can pick an appropriate agent or mechanism to present the data to the user, or otherwise deal with the data in an appropriate manner. The value in this field is called a media type.

Media types consist of a top-level type identifier, a subtype identifier, and an optional set of named parameter values. The top-level type identifier ("text", "image", "multipart", etc.) determines the general type of the entity content, and the subtype identifier ("plain", "jpg", "digest", etc.) is used to specify the actual type of the entity content. The parameters are mostly related to low-level issues like character sets and multipart boundaries that are handled before the message is stored in the Yukatan database. For now we will only be interested in the top-level and subtype identifiers.

The type identifiers are stored as two attributes of the entity relation:

CREATE TABLE entity (
        ...
        enttypemajor    CHARACTER VARYING DEFAULT 'text' NOT NULL
                        CHECK (LOWER(enttypemajor) = enttypemajor),
        enttypeminor    CHARACTER VARYING DEFAULT 'plain' NOT NULL
                        CHECK (LOWER(enttypeminor) = enttypeminor),
        ...
);

The type identifiers are constrained to be NOT NULL because each message body should always have a content type. Additionally the case-insensitive type identifiers must always be normalized to lower case to make them easiert to handle. The default type "text/plain" should be used if the Content-Type header is not present. Note also that the actual values of the type attributes are not constrained to a predefined selection of type identifiers. It is the task of the database clients to assign meaning to the the type identifiers stored in the database.

The conventional "type/subtype" notation is not used because the top-level type identifier is useful as a separate value.

Binary media types

So far the Yukatan data model has only been able to store textual entity bodies. Now that we have added support for storing the media type we should also make it possible to store the data of the binary media types. To achieve this we will add a new entity attribute entdata for storing the binary contents of the non-text entities. The previous entbody attribute will also be renamed to enttext to better match the semantics of the text field.

CREATE TABLE entity (
        ...      
        enttext         TEXT,
        entdata         BYTEA
);

The contents and semantics of the attributes are determined based on wich attributes are NULL:

enttext IS NOT NULL AND entdata IS NULL: A normal text/* entity body. The decoded text body of the entity is stored in the enttext field.
enttext IS NULL AND entdata IS NOT NULL: A binary entity body, like application/octet-stream. The decoded byte body of the entity is stored in the entdata field.
enttext IS NOT NULL AND entdata IS NOT NULL: A binary entity body that has a text representation, like application/ms-word. The entdata field contains the decoded entity body, and the enttext field contains a plain text rendering of the body. The main reason for allowing such a plan text rendering is the possibility to use a full text search engine to index the message contents. Client programs should use their own text renderers to display the entity contents and not rely on the contents of the enttext field.
enttext IS NULL AND entdata IS NULL: The entity does not have a body. For example all multipart and message/external-body entities have a NULL body. The body parts of multipart entities are stored as separate child entities and message/external-body entities use the Content-Type header field parameters to identify the location of the entity body.

While the different combinations have quite standard relationships with the various media types, we still won't set explicit table constraints to govern these relationships. The reason for this is that the set of media types is not complete, and future standards might define new media types that would contradict these constraints. In this case it is better to leave the interpretation of the data to the client programs.

Content identifiers

The MIME standard defines the content identifier as an entity-level identifier to be used like the message identifier defined in RFC 822. The content identifier and the Content-ID header field are defined in RFC 2045, section 7.

The content identifiers are stored in the Yukatan database just like the previously defined message identifiers:

CREATE TABLE entity (
        ...
        entmessageid    CHARACTER VARYING,
        entcontentid    CHARACTER VARYING,
        ...
);

Content description

As a final step to fully support RFC 2045 we will add support for the Content-Description header field defined in section 8 of the RFC:

The ability to associate some descriptive information with a given body is often desirable. For example, it may be useful to mark an "image" body as "a picture of the Space Shuttle Endeavor." Such text may be placed in the Content-Description header field. This header field is always optional.

The optional description is stored as a RFC 2047 decoded Unicode string in the entdescription attribute:

CREATE TABLE entity (
        ...
        entdescription  CHARACTER VARYING,
        ...
);

SQL schema

The full SQL schema of the Yukatan data model 0.8 is included as the attached SQL schema file.

Yukatan SQL schema 0.8

The only changes since version 0.7 are the added attributes of the entity relation. The nextversion of the Yukatan data model will add detailed information related to handling of file attachments.