Time for a geek entry, even though few if any current readers care about this stuff. My apologies.
Email must be stored somewhere. First, while it's waiting on a server to be downloaded with a POP or IMAP client; second, after it's been downloaded to a user's computer; third, when a user saves it for posterity. (Yes, I'm deliberately ignoring lots of earlier steps that aren't relevant to my thoughts here, as well as some other ways of accessing email that generally just skip one or more of these steps.) It's fairly safe to assume that this storage is on a disk, but in what format?
Email is normally organized into mailboxes holding multiple messages. Sometimes each user has a single mailbox, and sometimes a single user can have multiple mailboxes for different purposes. Multiple mailboxes are especially useful when the end user saves mail, since that person often wants to organize what they're saving. Most people don't think about the format used for saving the mail, but that format becomes important if more than one program will be used to access the mail.
There are three major types of email storage formats. One is for each mailbox to be a single file containing all its messages, with some internal organization to determine which message is which. Another is for each mailbox to be a directory (or directory structure), with each message being a separate file within that directory. The third is within a database outside the usual filesystem, with some method of organization enabled by the flexibilities of general-purpose databases.
There is not yet a commonly-accepted schema for storing email in databases; each program that does it has its own method, making interoperability impractical in the general case. This may change in the future, but for now, database storage is not useful if more than one program needs to access the mail.
That leaves mailbox files and mailbox directories, which each have their advantages and disadvantages....
Mailbox files
While there have been many single-file mailbox formats using various message delimiters, the most common type of mailbox file is the classic Unix mbox format. This format separates messages by prepending each one with a text line like:
The date is normally the date the message was delivered to this mailbox, and the email address normally comes from the sender address specified by the protocol used to get this message (such as SMTP). That address and date do not necessarily match those in the message headers, though the information in the headers is often used when no better information is available. This From line is NOT the same as the header From: line (note the colon), and is not properly a part of the message headers. To avoid confusion, this line is sometimes referred to as the "From_" line.
To make sure that messages are not split when someone happens to type the word "From" at the beginning of a line, the mbox format inserts a ">" character at the beginning of any line of the message that starts with "From ". (In an attempt to avoid this issue, there is a variant of the mbox format that uses a Content-Length: header instead of relying on the "From " string, but it tends to cause as many problems as it solves and it isn't entirely forward- or backward-compatible with the basic version.)
Messages in mbox files commonly have a Status: header which saves information such as whether the message is new since the mailbox was last read (the N flag), whether the message itself has been read by the user (the O flag indicates the message has not been read but was there the last time the mailbox was read), and other flags generally defined by mail readers themselves.
The biggest advantage of mailbox files is that they are self-contained. Moving or archiving an entire mailbox involves dealing with a single file, so moving or archiving a collection of mailboxes involves dealing with only as many files as there are mailboxes, with no excess structure to worry about.
A minor advantage of mailbox files is that reading all messages in a mailbox is generally quite fast. Excessively large messages, however, will slow things down.
The most serious disadvantage of mailbox files is simultaneous access by multiple processes. If everybody is only reading the file there is no problem, but if somebody is writing then others need to be aware of the way the file may change (including possible message deletion). If multiple processes attempt to write simultaneously, mailbox corruption will result. To avoid this problem, programs writing mailbox files need to lock the file and check for locks; however, there are multiple locking methods, none of which work properly in all cases, especially when network filesystems are involved. Because of the difficulties involved with locking, locks are often used only when it is known that contention is likely.
Another disadvantage of mailbox files is that random access is impossible; accessing the last message requires reading all previous messages. Similarly, deleting a message requires rewriting the whole mailbox. This can have severe memory implications; some programs have been known to corrupt mailboxes that become excessively large. Also, rewriting the whole mailbox makes any changes slow. However, simply adding a new message to the mailbox is fast.
Mailbox directories
There are two major mailbox directory formats, but one has largely superceded the other. The older one is the mh format, which simply uses a directory full of message files, with each message file named with a number. New messages get the name/number after the highest one already there. This resolves some of the problems with mailbox files, but still has some locking issues due to the selection of message numbers as well as the intermediate state when a message is being written.
The newer, better mailbox directory format is maildir. It is designed to avoid all locking problems, and has become quite common far beyond its originator's own programs.
Instead of a single directory, a maildir is a directory containing three subdirectories: tmp, new, and cur. (There may also be other files or subdirectories used by specific programs.) When a message is first written to the maildir, it goes into the tmp subdirectory with a name that is guaranteed to be unique (I'll skip how it's made unique). Once the message is completely written to tmp, it gets moved (atomically) to new. When a program comes along to read the maildir and sees the new message, it moves the message file to the cur directory and appends flags to its name to indicate status (similar to the Status: header described above). The maildir web page defines these flags.
Other than the locking issue, which maildir resolves and mh does not, the biggest advantage to mailbox directories is random access; messages can be read and removed without affecting other messages, eliminating common types of mailbox corruption.
There are two intrinsic disadvantages to mailbox directories, however. The first is convenience; a structure of files and directories isn't as portable or manipulatable as a single file.
The second disadvantage is read speed. When reading an entire mailbox, it takes longer to read it as multiple files than as one big file. However, if only the headers are being read, large messages aren't any slower to read than small messages.
Conclusions
I find that maildirs are best for active mailboxes, while mboxes are best for archival mailboxes. Active mailboxes (such as your inbox) are sensitive to locking issues and shouldn't grow large enough for the slower sequential reads of mailbox directories to be a problem. Archival mailboxes (such as where you save your love letters or evidence of your employer's evil) tend to be read-mostly, with the occasional append by a single program under the user's control, and they often get moved around (and read without benefit of mail programs) quite a bit later.
If only one of these two mailbox types is available, I'd choose maildir. However, when my mail program (MUA) is limited in its supported mailbox types, I find it useful to run a local IMAP server that can access the type of mailbox that the MUA has trouble with. This is possibly the only remaining good use for UW-IMAP, which supports only the mbox format (and has a poor security record). Other IMAP servers useful for this purpose (and other purposes, for that matter) include:
Email must be stored somewhere. First, while it's waiting on a server to be downloaded with a POP or IMAP client; second, after it's been downloaded to a user's computer; third, when a user saves it for posterity. (Yes, I'm deliberately ignoring lots of earlier steps that aren't relevant to my thoughts here, as well as some other ways of accessing email that generally just skip one or more of these steps.) It's fairly safe to assume that this storage is on a disk, but in what format?
Email is normally organized into mailboxes holding multiple messages. Sometimes each user has a single mailbox, and sometimes a single user can have multiple mailboxes for different purposes. Multiple mailboxes are especially useful when the end user saves mail, since that person often wants to organize what they're saving. Most people don't think about the format used for saving the mail, but that format becomes important if more than one program will be used to access the mail.
There are three major types of email storage formats. One is for each mailbox to be a single file containing all its messages, with some internal organization to determine which message is which. Another is for each mailbox to be a directory (or directory structure), with each message being a separate file within that directory. The third is within a database outside the usual filesystem, with some method of organization enabled by the flexibilities of general-purpose databases.
There is not yet a commonly-accepted schema for storing email in databases; each program that does it has its own method, making interoperability impractical in the general case. This may change in the future, but for now, database storage is not useful if more than one program needs to access the mail.
That leaves mailbox files and mailbox directories, which each have their advantages and disadvantages....
Mailbox files
While there have been many single-file mailbox formats using various message delimiters, the most common type of mailbox file is the classic Unix mbox format. This format separates messages by prepending each one with a text line like:
From user@example.com Fri May 21 13:37:59 2004
The date is normally the date the message was delivered to this mailbox, and the email address normally comes from the sender address specified by the protocol used to get this message (such as SMTP). That address and date do not necessarily match those in the message headers, though the information in the headers is often used when no better information is available. This From line is NOT the same as the header From: line (note the colon), and is not properly a part of the message headers. To avoid confusion, this line is sometimes referred to as the "From_" line.
To make sure that messages are not split when someone happens to type the word "From" at the beginning of a line, the mbox format inserts a ">" character at the beginning of any line of the message that starts with "From ". (In an attempt to avoid this issue, there is a variant of the mbox format that uses a Content-Length: header instead of relying on the "From " string, but it tends to cause as many problems as it solves and it isn't entirely forward- or backward-compatible with the basic version.)
Messages in mbox files commonly have a Status: header which saves information such as whether the message is new since the mailbox was last read (the N flag), whether the message itself has been read by the user (the O flag indicates the message has not been read but was there the last time the mailbox was read), and other flags generally defined by mail readers themselves.
The biggest advantage of mailbox files is that they are self-contained. Moving or archiving an entire mailbox involves dealing with a single file, so moving or archiving a collection of mailboxes involves dealing with only as many files as there are mailboxes, with no excess structure to worry about.
A minor advantage of mailbox files is that reading all messages in a mailbox is generally quite fast. Excessively large messages, however, will slow things down.
The most serious disadvantage of mailbox files is simultaneous access by multiple processes. If everybody is only reading the file there is no problem, but if somebody is writing then others need to be aware of the way the file may change (including possible message deletion). If multiple processes attempt to write simultaneously, mailbox corruption will result. To avoid this problem, programs writing mailbox files need to lock the file and check for locks; however, there are multiple locking methods, none of which work properly in all cases, especially when network filesystems are involved. Because of the difficulties involved with locking, locks are often used only when it is known that contention is likely.
Another disadvantage of mailbox files is that random access is impossible; accessing the last message requires reading all previous messages. Similarly, deleting a message requires rewriting the whole mailbox. This can have severe memory implications; some programs have been known to corrupt mailboxes that become excessively large. Also, rewriting the whole mailbox makes any changes slow. However, simply adding a new message to the mailbox is fast.
Mailbox directories
There are two major mailbox directory formats, but one has largely superceded the other. The older one is the mh format, which simply uses a directory full of message files, with each message file named with a number. New messages get the name/number after the highest one already there. This resolves some of the problems with mailbox files, but still has some locking issues due to the selection of message numbers as well as the intermediate state when a message is being written.
The newer, better mailbox directory format is maildir. It is designed to avoid all locking problems, and has become quite common far beyond its originator's own programs.
Instead of a single directory, a maildir is a directory containing three subdirectories: tmp, new, and cur. (There may also be other files or subdirectories used by specific programs.) When a message is first written to the maildir, it goes into the tmp subdirectory with a name that is guaranteed to be unique (I'll skip how it's made unique). Once the message is completely written to tmp, it gets moved (atomically) to new. When a program comes along to read the maildir and sees the new message, it moves the message file to the cur directory and appends flags to its name to indicate status (similar to the Status: header described above). The maildir web page defines these flags.
Other than the locking issue, which maildir resolves and mh does not, the biggest advantage to mailbox directories is random access; messages can be read and removed without affecting other messages, eliminating common types of mailbox corruption.
There are two intrinsic disadvantages to mailbox directories, however. The first is convenience; a structure of files and directories isn't as portable or manipulatable as a single file.
The second disadvantage is read speed. When reading an entire mailbox, it takes longer to read it as multiple files than as one big file. However, if only the headers are being read, large messages aren't any slower to read than small messages.
Conclusions
I find that maildirs are best for active mailboxes, while mboxes are best for archival mailboxes. Active mailboxes (such as your inbox) are sensitive to locking issues and shouldn't grow large enough for the slower sequential reads of mailbox directories to be a problem. Archival mailboxes (such as where you save your love letters or evidence of your employer's evil) tend to be read-mostly, with the occasional append by a single program under the user's control, and they often get moved around (and read without benefit of mail programs) quite a bit later.
If only one of these two mailbox types is available, I'd choose maildir. However, when my mail program (MUA) is limited in its supported mailbox types, I find it useful to run a local IMAP server that can access the type of mailbox that the MUA has trouble with. This is possibly the only remaining good use for UW-IMAP, which supports only the mbox format (and has a poor security record). Other IMAP servers useful for this purpose (and other purposes, for that matter) include:
- Dovecot - supports both maildir and mbox, but still fairly new and not necessarily entirely reliable yet
- Courier IMAP - supports maildir
- Binc IMAP - supports maildir
There is 1 comment on this entry.