Email Information Concentrator

Name
Peeter Jürviste
Abstract
Email is one of the most successful and widely used computer applications yet devised. It has been around for decades and is used by individuals as well as organizations all over the globe. However, nowadays we are facing a growing message overload problem. This paper is a part of the corresponding research mainly located in an area called knowledge management with an aim to discover possibilities to handle this problem. As the main outcome of my Bachelor thesis, an application called Email Information Concentrator and a corresponding library were designed and developed. It consists of two abstract modules that operate separately: a mail delivery agent and a mailbox parser. Each abstract module has different implementations. For instance, two different parsers were implemented, one yielding a graph representation of the mailbox and the other corresponding Prolog statements. The program is designed to support new implementations for these modules if the need arises. Two external packages are also used, namely Beautiful Soup and NetworkX, for parsing HTML contents in messages and for the creation of email graphs correspondingly. My library includes tools that can extract message header fields defined by RFC 5322. In addition, it can also extract relevant word, line and paragraph information from the message body. A simple heuristic was developed for the extraction of paragraphs from text/plain and text/html messages. Also, I developed a model for representing an Email Graph. My idea is to keep only unique mail message data parts and create relations for the duplicates. This technique allows to reduce the data duplication. The output of the Email Information Concentrator can be used as an input for further research in related research projects. The Email Information Concentrator will be the base for further works in the information management for emails. Especially the Synchronous Delivery Agent will be one of the next steps. It could make the overall user experience much better and reduce the network load as it will download messages which were changed to since the last time as opposed to fetching the whole mailbox. In addition, reconstructing discussion threads remains a challenging but required task. Finally, many optimizations are still pending to speed up the handling of large-scale mailbox traffic. The Email Information Concentrator has already successfully established itself as an integral part of the knowledge managemant research. Other researchers are using it in their branches for importing and parsing mailboxes.
Graduation Thesis language
English
Graduation Thesis type
Bachelor - Computer Science
Supervisor(s)
Ulrich Norbisrath, PhD
Defence year
2010
 
PDF