The main feature that attracted me to Taglocity was its ability to automatically classify and tag messages. My dream was that email would arrive, be pre-sorted for me, and that only the important messages would be presented. Since it was already tagged, I would only have to dismiss the window after reading each message. No filing would be required.

The technology isn’t quite there yet, but it’s going in that direction. To automatically tag messages, Taglocity uses the open source CRM114 engine. For those of you who are fans of the movie Dr. Strangelove, the name “CRM114″ was lifted from that movie as described here. CRM114 was created as a spam filtering engine. A message has to meet a certain set of criteria to be able to pass through, otherwise it is blocked. See the diagram below.

The characteristics of “spam” and “not spam” messages are stored in a CSS file. Even though the .css file extension is commonly used for “Cascading Style Sheets”, it has nothing to do with style sheets or web page layouts in this context. (You will certainly, however, feel more “stylish” when using AI technology to parse your email.) Anyway, in the CRM114 world, CSS stands for “CRM114 Sparse Spectra”.

So what is in a CSS file? Well, the file contains hashed information describing a particular type or class of email. The contents are counts of various words and phrases that have been observed in emails known to belong to that class. If you open the CSS file in Notepad, it will just be a bunch of gobbly-gook since the words are hashed and the file is in a binary format. Hashing basically means that words and word groups are boiled down into fixed-length numerical codes (32 bits in the case of CRM114). This just makes the classification process easier and quicker, and helps keep a consistent structure in the CSS file. It also has a side benefit related to privacy since the words and phrases from your original emails are no longer visible. Even though the process cannot (easily) be reversed to recover the original words, the hash codes and related counts can still be compared.

So how does CRM114 hash an incoming message? First of all, it uses a 5-word sliding window to create word groups. Each of these 5-word groups is then further broken down into various permutations (sub-phrases) to create a number of “features” to describe the message. The benefit is that there will be many more features than there are words in the message. Hopefully a good percentage of these features are stable and unique for that classification of email (Yerazunis, W.S., “Sparse Binary Polynomial Hashing and the CRM114 Discriminator”). Each classification of email (spam, not spam, etc.) will get its own CSS file and therefore its own set of characteristics/features. An example of how this works is shown below.

In this example, the sentence “The quick brown fox jumps over the lazy dog” is parsed into five groups each of five words in length. For each group, sub-phrases are extracted and converted into 32-bit hash codes. The example shows this conversion for the last word group “jumps over the lazy dog”. As the CSS files are trained over numerous emails, the feature counts accumulate and patterns begin to emerge.

So that’s what CRM114 does. From the basic spam classification diagram above, you can imagine how such a tool could be expanded to serve a more general classification function. Instead of picking between two classifications (spam vs. not spam), why not give it 10 or 20 possible outcomes? This is how the AutoTag function in Taglocity works. The figure below describes the basic flow.

If AutoTag is enabled, this process fires as each message hits your inbox. There is some additional help info on the Taglocity web site.

A few tips: If you plan to have lots of AutoTags defined, consider reducing the size of the CSS file, or run AutoTag manually several times each day. Under the current scheme, all of your CSS files must be loaded into memory at the start of classification. This can cause your system to grind to a halt for a few seconds, which can be pretty annoying if you get a lot of emails. I will provide information in future posts on how to manage and optimize the size and content of your CSS files. I will also provide info about some alternative algorithms that can be used within Taglocity for email classification. Until then, happy tagging!