Why Filters Don’t Work (for Stopping Spam)

In software, the definition of a filter is "a computer program to process a data stream1." This processing is generally sorting elements of the data stream into different groups. When applied to email, a more specific definition is "the processing of email to organize it according to specified criteria2."

These definitions make the idea of a "spam filter" seem so simple. Just sort the email message stream into two groups: the "good" messages and the spam. Then, dump the spam and deliver the rest.

The problem is that the label "spam filter" has become associated with a specific technique for performing the processing, or sorting, of the email stream. This technique involves:

  • scanning the content of each message
  • analyzing the words, phrases and patterns of the text, and
  • comparing the results against a keyword list, a rules table, or some sort of probability database

What this technique is really doing is "guessing" about whether an email message is "good" or "bad" based on the content. What everyone who uses a content-oriented spam filter finds is that, a significant amount of the time, the processing guesses wrong. This is why some spam still gets through, and some legitimate messages get lost in the junk folder. This is an inherent flaw in every content-oriented filtering process.

What needs to happen is that the definition of a spam filter needs to be reset to not presume this inadequate processing technique. "Filter" the messages, not the content of the messages.

[1]Wikipedia, Filter (software)
[2]Wikipedia, Email Filtering