Why Mail Filters are Doomed, and You Have to Have One
Originally published as a six degrees weblog on February 25th, 2003
Spam. Everybody gets it. Everybody hates it. When will politicians discover this defenseless whipping boy?
So, in the spirit of doing something about the weather, lets look at the various countermeasures that been tried, and see which ones work.
Legislation
In America, land of “There ought to be a law,” it’s not surprising that legislation was one of the first spam controls tried. Here is a summary of Colorado’s Junk Email Law:
The Colorado Junk Email Law, enacted in June 2000, prohibits the sending of unsolicited commercial e-mail that uses a third party’s Internet address or domain name without permission, or contains false or missing routing information. Unsolicited commercial e-mail messages must contain a label (“ADV:”) at the beginning of the subject line, and must include the sender’s e-mail address and opt-out instructions; opt-out requests must be honored.
(from: http://www.spamlaws.com/state/summary.html##co )
(You can read the actual legalese here: http://www.spamlaws.com/state/co.html )
I have no doubt that if we could just get every spammer to comply with these laws, we’d have no trouble putting them out of business. (And if we could just outlaw gravity, those floating cars in blade runner would be on the market faster, too.) So I think we can rule out legislation pretty quickly.
Blacklisting
Maybe the most natural defense against spam is to start a blacklist. You identify the problem kids and give them a time out.
A blacklist is what tail-gunner Joe McCarthy was working on: a big list of communists that we won’t let work in Hollywood anymore. Only this time it’s the spammers we are going to drag before the Senate and force to answer a bunch of embarrassing questions.
The most used blacklist is the Realtime Blackhole List.
The MAPS SM (Mail Abuse Prevention System) RBL SM (Realtime Blackhole List) is a list of networks which are known to be friendly, or at least neutral, to spammers who use these networks either to originate or relay spam. As we discover such networks, we deny them access to the part of the Internet that we are paying for. (from: [http://mail-abuse.org/rbl/rationale.html}(http://mail-abuse.org/rbl/rationale.html))
The RBL is a list of IP addresses of “known” spammers. People get on the list because they sent spam, they aided and abetted spammers in some way, or they had an open relay smtp server. (The folks at MAPS are pretty aggressive about putting networks on the list, see http://mail-abuse.org/rbl/candidacy.html for details. They can and do put pretty much anyone they deem a spammer on list, often without notification. )
They call it a blackhole list because the ISPs who use the list put any communication from the blacklisted addresses into a virtual blackhole so they cannot communicate with anyone else on the Internet
To give you just a little bit of flavor of the folks at MAPS, I submit the following two quotes. On a page discussing why they got into the blackhole business they write, “We’re mad as hell and we’re not going to take it any more,” (http://mail-abuse.org/rbl/rationale.html##LegalSpam). When talking about the rather drastic step of cutting off parts of the network to prevent spam they write, “Desperate times call for desperate measures,”(http://mail-abuse.org/rbl/candidacy.html).
In any blacklisting scheme, the problem is with false positives. Some people are going to be labeled as communists and have their lives and careers ruined, when they might just have enemies in the wrong places.
Another problem with blacklisting spammers, is that this alone doesn’t prevent spam. It’s reactionary. Spammers are identified after they send spam, and then they are blacklisted. So not only are blacklists draconian, they are ineffective too.
The biggest problem with blacklisting is that it operates at the server level. If you send me a message, and your ISP’s mail sever is blacklisted, your message won’t be delivered to me, and I’ll have no idea you tried to send me anything. You may receive a message from your mail server that the message was blocked, or you may not, depending on what blacklist is used and what mail server you are using.
Content Filtering
Content filtering works by examining the contents of messages to determine which are spam. The basic theory is that all spam messages are pretty much alike, so a computer should be smart enough to figure out which ones are spam.
In their simplest forms, content filters look for a given word or phrase, and if found the message is considered spam. (You know which words. Don’t make me list them like George Carlin.) More advanced filters score each word or phrase. Score high enough, and the message is considered spam. SpamAssassin, an open source rules based filter, is an example of a scoring content filter.
The problem with content filtering, again, is false positives; messages that aren’t spam get labeled as spam.
Making false positives worse is the way many of these content filters are deployed and operated. Many are run at the server level, and are set up to delete spam messages. When a false positive is deleted, mail you wish had been delivered isn’t. Worse, you aren’t even aware that the message was deleted. Far more sane are the ISPs that mark spam, by adding “SPAM” or something, to the subject line. Then you can run a rule in your email program that moves or deletes these messages. (There are also content filters that run client-side, as a plug-in to your email program.)
Some truly sophisticated content filters are beginning to emerge. By truly sophisticated, I mean that they rely on statistics, specifically Bayesian statistical theory, and are thus pretty much beyond my grasp. These are pretty new, and sound like they may have the solved the false positives problem. Paul Graham has written a bayesian spam filter which he describes here: http://www.paulgraham.com/spam.html.
Really, you should go read what Paul has to say. He is both smarter than me, and has done more deep thinking on this issue.
A Bayesian spam filter works by creating a list of words and the probabilities that they are in a spam email message. A filter “trains itself” by scanning a list of spam and non-spam messages, calculating their probabilities, and uses Bayes’ theorem to define the weights for each word’s spam probablility (how indicative the existence of a word is on the whole message being spam). Bayes’ theorem is a means of adjusting these weights whenever the filter is wrong about a message so that it can be even more effective in the future.
Bottom line: a well written bayesian filter, that is trained with your data, can catch 99.5% of spam with less than .03% false positives. The only downside is that you need about 600 spam messages to train a filter well, according to Graham.
These sophisticated content filters should be able to catch spam that contains a sales pitch of any kind, whether the sale pitch is for a porn site, a multilevel marketing scheme, an online casino or a religion.
Whitelisting
Whitelisting turns the notion of a blacklist inside out. Instead of keeping a list of people prohibited from sending messages, you keep a list of the people that are allowed to send messages. Unlike the previous centralized approaches, where mail or senders are checked at the network level, whitelists work best as part of your email program.
It’s pretty easy to develop a whitelist for an individual. It is simply a list of every address you have sent mail to. (The assumption is that anyone you have sent mail to isn’t a spammer.) A whitelist filter delivers messages from people on your whitelist and moves or deletes messages sent from other addresses.
Of course this approach alone has the false positive problem in spades. Any messages not from an address on your whitelist are treated as spam. So when your old college roommate sends you an email out of the blue, it gets rejected. Ditto when somone you know sends you email from a new address.
So whitelisting is usually used in combination with other approaches. As an example, Mail Frontier’s Matador combines a whitelist with blacklisting and content filtering.
In addition to other methods of filtering, many whitelist filters use the notion of challenge messages. When a message is rejected by the whitelist, a reply is sent automatically to the sender. This reply has a question that should be easy for a human to answer, but hard for a computer. For example, Matador might send a picture of a group of puppies, and ask you to count them. If the reply comes back with the right answer, the original message is delivered.
Challenge and response whitelist are kind of crude, and tend to make you look like a loon. They make other poeple that you deal with what is essentially your problem, and they increase the amount of email traffic. Plus a challenge email is unsolicited and usually unwanted, and/or confusing.
Most whitelists work at the email client level, but there is at least one network level whitelist, the IronPort Bonded Senderª Program.
The IronPort Bonded Senderª Program turns the spam problem upside down by identifying legitimate email traffic. Originators of legitimate email can now post a financial bond to ensure the integrity of their email campaign. Receivers who feel they have received an unsolicited email from a Bonded Sender can complain to their ISP, enterprise, or IronPort and a financial charge is debited from the bond. (from: http://www.bondedsender.com/)
The basic idea here is that companies that engage in mass mailings post a bond before sending the email. Depending on the number of complaints, the company has all or part of the bond returned.
Joel Spolsky had a similar idea. He proposes a system that delivers an email for one cent. The sender pays, and the system can’t be abused, “because no spammer can afford the penny times the 19 million messages they send.”
Lately, I have been a running a brain-dead-simple whitelist. My email program (Apple’s Mail) allows me to run a rule on all incoming mail, that checks if the sender is in my address book. If the sender isn’t in my address book, the message is moved out of my Inbox and into a Junk folder.
After doing this for a while, I can’t imagine not filtering my mail this way. This system fits with my expectations. I expect that all of the mail in my Inbox needs to be read, and most of it needs a response. I expect that most of the mail in my Junk folder is just that. When I look in there, I’m usually just scanning for the names of real people, that I know.
In effect what I have are two inboxes, my regular one with high priority mail that I check a couple of times a day, and the Junk one that I look at less frequently, usually just to empty it.
So what is spam anyway? Typically spam is defined as unsolicited bulk email. I read this as impersonal messages that I didn’t ask for, and don’t want.
I get a lot of these, but most of them aren’t about lengthening part of my anatomy, placing a bet at an online casino, or a recruiting five friends to pay me 50 dollars. Most of these messages are sent by real people who work for my company. They are stuff like:
A mass mailing from someone I don’t know, and will never speak with, giving me their new cell phone number.
A message from our company wide broadcast address, CreoMail, announcing that that voicemail system will be down next Saturday, or that there are bagels in kitchen of the Vancouver office. (I work in Denver.) (Note: I no longer work for creo, so I don’t get this particualr kind of spam anymore.)
Basically my Junk folder fills up with four kinds of stuff. Messages that didn’t need to be sent to me but were because it was easier to use a wide but not very targeted mailing list. Messages sent company wide that I would read if I had more hours in the day or were a better person, but there aren’t and I’m not. A small number of typical spam messages, advertising the sorts of things they advertise. A very small minority of messages, sent by people not in my address book, that I actually read.
As far as stopping the blatant advertising kind of spam, it looks like the best hope is a well trained bayesian content filter.
For stopping the other kinds of unsolicited bulk mailings, stuff that I used to think of as just part of email overload, a whitelist can help. Beware though, you’ll get an awful lot of false positives at first, as you slowly add all the people you forgot at first to your whitelist.
I’ll talk more about mail filters and the progress they have made in a future essay.