大家好,又见面了,我是你们的朋友全栈君。
Spam Filters
There are many techniques for classifying a message. It can be examined for “spam-markers” such as common spam subjects, known spammer addresses, known mail forwarding machines, or simply common spam phrases. The header and/or the body can be examined for these markers. Another method is to classify all messages not from known addresses as spam. Another is to compare with messages that others have received, and find common spam messages. And another technique, probably the most popular at the moment, is to apply machine learning techniques in an email classifier.
Bayesian Filtering
Paul Graham kicked off a flood of mail filters implementing Bayesian filtering with his “A Plan for Spam” article in August 2002, though it was far from a new concept. In fact, ifile has used a Naive Bayes classification algorithm since August 1996 to automatically file mail into folders. In academic circles, Bayesian methods have been used in text classification for many years, and for spam detection prior to Graham, as evidenced by the 1998 workshop paper A Bayesian Approach to Filtering Junk E-Mail by Sahami, et al.
In a nutshell, the approach is to tokenize a large corpus of spam and a large corpus of non-spam. Certain tokens will be common in spam messages and uncommon in non-spam messages, and certain other tokens will be common in non-spam messages and uncommon in spam messages. When a message is to be classified, we tokenize it and see whether the tokens are more like those of a spam message or those of a non-spam message. How we determine this similarity is what the math is all about. It isn’t complicated, but it has a number of variations.
There’s a lot more to it than that (Bayesian methods are used a lot in the AI field, for example, in machine learning and user modelling), but that’s all we need to know.
Some Spam Filters
In order to compare some spam filters, a number of filters had to be selected from the large list that is the Freshmeat Topic :: Communications :: Email :: Filters category.
The selection was restricted by only considering free software and only filters that didn’t use network resources in their classification. The filters were further restricted to those that could be executed as standalone programs, read a message from standard input, and indicate via their output or their exit value whether it was spam or not.
Several satisfying the restrictions were downloaded, and a few of those removed due to problems with installation or execution. In the end, seven filters were used, five of which were Bayesian.
The version of each filter that was available for download on the Third of July 2003 was used. This was done because, though the email was filtered in bulk in August, the actual email was received during July; it should be used with July’s versions of the programs. The filters are:
-
Bayesian Mail Filter
- A Bayesian filter that aims to be smaller, faster, and more versatile than other Bayesian filters. Version 0.9.4 was used.
- A Bayesian filter designed for speed, for use at sites which process a large amount of mail. Version 0.13.7.2 was used.
- A digramic Bayesian filter, not restricted to just spam and non-spam. This mail filter will classify a message into one of many categories. Version 1.3.1 was used.
- Another small, fast Bayesian filter. Version 0.5.9 was used.
- A filter which uses a wide range of heuristic tests to determine whether a message is spam, each test adding or subtracting from a score. Messages over a threshold score are declared spam. Version 2.55 was used.
- The final Bayesian filter. Version 0.8b was used.
-
A collection of procmail recipes which tag a message as spam if it matches any of a number of heuristic tests. This filter is not really a standalone program, but since
a previous comparison with SpamAssassin was criticized, I provided a procmail wrapper so it could be included in the comparison. Version 3.0 was used.
The Email Data
The email used in the testing consisted of my email from the month of July 2003. The mail consisted of 1,273 messages, of which 1,073 were spam. For the Bayesian filters, a training set of 68 spam messages and 68 non-spam messages was used (my email from the second half of June, with a random sample of spam messages from the same period).
The messages used were all hand-classified as spam or non-spam.
Methodology
Each program was installed according to its documentation. For the filters that required training, the training set data was supplied. Each filter was then taken in turn and executed once for each email in the spam and non-spam sets, and the classification it gave was recorded.
Default options were used for the filters in all cases.
The aim was to examine the filtering abilities of the packages. Hence, whitelists were not used, even though, in practice, they probably would be. Some analysis was done to see how much performance would be improved by whitelists.
Results
The standard metrics for text classification are recall and precision. For spam filtering, we are trying to correctly classify spam messages as spam and not incorrectly classify non-spam messages messages as spam. Spam classified as non-spam is known as a false negative. Non-spam classified as spam is known as a false positive.
Precision is the percentage of messages that were classified as spam that actually are spam. High precision is essential to prevent the messages we want to read being classified as spam. A low precision indicates that there are many false negatives.
Recall is the percentage of actual spam messages that were classified as spam messages. High recall is necessary in order to prevent our inbox filling with spam. A low recall indicates that there are many false positives.
False positives are generally considered far worse than false negatives. Viewing a spam is better than not getting an important message. Hence, precision is a more important measure than recall, though, of course, a low recall makes a filter useless.
Experiment One
For the first test of the filters, the 68 spam and 68 non-spam training messages were used to train the filters that required training. Then, the set of 1,273 messages was classified by each of the filters, the results of which are shown in Table 1:
Filter | Precision | Recall | False Positives | False Negatives | Correct Classifications |
---|---|---|---|---|---|
SpamProbe | 100.0% | 47.9% | 0 | 559 | 714 |
Bogofilter | 100.0% | 34.4% | 0 | 704 | 569 |
Bayesian Mail Filter | 100.0% | 11.0% | 0 | 955 | 318 |
SpamAssassin | 99.9% | 80.0% | 1 | 215 | 1057 |
dbacl | 99.0% | 47.0% | 5 | 569 | 699 |
Quick Spam Filter | 97.0% | 56.5% | 19 | 467 | 787 |
SPASTIC | 89.0% | 46.0% | 61 | 579 | 633 |
SpamAssassin is the only filter that has a recall rate worth using.
I think it’s reasonably clear that the Bayesian filters did not have large enough training sets, and hence are only achieving low recall rates.
Experiment Two
For the second test, the training data consisted of the original 68 spam and 68 non-spam training messages, plus the first 100 non-spam messages and the first 500 spam messages of the email data.
All the filters were run on the remaining email data, 100 non-spam messages and 573 spam messages, producing the results shown in Table 2:
Filter | Precision | Recall | False Positives | False Negatives | Correct Classifications |
---|---|---|---|---|---|
Bogofilter | 100.0% | 81.7% | 0 | 98 | 538 |
SpamProbe | 99.8% | 97.2% | 1 | 15 | 620 |
SpamAssassin | 99.8% | 78.7% | 1 | 114 | 521 |
Bayesian Mail Filter | 99.6% | 93.8% | 2 | 33 | 601 |
dbacl | 99.2% | 89.0% | 4 | 59 | 573 |
Quick Spam Filter | 94.9% | 79.1% | 23 | 112 | 501 |
SPASTIC | 88.5% | 43.3% | 30 | 304 | 302 |
Those results are more along the lines of how Bayesian filters are expected to perform. Quick Spam Filter and Bogofilter have noticeably lower recall than the other Bayesian filters, and Quick Spam Filter’s precision is too low to be useful.
SpamAssassin is now showing a significantly lower recall rate than most of the Bayesian filters. It should be noted that, in practice, SpamAssassin will likely use a few more metrics (using network resources), and hence should do a little better than these results indicate. Also, SpamAssassin has a Bayesian classifier built in, but it wasn’t used in these tests, since having five was enough.
That SpamAssassin is not better than the bulk of the other filters is a good sign for email filtering. Bayesian filters are reasonably easy to implement and require no knowledge of what differentiates spam from other email. SpamAssassin’s rules, on the other hand, need to be developed by people and probably account for most of the work in creating the software.
SPASTIC has both significantly lower precision and recall than the other filters. Since people actually do use it to filter mail, it must be suitable for some email profiles, but for my email, it isn’t usable.
Examining the False Positives
SpamProbe and SpamAssassin both generated one false positive, and it was caused by the same message. That message was essentially an advertisement for a conference, and many people would classify it as spam. However, I attended the previous conference, and I don’t mind this showing up in my inbox. It has a number of spam-like properties. “HTML only” is a big one. It is also generically addressed (“Dear Friends”). The From: address looks like it might be auto-generated due to some digits (icce2003@…). Basically, it’s spam that I didn’t mind receiving. The address it’s from could easily be entered into a whitelist to solve the problem, but it could also be argued that it should be classified as spam. I actually didn’t read it when it turned up in my inbox in real life (I don’t bother with HTML-only email), though it did remind me of the conference.
Bayesian Mail Filter also misclassified the message discussed above, as well as a message from my Web hosting provider announcing a server move and a little resulting downtime. Clearly, that is a message I want to receive. However, it was sent from the email address of my hosting provider, an address from which I expect to receive mail I want and which could easily be entered into a whitelist. In fact, it’s the type of address that should be put on a whitelist, since valid commercial messages look a lot like unsolicited commercial messages.
dbacl gave four false positives, one of which was the conference advertisement mentioned above. Another was a message detailing administrative responsibilities of staff. It was from someone who doesn’t send spam, and that address could easily be added to a whitelist.
It also flagged a forwarded IBM PhD Program nomination advertisement. This is another message that is essentially spam, but it was intentionally sent to a list I am on by a staff member. Again, a whitelist would catch this. The final false positive was a second copy of the IBM PhD Program email, this time forwarded by someone else to another list I am on.
Quick Spam Filter produced 23 false positives. These included the conference announcement and the hosting provider announcement mentioned above. A dozen or so newletters were flagged as spam, as were a few commercial messages that were not unsolicited and a couple of messages from my wife. Whitelists can solve these problems quickly and easily. The false positives that are not easily fixed are the problem, so I’ll focus on those.
An email bounce notice was flagged as spam. A whitelist can’t solve this without a fair amount of effort, since the address is determined by the machine on which I happen to run the “netfile” command.
A message requesting I contact a person about something which “needs urgent attention” was flagged as spam. This is what spam filtering nightmares are made of, especially when the email originates from an Associate Dean. Whitelists don’t help, since Associate Deans change and I had never heard of this person before I received this message. The reply to my reply to this message was also flagged as spam.
Four seminar announcements were flagged as spam. Since the sender is often different, a whitelist won’t fix this.
SPASTIC produced 30 false positives. The vast majority of these were newsletters, solicited commercial messages, and “calendar” reminder messages (which have no subjects), all of which cause problems easily solved by a whitelist.
SPASTIC also flagged an important message as spam, this time from my supervisor with the subject “URGENT”. Putting my supervisor in a whitelist is reasonable, I guess, but this highlights the problem with SPASTIC’s method of tagging a message as spam if any single test for spam succeeds. This particular message was not spam-like in any way, except for the subject.
Two more messages were tagged as spam which were not spam, but not from people I would put on a whitelist, since I wouldn’t expect email from them.
So, allowing for whitelists, we generate the false positives shown in Table 3:
Filter | False Positives |
---|---|
Bogofilter | 0 |
SpamProbe | 0 |
SpamAssassin | 0 |
Bayesian Mail Filter | 0 |
dbacl | 0 |
SPASTIC | 3 |
Quick Spam Filter | 6 |
Experiment Three
For the third test, the 1,273 pieces of July’s mail were used as the training set. The testing set was the first week of August’s mail: 252 mails, 210 of which were spam. The results are shown in Table 4. The low SpamAssassin and SPASTIC recalls indicate that my spam was quite different from what they expect spam to look like.
Filter | Precision | Recall | False Positives | False Negatives | Correct Classifications |
---|---|---|---|---|---|
Bayesian Mail Filter | 100.0% | 99.0% | 0 | 2 | 250 |
SpamProbe | 100.0% | 98.1% | 0 | 4 | 248 |
Bogofilter | 100.0% | 86.2% | 0 | 29 | 223 |
SpamAssassin | 100.0% | 59.0% | 0 | 86 | 166 |
dbacl | 99.3% | 64.8% | 1 | 74 | 177 |
Quick Spam Filter | 98.3% | 85.8% | 7 | 67 | 439 |
SPASTIC | 84.8% | 31.9% | 12 | 143 | 97 |
Experiment Four
For the fourth experiment, the 200 non-spam messages from July’s mail were combined with 200 spam messages randomly selected from July’s mail to make the training set. The testing set was the same as in the previous experiment. Therefore, SpamAssassin and SPASTIC were not tested; since they don’t use the training data, they would have the same results as in Table 4.
Filter | Precision | Recall | False Positives | False Negatives | Correct Classifications |
---|---|---|---|---|---|
Bayesian Mail Filter | 100.0% | 91.0% | 0 | 19 | 233 |
SpamProbe | 100.0% | 87.1% | 0 | 27 | 225 |
Bogofilter | 100.0% | 67.1% | 0 | 69 | 183 |
Quick Spam Filter | 95.7% | 63.8% | 6 | 76 | 170 |
dbacl | 99.2% | 55.7% | 1 | 93 | 158 |
The results in Table 5 show that all the Bayesian filters do worse than they did in Experiment Three, so a training set with a large amount of spam is better than a smaller, balanced training set. This conflicts with the documentation for sa-learn, SpamAssassin’s Bayesian classifier (not used in these tests), which says, “You should aim to train with at least the same amount (or more if possible!) of ham data [as] spam.”
Experiment Five
All the previous experiments haven’t been very scientific, and have merely indicated how the various filters performed on various data sets. In order to produce some numbers with which it may be possible to objectively compare the filters, we will follow the methodology used in a technical report by Androutsopoulos, et al.: Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach.
The data set used was all my email from the month of July. This was partitioned randomly into ten equally-sized sets, each containing 107 spam messages and 20 non-spam messages. Three spam messages were left over and were discarded. For each of the ten sets, the other nine sets were combined and made up the training set, and it was tested. Hence, each filter was run ten times. The average precision and recall of the filters over those ten tests is shown in Table 6:
Filter | Precision | Recall | ||
---|---|---|---|---|
Mean | Std. Dev. | Mean | Std. Dev. | |
Bogofilter | 99.9% | 0.291% | 94.4% | 1.564% |
SpamAssassin | 99.9% | 0.370% | 80.0% | 2.871% |
dbacl | 99.4% | 0.844% | 79.2% | 4.541% |
SpamProbe | 99.1% | 0.720% | 99.0% | 0.882% |
Bayesian Mail Filter | 98.9% | 1.221% | 98.9% | 0.815% |
Quick Spam Filter | 96.5% | 1.935% | 89.5% | 3.894% |
SPASTIC | 89.1% | 3.640% | 46.1% | 4.053% |
For our objective analysis, we will use the metrics defined in the technical report linked above. Some measure of the relative cost of false positives to false negatives is needed in order to do this. Androutsopoulos, et al. suggest using a measure in which each non-spam is treated as equivalent to a number of spam messages. That number can be tweaked to represent just how bad false positives are to the user. We’ll call this weight FPW (false positive weight).
The variables we will define are:
FPW: | False Positive Weight. |
---|---|
CCNS: | Correctly Classified Non-Spam Messages. |
CCS: | Correctly Classified Spam Messages. |
ICNS: | Incorrectly Classified Non-Spam Messages. |
ICS: | Incorrectly Classified Spam Messages. |
NS: | Non-spam messages. |
S: | Spam Messages. |
The Weighted Accuracy of the filter is then defined as: (FPW*CCNS + CCS])/(FWP*NS + S)
. It will be expressed as a percentage.
The Total Cost Ratio (please see the technical report for the justification of this metric) is then defined as: S/(FPW*ICNS + ICS)
. If the Total Cost Ratio is less than one, the filter is doing worse than no filter at all. The higher the Total Cost Ratio, the better.
Tables 7, 8, 9, and 10 show the results for three values of FPW. The Weighted Accuracy and Total Cost Ratio were calculated by summing all the variables across all ten runs, and not by calculating them ten times, then averaging. Doing this prevents infinite Total Cost Ratio scores (when no mistakes are made by a filter on one run).
|
| ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
If the Total Cost Ratio is greater than 1, the filter is worth using if the False Positive Weight is an accurate representation of the relative costs of errors. A False Positive Weight of 1 is only realistic for the case in which email is being marked by the filter, but still placed in your inbox for manual removal. If that is how you plan to use a filter, SpamProbe or Bayesian Mail Filter are the best options, according to Table 7.
A False Positive Weight of 9 might be appropriate if you are filtering spam messages to a folder which you check every day. In that case, Bogofilter, SpamProbe, and Bayesian Mail Filter all look reasonable, according to Table 8.
A False Positive Weight of 99 might be an accurate representation for someone who checks the spam folder each week for false positives. In this case, Bogofilter and SpamAssassin are the most worthwhile filters.
A False Positive Weight of 999 would represent a set-and-forget spam filter which sends spam to the bit bucket. In this case, Bogofilter is the only option, and it isn’t any better than no filter.
Personally, I check my spam folder a few times each day. It only takes a second to glance at the new subjects and check the sender for the subjects that look like they might not be spam. So, for me, a False Positive Weighted Accuracy of 9 is appropriate.
The graph below gives an indication of how the filters compare at a range of False Positive Weights:
It’s important to note that the Total Cost Ratio isn’t a perfect metric. It scores classifying a forwarded joke from an annoying coworker as spam, just as it scores classifying an urgent message from your boss or partner as spam.
Conclusion
The Bayesian filters, after training, offer better recall than the two heuristic filters. Catching a higher proportion of spam is clearly good, since that is the reason people use them. With insufficient training, however, the Bayesian filters perform poorly in comparison with SpamAssassin in terms of recall.
Based upon the results for my email, SpamProbe and Bayesian Mail Filter have usable recall percentages and acceptable precision. Four spam messages a week is much more bearable than 210, and well worth the minor effort involved in setting up one of these filters. If false positives are especially bad to you, Bogofilter is the best choice, according to my email.
SPASTIC is useless for my email, since it lets through far too much spam and marks some legitimate messages as spam messages. SpamAssassin is better; it lets through more spam than the Bayesian filters, but has enough precision to at least not hide wanted email.
Quick Spam Filter performs poorly when compared with the other Bayesian filters. I suspect it will improve in future versions, since clearly the underlying mechanism (Bayesian filtering) isn’t the problem.
dbacl is similar to SpamAssassin in performance. However, it should be noted that dbacl can classify into multiple folders, not just spam and non-spam. This extra functionality may cause its performance to be less than that of the other Bayesian filters, but if you use that functionality, the tradeoff might be worthwhile.
Recommendations
If you want to filter spam out of your email, I strongly suggest not automatically deleting messages. File the spam away, just in case you get false positives. Any spam which isn’t picked up by your filters should be manually moved to the spam folder, not deleted. The same is true for your real email; instead of deleting it, move it to another folder. That way, you’ll build a collection of spam and non-spam messages, which will come in handy for training filters.
Start by filtering with SpamAssassin. The Bayesian filters don’t work well if you don’t train them, and you can’t train them without having a collection of your past email (both spam messages and non-spam messages). A non-learning filter makes it easy to build this collection.
Watch for false positives. You really do need to scan the spam folder every so often to check for items that shouldn’t have been flagged as spam, especially if you ever move to a learning filter. Otherwise, it will learn that some valid messages are spam messages. If your filter supports whitelists (if not, you can always add a whitelist to a chain of filters), use them. If friends’ email gets flagged as spam, add them to the whitelist. It will save you time and lost messages in the end. If you can find the inclination, adding people to your whitelist preemptively should help avoid false positives.
Once you have enough spam messages and non-spam messages correctly classified, you can think about using a Bayesian filter. You really want a few hundred of each type, preferably more. You also want to make sure there isn’t an unintended identifying feature of the spam messages or non-spam messages. For example, don’t use non-spam messages from the past 6 months and only the last month of spam messages; the learning algorithm might decide that messages with old dates are non-spam messages and messages with new dates are spam messages. Don’t try to pad the numbers with duplicates; it will overtrain the filter on the features in those messages.
Moving to a learning filter is a good thing, since keeping up-to-date with the latest rules isn’t necessary. The learning algorithm won’t get worse with time, since it will learn the ever-changing look of spam. (At least until spammers make their spam look very much like non-spam messages.)
Once you are using a learning filter, you must remember to train it every so often. If you don’t, the performance will deteriorate as your email usage changes. Of course, deteriorating performance is a great reminder to do some training. Training will be easy, since you will have a nice collection of classified spam messages and non-spam messages, and you will have corrected by hand any misclassifications the filter makes. Don’t just blindly feed the filter’s own classifications back in as training data; it will reinforce any mistakes. Another option is to simply train it on the messages it classified as false positives or false negatives, to correct the mistakes.
Try spam filtering. It puts the joy back into email.
Recent comments
Re: Individual vs. systemwide use?
At present there are lot of individual spam filters based on Bayesian filtration for Outlook clients. Most popular are Spam Bully, Inboxer, and Outlook Spam Filter. My most liked outlook spam filter (http://www.spam-reader.com) is Spam Reader.
12 Apr 2006 05:54
Excellent Article
I know I’m very late in here, but just wanted to say well done on an excellent article Sam!
Even though your article is now very old it still has a lot of relevance today!
21 Feb 2006 08:47
Banning foreign sites – No SPAM now
At my little community-oriented Internet site, I have banned almost all (the one’s I could find) foreign IP addresses from Europe, Asia, South America, etc. I was getting 3 SPAM’s per day to my system accounts, and now I get less than 3 SPAM’s per month. I also have much fewer assauts via the network on my web server, email server, etc. This solution won’t fit many of your situations, but it would fit some. Do you really care if someone in China cannot send you e-mail or browse your website? I’m sure you wouldn’t care if someone in Europe could not remotely admin your Internet site.
You may send email directly to me if you would like my list of foreign IP’s or an detailed description of how I do this. No marketing, just my free tech info.
Kevin
18 Feb 2006 21:06
Re: Pre-server spam filtering to be reviewed in Network World
>
> % The review is now published and can
> be
> % read at:
> %
> %
> %
> %
> http://www.nwfusion.com/reviews/2003/0915spam.html
> %
> %
> % Joel Snyder
>
>
>
>
> I have used several spam filters, but
> the most consistently effective one I
> have run into is MailMate. I’m very
> happy with it and would like to
> recommend it to anyone.
>
> You can find it at: Here
>
thanks for the link i had tried many spamfilters but din’t get any sucess i tried this and looks good for now
22 Jan 2006 05:07
Antispam relay!
Thanks for this article, it helped me too. I perefer spamassasin+clamAV+mailfilter. I configured smtp relay server, which kills 99% of spam. Other companies sell analog hardware spam filters for $9K, but we can get it almost for free!
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/143661.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...
02 Oct 2006 21:05
alrubin