<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Real-time AntiSpam protection, automated and self-managed content filtering &#187; Spam filtering techniques</title>
	<atom:link href="http://veriat.com/category/spam-filtering-techniques/feed" rel="self" type="application/rss+xml" />
	<link>http://veriat.com</link>
	<description></description>
	<lastBuildDate>Thu, 27 May 2010 23:10:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Six approaches to eliminating unwanted e-mail</title>
		<link>http://veriat.com/six-approaches-to-eliminating-unwanted-e-mail.html</link>
		<comments>http://veriat.com/six-approaches-to-eliminating-unwanted-e-mail.html#comments</comments>
		<pubDate>Thu, 08 Oct 2009 10:18:43 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[eliminating]]></category>
		<category><![CDATA[Six approaches]]></category>
		<category><![CDATA[unwanted e-mail]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=371</guid>
		<description><![CDATA[The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.
The problem of unsolicited e-mail has been increasing for years, but help has arrived. [...]]]></description>
			<content:encoded><![CDATA[<p>The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.</p>
<p>The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.</p>
<p>Unethical e-mail senders bear little or no cost for mass distribution of messages, yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. In this article, I describe ways that computer code can help eliminate unsolicited commercial e-mail, viruses,<br />
trojans, and worms, as well as frauds perpetrated electronically and other undesired and troublesome e-mail. In some sense, the final and best solution for eliminating spam will probably take place on a legal level. In the meantime, however, you can do some things from a code perspective that can serve as an interim solution<br />
to the problem, until (if ever) the laws begin to evolve at the same rate as public frustration.</p>
<p>Considering matters technically ? but also with common sense ? what is generally called “spam” is somewhat broader than the category “unsolicited commercial e-mail”; spam encompasses all the e-mail that we do not want and that is only very loosely directed at us. Such messages are not always commercial per se, and some push the limits of what it means to be solicited. For example, we do<br />
not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don’t ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually unambiguous whether a message is spam, and many, many people get the same such e-mails.<span id="more-371"></span></p>
<p>The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.</p>
<p>Unethical e-mail senders bear little or no cost for mass distribution of messages,yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. In this article, I describe ways that computer code can help eliminate unsolicited commercial e-mail, viruses,<br />
trojans, and worms, as well as frauds perpetrated electronically and other undesired and troublesome e-mail. In some sense, the final and best solution for eliminating spam will probably take place on a legal level. In the meantime, however, you can do some things from a code perspective that can serve as an interim solution<br />
to the problem, until (if ever) the laws begin to evolve at the same rate as public frustration.</p>
<p>Considering matters technically ? but also with common sense ? what is generally called “spam” is somewhat broader than the category “unsolicited commercial e-mail”; spam encompasses all the e-mail that we do not want and that is only very loosely directed at us. Such messages are not always commercial per se,and some push the limits of what it means to be solicited. For example, we do not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don’t ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually<br />
unambiguous whether a message is spam, and many, many people get the same such e-mails.</p>
<p>The problem with spam is that it tends to sware spams than I did legitimate correspondences. On average, I probably get 10 spams for every appropriate e-mail. In some ways I am unusual ? as a public writer, I maintain a widely published e-mail address; moreover, I both welcome and receive frequent correspondence<br />
from strangers related to my published writing and to my software libraries. Unfortunately, a letter from a stranger ? with who-knows-which e-mail application, OS, native natural language, and so on, is not immediately obvious in its purpose; and spammers try to slip their messages underneath such ambiguities. My seconds are valuable to me, especially when they are claimed many times during every<br />
hour of a day.</p>
<p><strong>Hiding contact information:</strong></p>
<p>For some e-mail users, a reasonable, sufficient, and very simple approach to avoiding spam is simply to guard e-mail addresses closely. For these people, an e-mail address is something to be revealed only to selected, trusted parties. As extra precautions, an e-mail address can be chosen to avoid easily guessed<br />
names and dictionary words, and addresses can be disguised when posting to public areas. We have all seen e-mail addresses cutely encoded in forms like “echo zregm@tabfvf.pk | tr A-Za-z N-ZA-Mn-za-m”.</p>
<p>In addition to hiding addresses, a secretive e-mailer often uses one or more of the free e-mail services for “throwaway” addresses. If you need to transact e-mail with a semi-trusted party, a temporary address can be used for a few days, then abandoned along with any spam it might thereafter accumulate. The real “confidantes only” address is kept protected.</p>
<p>In my informal survey of discussions of spam on Web-boards, mailing lists, the Usenet, and so on, I’ve found that a category of e-mail users gains sufficient protection from these basic precautions.</p>
<p><strong>Looking at filtering software:</strong></p>
<p>This article looks at filtering software from a particular perspective. I want to know how well different approaches work in correctly identifying spam as spam and desirable messages as legitimate. For purposes of answering this question, I am not particularly interested in the details of configuring filter applications<br />
to work with various Mail Transfer Agents (MTAs). There is certainly a great<br />
deal of arcana surrounding the best configuration of MTAs such as Sendmail,<br />
QMail, Procmail, Fetchmail, and others. Further, many e-mail clients have their<br />
own filtering options and plug-in APIs. Fortunately, most of the filters I look<br />
at come with pretty good documentation covering how to configure them with various<br />
MTAs.</p>
<p>For purposes of my testing, I developed two collections of messages: spam and<br />
legitimate. Both collections were taken from mail I actually received in the<br />
last couple of months, but I added a significant subset of messages up to several<br />
years old to broaden the test. I cannot know exactly what will be contained<br />
in next month’s e-mails, but the past provides the best clue to what the future<br />
holds. That sounds cryptic, but all I mean is that I do not want to limit the<br />
patterns to a few words, phrases, regular expressions, etc. that might characterize<br />
the very latest e-mails but fail to generalize to the two types.</p>
<p>A general comment on testing is worth emphasizing. False negatives in spam filters<br />
just mean that some unwanted messages make it to your inbox. Not a good thing,<br />
but not horrible in itself. False positives are cases where legitimate messages<br />
are misidentified as spam. This can potentially be very bad, as some legitimate<br />
messages are important, even urgent, in nature, and even those that are merely<br />
conversational are ones we do not want to lose. Most filtering software allows<br />
you to save rejected messages in temporary folders pending review ? but if<br />
you need to review a folder full of spam, the usefulness of the software is<br />
thereby reduced.</p>
<p><strong>1. Basic structured text filters</strong></p>
<p>The e-mail client I use has the capability to sort incoming e-mail based on<br />
simple strings found in specific header fields, the header in general, and/or<br />
in the body. Its capability is very simple and does not even include regular<br />
expression matching. Almost all e-mail clients have this much filtering capability.</p>
<p>Over the last few months, I have developed a fairly small number of text filters.<br />
These few simple filters correctly catch about 80% of the spam I receive. Unfortunately,<br />
they also have a relatively high false positive rate ? enough that I need to<br />
manually examine some of the spam folders from time to time. (I sort probable<br />
spam into several different folders, and I save them all to develop message<br />
corpora.) Although exact details will differ among users, a general pattern<br />
will be useful to most readers:</p>
<p>? Set 1: A few people or mailing lists do funny things<br />
with their headers that get them flagged on other rules. I catch something in<br />
the header (usually the From:) and whitelist it (either to INBOX or somewhere<br />
else).</p>
<p>? Set 2: In no particular order, I run the following<br />
spam filters:</p>
<p>o Identify a specific bad sender.</p>
<p>o Look for “&lt;&gt;” as the From: header.</p>
<p>o Look for “@&lt;&#8221; in the header. The training sets are about twice as large.</p>
<p>A general comment on testing is worth emphasizing. False negatives in spam filters<br />
just mean that some unwanted messages make it to your inbox. Not a good thing,<br />
but not horrible in itself. False positives are cases where legitimate messages<br />
are misidentified as spam. This can potentially be very bad, as some legitimate<br />
messages are important, even urgent, in nature, and even those that are merely<br />
conversational are ones we do not want to lose. Most filtering software allows<br />
you to save rejected messages in temporary folders pending review ? but if<br />
you need to review a folder full of spam, the usefulness of the software is<br />
thereby reduced.</p>
<p><strong>1. Basic structured text filters:</strong></p>
<p>The e-mail client I use has the capability to sort incoming e-mail based on<br />
simple strings found in specific header fields, the header in general, and/or<br />
in the body. Its capability is very simple and does not even include regular<br />
expression matching. Almost all e-mail clients have this much filtering capability.</p>
<p>Over the last few months, I have developedhis for some reason).</p>
<p>o Look for “Content-Type: audio”. Nothing I want has this, only virii (your<br />
mileage may vary).</p>
<p>o Look for “euc-kr” and “ks_c_5601-1987? in the headers. I can’t read that language,<br />
but for some reason I get a huge volume of Korean spam (of course, for an actual<br />
Korean reader, this isn’t a good rule).</p>
<p>? Set 3: Store messages to known legitimate addresses.<br />
I have several such rules, but they all just match a literal To: field.</p>
<p>? Set 4: Look for messages that have a legit address<br />
in the header, but that weren’t caught by the previous To: filters. I find that<br />
when I am only in the Bcc: field, it’s almost always an unsolicited mailing<br />
to a list of alphabetically sequential addresses (mertz1@…, mertz37@…, etc).</p>
<p>? Set 5: Anything left at this point is probably spam<br />
(it probably has forged headers to avoid identification of the sender).</p>
<p>2. Whitelist/verification filters:</p>
<p>A fairly aggressive technique for spam filtering is what I would call the “whitelist<br />
plus automated verification” approach. There are several tools that implement<br />
a whitelist with verification: TDMA is a popular multi-platform open source<br />
tool; ChoiceMail is a commercial tool for Windows; most others seem more preliminary.<br />
(See Resources later in this article for links.)</p>
<p>A whitelist filter connects to an MTA and passes mail only from explicitly approved<br />
recipients on to the inbox. Other messages generate a special challenge response<br />
to the sender. The whitelist filter’s response contains some kind of unique<br />
code that identifies the original message, such as a hash or sequential ID.<br />
This challenge message contains instructions for the sender to reply in order<br />
to be added to the whitelist (the response message must contain the code generated<br />
by the whitelist filter). When a legitimate sender answers a challenge, her/his<br />
address is added to the whitelist so that any future messages from the same<br />
address are passed through automatically.</p>
<p>Although I have not used any of these tools more than experimentally myself,<br />
I would expect whitelist/verification filters to be very nearly 100% effective<br />
in blocking spam messages. It is conceivable that spammers will start adding<br />
challenge responses to their systems, but this could be countered by making<br />
challenges slightly more sophisticated (for example, by requiring small human<br />
modification to a code). Spammers who respond, moreover, make themselves more<br />
easily traceable for people seeking legal remedies against them.</p>
<p>The problem with whitelist/verification filters is the extra burden they place<br />
on legitimate senders. Inasmuch as some correspondents may fail to respond to<br />
challenges ? for any reason ? this makes for a type of false positive. In<br />
the best case, a slight extra effort is required for legitimate senders. But<br />
senders who have unreliable ISPs, picky firewalls, multiple e-mail addresses,<br />
non-native understanding of English (or whatever language the challenge is written<br />
in), or who simply overlook or cannot be bothered with challenges, may not have<br />
their legitimate messages delivered. Moreover, sometimes legitimate “correspondents”<br />
are not people at all, but automated response systems with no capability of<br />
challenge response. Whitelist/verification filters are likely to require extra<br />
efforts to deal with mailing-list signups, online purchases, Web site registrations,<br />
and other “robot correspondences”.</p>
<p>3. Distributed adaptive blacklists:</p>
<p>Spam is almost by definition delivered to a large number of recipients. And<br />
as a matter of practice, there is little if any customization of spam messages<br />
to individual recipients. Each recipient of a spam, however, in the absence<br />
of prior filtering, must press his own “Delete” button to get rid of the message.</p>
<p>Tools such as Razor and Pyzor (see Resources) operate around servers that store<br />
digests of known spams. When a message is received by an MTA, a distributed<br />
blacklist filter is called to determine whether the message is a known spam.<br />
These tools use clever statistical techniques for creating digests, so that<br />
spams with minor or automated mutations (or just different headers resulting<br />
from transport routes) do not prevent recognition of message identity. In addition,<br />
maintainers of distributed blacklist servers frequently create “honey-pot” addresses<br />
specifically for the purpose of attracting spam (but never for any legitimate<br />
correspondences). In my testing, I found zero false positive spam categorizations<br />
by Pyzor. I would not expect any to occur using other similar tools, such as<br />
Razor.</p>
<p>There is some common sense to this. Even those ill-intentioned enough to taint<br />
legitimate messages would not have samples of my good messages to report to<br />
the servers ? it is generally only the spam messages that are widely distributed.<br />
It is conceivable that a widely sent, but legitimate message such as the developerWorks<br />
newsletter could be misreported, but the maintainers of distributed blacklist<br />
servers would almost certainly detect this and quickly correct such problems.</p>
<p>As the summary table below shows, however, false negatives are far more common<br />
using distributed blacklists than with any of the other techniques I tested.<br />
The authors of Pyzor recommend using the tool in conjunction with other techniques<br />
rather than as a single line of defense. While this seems reasonable, it is<br />
not clear that such combined filtering will actually produce many more spam<br />
identifications than the other techniques by themselves.</p>
<p>In addition, since distributed blacklists require talking to a server to perform<br />
verification, Pyzor performed far more slowly against my test corpora than did<br />
any other techniques.</p>
<p>4. Rule-based rankings:</p>
<p>The most popular tool for rule-based spam filtering, by a good margin, is SpamAssassin.<br />
There are other tools, but they are not as widely used or actively maintained.<br />
SpamAssassin (and similar tools) evaluate a large number of patterns ? mostly<br />
regular expressions ? against a candidate message. Some matched patterns add<br />
to a message score, while others subtract from it. If a message’s score exceeds<br />
a certain threshold, it is filtered as spam; otherwise it is considered legitimate.</p>
<p>Some ranking rules are fairly constant over time ? forged headers and auto-executing<br />
JavaScript, for example, almost timelessly mark spam. Other rules need to be<br />
updated as the products and scams advanced by spammers evolve. Herbal Viagra<br />
and heirs of African dictators might be the rage today, but tomorrow they might<br />
be edged out by some brand new snake-oil drug or pornographic theme. As spam<br />
evolves, SpamAssassin must evolve to keep up with it.</p>
<p>The README for SpamAssassin makes some very strong claims:</p>
<p>In its most recent test, SpamAssassin differentiated between spam and non-spam<br />
mail correctly in 99.94% of cases. Since then, it’s just been getting better<br />
and better!</p>
<p>My testing showed nowhere near this level of success. Against my corpora, SpamAssassin<br />
had about 0.3% false positives and a whopping 19% false negatives. In fairness,<br />
this only evaluated the rule-based filters, not the optional checks against<br />
distributed blacklists. Additionally, my spam corpus is not purely spam ? it<br />
also includes a large collection of what are probably virus attachments (I do<br />
not open them to check for sure, but I know they are not messages I authorized).</p>
<p>SpamAssassin runs much quicker than distributed blacklists, which need to query<br />
network servers. But it also runs much slower than even non-optimized versions<br />
of the below statistical models (written in interpreted Python using naive data<br />
structures).</p>
<p>5. Bayesian word distribution filters:</p>
<p>Paul Graham wrote a provocative essay in August 2002. In “A Plan for Spam” (see<br />
Resources later in this article), Graham suggested building Bayesian probability<br />
models of spam and non-spam words. Graham’s essay, or any general text on statistics<br />
and probability, can provide more mathematical background than I will here.</p>
<p>The general idea is that some words occur more frequently in known spam, and<br />
other words occur more frequently in legitimate messages. Using well-known mathematics,<br />
it is possible to generate a “spam-indicative probability” for each word. Another<br />
simple mathematical formula can be used to determine the overall “spam probability”<br />
of a novel message based on the collection of words it contains.</p>
<p>Graham’s idea has several noteworthy benefits:</p>
<p>1. It can generate a filter automatically from corpora of categorized messages<br />
rather than requiring human effort in rule development.</p>
<p>2. It can be customized to individual users’ characteristic spam and legitimate<br />
messages.</p>
<p>3. It can be implemented in a very small number of lines of code.</p>
<p>4. It works surprisingly well.</p>
<p>At first blush, it would be reasonable to suppose that a set of hand-tuned and<br />
laboriously developed rules like those in SpamAssassin would predict spam more<br />
accurately than a scattershot automated approach. It turns out that this supposition<br />
is dead wrong. A statistical model basically just works better than a rule-based<br />
approach. As a side benefit, a Graham-style Bayesian filter is also simpler<br />
and faster than SpamAssassin.</p>
<p>There are some issues of data structures and storage techniques that will effect<br />
operating speed of different tools. But the actual predictive accuracy depends<br />
on very few factors ? the main significant factor is probably the word-lexing<br />
technique used, and this matters mostly for eliminating spurious random strings.<br />
Barham’s implementation simply looks for relatively short, disjoint sequences<br />
of characters in a small set (alphanumeric plus a few others).</p>
<p>6. Bayesian trigram filters:</p>
<p>Bayesian techniques built on a word model work rather well. One disadvantage<br />
of the word model is that the number of “words” in e-mail is virtually unbounded.<br />
This fact may be counterintuitive ? it seems reasonable to suppose that you<br />
would reach an asymptote once almost all the English words had been included.<br />
From my prior research into full text indexing, I know that this is simply not<br />
true; the number of “word-like” character sequences possible is nearly unlimited,<br />
and new text keeps producing new sequences. This fact is particularly true of<br />
e-mails, which contain random strings in Message-IDs, content separators, UU<br />
and base64 encodings, and so on. There are various ways to throw out words from<br />
the model (the easiest is just to discard the sufficiently infrequent ones).</p>
<p>I decided to look into how well a much more starkly limited model space would<br />
work for a Bayesian spam filter. Specifically, I decided to use trigrams for<br />
my probability model rather than “words”.</p>
<p>There were several decisions I made along the way. The biggest choice was deciding<br />
what a trigram is. While this is somewhat simpler than identifying a “word”,<br />
the completely naive approach of looking at every (overlapping) sequence of<br />
three bytes is non-optimal. In particular, considering high-bit characters ?<br />
although occurring relatively frequently in multi-byte character sets (in other<br />
words, CJK) ? forces a much bigger trigram space on us than does looking only<br />
at the ASCII range. Limiting the trigram space even further than to low-bit<br />
characters produces a smaller space, but not better overall results.</p>
<p>For my trigram analysis, I utilized only the most highly differentiating trigrams<br />
as message categorizers. But I arrived at the chosen numbers of “spam” and “good”<br />
trigrams only by trial and error. I also picked the cutoff probability for spam<br />
rather arbitrarily: I made an interesting discovery that no message in the “good”<br />
corpus was assigned a spam probability above .0071 other than two false positives<br />
in the .99 range. Lowering my cutoff from an initial 0.9 to 0.1, however, allowed<br />
me to catch a few more message in the “spam” corpus. For purposes of speed,<br />
I select no more than 100 “interesting” trigrams from each candidate message<br />
? changing that 100 to something else can produce slight variations in the<br />
results (but not in an obvious direction).</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/six-approaches-to-eliminating-unwanted-e-mail.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Words from Six Apart</title>
		<link>http://veriat.com/words-from-six-apart.html</link>
		<comments>http://veriat.com/words-from-six-apart.html#comments</comments>
		<pubDate>Thu, 08 Oct 2009 10:17:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Anti-Comment Spam Tactics]]></category>
		<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[filtering techniques]]></category>
		<category><![CDATA[Six Apart]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[Words]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=369</guid>
		<description><![CDATA[Comment Spam
We’ve all seen that comment spam is becoming a serious problem. Particularly on Movable Type weblogs, where the generated pages are all very similar in structure and semantics, spammers are abusing comment systems to increase their rank on Google.
Even more frustrating than the spamming problem is the fact that there isn’t a simple solution [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Comment Spam</strong></p>
<p>We’ve all seen that comment spam is becoming a serious problem. Particularly on Movable Type weblogs, where the generated pages are all very similar in structure and semantics, spammers are abusing comment systems to increase their rank on Google.</p>
<p>Even more frustrating than the spamming problem is the fact that there isn’t a simple solution that will work for everyone and that all options have their own sets of pros and cons. During the past couple of months, we’ve been throwing around ideas at Six Apart about the best ways to combat spammers.</p>
<p>Readers of your weblog must register before posting to your weblog.<br />
Before someone can post a comment to your weblog, they must register with your site.</p>
<p>For many webloggers, this solution is not ideal. Informal polling of webloggers has revealed that many do not want to require someone to register before posting. It usually discourages conversations from forming and is a barrier for open discussion. Additionally, without federation, logins on multiple weblogs become unmanageable.<span id="more-369"></span></p>
<p>While we do plan on integrating comment registration into Movable Type Pro (which we’ll be talking about in more detail very soon), it’s an option that serves a different purpose than just blocking spam. If you want to prevent links to explicit pornography from appearing on your site, you shouldn’t have to be required to turn on comment registration.</p>
<p>Comments require approval before being posted When a comment is posted, you can receive an email that provides a clickable link you must visit before the comment can be posted on your site.</p>
<p>For webloggers with a small amount of readers, this solution may be ideal. However, if you receive a good deal of comments, it’s a solution that doesn’t scale. Additionally, it may ruin the spontaneity of discussion.</p>
<p><strong>Image comprehension technology</strong><br />
Before a comment can be posted on a weblog, human eyes must enter a code that, ideally, is not readable by a computer.</p>
<p>This solution is not feasible because of accessibility issues. Additionally, spammers seem to be searching with bots and entering spam manually.</p>
<p><strong>A possible solution for everyone?</strong><br />
The problem has intensified in the past couple of weeks, but the good news is that as more people have been hit by comment spam, actual solutions are beginning to emerge.</p>
<p>Specifically, Jay Allen’s MT-BlackList is a blacklist-based solution to comment spam for Movable Type weblogs. It checks the comment fields (body, URL, author, etc) for URLs commonly found in spam comments, and rejects the comment if it looks like spam. The core plugin is set to be released today (Monday), but one of its neatest features-in-development is the ability for weblog systems to share blacklist data using XML-RPC. This provides the basis of a collaborative system similar to Razor, with the option for more management over the items in your own system’s blacklist.</p>
<p>We’re deeply committed to finding a way to combat spammers and we’re determined to do it on a core system level so that everyone can take advantage of spam prevention. We’re working on integrating comment spam blocking for MT and TypePad, and the great thing about Jay’s solution is that it could be the start of a distributed spam blocking network for comments, an implementation of which could be included in multiple tools. But, like email, there isn’t one simple solution that can be switched on and end spam completely. Hopefully we’re moving a step closer.</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/words-from-six-apart.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Spam filtering techniques</title>
		<link>http://veriat.com/spam-filtering-techniques.html</link>
		<comments>http://veriat.com/spam-filtering-techniques.html#comments</comments>
		<pubDate>Tue, 06 Oct 2009 08:16:11 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[filtering]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[techniques]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=358</guid>
		<description><![CDATA[The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.
The problem of unsolicited e-mail has been increasing for years, but help has arrived. [...]]]></description>
			<content:encoded><![CDATA[<p>The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.</p>
<p>The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.</p>
<p>Unethical e-mail senders bear little or no cost for mass distribution of messages, yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. In this article, I describe ways that computer code can help eliminate unsolicited commercial e-mail, viruses,<br />
trojans, and worms, as well as frauds perpetrated electronically and other undesired and troublesome e-mail. In some sense, the final and best solution for eliminating spam will probably take place on a legal level. In the meantime, however, you can do some things from a code perspective that can serve as an interim solution to the problem, until (if ever) the laws begin to evolve at the same rate as public frustration.<span id="more-358"></span></p>
<p>Considering matters technically ? but also with common sense ? what is generally called “spam” is somewhat broader than the category “unsolicited commercial e-mail”; spam encompasses all the e-mail that we do not want and that is only very loosely directed at us. Such messages are not always commercial per se, and some push the limits of what it means to be solicited. For example, we do not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don’t ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually<br />
unambiguous whether a message is spam, and many, many people get the same such<br />
e-mails.</p>
<p>The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.Unethical e-mail senders bear little or no cost for mass distribution of messages, yet normal e-mail users are forced to spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes. In this article, I describe<br />
ways that computer code can help eliminate unsolicited commercial e-mail, viruses, trojans, and worms, as well as frauds perpetrated electronically and other undesired and troublesome e-mail. In some sense, the final and best solution for eliminating spam will probably take place on a legal level. In the meantime, however, you<br />
can do some things from a code perspective that can serve as an interim solution to the problem, until (if ever) the laws begin to evolve at the same rate as public frustration.</p>
<p>Considering matters technically ? but also with common sense ? what is generally called “spam” is somewhat broader than the category “unsolicited commercial e-mail”; spam encompasses all the e-mail that we do not want and that is only very loosely directed at us. Such messages are not always commercial per se, and some push the limits of what it means to be solicited. For example, we do not want to get viruses (even from our unwary friends); nor do we generally want chain letters, even if they don’t ask for money; nor proselytizing messages from strangers; nor outright attempts to defraud us. In any case, it is usually<br />
unambiguous whether a message is spam, and many, many people get the same such e-mails.</p>
<p>The problem with spam is that it tends to sware spams than I did legitimate correspondences. On average, I probably get 10 spams for every appropriate e-mail. In some ways I am unusual ? as a public writer, I maintain a widely published e-mail address; moreover, I both welcome and receive frequent correspondence<br />
from strangers related to my published writing and to my software libraries. Unfortunately, a letter from a stranger ? with who-knows-which e-mail application, OS, native natural language, and so on, is not immediately obvious in its purpose; and spammers try to slip their messages underneath such ambiguities. My seconds are valuable to me, especially when they are claimed many times during every hour of a day.</p>
<p><strong>Hiding contact information:</strong></p>
<p>For some e-mail users, a reasonable, sufficient, and very simple approach to avoiding spam is simply to guard e-mail addresses closely. For these people, an e-mail address is something to be revealed only to selected, trusted parties. As extra precautions, an e-mail address can be chosen to avoid easily guessed names and dictionary words, and addresses can be disguised when posting to public areas. We have all seen e-mail addresses cutely encoded in forms like “echo zregm@tabfvf.pk | tr A-Za-z N-ZA-Mn-za-m”.</p>
<p>In addition to hiding addresses, a secretive e-mailer often uses one or more of the free e-mail services for “throwaway” addresses. If you need to transact e-mail with a semi-trusted party, a temporary address can be used for a few days, then abandoned along with any spam it might thereafter accumulate. The<br />
real “confidantes only” address is kept protected.</p>
<p>In my informal survey of discussions of spam on Web-boards, mailing lists, the Usenet, and so on, I’ve found that a category of e-mail users gains sufficient protection from these basic precautions.</p>
<p><strong>Looking at filtering software:</strong></p>
<p>This article looks at filtering software from a particular perspective. I want to know how well different approaches work in correctly identifying spam as spam and desirable messages as legitimate. For purposes of answering this question, I am not particularly interested in the details of configuring filter applications<br />
to work with various Mail Transfer Agents (MTAs). There is certainly a great deal of arcana surrounding the best configuration of MTAs such as Sendmail, QMail, Procmail, Fetchmail, and others. Further, many e-mail clients have their own filtering options and plug-in APIs. Fortunately, most of the filters I look at come with pretty good documentation covering how to configure them with various<br />
MTAs.</p>
<p>For purposes of my testing, I developed two collections of messages: spam and legitimate. Both collections were taken from mail I actually received in the last couple of months, but I added a significant subset of messages up to several years old to broaden the test. I cannot know exactly what will be contained<br />
in next month’s e-mails, but the past provides the best clue to what the future holds. That sounds cryptic, but all I mean is that I do not want to limit the patterns to a few words, phrases, regular expressions, etc. that might characterize the very latest e-mails but fail to generalize to the two types.</p>
<p>A general comment on testing is worth emphasizing. False negatives in spam filters just mean that some unwanted messages make it to your inbox. Not a good thing, but not horrible in itself. False positives are cases where legitimate messages are misidentified as spam. This can potentially be very bad, as some legitimate messages are important, even urgent, in nature, and even those that are merely conversational are ones we do not want to lose. Most filtering software allows you to save rejected messages in temporary folders pending review ? but if you need to review a folder full of spam, the usefulness of the software is<br />
thereby reduced.</p>
<p><strong>1. Basic structured text filters</strong></p>
<p>The e-mail client I use has the capability to sort incoming e-mail based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple and does not even include regular expression matching. Almost all e-mail clients have this much filtering capability.</p>
<p>Over the last few months, I have developed a fairly small number of text filters. These few simple filters correctly catch about 80% of the spam I receive. Unfortunately, they also have a relatively high false positive rate ? enough that I need to manually examine some of the spam folders from time to time. (I sort probable spam into several different folders, and I save them all to develop message corpora.) Although exact details will differ among users, a general pattern will be useful to most readers:</p>
<p>? Set 1: A few people or mailing lists do funny things<br />
with their headers that get them flagged on other rules. I catch something in<br />
the header (usually the From:) and whitelist it (either to INBOX or somewhere<br />
else).</p>
<p>? Set 2: In no particular order, I run the following<br />
spam filters:</p>
<p>o Identify a specific bad sender.</p>
<p>o Look for “&lt;&gt;” as the From: header.</p>
<p>o Look for “@&lt;&#8221; in the header. The training sets are about twice as large.</p>
<p>A general comment on testing is worth emphasizing. False negatives in spam filters just mean that some unwanted messages make it to your inbox. Not a good thing, but not horrible in itself. False positives are cases where legitimate messages are misidentified as spam. This can potentially be very bad, as some legitimate messages are important, even urgent, in nature, and even those that are merely conversational are ones we do not want to lose. Most filtering software allows you to save rejected messages in temporary folders pending review &#8212; but if you need to review a folder full of spam, the usefulness of the software is<br />
thereby reduced.</p>
<p><strong>1. Basic structured text filters:</strong></p>
<p>The e-mail client I use has the capability to sort incoming e-mail based on<br />
simple strings found in specific header fields, the header in general, and/or<br />
in the body. Its capability is very simple and does not even include regular<br />
expression matching. Almost all e-mail clients have this much filtering capability.</p>
<p>Over the last few months, I have developedhis for some reason).</p>
<p>o Look for “Content-Type: audio”. Nothing I want has this, only virii (your<br />
mileage may vary).</p>
<p>o Look for “euc-kr” and “ks_c_5601-1987? in the headers. I can’t read that language,<br />
but for some reason I get a huge volume of Korean spam (of course, for an actual<br />
Korean reader, this isn’t a good rule).</p>
<p>? Set 3: Store messages to known legitimate addresses.<br />
I have several such rules, but they all just match a literal To: field.</p>
<p>? Set 4: Look for messages that have a legit address<br />
in the header, but that weren’t caught by the previous To: filters. I find that<br />
when I am only in the Bcc: field, it’s almost always an unsolicited mailing<br />
to a list of alphabetically sequential addresses (mertz1@…, mertz37@…, etc).</p>
<p>? Set 5: Anything left at this point is probably spam<br />
(it probably has forged headers to avoid identification of the sender).</p>
<p>2. Whitelist/verification filters:</p>
<p>A fairly aggressive technique for spam filtering is what I would call the “whitelist<br />
plus automated verification” approach. There are several tools that implement<br />
a whitelist with verification: TDMA is a popular multi-platform open source<br />
tool; ChoiceMail is a commercial tool for Windows; most others seem more preliminary.<br />
(See Resources later in this article for links.)</p>
<p>A whitelist filter connects to an MTA and passes mail only from explicitly approved<br />
recipients on to the inbox. Other messages generate a special challenge response<br />
to the sender. The whitelist filter’s response contains some kind of unique<br />
code that identifies the original message, such as a hash or sequential ID.<br />
This challenge message contains instructions for the sender to reply in order<br />
to be added to the whitelist (the response message must contain the code generated<br />
by the whitelist filter). When a legitimate sender answers a challenge, her/his<br />
address is added to the whitelist so that any future messages from the same<br />
address are passed through automatically.</p>
<p>Although I have not used any of these tools more than experimentally myself,<br />
I would expect whitelist/verification filters to be very nearly 100% effective<br />
in blocking spam messages. It is conceivable that spammers will start adding<br />
challenge responses to their systems, but this could be countered by making<br />
challenges slightly more sophisticated (for example, by requiring small human<br />
modification to a code). Spammers who respond, moreover, make themselves more<br />
easily traceable for people seeking legal remedies against them.</p>
<p>The problem with whitelist/verification filters is the extra burden they place<br />
on legitimate senders. Inasmuch as some correspondents may fail to respond to<br />
challenges ? for any reason ? this makes for a type of false positive. In<br />
the best case, a slight extra effort is required for legitimate senders. But<br />
senders who have unreliable ISPs, picky firewalls, multiple e-mail addresses,<br />
non-native understanding of English (or whatever language the challenge is written<br />
in), or who simply overlook or cannot be bothered with challenges, may not have<br />
their legitimate messages delivered. Moreover, sometimes legitimate “correspondents”<br />
are not people at all, but automated response systems with no capability of<br />
challenge response. Whitelist/verification filters are likely to require extra<br />
efforts to deal with mailing-list signups, online purchases, Web site registrations,<br />
and other “robot correspondences”.</p>
<p>3. Distributed adaptive blacklists:</p>
<p>Spam is almost by definition delivered to a large number of recipients. And<br />
as a matter of practice, there is little if any customization of spam messages<br />
to individual recipients. Each recipient of a spam, however, in the absence<br />
of prior filtering, must press his own “Delete” button to get rid of the message.</p>
<p>Tools such as Razor and Pyzor (see Resources) operate around servers that store<br />
digests of known spams. When a message is received by an MTA, a distributed<br />
blacklist filter is called to determine whether the message is a known spam.<br />
These tools use clever statistical techniques for creating digests, so that<br />
spams with minor or automated mutations (or just different headers resulting<br />
from transport routes) do not prevent recognition of message identity. In addition,<br />
maintainers of distributed blacklist servers frequently create “honey-pot” addresses<br />
specifically for the purpose of attracting spam (but never for any legitimate<br />
correspondences). In my testing, I found zero false positive spam categorizations<br />
by Pyzor. I would not expect any to occur using other similar tools, such as<br />
Razor.</p>
<p>There is some common sense to this. Even those ill-intentioned enough to taint<br />
legitimate messages would not have samples of my good messages to report to<br />
the servers ? it is generally only the spam messages that are widely distributed.<br />
It is conceivable that a widely sent, but legitimate message such as the developerWorks<br />
newsletter could be misreported, but the maintainers of distributed blacklist<br />
servers would almost certainly detect this and quickly correct such problems.</p>
<p>As the summary table below shows, however, false negatives are far more common<br />
using distributed blacklists than with any of the other techniques I tested.<br />
The authors of Pyzor recommend using the tool in conjunction with other techniques<br />
rather than as a single line of defense. While this seems reasonable, it is<br />
not clear that such combined filtering will actually produce many more spam<br />
identifications than the other techniques by themselves.</p>
<p>In addition, since distributed blacklists require talking to a server to perform<br />
verification, Pyzor performed far more slowly against my test corpora than did<br />
any other techniques.</p>
<p>4. Rule-based rankings:</p>
<p>The most popular tool for rule-based spam filtering, by a good margin, is SpamAssassin.<br />
There are other tools, but they are not as widely used or actively maintained.<br />
SpamAssassin (and similar tools) evaluate a large number of patterns ? mostly<br />
regular expressions ? against a candidate message. Some matched patterns add<br />
to a message score, while others subtract from it. If a message’s score exceeds<br />
a certain threshold, it is filtered as spam; otherwise it is considered legitimate.</p>
<p>Some ranking rules are fairly constant over time ? forged headers and auto-executing<br />
JavaScript, for example, almost timelessly mark spam. Other rules need to be<br />
updated as the products and scams advanced by spammers evolve. Herbal Viagra<br />
and heirs of African dictators might be the rage today, but tomorrow they might<br />
be edged out by some brand new snake-oil drug or pornographic theme. As spam<br />
evolves, SpamAssassin must evolve to keep up with it.</p>
<p>The README for SpamAssassin makes some very strong claims:</p>
<p>In its most recent test, SpamAssassin differentiated between spam and non-spam<br />
mail correctly in 99.94% of cases. Since then, it’s just been getting better<br />
and better!</p>
<p>My testing showed nowhere near this level of success. Against my corpora, SpamAssassin<br />
had about 0.3% false positives and a whopping 19% false negatives. In fairness,<br />
this only evaluated the rule-based filters, not the optional checks against<br />
distributed blacklists. Additionally, my spam corpus is not purely spam ? it<br />
also includes a large collection of what are probably virus attachments (I do<br />
not open them to check for sure, but I know they are not messages I authorized).</p>
<p>SpamAssassin runs much quicker than distributed blacklists, which need to query<br />
network servers. But it also runs much slower than even non-optimized versions<br />
of the below statistical models (written in interpreted Python using naive data<br />
structures).</p>
<p>5. Bayesian word distribution filters:</p>
<p>Paul Graham wrote a provocative essay in August 2002. In “A Plan for Spam” (see<br />
Resources later in this article), Graham suggested building Bayesian probability<br />
models of spam and non-spam words. Graham’s essay, or any general text on statistics<br />
and probability, can provide more mathematical background than I will here.</p>
<p>The general idea is that some words occur more frequently in known spam, and<br />
other words occur more frequently in legitimate messages. Using well-known mathematics,<br />
it is possible to generate a “spam-indicative probability” for each word. Another<br />
simple mathematical formula can be used to determine the overall “spam probability”<br />
of a novel message based on the collection of words it contains.</p>
<p>Graham’s idea has several noteworthy benefits:</p>
<p>1. It can generate a filter automatically from corpora of categorized messages<br />
rather than requiring human effort in rule development.</p>
<p>2. It can be customized to individual users’ characteristic spam and legitimate<br />
messages.</p>
<p>3. It can be implemented in a very small number of lines of code.</p>
<p>4. It works surprisingly well.</p>
<p>At first blush, it would be reasonable to suppose that a set of hand-tuned and<br />
laboriously developed rules like those in SpamAssassin would predict spam more<br />
accurately than a scattershot automated approach. It turns out that this supposition<br />
is dead wrong. A statistical model basically just works better than a rule-based<br />
approach. As a side benefit, a Graham-style Bayesian filter is also simpler<br />
and faster than SpamAssassin.</p>
<p>There are some issues of data structures and storage techniques that will effect<br />
operating speed of different tools. But the actual predictive accuracy depends<br />
on very few factors ? the main significant factor is probably the word-lexing<br />
technique used, and this matters mostly for eliminating spurious random strings.<br />
Barham’s implementation simply looks for relatively short, disjoint sequences<br />
of characters in a small set (alphanumeric plus a few others).</p>
<p>6. Bayesian trigram filters:</p>
<p>Bayesian techniques built on a word model work rather well. One disadvantage<br />
of the word model is that the number of “words” in e-mail is virtually unbounded.<br />
This fact may be counterintuitive ? it seems reasonable to suppose that you<br />
would reach an asymptote once almost all the English words had been included.<br />
From my prior research into full text indexing, I know that this is simply not<br />
true; the number of “word-like” character sequences possible is nearly unlimited,<br />
and new text keeps producing new sequences. This fact is particularly true of<br />
e-mails, which contain random strings in Message-IDs, content separators, UU<br />
and base64 encodings, and so on. There are various ways to throw out words from<br />
the model (the easiest is just to discard the sufficiently infrequent ones).</p>
<p>I decided to look into how well a much more starkly limited model space would<br />
work for a Bayesian spam filter. Specifically, I decided to use trigrams for<br />
my probability model rather than “words”.</p>
<p>There were several decisions I made along the way. The biggest choice was deciding<br />
what a trigram is. While this is somewhat simpler than identifying a “word”,<br />
the completely naive approach of looking at every (overlapping) sequence of<br />
three bytes is non-optimal. In particular, considering high-bit characters ?<br />
although occurring relatively frequently in multi-byte character sets (in other<br />
words, CJK) ? forces a much bigger trigram space on us than does looking only<br />
at the ASCII range. Limiting the trigram space even further than to low-bit<br />
characters produces a smaller space, but not better overall results.</p>
<p>For my trigram analysis, I utilized only the most highly differentiating trigrams<br />
as message categorizers. But I arrived at the chosen numbers of “spam” and “good”<br />
trigrams only by trial and error. I also picked the cutoff probability for spam<br />
rather arbitrarily: I made an interesting discovery that no message in the “good”<br />
corpus was assigned a spam probability above .0071 other than two false positives<br />
in the .99 range. Lowering my cutoff from an initial 0.9 to 0.1, however, allowed<br />
me to catch a few more message in the “spam” corpus. For purposes of speed,<br />
I select no more than 100 “interesting” trigrams from each candidate message<br />
? changing that 100 to something else can produce slight variations in the<br />
results (but not in an obvious direction).<br />
Posted by: David at 09:08 AM<br />
Spam: The Plague of the Internet</p>
<p>“Spam” is the term for unsolicited commercial bulk email, this which started out a very small nuisance on the internet has grown to become something that plagues people every time they check their inbox. This spam fills up inboxes with unsolicited mail for services that no one could ever want. They cost time and money to those that receive them and the cost to the email servers and internet service providers (ISP’s) in then passed on to the consumers in the form of higher bills. Measures have been taken to put and end to this scourge and to prevent future spammers from arising.</p>
<p>Spam is an ineffective way to advertise as it provides people with information about products and services that likely are of use to no one. Many spammers are unaware of what they are doing and have probably been duped into some kind of “Get-Rich-Quick” scheme. In a article about the commercial uses of spam (from Alchemy Mindworks) it was explained the reason that people spam and how they might be unaware of the damage they are causing:</p>
<p>The most prevalent sort of junk e-mail is commercial advertising. Judging by the content of most of these messages, their perpetrators have all just signed up with an Internet access provider, and were given complimentary copies of one of the many “How to Make Lots of Money on the Internet” books. Some of them are genuinely inconsiderate of the rights of other users of the net ? the bulk of them, however, are merely confused, deluded and ignorant.</p>
<p>These unsolicited emails take time out of someone busy day to either read or get rid off; they become a constant hassle to your everyday routine. Another way that people get on the spam lists in the first place is when spammers get a hold of mailing list for an organization or forum:</p>
<p>One particularly nasty variant of email spam is sending spam to mailing lists (public or private email discussion forums.) Because many mailing lists limit activity to their subscribers, spammers will use automated tools to subscribe to as many mailing lists as possible, so that they can grab the lists of addresses, or use the mailing list as a direct target for their attacks.</p>
<p>Spam can come from anywhere and internet users should be very careful about who they give their email address too.</p>
<p>Ways to get rid of spam are ever rising and slowing down the amount of junk mail that email users receive. For those who use programs to read their emails there are solutions such as SpamCop a highly rated program and the slightly over zealous Spam Hater, both programs not only stop spam but report complaints to the senders of the spam. Online email such as Hotmail also have taken measure to prevent spam in a help file one can learn,<br />
In MSN Hotmail, you have several ways to protect yourself from junk mail: On the page, set up the Junk Mail Filter to redirect spam to your Junk Mail folder.<br />
You can set the Junk Mail Filter to High or Exclusive, and then create a Safe List of addresses that should always send messages to your Inbox… If an offending message still gets into your Inbox, click the check box to the left of the message, and then click Block to stop future messages from that sender from entering your Inbox.</p>
<p>Other measures should be taken as checking out a website or forum for a privacy policy before giving away your email address and when signing up for a service be sure not to check anything that says that you would like to receive “special offers” or “important information” these can be a red flag that the service is linked to potential spammers.</p>
<p>Spam is a nuisance and will hopefully be eliminated or at least slowed down over time. More people learn how to stop or prevent it everyday and hopefully one day it will no longer be in existence. Until then, everyone must take measures to protect their inbox from being crammed full of emails about something no one could ever want.</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/spam-filtering-techniques.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introduction to spam fighting</title>
		<link>http://veriat.com/introduction-to-spam-fighting.html</link>
		<comments>http://veriat.com/introduction-to-spam-fighting.html#comments</comments>
		<pubDate>Sat, 03 Oct 2009 07:55:02 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam Facts]]></category>
		<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[spam fighting]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=348</guid>
		<description><![CDATA[As you may know there is a lot of discussion going on out there regarding blog comment spam. In my opinion comment spam can be defined as a comment posted to a blog wich is not related with the content of your post. It will include a link in the comments field or in the [...]]]></description>
			<content:encoded><![CDATA[<p>As you may know there is a lot of discussion going on out there regarding blog comment spam. In my opinion comment spam can be defined as a comment posted to a blog wich is not related with the content of your post. It will include a link in the comments field or in the name of the author to a commercial website. The problem is becoming serious as spammers are developing bots that can make dozens of post in an hour. We must stop this now or we´ll be the third spam generation victims after mail boxes and guestbooks.</p>
<p>Blog comment spam can´t be comprared yet with his big brother “email” but there is enough presence to be considered as a danger. MT doesn´t help too much cleaning your articles from spamming post so if you don´t want to spend an hour each day doing blog cleaning I recommend you to take action right now. They use bots to kill your blog but you don´t have a cleaning bot, remewmber this!. We must hit asap before this becomes a major problem. Some blacklists are ready to use and other methods are a good starting point. I have collected here some methods and solutions that blog owner are developing. I´ll add a brief description of each method and a link to the author´s website where you ca find more info. I don´t want to infright copyrighted material so you must get the original content from the authore´s website.<span id="more-348"></span></p>
<p>Spam prevention is easy, some quick solutions can save your Movable Type blog from the spam plague. All the solutions I have posted here are specific for MT, I´ll try to add stuff for other blog types, sorry!. With one of this methods you can avoid most of blog spamm comments, and is very simple in case of bots, let´s go!</p>
<p>Email spamming is much easier than blog spamming, this is our first advantage. If a spammer wants to get an email in your inbox, he only needs your email address but if a spammer wants to get inside your blog he needs some extra effort: visit your blog and find the comment script page, then submit a post. As they use bots specifically designed for base MT installations we must change this structure as much a possible in order to increase the difficult of posting.</p>
<p>I recommend you to start with the easiest ones, and if they don’t keep the spammers away then try to add the advanced solutions, the reward worth the effort, so take action.</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/introduction-to-spam-fighting.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The fight against spam</title>
		<link>http://veriat.com/the-fight-against-spam.html</link>
		<comments>http://veriat.com/the-fight-against-spam.html#comments</comments>
		<pubDate>Thu, 01 Oct 2009 07:33:23 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[Spam wars]]></category>
		<category><![CDATA[against spam]]></category>
		<category><![CDATA[fightm]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=339</guid>
		<description><![CDATA[Ideology 
It is clear that spam brings economic benefits to its customers. This means that users, despite the dislike of spam, does enjoy the services advertised through spam. Until the impact of spam exceeds the cost of overcoming protection, spam will not disappear. Thus, the surest way to fight a denial of service advertised through [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Ideology </strong></p>
<p>It is clear that spam brings economic benefits to its customers. This means that users, despite the dislike of spam, does enjoy the services advertised through spam. Until the impact of spam exceeds the cost of overcoming protection, spam will not disappear. Thus, the surest way to fight a denial of service advertised through spam. There are proposals on the use of public condemnation, until the end of communication, against those who buy spam advertised goods and services.</p>
<p>Other methods are aimed at inhibiting the spammers access to users.</p>
<p><strong>Preventive measures to protect </strong></p>
<p>The surest way to fight spam ? do not let spammers get e-mail address. This is a difficult task, but some precautions can be taken.<br />
Do not publish your email address on public websites.<br />
If for some reason the email account to publish, it can be coded like ?u_s_e_r_ (a) _d_o_m_a_i_n_._n_e_t?. Spammers use special programs to scan websites and collect email addresses, so even a masking addresses can help. It should be remembered, however, that in the simplest cases ?encoded? will be able to recognize and address of the program. In addition, it is an inconvenience not only for spammers, but also for ordinary users.<span id="more-339"></span></p>
<p>Many services to provide addresses for non-registered users can send a message to the nick. The real address is substituted service of a user profile and not visible to other users.<br />
Address can be represented in the form of pictures. There are online services that make it automatically , you can also do it in a graphics editor , or simply write the email address of your hands and take pictures.<br />
On the web-pages, e-mail addresses can be protected with the help of Java Script, which is not recognized by the software to collect e-mail addresses.</p>
<p>There is no need to without full guarantees of non-register at the web sites. You can make a special box for these cases and do not use it for regular work. There are even services, issuing disposable addresses specifically to identify them in case of doubt. The most famous of them ? mailinator.com.<br />
Never respond to spam or pass on the reference therein. Such action will confirm that the e-mail address is actively used and would increase the amount of spam.</p>
<p>By downloading the images included in the letter, when read, can be used to test the activity of postal address. It is therefore recommended that they request a mail client for permission to prohibit the effect of loading the image, if you are unsure of the sender.<br />
When choosing an e-mail address should, if possible, stay in a long and uncomfortable for guessing the name. Thus, there is less than 12 million names, consisting of no more than 5 Latin letters. Even if you add numbers and symbols underlined, the number of nicknames, less than 70 million. The spammers can send mails to all such names and weed out those with whom he came to answer ?recipient does not exist?. Thus, it is desirable that the name was not shorter than 6 characters, and if there are no numbers ? not shorter than 7 characters. It is also desirable that the name was not a word in any language, including common names, as well as recorded in Latin Russian words. In this case the address can be guessed by the crowding of words and combinations of the dictionary.</p>
<p>You may from time to time change its address, but this is due to the obvious difficulty: you need to communicate the new address to people who would like to receive e-mails.<br />
Companies often do not publish your address, instead of using CGI to communicate with users.</p>
<p>All methods of hiding the address is a fundamental flaw: they create an inconvenience, not only the alleged spammers, but the real addressees. Besides, often just need to publish the address ? for example, if a contact address.<strong><br />
</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/the-fight-against-spam.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Make a list on spam risk</title>
		<link>http://veriat.com/make-a-list-on-spam-risk.html</link>
		<comments>http://veriat.com/make-a-list-on-spam-risk.html#comments</comments>
		<pubDate>Thu, 01 Oct 2009 07:28:57 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam Facts]]></category>
		<category><![CDATA[Spam filtering techniques]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=337</guid>
		<description><![CDATA[Black lists 
Ownership, use, efficiency
These include:
lists of IP-addresses of computers that are known to them being spam.
(widely used) lists of computers that can be used for distribution ? ?relei open? and ?open proxy?, and also ? lists ?dial? ? client addresses to which there can be no mail servers
(possible use), a local list or the [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Black lists </strong></p>
<p>Ownership, use, efficiency<br />
These include:<br />
lists of IP-addresses of computers that are known to them being spam.<br />
(widely used) lists of computers that can be used for distribution ? ?relei open? and ?open proxy?, and also ? lists ?dial? ? client addresses to which there can be no mail servers<br />
(possible use), a local list or the list maintained by someone else.<br />
(widely distributed through the simplicity of implementation), black lists, a request that is carried out via DNS. They are called DNSBL (DNS B lack L ist). Currently this method is not very efficient. Spammers find new computers to their goals faster than they manage to enter in the black lists. In addition, several computers, send spam, can compromise the entire email domain, or subnet, and thousands of law-abiding users for an indefinite period will be denied the opportunity to send e-mail servers, using a black list.<br />
(found) lists rather preach radical theory (eg, equating to a viral malicious spam messages, etc.).<span id="more-337"></span></p>
<p><strong>Misuse</strong><br />
Often, the irresponsible and improper use of black lists of administrators of resources, leading to blockage of the large number of innocent users.</p>
<p>Example: the use of lists with accurate representations of what the address and how it incorporated the use of email black lists for web-resources, etc.</p>
<p><strong>The irresponsible use of </strong></p>
<p>Example: the failure of a user (or administrator) blocked addresses on the list (because they are there a great many), or rukovodstvovanie in their actions the principle of presumption of guilt.</p>
<p>Example: (the most striking example of the irresponsible attitude last time), the blocking of domain registrar GoDaddy thousands of domain names registered by the hosting company Majordomo [17], based on single and unverified complaints from a group Spamhaus [18] [19].</p>
<p><strong>Racket on the part of administrators blacklists </strong></p>
<p>Recently, the network appears more and more complaints against administrators blacklists which blackmailed Internet providers and hosting providers failure to remove IP addresses from which spam was once perhaps was sent (the addresses are in the black lists of anonymous complaints that are often impossible to verify) . In addition, many require “donations” from the owners of IP addresses for the removal of records from the blacklists.</p>
<p><strong>Authorization Server </strong></p>
<p>Have been proposed various methods to confirm that the computer that sends the message, actually has the right to do so (Sender ID, SPF, Caller ID, Yahoo DomainKeys, MessageLevel [1]), but they are not yet widely available. In addition, these technologies limit the functionality of some common types of mail servers: becomes impossible to automatically redirect your mail from one mailbox to another server (SMTP Forwarding).</p>
<p>Among the providers of extended policy, under which customers are allowed to install SMTP-connection with server. In this case, becomes impossible to use some of the mechanisms of authentication.</p>
<p><strong>Gray lists </strong></p>
<p>The method of gray lists based on the fact that ?behavior? software designed to send spam, different from that of an ordinary e-mail servers, namely, spam programs are not trying to re-send the letter in the event of a temporary error, as required by the protocol SMTP. More precisely, an attempt to circumvent the protection, in subsequent attempts, they use a different relay, another return address, etc., so it looks for the host as part of attempts to send different messages.</p>
<p>The simplest version of the gray lists works as follows. All previously unknown SMTP-servers rely in a ?gray? list. Mail from such servers is not accepted, nor rejected entirely ? he returns a temporary error code ( ?come later?). If the server-sender repeats its attempt to at least some time tg (this time called the delay), the server is entered in the whitelist, and the mail was adopted. Therefore, standard mail (not spam) are not lost, just delayed delivery (they remain in the queue at the sender’s server and delivered after one or more unsuccessful attempts). Program-spammers, or do not know how to re-send messages or use their servers will actually delay time to get on blacklists DNSBL.</p>
<p>This method currently allows the filter to 90% of spam with virtually no risk of losing important messages. However, it also was not perfect.</p>
<p>May mistakenly filter out messages from servers who do not meet the recommendations of the protocol SMTP, for example, the distribution of news sites. Servers with this behavior, if possible, be recorded in the whitelists.<br />
Delay in delivery of the letter can be as high as half (or even more), which may be unacceptable in the case of urgent correspondence. This disadvantage is offset by the fact that the delay is introduced only when making the first letter from a previously unknown sender. Also, many of the implementation of gray lists automatically after a period of ?friendship?, making a SMTP-server in the whitelist. There are ways of sharing such mezhservernogo white lists. As a result, after an initial period ?remember?, in fact, been delayed less than 20% of the letters.<br />
Major postal services using multiple servers with different IP-addresses, moreover, possible that a few servers in turn are trying to send the same message. This can lead to great delays in the delivery of letters. Pools of servers such behavior is also possible to put in the white lists.<br />
Spam programs can be improved. Support for re-sending the message is implemented fairly easily and in large part ?????????? this kind of protection. A key figure in this struggle is the ratio of the characteristic time getting to the spammer blacklists tb and a typical time-delay ?gray? lists tg. When the gray list of potentially futile, with formidable gray lists for spammers.</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/make-a-list-on-spam-risk.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Some other methods of spam filtering</title>
		<link>http://veriat.com/some-other-methods-of-spam-filtering.html</link>
		<comments>http://veriat.com/some-other-methods-of-spam-filtering.html#comments</comments>
		<pubDate>Wed, 30 Sep 2009 07:28:56 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[methods]]></category>
		<category><![CDATA[spam filtering]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=335</guid>
		<description><![CDATA[Other methods
The general tightening of the requirements for letters and senders, for example ? refusing to accept letters with the wrong return address (letters from non-existent domain), check the domain name to IP-address, which is a letter, etc. Through these measures eliminated only most primitive spam ? a small number of messages. However, it is [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Other methods</strong><br />
The general tightening of the requirements for letters and senders, for example ? refusing to accept letters with the wrong return address (letters from non-existent domain), check the domain name to IP-address, which is a letter, etc. Through these measures eliminated only most primitive spam ? a small number of messages. However, it is not zero, so the meaning of their use remains.</p>
<p>Sort the letters on the contents of the letter header fields allows to get rid of a certain amount of spam. Some clients (eg, Mozilla Thunderbird or The Bat!) Provide an opportunity to examine the headers without downloading from the server all the email as well, and thus save bandwidth.<br />
System type ?a challenge-response? to verify that the sender ? the person, not the robot software. Using this method requires the sender of the fulfillment of certain additional actions, often it may be desirable. Many such systems pose an additional burden on the postal system, in many cases, they send requests to fake addresses, it is in professional circles, such decisions are not respected. In addition, the system can not distinguish a robot spammers from any other, for example those that send news.</p>
<p>Systems to measure signs of mass communication, such as Razor and Distributed Checksum Clearinghouse. Built-in mail server software modules count checksums of each passing through them and check their email on the servers of Razor, or DCC, which reported the number of appearances letter on the Internet. If a letter appears, for example, tens of thousands of times ? perhaps this is spam. On the other hand, mass communication can be a legitimate mailing list. In addition, spammers can vary the text message, for example, by adding at the end of random characters.<span id="more-335"></span></p>
<p><strong>Legal aspects of the problem </strong><br />
In some countries, legislative action against spammers. Attempts by outlawing or limiting the activities of spammers face a number of difficulties. Is not easy to define in law what is a legitimate mailing list, and what not. Worst of all, that the company (or person), spammers can be located in another country. To ensure that such laws were effective, it is necessary to develop a coherent legislation, which would be operated in most countries, which seems elusive for the foreseeable future.</p>
<p>In Russia, spam is prohibited by ?the Law on Advertising? (Article 18, Clause 1)<br />
The proliferation of advertising on the networks of telecommunications, including through the use of telephone, facsimile, mobile radio communication is permitted only with prior consent of the caller or recipient to receive advertising. At the same time recognizes the widespread advertising without the prior consent of the caller or recipient, if reklamorasprostranitel not prove that such consent was obtained.</p>
<p>In formal comments, the Federal Antimonopoly Service, entrusted with the responsibilities for monitoring compliance with this Act, referred to the applicability of the rule for Internet delivery. For violation of Article 18 reklamorasprostranitel is responsible under the law on Administrative Violations. However, FAS has no authority to carry out operational search activities of the person responsible for the spam, and authorized to do so their bodies can not hold in the absence of the Russian administrative and criminal law responsibility for the delivery of spam. Therefore, despite the periodic publication of materials to bring the perpetrators to justice [21], currently, the legislative rule ineffective.</p>
<p>From 1 January 2004 in the U.S. the federal law, known as the Can-Spam Act. Attempts to bring spammers to court, and sometimes these attempts are successful.</p>
<p>American Robert Solouey lost to the process in federal court against a small company oklahomskoy provider of Internet services, the operator which has accused him of sending spam. The sentence the court included damages of $ 10 075 000 .</p>
<p>The first case, when the user has won a case against a company involved in the visual occurred in December 2005, when businessman Nigel Roberts from the island of Alderney (Channel Islands) has won court against Media Logistics UK, receiving as compensation for 270 ?</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/some-other-methods-of-spam-filtering.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Eliminate Spam at source</title>
		<link>http://veriat.com/eliminate-spam-at-source.html</link>
		<comments>http://veriat.com/eliminate-spam-at-source.html#comments</comments>
		<pubDate>Wed, 30 Sep 2009 07:25:50 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Spam filtering techniques]]></category>
		<category><![CDATA[Eliminate Spam]]></category>

		<guid isPermaLink="false">http://veriat.com/?p=331</guid>
		<description><![CDATA[The first thing to have is to block spam before it enters your inbox. The only risk to this practice is that any email rejected by your spam software will be lost for good. However, sometimes antispam software wrongAnd consider to be a legitimate email as spam.
First solution to dry up the spam at the [...]]]></description>
			<content:encoded><![CDATA[<p>The first thing to have is to block spam before it enters your inbox. The only risk to this practice is that any email rejected by your spam software will be lost for good. However, sometimes antispam software wrongAnd consider to be a legitimate email as spam.</p>
<p>First solution to dry up the spam at the source, set up SpamPal. This software will Prevent your solution (Outlook, Outlook Express or Thunderbird) to download the spam, Which will save you time to sort the mail left.</p>
<p>This type of spam filter is effective Particularly in the case of a mailbox heavily spammedFor Which the time spent sorting and deleting spam becomes a problem. Ultimately, the user will not mind one or two good messages deleted by accident a few thousand well understood.</p>
<p>Alternatively, methods of logical analysis of the spam, using software like SpamBayes working on the source of the e-mail to Detect Whether or not it is spam. Advantage of this technique: the filter is sometimes more intelligent, since it will be able to detect if you tell him several times to type emails casino spam, any content of this type should be redirected to your Deleted Items. If you delete all content in foreign languages, it will adopt the same attitude to handle spam …</p>
<p>Moreover, it is important to Ensure that its anti-spam solution has advanced functions overlap to detect spam The message contains links to external relatively unsafe formatted text, the sender is attached to a nonexistent domain name makes heavy use of certain HTML tags … All of which allow you to automate much of the processing of spam.</p>
]]></content:encoded>
			<wfw:commentRss>http://veriat.com/eliminate-spam-at-source.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

