In 2002, Paul Graham published A Plan For Spam, in which he describes the use of a Bayesian filter to detect spam in email. Today, this is one of the techniques used by most modern mail clients to detect spam, after initial training.

Could the same technique be used to filter your Twitter timeline? This could be either to detect spam, or, as I wanted to use it, to filter out things that I’m just not interested in, e.g. the current discussion of American politics. (Not that this is not worthy of discussion, it’s just that I have very little interest in it, and live in England.)

The idea is that upon reading a tweet, you decide whether it is interesting or not, and mark it accordingly. The filter learns that this particular combination of words is spam or ham, and uses this to adjust its assessment of the spamminess of other tweets. Tweets are then displayed in different colours, depending on the probability being high, low or medium. The record of your choices is stored on disk, and reloaded on startup; the effect being that over time, the program learns your interests.

One problem is the 140-character limit of tweets. This does not provide much with which to train a filter! However, I have had some initial success, not yet to the extent that I could simply not display low priority tweets; the filter still needs tuning.

The filter is an extension to Cameron Kaiser’s excellent TTYtter command-line Twitter client, which is written in Perl 5. As TTYtter does not currently provide a mechanism for an extension to indicate the ‘priority’ of tweets, and hence, colour them differently, I have patched TTYtter 1.2.5. In addition to TTYtter’s requirements, you will also need to install Gea-Suan Lin’s Algorithm::Bayesian module from the CPAN. I also recommend installing Term-ReadLine-TTYtter. The filter memory is persisted using Data::Dumper.

To use, invoke TTYtter as:

./ttytter.pl -readline -exts=bayes.pl -ansi

To mark tweets as spam, use the command "/spam xx yy .. zz" where xx, yy, zz are one or more tweet menu IDs, as shown in the incoming timeline. To mark as ham, use "/ham xx yy .. zz". Alternatively, use "/- xx yy .. zz" for spam and "/+ xx yy .. zz" for ham.

Tweets are then coloured according to the filter. Ham or interesting tweets are bold white, spam is yellow, undecided is plain white.

The weighting of the filter’s probability could possibly be better: I’m splitting the probability range [0..1] into three equal parts, [0..1/3), [1/3..2/3) and [2/3..1]. This was a guess 🙂

The patched TTYtter, and the module, can be found on the DevZendo Miscellaneous Mercurial repository . Changes to TTYtter are under the same licence as TTYtter itself; the Bayesian filter extension is under the Apache License v2.0.

Advertisements