January 2012

In 2002, Paul Graham published A Plan For Spam, in which he describes the use of a Bayesian filter to detect spam in email. Today, this is one of the techniques used by most modern mail clients to detect spam, after initial training.

Could the same technique be used to filter your Twitter timeline? This could be either to detect spam, or, as I wanted to use it, to filter out things that I’m just not interested in, e.g. the current discussion of American politics. (Not that this is not worthy of discussion, it’s just that I have very little interest in it, and live in England.)

The idea is that upon reading a tweet, you decide whether it is interesting or not, and mark it accordingly. The filter learns that this particular combination of words is spam or ham, and uses this to adjust its assessment of the spamminess of other tweets. Tweets are then displayed in different colours, depending on the probability being high, low or medium. The record of your choices is stored on disk, and reloaded on startup; the effect being that over time, the program learns your interests.

One problem is the 140-character limit of tweets. This does not provide much with which to train a filter! However, I have had some initial success, not yet to the extent that I could simply not display low priority tweets; the filter still needs tuning.

The filter is an extension to Cameron Kaiser’s excellent TTYtter command-line Twitter client, which is written in Perl 5. As TTYtter does not currently provide a mechanism for an extension to indicate the ‘priority’ of tweets, and hence, colour them differently, I have patched TTYtter 1.2.5. In addition to TTYtter’s requirements, you will also need to install Gea-Suan Lin’s Algorithm::Bayesian module from the CPAN. I also recommend installing Term-ReadLine-TTYtter. The filter memory is persisted using Data::Dumper.

To use, invoke TTYtter as:

./ttytter.pl -readline -exts=bayes.pl -ansi

To mark tweets as spam, use the command “/spam xx yy .. zz” where xx, yy, zz are one or more tweet menu IDs, as shown in the incoming timeline. To mark as ham, use “/ham xx yy .. zz“. Alternatively, use “/- xx yy .. zz” for spam and “/+ xx yy .. zz” for ham.

Tweets are then coloured according to the filter. Ham or interesting tweets are bold white, spam is yellow, undecided is plain white.

The weighting of the filter’s probability could possibly be better: I’m splitting the probability range [0..1] into three equal parts, [0..1/3), [1/3..2/3) and [2/3..1]. This was a guess 🙂

The patched TTYtter, and the module, can be found on the DevZendo Miscellaneous GitHub repository . Changes to TTYtter are under the same licence as TTYtter itself; the Bayesian filter extension is under the Apache License v2.0.

Article updated Feb 26 2020 to change URL of source to reflect change from Google Code, to Mercurial, to GitHub.

I’ve started using the Qumana offline blog editor to prepare blog posts. I’m using version 3.2.4 on Mac OS X Snow Leopard. I’m impressed so far. Here are some initial thoughts on it.

I’d considered ScribeFire, I used it previously, but was not impressed. I’d hoped that it might have improved recently, but a recent rewrite received something of a panning in reviews, hence, the search for something better. Qumana seems to do what I need.

It syncs up well with WordPress; I now have some of my earlier posts on the laptop.

It doesn’t seem to like the apostrophe in my blog title, showing it as a HTML Entity – bloggers do use apostrophes, you know; some of us, correctly. It does the same with categories, one has an ampersand which it shows as &. Hmmmm.

It gets the funky foreign letters in BoøkWürm correct; and even inserted that as a hypertext link to that category. As I mouse over the link in the editor, its address doesn’t show anywhere. I’d have to switch to HTML Source View.

I can’t seem to add categories from it. That’s a pain. It does allow me to add tags though, which link through to Technorati

It’s written in Java. This has pros and cons. Portability is good. There are small resize and toolbar layout problems; it has that certain Javaesque je ne sais quoi. I know from my own Open Source Mac OS X endeavours that making Java programs look beautiful is hard, and on the Mac, even harder. (I use Quaqua; Qumana uses JGoodies Looks, which I use on Windows and Linux)

The editor is very nice – offering WYSIWYG and HTML Source View. It has basic formatting options: fonts, underline/bold/italic/strikethrough, alignment, lists, blockquotes, indent/outdent, images and links. Just enough; I don’t want anything fancy. Word count would be useful. The spell checker is useful, but I can’t change the language from American English to British English.

I won’t be making use of its Ad feature. The Blog Manager allows multiple blogs. Could be useful 🙂

In all, a nice application – and it seems reliable so far.

I learned to touch-type in the 80’s, as I got into computing, with help from a lunchtime class that our commerce teacher ran. Not so easy to touch-type on a ZX81 – we used typewriters. I’m fairly fast, but I’ve drifted from doing it properly, and tend to keep looking down at what I’m typing.

As I expect to be programming and using computers until I can’t physically do it any more, I need to look after my ability to type effectively. Qwerty isn’t optimal.

A couple of times in the last few years, I’ve tried to remedy this by learning the Dvorak layout. I’ve gone through lessons with a tutor program twice, but then as soon as I started using it for the day job, the difficulty of adjusting my muscle memory’s knowledge of Vim and Eclipse rears up, and I stop using it. Having to enter the usual panoply of symbols whilst programming is also more difficult.

If I were only writing straight English, this wouldn’t be so much of a problem. So, write more English… like this blog, which has become woefully neglected. There’s also the possibility of resurrecting writing short stories. (Thomas Pynchon need not start getting worried.) I also like the idea of 750words.com, where a writer puts down 750 words of any stuff that comes into your head, first thing, every day.

The Colemak layout is supposed to be better for programming with than Dvorak, so for my third and final attempt at speeding up, I’m using aTypeTrainer4Mac with this layout. It’s going well so far; I’m just moving off the home row.

So the plan is to train up, then when I’ve got all the keys in muscle memory, write English daily with it. Then and only then start looking at adapting it to use for programming.