The bioinformatics research group at IBM’s Thomas J Watson Research Center in New York has adapted a technique originally designed to analyse DNA sequences to catch spam.
The system is based on the Chung-Kwei algorithm, designed to search different DNA and amino acid sequences for recurring patterns.
The algorithm was fed with 65,000 examples of known spam, each email was treated as a DNA-like chain of characters and Chung-Kwei identified six million recurring patterns in this collection, such as “Viagra”.
Each pattern represented a common sequence of letters and numbers that had appeared in unsolicited message. The researchers then ran a collection of known non-spam through the same process, and removed the patterns that occurred in both groups.
Incoming email was given a score based on how many spam patterns it had. A long email that only had a few spammy sentences would get a relatively low score; but one with many patterns would score much higher. The Chung-Kwei correctly identified nearly 97% of the test messages as being spam.
From New Scientist.