Wednesday, February 02, 2005

Porter Stemming

The English language is a fairly inflexible algorithmic domain, which makes porter-stemming all-the-more useful an algorithm, and impressive in its accuracy. From the author's official web page, which also contains useful starter-implementations in various modern languages:

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

Essentially, this means that English-language words are reduced to their stem forms, for example:

  • tags => tag
  • nicely => nice
  • drawers => drawer
  • usefulness => use

And it gets far more complicated from there, due to the bizarre contortions required for spelling in English, whose vocabulary draws so widely on other languages. As the Internet becomes increasingly an information retrieval application, and as English-language content spreads throughout the Internet as the new dataset of choice, semantically intelligent processing will nearly always require this type of stemming in order to normalize text into a canonical form. And so Porter Stemming will become more and more ubiquitous.

An excellent online example of how useful this can be is hackdiary's utility for stem-checking the category tags used for a del.icio.us account: del.icio.us tag stemmer

For the very serious student of stemming, Porter's latest work is the essentially domain specific language Snowball, which is distributed with English, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian, and Finnish stemming code. This is the site he directs readers toward for future enhancements and research in stemming algorithms.

Quick access to the code: Common Lisp version, Perl version, Python version, Ruby version, and Javascript version.

No comments: