Friday, February 11, 2005

The Journal of Machine Learning Research

Microtome Publishing provides excellent online open-content, including every issue of JMLR (in cooperation with MIT cSAIL).

The Journal of Machine Learning Research

JMLR Cover Image
The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. JMLR has a commitment to rigorous yet rapid reviewing. Final versions are published electronically (ISSN 1533-7928) immediately upon receipt. Until the end of 2004, paper volumes (ISSN 1532-4435) were published 8 times annually and sold to libraries and individuals by the MIT Press. Paper volumes (ISSN 1532-4435) are now published and sold by Microtome Publishing.

Some recent publications include:

Thursday, February 03, 2005

ROR Metadata - “Divide and Describe”

The major complaint that I have about currently implementable semantic web technologies is the seemingly endless pit of complexity combined with the meager resulting functionality. You can spend literally days on end going learning XML vocabularies, starting with a general study of RDF, through the Ontology Swamps (OWL, DAML, DAML+OIL, OWL-S, SHOE, ... ) and then on to the specific vocabulary applications which are only now (slowly) gaining recognition. Some of the more useful ones that are showing up these days include: FOAF, DOAP, and RDDL.

ROR - Resources of a Resource is a new vocabulary, however, that is immediately implementable, and which strikes me as being nicely balanced between the general abstraction required to be useful in a range of use-cases, and the grounded specificity that allows it to be easily understood, implemented, and (hopefully) made use of in the real world. Refer to the ROR Specification for details, but in many cases the ROR metadata can (probably should) be automatically generated, and provides a clear first-cut at semantically mapping the available components of an online resource. In addition to an implementation template, the author provides a nifty browser-based ROR Explorer application, which highlights some basic examples of ROR metadata in action.

Of course the chicken and egg conundrum is still out there -- in order to be useful, a large number of sites must implement this technology, which won't be compelling until they do. However, this vocabulary seems both easy to use and easy to make substantial use of, and that, at least, is a big step in the right direction.

Wednesday, February 02, 2005

Porter Stemming

The English language is a fairly inflexible algorithmic domain, which makes porter-stemming all-the-more useful an algorithm, and impressive in its accuracy. From the author's official web page, which also contains useful starter-implementations in various modern languages:

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

Essentially, this means that English-language words are reduced to their stem forms, for example:

  • tags => tag
  • nicely => nice
  • drawers => drawer
  • usefulness => use

And it gets far more complicated from there, due to the bizarre contortions required for spelling in English, whose vocabulary draws so widely on other languages. As the Internet becomes increasingly an information retrieval application, and as English-language content spreads throughout the Internet as the new dataset of choice, semantically intelligent processing will nearly always require this type of stemming in order to normalize text into a canonical form. And so Porter Stemming will become more and more ubiquitous.

An excellent online example of how useful this can be is hackdiary's utility for stem-checking the category tags used for a account: tag stemmer

For the very serious student of stemming, Porter's latest work is the essentially domain specific language Snowball, which is distributed with English, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian, and Finnish stemming code. This is the site he directs readers toward for future enhancements and research in stemming algorithms.

Quick access to the code: Common Lisp version, Perl version, Python version, Ruby version, and Javascript version.