Friday, February 11, 2005

The Journal of Machine Learning Research

Microtome Publishing provides excellent online open-content, including every issue of JMLR (in cooperation with MIT cSAIL).

The Journal of Machine Learning Research

JMLR Cover Image
The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. JMLR has a commitment to rigorous yet rapid reviewing. Final versions are published electronically (ISSN 1533-7928) immediately upon receipt. Until the end of 2004, paper volumes (ISSN 1532-4435) were published 8 times annually and sold to libraries and individuals by the MIT Press. Paper volumes (ISSN 1532-4435) are now published and sold by Microtome Publishing.

Some recent publications include:

Thursday, February 03, 2005

ROR Metadata - “Divide and Describe”

The major complaint that I have about currently implementable semantic web technologies is the seemingly endless pit of complexity combined with the meager resulting functionality. You can spend literally days on end going learning XML vocabularies, starting with a general study of RDF, through the Ontology Swamps (OWL, DAML, DAML+OIL, OWL-S, SHOE, ... ) and then on to the specific vocabulary applications which are only now (slowly) gaining recognition. Some of the more useful ones that are showing up these days include: FOAF, DOAP, and RDDL.

ROR - Resources of a Resource is a new vocabulary, however, that is immediately implementable, and which strikes me as being nicely balanced between the general abstraction required to be useful in a range of use-cases, and the grounded specificity that allows it to be easily understood, implemented, and (hopefully) made use of in the real world. Refer to the ROR Specification for details, but in many cases the ROR metadata can (probably should) be automatically generated, and provides a clear first-cut at semantically mapping the available components of an online resource. In addition to an implementation template, the author provides a nifty browser-based ROR Explorer application, which highlights some basic examples of ROR metadata in action.

Of course the chicken and egg conundrum is still out there -- in order to be useful, a large number of sites must implement this technology, which won't be compelling until they do. However, this vocabulary seems both easy to use and easy to make substantial use of, and that, at least, is a big step in the right direction.

Wednesday, February 02, 2005

Porter Stemming

The English language is a fairly inflexible algorithmic domain, which makes porter-stemming all-the-more useful an algorithm, and impressive in its accuracy. From the author's official web page, which also contains useful starter-implementations in various modern languages:

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

Essentially, this means that English-language words are reduced to their stem forms, for example:

  • tags => tag
  • nicely => nice
  • drawers => drawer
  • usefulness => use

And it gets far more complicated from there, due to the bizarre contortions required for spelling in English, whose vocabulary draws so widely on other languages. As the Internet becomes increasingly an information retrieval application, and as English-language content spreads throughout the Internet as the new dataset of choice, semantically intelligent processing will nearly always require this type of stemming in order to normalize text into a canonical form. And so Porter Stemming will become more and more ubiquitous.

An excellent online example of how useful this can be is hackdiary's utility for stem-checking the category tags used for a del.icio.us account: del.icio.us tag stemmer

For the very serious student of stemming, Porter's latest work is the essentially domain specific language Snowball, which is distributed with English, French, Spanish, Portuguese, Italian, German, Dutch, Swedish, Norwegian, Danish, Russian, and Finnish stemming code. This is the site he directs readers toward for future enhancements and research in stemming algorithms.

Quick access to the code: Common Lisp version, Perl version, Python version, Ruby version, and Javascript version.

Wednesday, January 05, 2005

Why flickr is such a definitive step forward

2002_0820_200212AA
2002_0820_200212AA,
originally uploaded by danlentz.
Digital photography is compelling for any number of reasons, but just about all of the sundry "photo album" sites have been kind of lacking (I don't even remember the password to get back into half of the albums I've left hanging around in various dark corners of the internet. Flickr combines all of the expected niceties of a modern, internet enabled photo album, with the dynamic character and unpredictability that comes from social software architecture -- and it seems to be an effective conceptual "twist" that makes digital albums as compelling as digital photography has turned out to be.

If course, it also provides an ideal platform for integration with blogs and other syndicated content -- an ideal publishing medium that can now be leveraged with one's unlimited supply of personal digital image content. Feeds for my Flickr PhotoStreams: RSS 2.0 Atom

Monday, January 03, 2005

"HitMaps" are one-up on the old-fashioned Hit-Counters

Locations of visitors to this page
Where are visitors to this page?
(Auto-update daily since 03-JAN-05)
The HitMap project, essentially a combination of Geo-encoded URL semantics with a nice clustering algorithm, is one of the more mundane projects going on at The Open University in the UK. There are a very wide variety of SemanticWeb and Social Software projects under way, many of which seem to be producing tangible results. The reasearch is a component of the BuddySpace projects, hosted by the Knowledge Media Institute group. Really fun stuff.