Lesk Algorithm

>> Sunday, March 30, 2008

I am taking an NLP (Natural Language Processing) course this time there seems to be quite a number of interesting algorithms in that. I will try to blog on some of the interesting algorithms here once in a while, for the interest of my blog readers.
My interest in NLP doesn't mean I am going away from my major interest in Distributed computing :)

This time it is about Lesk Algorithm. Ambiguity is one of the challenges in NLP. Words like suit, chair, etc., have multiple meanings depending on the context. There are various ways of disambiguating them and Lesk algorithm is one of them.
Let's take the word "suit" as an example. It has meanings like clothes, law suit, fit for something, etc.,
First we need to find sample sentences for each and every meaning of the word suit. For every meaning, we create "bag of words" from those sample sentences. Basically we get all the words from the samples sentences, and assign that to the particular meaning.
When we get a new sentences with the word to disambiguate, we get all the words in that sentence, and try to find the best intersection with the bag of words we created earlier. The meaning associated with the maximum intersection, is considered the meaning of the word in the test sentence.

Let's take an example from the word "suit"

'suit of clothes': "set of garments (usually including a jacket and trousers or skirt) for outerwear all of the same fabric, design and color) "they buried him in his best suit""
'lawsuit': "comprehensive term for any proceeding in a court of law whereby an individual seeks a legal remedy) "the family brought suit against the landlord"

The bag of words of the meaning 'suit of clothes' will consists of {"set", "of", "garments", "usually", "including", ...}

Then we get a test sentence, with the word suit.
{'Mayor', 'William', 'B.', 'Hartsfield', 'filed', 'suit', 'for', 'divorce', 'from', 'his', 'wife', ',', 'Pearl', 'Williams', 'Hartsfield', ',', 'in', 'Fulton', 'Superior', 'Court', 'Friday'}

In the first approach, we remove all the stop words like 'the', 'in' both from test and training sentences.
If we do that, you can see the closest meaning will be 'law suit' for the given test sentence.

There is a problem of identifying the stop words. Perhaps some of them might not be stop words in some contexts. For this, the second approach is to come up with an IDF measure. We take all the sentences as it is.
We get set of training sentences and "augment" our bag of words with them. We then weight the overlapping set by their their inverse document frequency (IDF). As a crude way to calculate the IDF for a word, we can use the n documents within a given corpus as documents. For example, the word
brain appears in 12 of the 15 documents, so its IDF is log 15.0/12.0 = 0.3219.
This method, even though might be slow, gave me better results for disambguation, over the first approach.

I wrote a simple python code for this (this was one of our assignments). You need to have NLTK installed (installation notes) to run this.

It was so amazing to see how this really works, even though it seems to be dump sometimes to me. This is the more trivial of capturing contexts in to programs.

0 comments: