Nowhere Near Ithaca: bigrams

I have been playing with a simple online tool that will take a text document and calculate the log-likehood scores for all of the bigrams in the document, and generate the sorted list.

One of the things I wanted to be able to easily do is explore the sensitivity of the calculated log-likelihood scores on the entries of the contingency matrix for each bigram. This feature has been added and the tool updated.

For exploring the impact of changes to a contingency matrix, you can either manually enter specific values in one of the contingency tables, or drag your mouse left or right on a specific entry in order to decrease/increase the values.

Explore the Impact of Changes in the
Contingency Matrix Values on the LLR Scores
(available for arbitrary text documents at
https://googledrive.com/host/0B2GQktu-wcTidC01Ym1lR2h1TTA/)

Using a slightly modified version of the matrix from Dunning's ( http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html), we write the contingency matrix as

	Starts with Word 1	Not Word 1
Ends with Word 2	k_11: Count of the bigram Word 1 Word 2	k_12: Count of bigrams that end with Word 2, but do not start with Word 1
Does Not End with Word 2	k_21: Count of bigrams starting with Word 1 but not ending with Word 2	k_22: Count of bigrams that do not start with Word 1 and do not end with Word 2

and the log-likelihood score is calculated from the terms k_11, k_12, k_21, k_22.

The "exploring" involves watching the impact of changes in the values k_ij. At the moment, you can only increase/decrease a particular entry. However, in reality there are additional changes that are of interest; namely, keeping the total number of bigrams constant when changing a particular value k_ij, so that there has to be a corresponding change in the other value(s) when k_ij is changed. In fact, because of the lack of this restriction, it may be the case that you end up with a contingency matrix that could not occur for bigrams in a document.

You can also get other on-first-glance weirdness: very high LLR scores without having the bigram itself appear at all - at least, based on how the entries of the matrix are interpreted. As an extreme, for example, the LLR for the contingency matrix (k11,k12,k21,k22)=(0,100,100,0) has value 277.3, even though since k11=0 it would mean that the bigram itself did not occur at all. However, in this case the RootLLR is negative (at -16.7), indicating that the bigram appeared fewer times than expected (see http://s.apache.org/CGL). The RootLLR is defined as

RootLLR = signum[k11/(k11+k12) - k21/(k21+k22)] * sqrt(LLR)
= signum[0 - 1] * sqrt(277.3)
= - 16.7
There is much more to explore here.

The methods described in Ted Dunning's Surprise and Coincidence blog post regarding the log-likelihood ratio score can be used for a variety of interesting applications. This score is simple to calculate, and yet apparently can capture "anomalous" rare events for filtering purposes.

To help me better understand this, I have started a small online project that lets you calculate these scores for bigrams of a given text document. You can use a text file from your machine, one of the preselected ones from Project Gutenberg, or the MED dataset from the Classic3 dataset. The contingency table upon which the score is based is also shown for each bigram. Note that currently a list of stopwords (based on Ken Church's ngrams tutorial) is used so that bigrams that include these words are not included in the analysis (e.g., "of", "they", etc.). I am not sure whether this list should included in an upfront fashion or not, and am still researching a better way to address this kind of thing.

This is still very much a work-in-progress.

Performance-wise, it seems to run fine for files up to about 1MB or so (javascript web workers are used for the raw processing on a file).

You can selectively add/remove words from the "top" bigrams by clicking on the words. "Noise" seems to be an issue that has not been addressed here in any serious way yet - there are lots of "nuisance" words that show up (especially since the Project Gutenberg files were not modified in any way), and this is in spite of the use of a common set of stopwords - in fact, that's why I added the easy ability to remove additional words dynamically from the list by just clicking on them (either as a start word or end word in a bigram).

Top Bigrams from Moby Dick (full list of 75 not included here)
(from https://googledrive.com/host/0B2GQktu-wcTidC01Ym1lR2h1TTA/)

The (relatively simple) calculations of the log-likelihood scores themselves are done with a straightforward translation of the LogLikelihood.java class from the Apache Mahout project. Also, the handful of log-likelihood tests from that project were used to find an issue with the calculations in the javascript version (that was fixed June 23). Also included are the "root log-likelihood ratios" (see this mailing list post by Ted Dunning for some background on this).

The contingency tables are included to assist the ongoing debugging, and the plan is to make visible more of the intermediate calculations and statistics for each "run".

Nowhere Near Ithaca

Exploring Sensitivity of Log-Likelihood Scores for Bigrams

A Simple Online tool for Exploring Bigrams in Text Documents

Popular Posts