C# implementation of LSA test
Latent Semantic AnalysisLSA is patented technology (US Patent 4,839, 853). The patent is presumably expired. Latent classes were introduced in 50th when statisticians noticed that in poll forms some answers can be successfully predicted. For example, when someone provides positive answers on
Do you like movies?
Do you like Coca-Cola?
the answer on
Do you like popcorn?
can be already predicted. So statisticians introduced latent classes of people and, once someone is identified as belonging to a particular latent class, many answers become statistically predictable.
The same approach may be applied to categorization of documents. The recognition of latent classes is claimed as biggest advantage of LSA. The claim can be explained on following simple example. Presume we have 5 documents of 2 latent classes: COMPUTERS and ROOM-DECORATION (blue and red). The document-term matrix is shown below
Simple key-word search on the word PROCESSOR returns only documents 1 and 2. LSA algorithm is capable to return documents 1,2 and 3 because the word INTERFACE is sitting together with word SOFTWARE in the same document and SOFTWARE, in turn, is in the same document with word PROCESSOR. This result is achieved by singular value decomposition and processing query (as vector) by matrices involved into decomposition. The other advantage claimed in LSA is reducing dimensionality. Realistically document-term matrix is near 1,000,000 * 50,000 (one million documents and 50 thousand words). After SVD the rank can be reduced to 300. Replacement of original data by reduced is not an advantage because matrices 300 * 1,000,000 and 50,000 * 300 can be even larger than original because original matrix is sparse and contains near 1 percent of non-zero elements. The advantage of reducing dimensionality is in quicker search. The queries can be processed quicker after decomposition. However, to achieve decomposition of 1,000,000 * 50,000 matrix is very challenging. I did not try anything larger than 10,000. Matrices with rank near 10,000 take almost a day to process.
In this research I test the main claim of LSA. Does it really do better recognition of latent classes compared to simple comparison of row vectors in document-term matrix? The answer is no, it does not. My experiment can be reproduced by downloading project from the top link. The program uses my PATENT CORPUS as data. They represent technical descriptions of 128 patents that belong to 8 classes (16 documents in each class). Sorting these 128 documents correctly into 8 classes is very hard even for human expert, so the 50 percent correct recognition rate is considered as good. After matrix is ready it is subjected to SVD
A = U * L * VT
For precise data uncentered correlation coefficient computed for any two rows of matrix U*L is equivalent to same coefficient computed for same two rows of original document-term matrix. Uncentered correlation coefficient is simply the cosine of the angle for two vectors in n-dimensional space formed by words. It is a dot product divided by norms of involved vectors. This cosine value can be computed for any selected couple of vectors or rows in matrix. The results can be sorted, and pairs with higher value are documents with better match in used vocabulary. Since each document has another 15 documents of the same class and 112 documents of different classes in collection, we can simply introduce the ratio of correct recognition. For precise SVD the ratio is near 62 percent. According to main claim of LSA, when we reduce dimensionality (truncate L matrix) we should observe some improvements in recognition due to noise filtering and usage of latent semantic. That simply did not occur. With reduction in dimensionality the rate of correct recognition only reducing. The individual results for a particular document may vary, but statistically for all documents the ratio is declining with reduction of dimensionality. In experiment I constructed document-term matrix using filtering of stop words and stemming. After SVD is completed, the uncentered correlation coefficient is computed for each possible document couple. The coefficients are sorted, and top 15 is considered as best estimated match in collection. The ratio is computed as number of correctly recognized classes divided by maximum possible (which is 15). The experiment can be repeated for any chosen dimensionality, which can be set as class property. The comment lines in project contain instructions how modify program.