Elementary pLSA demo
Visual Studio project for full scale demo of PLSA using PATENTCORPUS128
Probabilistic Latent Semantic AnalysispLSA was introduced by Hofmann in 1999. Given data is document-term matrix and number of presumed categories (2 in this example).
The result is computed two matrices with meaning explained below:
The first matrix has number of columns equal to the number of categories. Each row contains the degree on which we may consider that document belong to the category associated with the column. Upper two documents have strong similarities and very different from bottom two. That can be seen in computed first matrix. The elements are treated as probabilities that this specific document belong to a certain category. Sums in rows equal to 1.0.
Second matrix has same meaning for words. Row is considered as category and elements are probabilities that a particular word belong to a specific category. Sums in rows equal to 1.0. Very important: for both matrices the sums of all elements in each row must be equal to 1.0.
When given data and computed result is defined we can consider the solution steps. It requires some background in probability theory, particularly, knowledge of concept of likelihood functions. Likelihood functions may be used for different purposes but most frequent application is to estimate probabilities having experimental data. For example, let us isolate only first line of our data table and find the probabilities of occurrence of each word (p1, p2, p3) in the first document. We can say right away that they are 8/12, 2/12 and 2/12. It can be found also by maximization of likelihood function L
L = 8 * log(p1) + 2 * log(p2) + 2 * log(p3),
with constraint p1 + p2 + p3 = 1.0.
Here the experimental data are constants and probabilities are variables that have to be estimated. In very many cases Lagrange multipliers work very well for obtaining the solution. We have to make functional F
F = 8 * log(p1) + 2 * log(p2) + 2 * log(p3) + Q(p1 + p2 + p3 - 1.0)
take partial derivatives with respect to sought probabilities, equate them to 0, exclude unknown multiplier Q and solve obtained equations relatively to probabilities. The result will be 8/12, 2/12, 2/12. Obviously, for such elementary example using a likelihood function is overkill, it is shown here for illustration of the concept. The question how to derive likelihood function has no answer. It is sort of mathematical model, derived by someone with experience in the art. The PLSA is expansion of above elementary likelihood function on the whole experimental data matrix (let us call it N) plus its maximization using Lagrange multipliers. There are many articles showing details I followed one on-line lecture to derive numerical technique, which can be explained in details if we introduce some elementary operations with matrices that are not typically used in linear algebra and some of them do not even have standard notations.
Problems of pLSA:
It has been noted that pLSA can improve solutions obtained by other methods because initial approximation is already close to sought solution. It is confirmed by my experiment with 128 files of PATENT CORPUS. The full C# project for Visual Studio 2010 can be downloaded from the link on the top. In experiment PLSA improved correct recognition rate from 62 percent obtained by LSA to 64 percent. To me result looks disappointing because, I expected something better than that. The time is about half minute for 128 files on regular lap-top. One million files will require about 65 hours.