~ Semantic Search Art ~

Elementary pLSA demo
Visual Studio project for full scale demo of PLSA using PATENTCORPUS128

Probabilistic Latent Semantic Analysis

pLSA was introduced by Hofmann in 1999. Given data is document-term matrix and number of presumed categories (2 in this example).

doc/wordw1w2w3w4w5
d182200
d291100
d301217
d401218


The result is computed two matrices with meaning explained below:



The first matrix has number of columns equal to the number of categories. Each row contains the degree on which we may consider that document belong to the category associated with the column. Upper two documents have strong similarities and very different from bottom two. That can be seen in computed first matrix. The elements are treated as probabilities that this specific document belong to a certain category. Sums in rows equal to 1.0.

Second matrix has same meaning for words. Row is considered as category and elements are probabilities that a particular word belong to a specific category. Sums in rows equal to 1.0. Very important: for both matrices the sums of all elements in each row must be equal to 1.0.

When given data and computed result is defined we can consider the solution steps. It requires some background in probability theory, particularly, knowledge of concept of likelihood functions. Likelihood functions may be used for different purposes but most frequent application is to estimate probabilities having experimental data. For example, let us isolate only first line of our data table and find the probabilities of occurrence of each word (p1, p2, p3) in the first document. We can say right away that they are 8/12, 2/12 and 2/12. It can be found also by maximization of likelihood function L

L = 8 * log(p1) + 2 * log(p2) + 2 * log(p3),
with constraint p1 + p2 + p3 = 1.0.

Here the experimental data are constants and probabilities are variables that have to be estimated. In very many cases Lagrange multipliers work very well for obtaining the solution. We have to make functional F

F = 8 * log(p1) + 2 * log(p2) + 2 * log(p3) + Q(p1 + p2 + p3 - 1.0)

take partial derivatives with respect to sought probabilities, equate them to 0, exclude unknown multiplier Q and solve obtained equations relatively to probabilities. The result will be 8/12, 2/12, 2/12. Obviously, for such elementary example using a likelihood function is overkill, it is shown here for illustration of the concept. The question how to derive likelihood function has no answer. It is sort of mathematical model, derived by someone with experience in the art. The PLSA is expansion of above elementary likelihood function on the whole experimental data matrix (let us call it N) plus its maximization using Lagrange multipliers. There are many articles showing details I followed one on-line lecture to derive numerical technique, which can be explained in details if we introduce some elementary operations with matrices that are not typically used in linear algebra and some of them do not even have standard notations.
  • Operation 1: R = SpecialProduct(A, B)
  • Operation 2: R = SpecialQuotient(A, B)
  • Operation 3: R = SpecialNorm(A)
The result of operation 1 is matrix where each element equal to the product of two elements of A and B located on the same position. This operation is known as Hadamard product and has notation but the other two are special. The result of operation 2 is matrix where each element is result of division of pair of elements A and B locating on the same position (presuming that division make sense). Operation 3 is normalization of row elements by making its sum equal to 1. Provided above notations the algorithm can be expressed as follows:
  • 1. Make random matrices D and W
  • 2. D = SpecialNorm(D), W = SpecialNorm(W)
  • 3. N = SpecialQuotient(N, D * W)
  • 4. D = SpecialProduct(D, N * WT)
  • 5. W = SpecialProduct(W, DT * N ). Matrix D here is the one known before step 4.
  • 6. Go to step 2 and loop until D and W converge.
For intuitive comprehension of PLSA it can be pointed out that product of computed matrices D and W approximately match matrix of source data with normalized rows. Normalization means making sums of all elements in rows equal to 1.0 by division by constant. That means if we simply normalize document-term matrix and factor it into two constrained matrices we obtain approximate solution, which is not exactly PLSA but close. The details regarding such factorization and comparison to classical PLSA can be found in my other article .

Problems of pLSA:
  • Numerical iterations may converge to a local extremum, which is not the best solution.
  • The numerical solution is very sensitive to initial approximation.
  • It is slow. For couple million documents and a hundred categories it can be running for a week.
Regarding last disadvantage I can say that, if program is going to return something important people will wait. That is rephrase of John Travolta character remark from a movie when someone said that his car is slow. He said: When you are important people will wait.

It has been noted that pLSA can improve solutions obtained by other methods because initial approximation is already close to sought solution. It is confirmed by my experiment with 128 files of PATENT CORPUS. The full C# project for Visual Studio 2010 can be downloaded from the link on the top. In experiment PLSA improved correct recognition rate from 62 percent obtained by LSA to 64 percent. To me result looks disappointing because, I expected something better than that. The time is about half minute for 128 files on regular lap-top. One million files will require about 65 hours.