DEMO of Polar Orthogonalization and comparison to Naive Bayes and HAC.
Polar Orthogonalization is the answer to question if there is anything more effective than Naive Bayes. It is also a good replacement for k-means, knn and improvement for Random Forest. The method is based on the algorithm that is called Polar Decomposition. Theoretical part of Polar Decomposition can be found in the article of A.Bjorck and C.Bowie, An Iterative Algorithm for Computing the Best Estimate of an Orthogonal Matrix. SIAM J.Numer.Anal. 8-2 (1971), pp.358-364. There is a difference between Polar Orthogonalization, used in text clustering and Polar Decomposition, described in the mentioned article. The concept of Polar Orthogonalization is only a part of Polar Decomposition. The matter of Polar Orthogonalization is iterational process applied to rectangular matrix
N i+1 = 1.5 * N i - 0.5 * N i * N i T N i
If matrix is non-degenerate and has norm not larger than 1.0, it converges to matrix with orthogonal columns very quick and the important part, that makes it effective for document clustering, is that the result is product of orthogonal matrices P and QT included into singular value decomposition PLQT. The transformations simply changes singular values by converging them to 1.0. The theoretical proof is mentioned in referenced articles.
At this point I presume that mathematical part is over and move on to practical application of orthogonalization. This method is applicable for a quick clustering of documents with provided training sample. The data represents document-word matrix, and the training sample is a small group of vectors from this matrix with known category. The training sample is subjected to orthogonalization (Gram-Schmidt does not work in this case). Prior to orthogonalization the vectors (or documents) from the training sample that belong to the same category are added together, so the size of matrix is number of categories times number of words, which requires near one second to process. Since number of categories rarely exceed 20, there is no computational challenge opposite to LSA, where matrices have rank over hundred thousand.
After training matrix is orthogonalized all other vectors (documents) from document-word matrix are simply decomposed in this basis. The category is chosen according to the largest linear coefficient or inner product. The DEMO and comparison to Naive Bayes and Hierarchical Agglomerative Clustering can be found on the top link. Users can try different training samples and see the accuracy compared to Naive Bayes and HAC right away. Most of the time accuracy for PO is significantly better than NB and slightly less compared to HAC, however, it is possible to find rare combination of training files where result of NB is more accurate. NB shows less stability. It varies from 30% to 90%, while PO from 60% to 88% with less dispersion. Although HAC shows slightly better precision it can not replace PO, because HAC is computationally expensive. It is not applicable to collections with more than 10,000 files, because it requires building triangular matrix with size equal to size of collection and computing each element of this matrix as inner product. PO works with one run over data similar to knn and k-means, while exposing significantly higher accuracy.
Obviously it is applicable also as classifier for Random Forest. RF conducts direct flow down a binary tree, choosing direction according to classifier. These classifiers can be chosen as two orthogonal vectors using PO algorithm.
I can add remarks on a clustering strategy. Both knn and k-means have same problem. After vector is clustered, it is added to training sample. Sometimes the category is detected by single computed criteria, which is very close to several categories. The difference is frequently a fraction of percent. When the vector is close to two or more categories, it should be discarded, because adding it to either of them can be a mistake. The discarded vectors or, at least, part of them should be investigated, manually categorized and added to correct category in training sample. Then categorization should be restarted. Making several steps with manual categorization of very small number of documents can significantly improve accuracy and any wrong data in the training sample significantly deproves the result.
Using Polar Orthogonalization in document management system
Having all files in large collection get clustered is sort of theoretical task, although it may look like practical application. Real life situations are different. The large document collections may contain hundreds of clusters, from which user needs to identify only a few, so clustering needs to handle scenario of separation of few clusters from all others, number and size of which is and will stay unknown.
To handle this scenario a regular query can be executed on the first step. On the second step user examines the list of returned documents, finds small group of files representing an interest and request to find similar documents in entire collection. I demonstrate this solution on elementary example. Presume we have five documents shown in the table below. Upper three documents and lower two obviously belong to convensionally different clusters.
Presume query returned upper two documents. We split entire collection on two separate groups: returned documents and all others.
Then we apply Polar Orthogonalization.
Then we compute inner product (L1 and L2) for each vector with each row of above orthogonal matrix.
We can sort vectors according to their inner products (L1 and L2) and look at that sorted list is d1, d2, d3, d5, d4. Having this sorted list user may further examine returned files (d3, d5, d4) and decide which of them match the selected category and where to put the border between my category and all others. Computing inner products for the original matrix, skipping orthogonalization, significantly reduces the accuracy. It can not be seen on this tiny example, but it is observed on large collections.
When user finished with isolation of one category from all other documents, he(she) may continue with a new category, so the size of orthogonalized matrix is not always two but always not very large.
Interesting that positive values in PO matrix may be used for so-called feature extraction. Please note that the first category has positive coefficients against the words next, from, wind, sky and second category expose next, from, snow, tree, grass , where next and from are generic words, that can be easy filtered because they have same sign but belong to different categories.