~ Semantic Search Art ~

Demo for comparison of several best clustering technologies

Comparison of the best clustering technologies

This article is a review of my one year research in semantic search and document clustering technologies. I found several algorithms significantly more effective and accurate than the others:

  • Naive Bayes
  • Hierarchical Agglomerative Clustering
  • Polar Orthogonalization
  • k-means with Polar Orthogonalization
  • Random Forest with Polar Orthogonalization of the classifiers

Besides HAC they all need the training sample. In case training sample is not available, it can be easily obtained by HAC. After HAC has created several clusters, they can be examined and used as training sample. In the demo at the top HAC also uses training sample. When merging the clusters, it checks if clusters to be merged contain files from training sample that were categorized differently. If yes, HAC skips this merge and use another best couple. Polar Orthogonalization is based on orthoginalization of matrix discovered by myself twenty years ago. I recently decided to test how it works in application to document clustering and found significant advantage compared to other known technologies. It appeared that PO can be used along with other methods and improve them in this way. Random Forest and k-means are two good examples of improvement by PO. In Random Forest the critical part for the algorithm is construction of the classifier. Articles about Random Forest don't tell how to do that. In my code I make classifier as two orthogonal vectors. For the group of vectors in each node I seek two the most orthogonal vectors and investigate the others. I classify each other vector as being closer to either of them, add them to accordingly selected vector, making two groups, orthogonalize these two vectors and use them as classifier in navigating further down the tree.

My remarks should look clear for those who are familiar with algorithms that I mention in my review. The detailed description and code samples can be found on this site along with references to other articles.

The project at the top can be used for any data corpus, but different data needs different training samples and different expected result for estimation of accuracy. It is recommended to execute it first with PATENTCORPUS128. For this data corpus no changes in code are required besides passing the correct data location to a program. Code is not optimized for a quick processing of several million documents but rather designed to show the concepts and compare algorithms. Interesting that disabling stemming deproves HAC, PO and k-means but improves NB and RF. Disabling PO in k-means and RF allow them to run but reduce accuracy on 25%. Enabling or disabling stemming is done by the flag passed to function WordProcessing.processFiles.

The program also shows how different technologies may be used for simple voting. All five technologies cluster data with slight difference but with accuracy 65 to 90 percent. Applying simple voting improves result only by one or two percent, which is sort of disappointing and clearly shows that adding new but inaccurate technique may not improve the end result.

Having multiple technologies may, however, be used for improving accuracy. When different techniques disagree in clustering a particular document, this document can be examined manually and added to training set. Following this strategy will quickly improve clustering by few human intervention steps. Everyone can do try it yourself. Just spot document classified differently by different technique, add it to training sample and rerun program. After few such steps the accuracy will increase dramatically. This is sort of known phenomenon: machine-human combination leading to significant improvement of result and used by others but in different way (Recommind predictive coding, for example).