Demo for feature extraction
Feature ExtractionFeature extraction is a dimensionality reduction. In application for text documents it simply extracts keywords or phrases that constitute topic of the document and may be used in a query to find similar documents or distinguish them in in a large population.
Demo program at the top shows elementary and effective feature extraction technique for the group of document vectors. The selected group is 3 documents in PATENTCORPUS128. Program builds document-term matrix and adds together the selected vectors and all other vectors. Then Polar Orthogonalization is applied and components with frequent words in selected group are examined.
The coefficient passed to function getFeature defines threshold frequency relative to the maximum frequency. The accuracy can be examined using Qcreener, where the extracted feature can be passed as a query. For example, for the files:
C850P7793356.txt C850P7810166.txt C850P7818816.txt
and coefficient = 2.5, the extracted feature is
measur substrat probe sampl scan
The list of returned documents by Qscreener for this query is shown below
which is very accurate. The files with names starting from C850 belong to the same class. There are 16 files of this class in a collection. They all are returned within the first 18 files in the list, that means correct 16 files in the list of 18 files, obtained on a query, that is feature automatically extracted from the sample of 3 files.