Introduction

Text Mining is the discovery of unknown information, by automatically extracting information from different written resources. It is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. In the world of computer technology every 10 minutes billions of gigabytes of data is generated and people must be able to retrieve this data efficiently. It generates the need of the document clustering i.e. also the systematic organization of data. Document Clustering is the application of cluster analysis to textual documents and it is a small part of Text mining. It has application in automatic document organization, topic extraction and fast information retrieval or filtering. The aim of this report is to explain step-wise procedure of document clustering. The Scope of this report is in the study of document clustering. It explains all the pre-requisite for Document Clustering.

Field of Text Mining

The technology used in text mining processes are

1. Information Extraction identifies key phrases using pattern matching process, Rule based mining algorithm (Sequential Covering)

2. Topic Tracking allows users to choose pattern keyword and notifies them when news related to those topics become available. It uses Term Frequent Inverse Document Frequency (TF IDF) algorithm.

3. Summarization is to reduce length and detail of a document while retaining its main points and overall meaning. It uses fuzzy c-mean algorithm.

4. Categorization involves identifying the main themes of a document by placing the document into a pre-defined set of topics. It uses Support Vector Machine.

5. Clustering is a technique used to group similar documents. It includes K-Means Clustering, Word relativity based Clustering Method, stemming method and filtering.

6. Concept Linkage can identify links between diseases and treatments.

7. Information Visualization is conducted into three step Data preparation, Data analysis and extraction and Visualization mapping.

8. Question Answering deals with how to find the best answer to a given question.

Analysis & Literature Study

Document Clustering consists of multiple step-wise operation. It consists of the following:

1. Website Crawling & Indexing 2. Pre-processing 3. Document Clustering 4. Post-Processing

Website Crawling & Indexing Crawling is the process where the links of a given set of websites are traversed to gather all relevant pages.Pre-processing is a technique of converting indexes into suitable data representation. It includes filtering, term weighting, stemming etc.

Document Clustering is actual clustering of documents on the basis of different distance measures. Post-processing is the application of document clustering for information retrieval process.

Categorization of Clustering Methods

1. Partitioning Based Clustering, which includes Single-Pass Algorithm and Relocation based partitioning (K-Mean)

2. Hierarchical Clustering includes Agglomerative (sub divided into Single Link- Cure Algorithm, complete Link and Group Average) and Divisive (Principal Direction Divisive Partitioning).

3. Keyword Based Clustering includes Frequent Itemset based Clustering (Apriori) and Lattice Based Clustering (Galois)

4. Model Based Clustering includes Statistical Method and Neural Network Method.

Improving Clustering Results

1. Relevance Feedback takes results that are initially returned from given query and uses information about whether or not those results are relevant to perform a new query.

2. Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses single value Decomposition (SVD) to identify patterns between terms and concepts.

3. Non-Negative Matrix Factorization forms document term matrix which is constructed using weight of various terms in the given set of documents.

Crawling & Indexing

To create the database of documents on the web, we need to crawl the web. We start with a seed set of sites that are known to be very high quality sites, and then visit the links on each page of those sites to discover other web and we index it in our database.

Pre-processing Techniques

Filtering irrelevant words can reduce the size of the database as well as serving the critical purpose of removing noise from the database.

Term Weighting is done to prioritize the relevant terms. It may be taken as reciprocal of the total terms available in all the documents.

Stemming is the conflation of the variant forms of a word into a single representation, i.e. the stem. For example, the terms presentation, presenting, and presented could all be stemmed to present. The stem does not need to be a valid word, but it must capture the meaning of the word. There are several criteria for judging stemmers: correctness, retrieval effectiveness, and compression performance. There are two ways stemming can be incorrect 1. Over stemming 2. Under stemming. When a term is overstemmed, too much of it is removed. Under stemming is the removal of too little of a term. In English Language, Parsing is difficult for conjoin words. But it may be easier in the languages like Danish.

Types of Stemming Algorithms

Table Lookup method

There are several approaches to stemming. One way to do stemming is to store a table of all index terms and their stems. For example:

Term Stem

engineering engineer

engineered engineer

engineer engineer

N-gram stemmers

In this approach, association measures are calculated between pairs of terms based on shared unique diagrams.

Affix Removal Stemmers

Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. These algorithms sometimes also transform the resultant stem. A simple example of an affix removal stemmer is one that removes the plurals from terms. A set of rules for such a stemmer.

The Porter algorithm is used to transfer terms into stem. It consists of a set of condition/action rules. The conditions fall into three classes: conditions on the stem, conditions on the suffix, and conditions on the rules.

Stem Conditions

1. The measure, denoted m, of a stem is based on its alternate vowel-consonant sequences. Vowels are a, e, i, o, u, and y if preceded by a consonant. Let C stand for a sequence of consonants, and V for a sequence of vowels. The measure m, then, is defined as

[C](VC)^m[V]

 Measure             Examples

 ---------------------------------------

 m=0      TR, EE, TREE, Y, BY

 m=1      TROUBLE, OATS, TREES, IVY

 m=2      TROUBLES, PRIVATE, OATEN

2.* < X >-the stem ends with given letter X

3. *v*--the stem contains a vowel

4. *d--the stem ends in a double consonant

5. *o--the stem ends with a consonant-vowel-consonant, sequence, where the final consonant is not w, x, or y

Document Clustering Algorithms

It covers various Algorithm as written below

1. K-Means Algorithm

2. Bisecting K-Means Algorithm is combination of k-Means and hierarchical clustering.

3. Bredley Fayyad Reina (BFR) Algorithm is variant of K-Means for very Large (disk resident) data sets. It assumes that points are normally distributed around a centroid in Euclidean space.

4. CURE Algorithm is used for detecting non- convex shapes. It uses Euclidean distance and a collection of representative point to represent clusters.

5. Principal Direction Divisive Partitioning (PDDP) is a hierarchical algorithm. PDDP uses a 1-dimensional singular value decomposition to determine the principal direction of a document set, and splits the set into two subsets according to their projection on the principal direction.

6. GALOIS generates concepts (clusters of objects) by automatically generating a Galois lattice from a set of objects and their corresponding attributes. The clusters resulting from GALOIS are thus ordered in a lattice and characterized by both an extensional description as well as an intentional (conceptual) description.

7. Apriori Algorithm is used to find frequent item set from given transaction set. It also uses minimum support and minimum confidence level to find out association rules from item set.

Results & Conclusion

We have implemented a program in java to crawl the web and stored it into MySQL database. We have also implemented porter Algorithm and N- Grams. We have also implemented basic K-Means algorithm. We have gone through the literature for document clustering thoroughly and we found that K-Means algorithm is dominating the clustering process. If we combine two algorithms there are chances of better performance. For ex. If we mix PDDP and K-Means algorithm, it is expected to give better results.

Big Data Clustering

Monday, 6 July 2015

Document Clustering Process