Introduction
Text
Mining is the discovery of unknown information, by automatically extracting
information from different written resources. It is a young interdisciplinary
field which draws on information retrieval, data mining, machine learning,
statistics and computational linguistics. In the world of computer technology
every 10 minutes billions of gigabytes of data is generated and people must be able
to retrieve this data efficiently. It generates the need of the document
clustering i.e. also the systematic organization of data. Document Clustering
is the application of cluster analysis to textual documents and it is a small
part of Text mining. It has application in automatic document organization,
topic extraction and fast information retrieval or filtering. The aim of this
report is to explain step-wise procedure of document clustering. The Scope of
this report is in the study of document clustering. It explains all the
pre-requisite for Document Clustering.
Field of Text Mining
The technology used in text mining processes are
1. Information Extraction identifies key phrases using
pattern matching process, Rule based mining algorithm (Sequential Covering)
2. Topic Tracking allows users to choose pattern keyword and notifies them when news
related to those topics become available. It uses Term Frequent Inverse Document
Frequency (TF IDF) algorithm.
3. Summarization is to reduce length and detail of a document while retaining its main
points and overall meaning. It uses fuzzy c-mean algorithm.
4. Categorization involves identifying the main themes of a document by placing the
document into a pre-defined set of topics. It uses Support Vector Machine.
5. Clustering is a technique used to group similar documents. It includes K-Means
Clustering, Word relativity based Clustering Method, stemming method and
filtering.
6. Concept Linkage can identify links between diseases and treatments.
7. Information Visualization is conducted into three step Data preparation, Data analysis and
extraction and Visualization mapping.
8. Question Answering deals with how to find the best answer to a given question.
Analysis & Literature
Study
Document
Clustering consists of multiple step-wise operation. It consists of the
following:
1. Website Crawling &
Indexing 2. Pre-processing 3. Document Clustering 4. Post-Processing
Website
Crawling & Indexing Crawling
is the process where the links of a given set of websites are traversed to
gather all relevant pages.Pre-processing is a technique of
converting indexes into suitable data representation. It includes filtering,
term weighting, stemming etc.
Document
Clustering is actual clustering of documents on the basis of different distance
measures. Post-processing is the
application of document clustering for information retrieval process.
Categorization of Clustering Methods
1. Partitioning Based Clustering, which includes Single-Pass Algorithm and
Relocation based partitioning (K-Mean)
2. Hierarchical Clustering includes Agglomerative (sub divided into Single Link- Cure Algorithm,
complete Link and Group Average) and Divisive (Principal Direction Divisive Partitioning).
3. Keyword Based Clustering includes Frequent Itemset based Clustering (Apriori) and Lattice Based
Clustering (Galois)
4. Model Based Clustering includes Statistical Method and Neural Network Method.
Improving Clustering Results
1. Relevance Feedback takes results that are initially returned from given query and uses
information about whether or not those results are relevant to perform a new
query.
2. Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses
single value Decomposition (SVD) to identify patterns between terms and concepts.
3. Non-Negative Matrix Factorization forms document term matrix which is constructed
using weight of various terms in the given set of documents.
Crawling
& Indexing
To create the database of documents on the web, we need to crawl the
web. We start with a seed set of sites that are known to be very high quality
sites, and then visit the links on each page of those sites to discover other
web and we index it in our database.
Pre-processing Techniques
Filtering irrelevant words can reduce the size of the database as well as serving
the critical purpose of removing noise from the database.
Term
Weighting is done to prioritize the relevant terms. It may be taken as reciprocal
of the total terms available in all the documents.
Stemming is the conflation of the variant forms of a word
into a single representation, i.e. the
stem. For example, the terms presentation,
presenting, and presented could all be stemmed to present. The stem does not
need to be a valid word, but it must capture the meaning of the word. There are several criteria for judging stemmers:
correctness, retrieval effectiveness, and compression performance. There are
two ways stemming can be incorrect 1. Over stemming 2. Under stemming. When a
term is overstemmed, too much of it is removed. Under stemming is the removal
of too little of a term. In English Language, Parsing is difficult for conjoin words. But it may
be easier in the languages like Danish.
Types of Stemming Algorithms
Table Lookup method
There are several approaches to stemming. One way to
do stemming is to store a table of all index terms and their stems. For
example:
Term Stem
engineering
engineer
engineered
engineer
engineer
engineer
N-gram stemmers
In this approach,
association measures are calculated between pairs of terms based on shared
unique diagrams.
Affix Removal Stemmers
Affix removal algorithms remove suffixes and/or prefixes from
terms leaving a stem. These algorithms sometimes also transform the resultant
stem. A simple example of an affix removal stemmer is one that removes the
plurals from terms. A set of rules for such a stemmer.
The Porter algorithm is used to transfer terms into
stem. It consists of a set of condition/action rules. The conditions fall into
three classes: conditions on the stem, conditions on the suffix, and conditions
on the rules.
Stem Conditions
1. The measure,
denoted m, of a stem is based on its alternate
vowel-consonant sequences. Vowels are a, e, i, o, u, and y if preceded by a
consonant. Let C stand for a sequence of consonants, and V for a sequence of vowels. The measure m,
then, is defined as
[C](VC)m[V]
Measure Examples
---------------------------------------
m=0 TR, EE, TREE, Y, BY
m=1 TROUBLE, OATS, TREES, IVY
m=2 TROUBLES, PRIVATE, OATEN
2.*
< X >-the stem ends with given letter X
3. *v*--the stem contains a vowel
4. *d--the stem ends in a double consonant
5. *o--the stem ends with a consonant-vowel-consonant,
sequence, where the final consonant is not w, x, or y
Document Clustering Algorithms
It covers various Algorithm as written
below
1. K-Means Algorithm
2. Bisecting K-Means Algorithm is combination of k-Means and hierarchical clustering.
3. Bredley Fayyad Reina (BFR) Algorithm is variant of K-Means for very Large (disk
resident) data sets. It assumes that points are normally distributed around a
centroid in Euclidean space.
4. CURE Algorithm is used for detecting non- convex shapes.
It uses Euclidean distance
and a collection of representative point to represent clusters.
5. Principal Direction Divisive Partitioning (PDDP) is a hierarchical algorithm. PDDP uses a
1-dimensional singular value decomposition to determine the principal direction
of a document set, and splits the set into two subsets according to their
projection on the principal direction.
6. GALOIS generates concepts (clusters of objects) by
automatically generating a Galois lattice from a set of objects and their
corresponding attributes. The clusters resulting from GALOIS are thus ordered
in a lattice and characterized by both an extensional description as well as an
intentional (conceptual) description.
7. Apriori Algorithm is used to find frequent item set from given transaction set. It also
uses minimum support and minimum confidence level to find out association rules
from item set.
Results & Conclusion
We have implemented a program in java to crawl the
web and stored it into MySQL database. We have also implemented porter
Algorithm and N- Grams. We have also implemented basic K-Means algorithm. We
have gone through the literature for document clustering thoroughly and we
found that K-Means algorithm is dominating the clustering process. If we
combine two algorithms there are chances of better performance. For ex. If we
mix PDDP and K-Means algorithm, it is expected to give better results.
No comments:
Post a Comment