Big Data Clustering
Friday, 10 July 2015
Monday, 6 July 2015
Document Clustering Process
Introduction
Text
Mining is the discovery of unknown information, by automatically extracting
information from different written resources. It is a young interdisciplinary
field which draws on information retrieval, data mining, machine learning,
statistics and computational linguistics. In the world of computer technology
every 10 minutes billions of gigabytes of data is generated and people must be able
to retrieve this data efficiently. It generates the need of the document
clustering i.e. also the systematic organization of data. Document Clustering
is the application of cluster analysis to textual documents and it is a small
part of Text mining. It has application in automatic document organization,
topic extraction and fast information retrieval or filtering. The aim of this
report is to explain step-wise procedure of document clustering. The Scope of
this report is in the study of document clustering. It explains all the
pre-requisite for Document Clustering.
Field of Text Mining
The technology used in text mining processes are
1. Information Extraction identifies key phrases using
pattern matching process, Rule based mining algorithm (Sequential Covering)
2. Topic Tracking allows users to choose pattern keyword and notifies them when news
related to those topics become available. It uses Term Frequent Inverse Document
Frequency (TF IDF) algorithm.
3. Summarization is to reduce length and detail of a document while retaining its main
points and overall meaning. It uses fuzzy c-mean algorithm.
4. Categorization involves identifying the main themes of a document by placing the
document into a pre-defined set of topics. It uses Support Vector Machine.
5. Clustering is a technique used to group similar documents. It includes K-Means
Clustering, Word relativity based Clustering Method, stemming method and
filtering.
6. Concept Linkage can identify links between diseases and treatments.
7. Information Visualization is conducted into three step Data preparation, Data analysis and
extraction and Visualization mapping.
8. Question Answering deals with how to find the best answer to a given question.
Analysis & Literature
Study
Document
Clustering consists of multiple step-wise operation. It consists of the
following:
1. Website Crawling &
Indexing 2. Pre-processing 3. Document Clustering 4. Post-Processing
Website
Crawling & Indexing Crawling
is the process where the links of a given set of websites are traversed to
gather all relevant pages.Pre-processing is a technique of
converting indexes into suitable data representation. It includes filtering,
term weighting, stemming etc.
Document
Clustering is actual clustering of documents on the basis of different distance
measures. Post-processing is the
application of document clustering for information retrieval process.
Categorization of Clustering Methods
1. Partitioning Based Clustering, which includes Single-Pass Algorithm and
Relocation based partitioning (K-Mean)
2. Hierarchical Clustering includes Agglomerative (sub divided into Single Link- Cure Algorithm,
complete Link and Group Average) and Divisive (Principal Direction Divisive Partitioning).
3. Keyword Based Clustering includes Frequent Itemset based Clustering (Apriori) and Lattice Based
Clustering (Galois)
4. Model Based Clustering includes Statistical Method and Neural Network Method.
Improving Clustering Results
1. Relevance Feedback takes results that are initially returned from given query and uses
information about whether or not those results are relevant to perform a new
query.
2. Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses
single value Decomposition (SVD) to identify patterns between terms and concepts.
3. Non-Negative Matrix Factorization forms document term matrix which is constructed
using weight of various terms in the given set of documents.
Crawling
& Indexing
To create the database of documents on the web, we need to crawl the
web. We start with a seed set of sites that are known to be very high quality
sites, and then visit the links on each page of those sites to discover other
web and we index it in our database.
Pre-processing Techniques
Filtering irrelevant words can reduce the size of the database as well as serving
the critical purpose of removing noise from the database.
Term
Weighting is done to prioritize the relevant terms. It may be taken as reciprocal
of the total terms available in all the documents.
Stemming is the conflation of the variant forms of a word
into a single representation, i.e. the
stem. For example, the terms presentation,
presenting, and presented could all be stemmed to present. The stem does not
need to be a valid word, but it must capture the meaning of the word. There are several criteria for judging stemmers:
correctness, retrieval effectiveness, and compression performance. There are
two ways stemming can be incorrect 1. Over stemming 2. Under stemming. When a
term is overstemmed, too much of it is removed. Under stemming is the removal
of too little of a term. In English Language, Parsing is difficult for conjoin words. But it may
be easier in the languages like Danish.
Types of Stemming Algorithms
Table Lookup method
There are several approaches to stemming. One way to
do stemming is to store a table of all index terms and their stems. For
example:
Term Stem
engineering
engineer
engineered
engineer
engineer
engineer
N-gram stemmers
In this approach,
association measures are calculated between pairs of terms based on shared
unique diagrams.
Affix Removal Stemmers
Affix removal algorithms remove suffixes and/or prefixes from
terms leaving a stem. These algorithms sometimes also transform the resultant
stem. A simple example of an affix removal stemmer is one that removes the
plurals from terms. A set of rules for such a stemmer.
The Porter algorithm is used to transfer terms into
stem. It consists of a set of condition/action rules. The conditions fall into
three classes: conditions on the stem, conditions on the suffix, and conditions
on the rules.
Stem Conditions
1. The measure,
denoted m, of a stem is based on its alternate
vowel-consonant sequences. Vowels are a, e, i, o, u, and y if preceded by a
consonant. Let C stand for a sequence of consonants, and V for a sequence of vowels. The measure m,
then, is defined as
[C](VC)m[V]
Measure Examples
---------------------------------------
m=0 TR, EE, TREE, Y, BY
m=1 TROUBLE, OATS, TREES, IVY
m=2 TROUBLES, PRIVATE, OATEN
2.*
< X >-the stem ends with given letter X
3. *v*--the stem contains a vowel
4. *d--the stem ends in a double consonant
5. *o--the stem ends with a consonant-vowel-consonant,
sequence, where the final consonant is not w, x, or y
Document Clustering Algorithms
It covers various Algorithm as written
below
1. K-Means Algorithm
2. Bisecting K-Means Algorithm is combination of k-Means and hierarchical clustering.
3. Bredley Fayyad Reina (BFR) Algorithm is variant of K-Means for very Large (disk
resident) data sets. It assumes that points are normally distributed around a
centroid in Euclidean space.
4. CURE Algorithm is used for detecting non- convex shapes.
It uses Euclidean distance
and a collection of representative point to represent clusters.
5. Principal Direction Divisive Partitioning (PDDP) is a hierarchical algorithm. PDDP uses a
1-dimensional singular value decomposition to determine the principal direction
of a document set, and splits the set into two subsets according to their
projection on the principal direction.
6. GALOIS generates concepts (clusters of objects) by
automatically generating a Galois lattice from a set of objects and their
corresponding attributes. The clusters resulting from GALOIS are thus ordered
in a lattice and characterized by both an extensional description as well as an
intentional (conceptual) description.
7. Apriori Algorithm is used to find frequent item set from given transaction set. It also
uses minimum support and minimum confidence level to find out association rules
from item set.
Results & Conclusion
We have implemented a program in java to crawl the
web and stored it into MySQL database. We have also implemented porter
Algorithm and N- Grams. We have also implemented basic K-Means algorithm. We
have gone through the literature for document clustering thoroughly and we
found that K-Means algorithm is dominating the clustering process. If we
combine two algorithms there are chances of better performance. For ex. If we
mix PDDP and K-Means algorithm, it is expected to give better results.
Sunday, 28 June 2015
Document Clustering Algorithms
Introduction
Document Clustering is
the application of cluster analysis to textual documents. It has application in
automatic document organization, topic extraction and fast information
retrieval or filtering. In the world of computer technology every 10 minutes
billions of gigabytes of data is generated and people must be able to retrieve
this data efficiently. It generates the need of the document clustering i.e.
also the systematic organization of data. Many Search Engines like Google,
Yahoo, Baidu, Bing, AltaVista and even many commercial search engines are
available to organize the data into useful format. Even at Intranet, clustering
of data is becoming highly important. Besides the different nature of
Clustering on Intranet and Internet, Basic Requirement is almost the same.
The aim of this report
is to specify the entire existing algorithms for document clustering. Document clustering
algorithms have evolved over time gradually into a more subtle way. The quality
of Document clustering is defined by the fulfillment of user needs. The algorithms like K-Means, Bisecting K-Means,
Apriori, CURE, PDDP and GALOIS has evolved over time and
providing achievable time complexity and improved clustering of Documents.
The Scope of this
report is in the field of Document Clustering. The brief introduction of all
the algorithms used is specified in this report. Further combined result of two
algorithm is expected to give better results.
Various Standard Clustering Methods
Partitioning Based Clustering
Partition-based
algorithms partition a document collection into a number of hard clusters using
a given distance measure.
1. Single-Pass
Algorithm
2. Relocation
based partitioning Ex. K-Means
Hierarchical Clustering
Hierarchical
clustering approaches attempt to create a hierarchical decomposition of the
given document collection thus achieving a hierarchical structure.
1.
Agglomerative
Approach:
Agglomerative
methods start with an initial clustering of the term space, where all documents
are considered representing a separate cluster. The closest clusters using a
given inter-cluster similarity measure are then merged continuously until only
1 cluster or a predefined number of clusters remain. Some of these methods are
more suitable for discovering clusters of non-convex forms than partition-based
algorithms. Agglomerative methods normally produce hard (hierarchical) clusters.
Ex.
Single
Link
– CURE Algorithm, MST, Complete Link
–Vorhees algorithm
Group
Average – BIRCH algorithm
2.
Divisive Methods
Divisive
clustering algorithms start with a single cluster containing all documents. It
then continuously divides clusters until all documents are contained in their
own cluster or a predefined number of clusters are found. Ex. Principal Direction Divisive Partitioning
Keyword Based Clustering
Keyword-based clustering works by
assigning a set of core-terms (a form of keywords) to each document and uses
these terms as features for the subsequent clustering. This is based on the
idea that the main topic/theme of a document can be captured in a relatively
small set of core-terms or keywords. This idea is related to the Topic-binder
hypothesis.
1. Frequent Itemset based
Clustering:
Frequent itemset clustering is based on
finding frequent sets of keywords occur together in the document collection –
an approach often used in data mining. Ex. Apriori
2.
Lattice based Clustering:
Lattice-based
clustering is closely related to the frequent itemset approach above, but
constructs a lattice structure of concepts forming a kind of hierarchy with
multiple inheritance. This approach is based on Formal Concept Analysis and is
often described as a form of conceptual clustering. Since a document can be
related to several concepts on the same level (in the lattice), the lattice
represents a hierarchical soft clustering of the document collection. Ex. GALOIS
Model Based Clustering
Model-based clustering, as we define it
is based on hypothesis models for clusters in the document collection, and then
for each document finding the cluster whose model the document best fits.
1. Statistical
Methods
2. Neural
Network Methods
Algorithm and its brief Explanation
K-means Algorithm: K-means clustering aims
to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean. This
results in a partitioning of the data space into Voronoi cells. It use
Euclidean space/distance. Start by picking k, the number of clusters.
Initialize clusters by picking one point per cluster.
Populating Clusters:
1.
For each point,
place it in the cluster whose current centroid it is nearest.
2.
After all point
are assigned, update the location of centroid of the clusters.
3.
Reassign all
point to their closest centroid.
4.
Repeat 2 and 3
until convergence: Points don’t move between clusters and centroid stabilize.
Select
value for K: Average distance to centroid falls
rapidly until Best value of K
Picking
initial K points: Pick first point at random then next
point to be one whose minimum distance from the selected points is as large as
possible. Repeat until we have k points.
Disadvantage:
K-means algorithm may converge to a (local) optimum. There is no guarantee that
the global optimum is found.
Bisecting K-Means:
Bisecting
k-Means is like a combination of k-Means and hierarchical clustering. It starts
with all objects in a single cluster. The pseudo code of the algorithm is
displayed below:
Basic Bisecting
K-means Algorithm for finding K clusters
1. Pick
a cluster to split.
2.
Find 2 sub-clusters using the basic
k-Means algorithm (Bisecting step)
3.
Repeat step 2, the bisecting step, for
ITER times and take the split that produces the clustering with the highest
overall similarity.
4. Repeat
steps 1, 2 and 3 until the desired number of clusters is reached
There are a number of different ways
to choose which cluster to split. For example, we can choose the largest
cluster at each step, the one with the least overall similarity, or use a
criterion based on both size and overall similarity.
Bradley-Fayyad-Reina (BFR)
Algorithm:
BFR is variant of K-mean for very large (disk-resident) data sets. It assumes
each cluster is normally distributed around a centroid in Euclidean space.
Normal distribution assumption implies that clusters look like axis-aligned
ellipses. Block Memory point is read from the disk and put into the main memory
at a time. It uses K–means technique for selecting initial K centroids.
BFR categorizes
point into three Class:
1.
Discard Set (DS): Point close enough to
centroid.
2.
Compression Set (CS): Groups of point
that are close together but not close to any existing centroid. They are
summarized, but not assigned to a cluster.
3.
Retained Set (RS): Point are very far
from each other (Isolated point).
Actual
Clustering: Find those points that are sufficiently close to cluster centroid
and add those point to that cluster.DS set adjust statistics of each cluster to
account for newly added point. The remaining point are not close to any cluster
go to the CS.
Find out how
much Close Enough BFR approach use Mahalanobis distance which is less than a
threshold.
Limitation
of BFR Algorithm: Uses Strong assumptions like clusters
normally distributed in each dimension, Axes are fixed and ellipses at an angle
don’t give reliable result.
CURE (Clustering Using
Representatives):
CURE is ability to detect non-convex shapes as
opposed to the more popular K-Means. In contrast to K-Means, It produces
uniform clusters. It uses Euclidean distance and a collection of representative
point to represent clusters.
Algorithm:
1. First cluster
sample points hierarchically to create the initial clusters.
2. For each
cluster, pick K (e.g. 4) representative point as dispersed as possible. Move
each representative point a fixed fraction (e.g. 20%) toward the centroid of
the cluster.
3. Now rescan
the whole dataset and visit each point p in the data set and Place it in the
“closest cluster” that cluster with the closet (to p) among all the
representative points of all the clusters. CURE algorithm has a worst-case
complexity of (n2logn) for high-dimensional spaces.
Various modifications
of k-means such as spherical k-means and k-medoids have been
proposed to allow using other distance measures. In statistics and data mining, k-medians
clustering is a cluster analysis algorithm.
It is a variation of k-means clustering where instead of calculating theme for each cluster to determine its centroid, one
instead calculates the median.
PDDP (Principal
Direction Divisive Partitioning): PDDP is a hierarchical algorithm. But where
K-Means was made hierarchical by bisection, Implemented Clustering Algorithms PDDP
is hierarchical by design. PDDP uses a 1-dimensional singular value
decomposition to determine the principal direction of a document set, and
splits the set into two subsets according to their projection on the principal
direction. The sets are placed in a simple binary tree and a ” scatter value”
is calculated for each node in the tree as they are created in order to
determine which leaf the algorithm will split next. The algorithm terminates
when enough clusters have been formed in the tree.
GALOIS: GALOIS generates concepts (clusters of objects) by
automatically generating a Galois lattice from a set of objects and their
corresponding attributes. In the context of document clustering, the objects
are documents and the attributes are keywords (index) assigned to these
documents. The clusters resulting from GALOIS are thus ordered in a lattice and
characterized by both an extensional description (i.e. the documents in the
cluster) as well as an intentional (conceptual) description (i.e. the keywords
that these documents share)
The distance measure
applied for document clustering using GALOIS is as a result simply the inverse
of the number of shared keywords (found using intersection). The more keywords
two documents share, the closer they are conceptually, since they will co-occur
in the extents of more specific concepts/clusters.
Apriori: Apriori is
an algorithm for frequent item set
mining and association rule learning over transactional databases.
It proceeds by identifying the frequent individual items in the database and
extending them to larger and larger item. The frequent item sets determined by
Apriori can be used to determine association rules which highlight
general trends in the database. This has applications in domains such
as market basket analysis.
I = {i1 , i2 , …, im}:
a set of items. Transaction t : t a set
of items, and t ⊆ I. Transaction Database T: a set of transactions
T = {t1 , t2 , …, tn }.
Algorithm:
Step-1: Find all
itemsets that have minimum support frequent itemsets.
Step-2: Use frequent
itemsets to generate rules.
Document Clustering Literature Survey
Introduction
Text Mining is the
discovery of unknown information, by automatically extracting information from
different written resources. Text mining is different from what are familiar
with in web search. Text mining is a variation on a field called data mining
that tries to find interesting patterns from large databases. Text mining is also
known as Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in
Text (KDT). Text mining is a young interdisciplinary field which draws on
information retrieval, data mining, machine learning, statistics and
computational linguistics.
Document Clustering is
the application of cluster analysis to textual documents. It has application in
automatic document organization, topic extraction and fast information
retrieval or filtering. In the world of computer technology every 10 minutes
billions of gigabytes of data is generated and people must be able to retrieve
this data efficiently. It generates the need of the document clustering i.e.
also the systematic organization of data. Many Search Engines like Google,
Yahoo, Baidu, Bing, AltaVista and even many commercial search engines are
available to organize the data into useful format. Even at Intranet, clustering
of data is becoming highly important. Besides the different nature of
Clustering on Intranet and Internet, Basic Requirement is almost the same.
The aim of this report
is to specify the entire existing algorithms for document clustering. Document clustering
algorithms have evolved over time gradually into a more subtle way. The quality
of Document clustering is defined by the fulfillment of user needs. Initially queries were matched in Boolean
form. After that Relevance Feedback came into existence. Further Search
Strategies were improved using Latent Semantic Indexing and Non-negative Matrix
Factorization. The things even got totally changed with the start of machine
learning algorithms. The new field of Natural Language Processing (NLP) came
into existence for the organized study of Text Mining. The algorithm like
K-means and fuzzy c- means were used now to find the nearest neighbors for
Document Clustering. All the Algorithms used And Categorization of Text Mining
has been specified in this document.
The Scope of this
project in Searching and querying, Ranking of search results, Navigating and
browsing information, Optimizing information representation and storage,
Document classification (into predefined group), Document clustering (automatic
discovered results ) .
I.
Literature Survey
Boolean logic
For document clustering
technique initially Boolean logic was used for comparison purpose. It is used now
also but not exactly in the same form. For the clustering of document only Boolean comparison was not sufficient. Since
always data is not available in the structured format. Also Document clustering
result is not meant only for the query given. But it should be in the form what
user actually needs.
Relevance Feedback
Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to
take the results that are initially returned from a given query and to use
information about whether or not those results are relevant to perform a new
query.
Latent Semantic Indexing
Latent semantic
indexing (LSI) is an indexing and retrieval method that uses a mathematical
technique called singular value decomposition (SVD) to identify patterns in the
relationships between the terms and concepts contained in an unstructured
collection of text. A key feature of LSI is its ability to extract the conceptual
content of a body of text by establishing associations between those terms that
occur in similar contexts.LSI overcomes two of the most problematic constraints
of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have
more than one meaning (polysemy). Synonymy is often the cause of mismatches in the
vocabulary used by the authors of documents and the users of information
retrieval systems. As a result, Boolean or keyword queries often return
irrelevant results and miss information that is relevant.
Non-Negative Matrix Factorization
While using Latent
Semantic Indexing Every word does not have the equal importance to the user
queries. So they must be given different weightage on the basis of term
frequency which must be taken from the document. It is done by using
Non-Negative Matrix Factorization.
In this process
document term matrix is constructed with the weights of various terms from a
set of documents. This matrix is factored into a term-feature and a
feature-document matrix. The features are derived from the contents of the
documents, and the feature-document matrix describes data clusters of related
documents.
II.
Technological Foundation of Machine Learning
The field of natural
language processing has produced technologies that teach computers natural
language so that they may analyze, understand, and even generate text. Some of
the technologies that have been developed and can be used in the text mining
process are information extraction, topic tracking, summarization,
categorization, clustering, concept linkage, information visualization and
question answering.
Information Extraction
Information extraction
identifies key phrases and relationships within text. It is done by looking for
predefined sequences in text, a process called pattern matching. For information
Extraction, Rule Based mining algorithms For ex. Sequential Covering are used.
Topic Tracking
A topic tracking system
works by keeping user profiles and, based on the documents the user views ,predicts
other documents of interest to the user. Google offers a free topic tracking
tool that allows users to choose keywords and notifies them when news relating
to those topics becomes available.
Term Frequency –Inverse Document Frequency (TF IDF)
algorithm is used for topic
tracking. This algorithm is numerical statistics that reflects how important a
word is to document. We used weighting factor for each term on the basis of
domain knowledge and term frequency in document.
Summarization
The key to summarization
is to reduce the length and detail of a document while retaining its main
points and overall meaning. The challenge is that, although computers are able
to identify people, places, and time, it is still difficult to teach software
to analyze semantics and to interpret meaning. It uses fuzzy c- mean algorithm.
Categorization
Categorization
involves identifying the main themes of a document by placing the document into
a pre-defined set of topics. Categorization often relies on a thesaurus for
which topics are predefined, and relationships are identified by looking for
broad terms, narrower terms, synonyms, and related terms. Categorization tools
normally have a method for ranking the documents in order of which documents
have the most content on a particular topic. It uses Support Vector machine.
Clustering
Clustering is a
technique used to group similar documents, but it differs from categorization
in that documents are clustered on the fly instead of through the use of
predefined topics. A basic clustering algorithm creates a vector of topics for
each document and measures the weights of how well the document fits into each
cluster.
(1) K-means clustering algorithm
(2) Word relativity-based clustering (WRBC) method, text clustering process contains four main parts:
Text reprocessing, word relativity computation, word clustering and text
classification. Remove stop-words, Stemming,
Filtering
Concept Linkage
Concept linkage tools connect related documents by
identifying their commonly shared concepts and help users find information that
they perhaps wouldn’t have found using traditional searching methods. Concept
linkage is a valuable concept in text mining, especially in the biomedical
fields where so much research has been done that it is impossible for
researchers to read all the material and make associations to other research.
Ideally, concept linking software can identify links between diseases and
treatments when humans cannot.
Information Visualization
Visual text mining, or information visualization,
puts large textual sources in a visual hierarchy or map and provides browsing
capabilities, in addition to simple searching. The information visualization may be conducted
into three steps: (1) Data preparation: i.e. determine and acquire original
data of visualization and form original data space. (2) Data analysis and
extraction: i.e. analyze and extract visualization data needed from original
data and form visualization data space. (3) Visualization mapping: i.e. employ
certain mapping algorithm to map visualization data space to visualization
target.
Question Answering
Another
application area of natural language processing is natural language queries, or
question answering (Q&A), which deals with how to find the best answer to a
given question. Many websites that are equipped with question answering
technology, allow end users to “ask” the computer a question and be given an
answer. Q&A can utilize multiple text mining techniques.
III.
The Evolution of Document Clustering
Document
Clustering is the application of cluster analysis to textual documents.
Document Clustering involves the use of Descriptors and Descriptor Extraction.
Descriptors are set of words that describe the contents within the clusters.
Application of document clustering is done in the following fields.
1. Automatic Document Organization
2. Topic Extraction
3. Fast Information Retrieval
4. Filtering
Algorithms Used for Document Clustering are:
1. Hierarchical Based Algorithms
2. K-Means Algorithms and it’s Variants
3. Other Algorithms are:
a)
Graph Based
Algorithms
b)
Ontology
Supported Algorithms
c)
Order Sensitive
Clustering
Hierarchical Clustering Algorithms (HCA)
In data mining
Hierarchical Clustering Algorithms is a method of cluster Analysis which seeks
to build hierarchy of Clusters. Two Ways of Hierarchical Clustering Algorithms
are:
1. Agglomerative: This is a "bottom up" approach. each
observation starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
2. Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.
K-Means And it’s Variants
k-means
clustering is a method of vector quantization, originally from signal processing, that is popular
for cluster analysis in data mining. k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
This results in a partitioning of the data space into Voronoi cells
1.
Fuzzy C-Means Clustering is a soft version of K-means, where each data point
has a fuzzy degree of belonging to each cluster.
Subscribe to:
Posts (Atom)