Introduction
Document Clustering is
the application of cluster analysis to textual documents. It has application in
automatic document organization, topic extraction and fast information
retrieval or filtering. In the world of computer technology every 10 minutes
billions of gigabytes of data is generated and people must be able to retrieve
this data efficiently. It generates the need of the document clustering i.e.
also the systematic organization of data. Many Search Engines like Google,
Yahoo, Baidu, Bing, AltaVista and even many commercial search engines are
available to organize the data into useful format. Even at Intranet, clustering
of data is becoming highly important. Besides the different nature of
Clustering on Intranet and Internet, Basic Requirement is almost the same.
The aim of this report
is to specify the entire existing algorithms for document clustering. Document clustering
algorithms have evolved over time gradually into a more subtle way. The quality
of Document clustering is defined by the fulfillment of user needs. The algorithms like K-Means, Bisecting K-Means,
Apriori, CURE, PDDP and GALOIS has evolved over time and
providing achievable time complexity and improved clustering of Documents.
The Scope of this
report is in the field of Document Clustering. The brief introduction of all
the algorithms used is specified in this report. Further combined result of two
algorithm is expected to give better results.
Various Standard Clustering Methods
Partitioning Based Clustering
Partition-based
algorithms partition a document collection into a number of hard clusters using
a given distance measure.
1. Single-Pass
Algorithm
2. Relocation
based partitioning Ex. K-Means
Hierarchical Clustering
Hierarchical
clustering approaches attempt to create a hierarchical decomposition of the
given document collection thus achieving a hierarchical structure.
1.
Agglomerative
Approach:
Agglomerative
methods start with an initial clustering of the term space, where all documents
are considered representing a separate cluster. The closest clusters using a
given inter-cluster similarity measure are then merged continuously until only
1 cluster or a predefined number of clusters remain. Some of these methods are
more suitable for discovering clusters of non-convex forms than partition-based
algorithms. Agglomerative methods normally produce hard (hierarchical) clusters.
Ex.
Single
Link
– CURE Algorithm, MST, Complete Link
–Vorhees algorithm
Group
Average – BIRCH algorithm
2.
Divisive Methods
Divisive
clustering algorithms start with a single cluster containing all documents. It
then continuously divides clusters until all documents are contained in their
own cluster or a predefined number of clusters are found. Ex. Principal Direction Divisive Partitioning
Keyword Based Clustering
Keyword-based clustering works by
assigning a set of core-terms (a form of keywords) to each document and uses
these terms as features for the subsequent clustering. This is based on the
idea that the main topic/theme of a document can be captured in a relatively
small set of core-terms or keywords. This idea is related to the Topic-binder
hypothesis.
1. Frequent Itemset based
Clustering:
Frequent itemset clustering is based on
finding frequent sets of keywords occur together in the document collection –
an approach often used in data mining. Ex. Apriori
2.
Lattice based Clustering:
Lattice-based
clustering is closely related to the frequent itemset approach above, but
constructs a lattice structure of concepts forming a kind of hierarchy with
multiple inheritance. This approach is based on Formal Concept Analysis and is
often described as a form of conceptual clustering. Since a document can be
related to several concepts on the same level (in the lattice), the lattice
represents a hierarchical soft clustering of the document collection. Ex. GALOIS
Model Based Clustering
Model-based clustering, as we define it
is based on hypothesis models for clusters in the document collection, and then
for each document finding the cluster whose model the document best fits.
1. Statistical
Methods
2. Neural
Network Methods
Algorithm and its brief Explanation
K-means Algorithm: K-means clustering aims
to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean. This
results in a partitioning of the data space into Voronoi cells. It use
Euclidean space/distance. Start by picking k, the number of clusters.
Initialize clusters by picking one point per cluster.
Populating Clusters:
1.
For each point,
place it in the cluster whose current centroid it is nearest.
2.
After all point
are assigned, update the location of centroid of the clusters.
3.
Reassign all
point to their closest centroid.
4.
Repeat 2 and 3
until convergence: Points don’t move between clusters and centroid stabilize.
Select
value for K: Average distance to centroid falls
rapidly until Best value of K
Picking
initial K points: Pick first point at random then next
point to be one whose minimum distance from the selected points is as large as
possible. Repeat until we have k points.
Disadvantage:
K-means algorithm may converge to a (local) optimum. There is no guarantee that
the global optimum is found.
Bisecting K-Means:
Bisecting
k-Means is like a combination of k-Means and hierarchical clustering. It starts
with all objects in a single cluster. The pseudo code of the algorithm is
displayed below:
Basic Bisecting
K-means Algorithm for finding K clusters
1. Pick
a cluster to split.
2.
Find 2 sub-clusters using the basic
k-Means algorithm (Bisecting step)
3.
Repeat step 2, the bisecting step, for
ITER times and take the split that produces the clustering with the highest
overall similarity.
4. Repeat
steps 1, 2 and 3 until the desired number of clusters is reached
There are a number of different ways
to choose which cluster to split. For example, we can choose the largest
cluster at each step, the one with the least overall similarity, or use a
criterion based on both size and overall similarity.
Bradley-Fayyad-Reina (BFR)
Algorithm:
BFR is variant of K-mean for very large (disk-resident) data sets. It assumes
each cluster is normally distributed around a centroid in Euclidean space.
Normal distribution assumption implies that clusters look like axis-aligned
ellipses. Block Memory point is read from the disk and put into the main memory
at a time. It uses K–means technique for selecting initial K centroids.
BFR categorizes
point into three Class:
1.
Discard Set (DS): Point close enough to
centroid.
2.
Compression Set (CS): Groups of point
that are close together but not close to any existing centroid. They are
summarized, but not assigned to a cluster.
3.
Retained Set (RS): Point are very far
from each other (Isolated point).
Actual
Clustering: Find those points that are sufficiently close to cluster centroid
and add those point to that cluster.DS set adjust statistics of each cluster to
account for newly added point. The remaining point are not close to any cluster
go to the CS.
Find out how
much Close Enough BFR approach use Mahalanobis distance which is less than a
threshold.
Limitation
of BFR Algorithm: Uses Strong assumptions like clusters
normally distributed in each dimension, Axes are fixed and ellipses at an angle
don’t give reliable result.
CURE (Clustering Using
Representatives):
CURE is ability to detect non-convex shapes as
opposed to the more popular K-Means. In contrast to K-Means, It produces
uniform clusters. It uses Euclidean distance and a collection of representative
point to represent clusters.
Algorithm:
1. First cluster
sample points hierarchically to create the initial clusters.
2. For each
cluster, pick K (e.g. 4) representative point as dispersed as possible. Move
each representative point a fixed fraction (e.g. 20%) toward the centroid of
the cluster.
3. Now rescan
the whole dataset and visit each point p in the data set and Place it in the
“closest cluster” that cluster with the closet (to p) among all the
representative points of all the clusters. CURE algorithm has a worst-case
complexity of (n2logn) for high-dimensional spaces.
Various modifications
of k-means such as spherical k-means and k-medoids have been
proposed to allow using other distance measures. In statistics and data mining, k-medians
clustering is a cluster analysis algorithm.
It is a variation of k-means clustering where instead of calculating theme for each cluster to determine its centroid, one
instead calculates the median.
PDDP (Principal
Direction Divisive Partitioning): PDDP is a hierarchical algorithm. But where
K-Means was made hierarchical by bisection, Implemented Clustering Algorithms PDDP
is hierarchical by design. PDDP uses a 1-dimensional singular value
decomposition to determine the principal direction of a document set, and
splits the set into two subsets according to their projection on the principal
direction. The sets are placed in a simple binary tree and a ” scatter value”
is calculated for each node in the tree as they are created in order to
determine which leaf the algorithm will split next. The algorithm terminates
when enough clusters have been formed in the tree.
GALOIS: GALOIS generates concepts (clusters of objects) by
automatically generating a Galois lattice from a set of objects and their
corresponding attributes. In the context of document clustering, the objects
are documents and the attributes are keywords (index) assigned to these
documents. The clusters resulting from GALOIS are thus ordered in a lattice and
characterized by both an extensional description (i.e. the documents in the
cluster) as well as an intentional (conceptual) description (i.e. the keywords
that these documents share)
The distance measure
applied for document clustering using GALOIS is as a result simply the inverse
of the number of shared keywords (found using intersection). The more keywords
two documents share, the closer they are conceptually, since they will co-occur
in the extents of more specific concepts/clusters.
Apriori: Apriori is
an algorithm for frequent item set
mining and association rule learning over transactional databases.
It proceeds by identifying the frequent individual items in the database and
extending them to larger and larger item. The frequent item sets determined by
Apriori can be used to determine association rules which highlight
general trends in the database. This has applications in domains such
as market basket analysis.
I = {i1 , i2 , …, im}:
a set of items. Transaction t : t a set
of items, and t ⊆ I. Transaction Database T: a set of transactions
T = {t1 , t2 , …, tn }.
Algorithm:
Step-1: Find all
itemsets that have minimum support frequent itemsets.
Step-2: Use frequent
itemsets to generate rules.