Introduction
Text Mining is the
discovery of unknown information, by automatically extracting information from
different written resources. Text mining is different from what are familiar
with in web search. Text mining is a variation on a field called data mining
that tries to find interesting patterns from large databases. Text mining is also
known as Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in
Text (KDT). Text mining is a young interdisciplinary field which draws on
information retrieval, data mining, machine learning, statistics and
computational linguistics.
Document Clustering is
the application of cluster analysis to textual documents. It has application in
automatic document organization, topic extraction and fast information
retrieval or filtering. In the world of computer technology every 10 minutes
billions of gigabytes of data is generated and people must be able to retrieve
this data efficiently. It generates the need of the document clustering i.e.
also the systematic organization of data. Many Search Engines like Google,
Yahoo, Baidu, Bing, AltaVista and even many commercial search engines are
available to organize the data into useful format. Even at Intranet, clustering
of data is becoming highly important. Besides the different nature of
Clustering on Intranet and Internet, Basic Requirement is almost the same.
The aim of this report
is to specify the entire existing algorithms for document clustering. Document clustering
algorithms have evolved over time gradually into a more subtle way. The quality
of Document clustering is defined by the fulfillment of user needs. Initially queries were matched in Boolean
form. After that Relevance Feedback came into existence. Further Search
Strategies were improved using Latent Semantic Indexing and Non-negative Matrix
Factorization. The things even got totally changed with the start of machine
learning algorithms. The new field of Natural Language Processing (NLP) came
into existence for the organized study of Text Mining. The algorithm like
K-means and fuzzy c- means were used now to find the nearest neighbors for
Document Clustering. All the Algorithms used And Categorization of Text Mining
has been specified in this document.
The Scope of this
project in Searching and querying, Ranking of search results, Navigating and
browsing information, Optimizing information representation and storage,
Document classification (into predefined group), Document clustering (automatic
discovered results ) .
I.
Literature Survey
Boolean logic
For document clustering
technique initially Boolean logic was used for comparison purpose. It is used now
also but not exactly in the same form. For the clustering of document only Boolean comparison was not sufficient. Since
always data is not available in the structured format. Also Document clustering
result is not meant only for the query given. But it should be in the form what
user actually needs.
Relevance Feedback
Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to
take the results that are initially returned from a given query and to use
information about whether or not those results are relevant to perform a new
query.
Latent Semantic Indexing
Latent semantic
indexing (LSI) is an indexing and retrieval method that uses a mathematical
technique called singular value decomposition (SVD) to identify patterns in the
relationships between the terms and concepts contained in an unstructured
collection of text. A key feature of LSI is its ability to extract the conceptual
content of a body of text by establishing associations between those terms that
occur in similar contexts.LSI overcomes two of the most problematic constraints
of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have
more than one meaning (polysemy). Synonymy is often the cause of mismatches in the
vocabulary used by the authors of documents and the users of information
retrieval systems. As a result, Boolean or keyword queries often return
irrelevant results and miss information that is relevant.
Non-Negative Matrix Factorization
While using Latent
Semantic Indexing Every word does not have the equal importance to the user
queries. So they must be given different weightage on the basis of term
frequency which must be taken from the document. It is done by using
Non-Negative Matrix Factorization.
In this process
document term matrix is constructed with the weights of various terms from a
set of documents. This matrix is factored into a term-feature and a
feature-document matrix. The features are derived from the contents of the
documents, and the feature-document matrix describes data clusters of related
documents.
II.
Technological Foundation of Machine Learning
The field of natural
language processing has produced technologies that teach computers natural
language so that they may analyze, understand, and even generate text. Some of
the technologies that have been developed and can be used in the text mining
process are information extraction, topic tracking, summarization,
categorization, clustering, concept linkage, information visualization and
question answering.
Information Extraction
Information extraction
identifies key phrases and relationships within text. It is done by looking for
predefined sequences in text, a process called pattern matching. For information
Extraction, Rule Based mining algorithms For ex. Sequential Covering are used.
Topic Tracking
A topic tracking system
works by keeping user profiles and, based on the documents the user views ,predicts
other documents of interest to the user. Google offers a free topic tracking
tool that allows users to choose keywords and notifies them when news relating
to those topics becomes available.
Term Frequency –Inverse Document Frequency (TF IDF)
algorithm is used for topic
tracking. This algorithm is numerical statistics that reflects how important a
word is to document. We used weighting factor for each term on the basis of
domain knowledge and term frequency in document.
Summarization
The key to summarization
is to reduce the length and detail of a document while retaining its main
points and overall meaning. The challenge is that, although computers are able
to identify people, places, and time, it is still difficult to teach software
to analyze semantics and to interpret meaning. It uses fuzzy c- mean algorithm.
Categorization
Categorization
involves identifying the main themes of a document by placing the document into
a pre-defined set of topics. Categorization often relies on a thesaurus for
which topics are predefined, and relationships are identified by looking for
broad terms, narrower terms, synonyms, and related terms. Categorization tools
normally have a method for ranking the documents in order of which documents
have the most content on a particular topic. It uses Support Vector machine.
Clustering
Clustering is a
technique used to group similar documents, but it differs from categorization
in that documents are clustered on the fly instead of through the use of
predefined topics. A basic clustering algorithm creates a vector of topics for
each document and measures the weights of how well the document fits into each
cluster.
(1) K-means clustering algorithm
(2) Word relativity-based clustering (WRBC) method, text clustering process contains four main parts:
Text reprocessing, word relativity computation, word clustering and text
classification. Remove stop-words, Stemming,
Filtering
Concept Linkage
Concept linkage tools connect related documents by
identifying their commonly shared concepts and help users find information that
they perhaps wouldn’t have found using traditional searching methods. Concept
linkage is a valuable concept in text mining, especially in the biomedical
fields where so much research has been done that it is impossible for
researchers to read all the material and make associations to other research.
Ideally, concept linking software can identify links between diseases and
treatments when humans cannot.
Information Visualization
Visual text mining, or information visualization,
puts large textual sources in a visual hierarchy or map and provides browsing
capabilities, in addition to simple searching. The information visualization may be conducted
into three steps: (1) Data preparation: i.e. determine and acquire original
data of visualization and form original data space. (2) Data analysis and
extraction: i.e. analyze and extract visualization data needed from original
data and form visualization data space. (3) Visualization mapping: i.e. employ
certain mapping algorithm to map visualization data space to visualization
target.
Question Answering
Another
application area of natural language processing is natural language queries, or
question answering (Q&A), which deals with how to find the best answer to a
given question. Many websites that are equipped with question answering
technology, allow end users to “ask” the computer a question and be given an
answer. Q&A can utilize multiple text mining techniques.
III.
The Evolution of Document Clustering
Document
Clustering is the application of cluster analysis to textual documents.
Document Clustering involves the use of Descriptors and Descriptor Extraction.
Descriptors are set of words that describe the contents within the clusters.
Application of document clustering is done in the following fields.
1. Automatic Document Organization
2. Topic Extraction
3. Fast Information Retrieval
4. Filtering
Algorithms Used for Document Clustering are:
1. Hierarchical Based Algorithms
2. K-Means Algorithms and it’s Variants
3. Other Algorithms are:
a)
Graph Based
Algorithms
b)
Ontology
Supported Algorithms
c)
Order Sensitive
Clustering
Hierarchical Clustering Algorithms (HCA)
In data mining
Hierarchical Clustering Algorithms is a method of cluster Analysis which seeks
to build hierarchy of Clusters. Two Ways of Hierarchical Clustering Algorithms
are:
1. Agglomerative: This is a "bottom up" approach. each
observation starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
2. Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.
K-Means And it’s Variants
k-means
clustering is a method of vector quantization, originally from signal processing, that is popular
for cluster analysis in data mining. k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
This results in a partitioning of the data space into Voronoi cells
1.
Fuzzy C-Means Clustering is a soft version of K-means, where each data point
has a fuzzy degree of belonging to each cluster.
No comments:
Post a Comment