| Table of Content [Hide] |
| Source: | http://repository.seasr.org/Meandre/Repositories/Demo-Flows/TextClustering2/repository.rdf |
| URI: | http://seasr.org/flows/textclustering2/ |
| Name: | TextClustering2 |
| Creator: | admin |
| Date: | 2008-07-24 (23:14:04) |
| Rights: | UofI/NCSA |
| Tags: | cluster, dendrogram, opennlp, discovery, text, visualization |
| Description: | This flow shows a complete pre-processing and clustering of text ending in a dendrogram visualization. This is the same as Text_Clustering_Demo_1 except that we have replaced the tokenizer and the pos tagger with OpenNLP components to demonstrate component interoperability. |
Overview
This flow performs the same function as Clustering for Discovery in Text Collections I, except that the SEASR tokenizer and pos tagger components have been replaced with those from OpenNLP. This is intended to demonstrate component interoperability across NLP project codes. The OpeNLP components actually run faster than the SEASR components and so we are able to process the entire Gertrude Stein text. Note, however, that to run faster does not necessarily imply equivalent performance on an NLP task.
The type of cluster analysis that is employed in SEASR for defining text concepts is an hierarchical agglomerative (bottom-up) technique (HAC) that models individual text items as points in vector space. This vector space is sometimes referred to as term space because each unique term that appears in any text document defines a dimension in this space. Not all words in a document are considered terms. Terms are primarily nouns and/or noun phrases. Common words that have little semantic content, such as prepositions and conjunctions, are routinely discarded, as are most verbs. For each term, a scalar weight value is computed as the normalized frequency of occurrence of that term. Terms are further weighted based on their uniqueness for that document using the following formula:
term weight=normalized term freq*log (inverse document freq).
The vector of term weights for individual documents defines the point in term space. The measure of similarity between two documents is therefore the Euclidean distance between their respective representative points in space. The validity of this measure of “similarity” hypothesizes that like documents share many of the same terms.
The agglomerative process will begin by placing each individual response in its own cluster and computing the similarity values for all possible cluster pairings. The two most similar clusters will then be combined and the process repeated until either a finite number of clusters are formed or the similarity between any two clusters does not rise above some threshold value.
The similarity between two items is not the same as the similarity between two clusters of items. In fact, there are several ways of computing the similarity between two clusters. One can measure the distance between the members of each cluster that are the closest (since link) or farthest (complete link) apart. We will calculate the group average similarity between all pairs of items in the two groups which recent research suggests is a more accurate approach.
This is how clusters are formed using HAC components. The cluster hierarchy is often represented as a binary tree that is called a dendrogram. SEASR provides a dendrogram visualization. component for cluster analysis.
Application
In this particular flow, a single document is being read from a webdav and segmented into chunks of approximately 250 words (at sentence boundaries.
In general, unsupervised techniques like clustering are often used as tools for exploration and discovery in large text collections. Such techniques can greatly reduce the time required for human inspection or search.



