Purpose
This demonstration uses UIMA (see previous description, UIMA and SEASR) to take unstructured data (an unmarked book from Project Gutenberg) and apply POS (part of speech) tagging. Once this is done, we import the data into SEASR and apply frequent pattern analysis to only the nouns in the text.
Although the end result of this demonstration attempts to find common noun groupings, the underlying goal is to show how two different data analysis frameworks can be used together to give researchers flexibility for implementing a solution.
It is our hope that the result of this demonstration will show not only a list of frequently occurring characters but also which nouns frequently occur together (within a small window).
Relevance
The frequent pattern analysis described in Frequent Pattern Mining for Discovery is also applicable here. However, the relevance of this demonstration has more to do with integrating other third-party technologies with the SEASR framework.
Overview
The main contribution of this demonstration is to show the functionality of a new SEASR component that handles sparse itemsets. Itemsets are sets of items that occur together and are the main data structure for association rule mining (e.g. fpgrowth). A prototypical itemset looks like the following, where each transaction indicates whether or not a specific item is part of the transaction; each column indicates whether or not the item is part of the transaction:
| Transaction ID | Item A | Item B | Item C | Item D | Item E |
|---|---|---|---|---|---|
| 1 | 0 | 1 | 1 | 0 | 1 |
| 2 | 1 | 1 | 1 | 1 | 0 |
| 3 | 0 | 1 | 1 | 1 | 1 |
Another table implementation for itemsets is used when each column can have a different number of values for each attribute.
| Transaction ID | Attribute A | Attribute B | Attribute C |
|---|---|---|---|
| 1 | a | h | x |
| 2 | b | i | y |
| 3 | a | j | z |
However, for sparse datasets where there are thousands of attributes and each transaction contains a small subset of items (like text), the table format does not scale well. We built a new component for handling sparse itemsets. The items in a particular transaction (i.e. itemset) are now listed in any order within a set of brackets:
{A,B,C}
{F,E}
{A,F,C}
{X,Y,Z,A}
In this demonstration each itemset is a bag of nouns that were part of a window of sentences. Both UIMA and SEASR can be used to control the size of the window.
Process
The first half of the process, generating the POS data, is described in UIMA and SEASR. The flow loads in the dataset generated from the UIMA CAS. It then parses the itemsets, with the result then used as the data model for the standard sears flow for frequent pattern analysis.
Data Input and Manipulation
(see Frequent Pattern Mining for Discovery)
Execution of Analysis
(see Frequent Pattern Mining for Discovery)
Visualization of Results
The output is the same visualization described in the fpgrowth demo. In this case figure 1 shows the result of analyzing Tom Sawyer. The size of the window was eight paragraphs.

Figure 1. Result of fp-growth on the nouns in Tom Sawyer.
To get a better understanding of when new relationships are formed within the text, it would be possible to run the fp-growth algorithm over smaller chunks (e.g. chapters) rather then over the entire document. In this case, you could decrease the window size to a smaller number of paragraphs.
Scale Limitations
Since we are dealing with sparse datasets, running the flow with different window sizes and modifying the threshold for support will yield different results and affect the running time. For small windows (e.g. single sentences), you will need to lower the support threshold to near 0 since there will not be a lot of support at the sentence level, but at the same time, this will increase the running time of the fp-growth algorithm.
References
- UIMA, http://incubator.apache.org/uima/
- Eclipse, http://www.eclipse.org/
- Project Gutenberg, http://www.gutenberg.org/wiki/Main_Page
- Rule Association Learning, http://en.wikipedia.org/wiki/Association_rule_learning



December 15th, 2009 at 5:25 am
I have some problems to executed frequent pattern mining. I want to find a demo of fpgrowth algorithm. How to run its demo. Can you help me.
Please reply via mail for me.
Thanks alot.
Hien Bui Van