Print Print

Purpose

The purpose of this demonstration is to show how other software frameworks can be easily used in conjunction with the SEASR framework, giving researchers the flexibility to build a solution that involves multiple technologies.

In this demonstration, we use UIMA (Unstructured Information Management Applications) to take unstructured text and apply part of speech (POS) tagging. UIMA is a component framework for analyzing unstructured content such as text, audio, and video. UIMA started at IBM, but is now an Open Source project at the Apache Software Foundation.

Relevance

Being able to leverage the strengths of different tools and software frameworks allows researchers to quickly prototype and test different ideas. The best solution for a problem will sometimes involve multiple software products.

It is important to the SEASR project that researchers can still use tools they are familiar with and leverage SEASR’s flexibility for integrating different software products. Not only can other technologies be easily integrated into SEASR components, they can also be used in a pipeline fashion where the output from one technology can be used as input into another. There are several SEASR components that ingest data from different types of sources.

In this demonstration we show that a UIMA user can take advantage of the structured data analysis engines found in SEASR, allowing the researcher to orchestrate a solution that incorporates two software frameworks.

Overview

UIMA’s basic building blocks—called Analysis Engines (AEs)—are composed to analyze a document and infer and record descriptive attributes about the document. We use four Analysis Engines to analyze a document to record POS information. This information is recorded in the CAS (Common Analysis Structure). The CAS is UIMA’s object-based data structure that allows the representation of objects, properties, and values. It provides cooperating UIMA components with a common representation and mechanism for shared access to the artifact being analyzed (the text document in this case) and the current analysis results.

We use UIMA’s implementation for POS tagging and then integrate the result into two different SEASR flows:

  1. A flow for performing frequent pattern analysis on the nouns within a body of text.
  2. A flow for mapping a window of text to emotions.

Each subsequent demonstration provides more detail for the specific analysis. POS tagging is the starting point for most research that has a natural language processing component.

Process

The UIMA process consists of building an aggregate engine that contains four primitive engines: OpenNLPTokenizer, OpenNLPPosTagger, OpenNLPSentenceDetector, and POSWriter. The only custom component we wrote was the last component, POSWriter, which serializes the UIMA CAS data structure into something that can be easily consumed by existing SEASR components.

Data Setup, Manipulation and Execution of Analysis

The UIMA flow components can be authorized in Eclipse using the UIMA plugin.

For executing UIMA engine:
1. Select Run Configuration (within Eclipse) on the Analysis Engine you want to run (in this case the OpenNLPAggreagate). See figure 1.


Figure 1. Step one of the UIMA process.

2. Select UIMA Document Analyzer; select run (see figure 2).


Figure 2. Step two of the UIMA process

3. Once the Document Analyzer comes up, be sure to set the Input Directory to the directory that contains the document that needs to be analyzed and select the Output directory where the output will be saved. Hit Run (figure 3).


Figure 3. Step three of the UIMA process

Visualization of Results

For each document analyzed, there are two outputs. The output ending with .xmi is what the UIMA framework uses to view the annotation results. The other document produced is the serialized CAS (csv). It is this document that gets used in both flows described in subsequent demos. Figure 4 shows the output of POS tagging Tom Sawyer by Mark Twain. The text was acquired from Project Gutenberg.

Figure 4. Visualization of POS tagging within UIMA

Data Type Restrictions

UIMA is currently best suited for dealing with unstructured text documents, although their documentation points to working with both audio and video.

Scale Limitations

UIMA can be slow to process huge documents and the user must navigate both Eclipse and UIMA via the GUI interfaces.

References

  1. UIMA, http://incubator.apache.org/uima/
  2. Eclipse, http://www.eclipse.org/
  3. Project Gutenberg, http://www.gutenberg.org/wiki/Main_Page

Leave a Reply