| Table of Content [Hide] |
Overview
This flow shows a complete pre-processing of a text collection ending in a conversion of the collection to a sparse table. A portion of the table is displayed in the table viewer.
Transformation Steps
Sentence Splitting
Identifying sentence boundaries in a document is not as trivial a process as it may seem. SEASR has components that achieve sentence splitting either using rules or statistical models (or both). Once sentences are identified they are recorded as annotations in their own annotation set.
Tokenization
Tokenization, simply put, is basically labeling individual words or sometimes word parts. This is important because many down stream components need the tokens to be clearly identified of analysis. Tokens are recorded as annotations in their own annotation set.
Part-of-Speech (pos) Tagging
Such components typically assign a pos tag to a token (the Penn Treebank project has provided a set of codes for this purpose that is widely used). Other data such as lemma, lexemes, and synonyms (to name a very few) may also be identified at this stage. Pos information is stored as features of the token annotation.
Stop Word Filtering
Very common words like “and” and “the” are often filtered out to improve performance. This process is called stop word removal. SEASR has a components to perform this process. One basic approach is to remove all words that appear on a list od common words. Another approach is to remove words that occurr in large number across most documents — these types of terms create “noise” that makes text records less distinguishable. The stop word filters will remove token annotations from a documents token annotation set or they can also mark such annotations as “stopped.”
PoS Filtering
This component reads a document object as input and filters the tokens for that document based on part of speech tag information. A document object is taken as input. The token list is retrieved from the document and only those tokens with part of speech tags that match at least one value in the selected tag list are retained. The filtered list of tokens is placed into the document (replacing the old list) and the document is output. In the SEASR component the PoS Tags is a comma-delimited list of the part-of-speech tags that we want to retain. Tokens that do not possess one of these values will be removed from the annotation list or flagged as “stopped”.
Some components use regular expressions of pos tags to identify “chunks” of tokens (perhaps all noun phrases for example). This process is sometimes called chunking.



