Technology
by SEASR Team.Over the past twenty years, the interdisciplinary field of humanities computing has developed tools to support research–from the archiving of electronic texts, images, audio, and email conversations; to the sharing of web-based research sites and exhibitions; to the establishing of electronic discussion lists and research forums; to the publishing of electronic journals; to the creation of ever-more intelligent and refined search tools. The result is a riot of research information and tools, developed in and across a variety of incompatible technical formats and platforms.
The informatics specialists behind SEASR saw the digital humanities’ need for software to bridge these technical gaps: a need for technical and informational exchange. At SEASR, our mission is to leverage existing technology and invent new technology to enable humanities computing resources to communicate.
How will we do it?
SEASR employs leading technology to transform raw data into the semi-structured and structured information that can be processed by machine learning and data analysis applications.
Transforming Data: SEASR will construct software bridges to move information from the unstructured and semi-structured data world to the structured data world by leveraging two well-known research and development frameworks: NCSA’s Data-To-Knowledge (D2K) and IBM’s Unstructured Information Management Architecture (UIMA).
Specifically, SEASR uses IBM’s open-source UIMA to construct data services that access and normalize unstructured information. UIMA, according to IBM, advances data synthesis by providing “a technology designed to support a new breed of software applications that can process text within documents and other content sources to understand the latent meaning, relationship and relevant facts buried within…” The SEASR team has chosen to work with UIMA, since it has become a new formal standard with wide use in many fields. Members of the SEASR development team actively serve on the OASIS technical committee to establish semantic search and content analytics specifications for UIMA.
Technically, UIMA’s appeal is two-fold: it offers a rich metadata standard that allows for expressing structure in complex ways and it provides a run-time environment in which developers can build, deploy, plug in, and run UIMA component implementations, along with other independently-developed components. UIMA’s component-based framework enables reuse, so that developers can leverage third-party codes across platforms and development environments.
Building a Virtual Research and Development Environment: SEASR will also focus on developing, integrating, deploying, and sustaining a set of reusable and expandable software components and a supporting framework, benefiting a broad set of data-mining applications for scholars in the humanities.
The unstructured information SEASR transforms through component-based data services is processed a step further by our service-oriented architecture (i.e., also component-based), which includes ontology libraries and analytics services. To mine semi-structured and structured data and enable previously incompatible formats and platforms to communicate, SEASR draws upon the best practices developed in NCSA’s D2K project over the last decade. D2K is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection with data and information visualization tools. D2K’s visual programming environment allows developers to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development.
The technical legacy of D2K’s data and visualization advances for developers is passed on to humanities researchers with SEASR. Users will leverage SEASR through developer toolkits—as rich client applications—and user toolkits—in the form of rich internet applications—or through custom user interfaces, developed by anyone in the digital humanities development community.
For more information on SEASR’s technology legacy and source, see:
D2K (data to knowledge)
a visual programming environment and generalized infrastructure for developing and deploying data-mining applications, from the Automated Learning Group at NCSA, being used in nora
UIMA (unstructured information management architecture, developed by IBM), with specifications technical committee OASIS

