OmniPort: Integrating Legacy Data into the WWW

Shelley G. Ford,
Chief, Information Analysis Branch
Defense Technical Information Center

Robert C. Stern,
Advanced Decision Systems
Booz¥Allen & Hamilton, Inc.

Abstract:

Today's information seekers are faced with two seemingly contradictory problems. First, the amount of data they can retrieve is limited by the number of different access protocols, search languages and indexing vocabularies they can master. Second, the sheer volume of available data threatens to overwhelm the user's ability to extract meaningful information from this ever-growing mass of data. This paper describes OmniPort, a system under development for the Defense Technical Information Center (DTIC), designed to address both of these problems by offering consistent single-query access to heterogeneous, distributed data sources and by providing a seamless environment for analyzing retrieved data. The former capability is supplied by the Minerva 'middleware' environment developed by Booz¥Allen & Hamilton, the latter by Mosaic.

1.0 The Problem

OmniPort came into existence to solve two seemingly contradictory problems. First, because of the need to master a bewildering variety of access protocols, search languages and indexing vocabularies, information users are able to find only a small percentage of the relevant data that should be available given current network technology. Second, information users are being overwhelmed by the sheer volume of data that can be retrieved on almost any subject. The better the first problem is solved, the worse the second problem tends to become.

This paper describes the OmniPort system, under development for the Defense Technical Information Center (DTIC) by the Advanced Decision Systems group of Booz¥Allen & Hamilton. This first section discusses the linked problems of data access in a networked environment and information retrieval from the masses of data available. The second section describes the Minerva middleware. The third section describes the Mosaic environment being developed to provide consistent user access to the data that Minerva retrieves. Finally, a fourth section discusses future growth plans and opportunities.

1.1 Data Access

The emergence of the Internet as a common means of information exchange has led to the creation of increasingly powerful tools for locating and retrieving relevant data. These tools have included archie, veronica, gopher and the Wide Area Information Service (WAIS). The most powerful of these, WAIS, provides a text search engine that is used to index primary words in each document. A user can then submit a search pattern to the WAIS engine which identifies and ranks documents based on degree of match and allows the retrieval of listed document. [1] Each of these tools, however, requires the data to be organized (or re-organized, in the case of legacy data) into the format used by the access tool. This process can be expensive, time consuming and always carries the risk of data loss or corruption during the conversion process. Using WAIS makes sense when the data has not previously been made available using a searchable online system. However, when the data is already organized and accessible to local users via a modern database or textbase manager, it is wasteful to rehost the data in order to make it available on a wide area network (WAN).

While the amount of data accessible using gopher, WAIS or the World Wide Web (WWW) is increasing rapidly, the percentage of data accessible via a standard Internet access tool is small when compared to the amount of relevant data that exists in organized, searchable form on network accessible computers. These existing legacy systems are the product of decades of information gathering and data analysis. The problem of accessing this vast store of legacy data has until recently resisted practical resolution. Either users were forced to learn multiple search interfaces or a costly time-consuming data conversion effort was required.

1.2 Information Retrieval

Although users currently can retrieve only a small percentage of the data that might be available, they are already being overwhelmed by the volume and kinds of data that can be obtained now on the Internet. So much information is available today in so many formats (full-text, graphics, audio, video, etc.) that the usable portion lies buried in masses of data where it cannot be located or retrieved easily. The expansion of the National Information Infrastructure (NII) to create an "information superhighway" will only add thousands of new information sources.

Effective tools must be developed to help the end user extract meaningful information from the masses of data now or soon to be available on the networks. One such tool is Mosaic, a toolset that uses the WWW hypertext paradigm to browse the Internet. More importantly, it introduces the concept of linked analysis tools that can be used to view and, ultimately, to manipulate any data that is retrieved. The set of helper applications that can be defined for any Mosaic client forms the core of this set of linked tools. Each retrieved document is displayed in the manner that makes it most understandable and usable.

The hypertext browsing approach however does not solve the problem of selecting relevant documents from the many thousands of pages that can be accessed. This problem can only be addressed by a query capability, designed to help the user at both ends of the query process. First, by assisting the user in formulating effective queries and, then, by providing relevance ranking of returned documents, help can be provided for inexpert or infrequent users. Individual query systems sometimes provide these capabilities, but only on a limited scale and only to homogeneous information sources. The Internet environment requires that these capabilities operate across multiple dissimilar, geographically-distributed data sources.

1.3 The OmniPort Project

DTIC began developing OmniPort as a tool for integrating the vast stores of legacy data gathered over the years by the Department of Defense (DoD) Information Analysis Centers (IACs). The IAC program was created by DoD to "improve the productivity of scientists, engineers, managers, and technicians in the Defense community through the timely dissemination of evaluated information". [2] IACs act as central data repositories in such specialized subject areas as ceramic materials, cold regions, soil mechanics and platform survivability/vulnerability, among others. Each IAC has acted independently to collect technical materials and data in its particular area of expertise. Therefore, at the 26 IACs, there are now 26 different data collections, with little commonality in data storage and retrieval methods. Rather than attempt to rehost and centralize these data sources, DTIC opted to employ Minerva technology. This approach had several benefits, including leaving ownership and maintenance with the responsible IACs. The intended OmniPort user community includes members of the DoD acquisition and technology community who need access to IAC information in order to perform their jobs. To make OmniPort easily and inexpensively available to this large and diverse user population, it neede a front-end user interface that would work on all common desktop platforms. Mosaic was identified as the sole environment that met these requirements. In addition, Mosaic's increasing popularity within the DoD community would mean greater acceptance of OmniPort and reduced training costs.

2.0 Minerva

The Advanced Decision Systems group of Booz¥Allen & Hamilton has developed a middleware environment that directly addresses the data access and information retrieval problems outlined in Section 1. Minerva provides users with simple, consistent access to multiple information sources, regardless of physical location or the method used to access documents. Minerva operates with existing information sources, each with its own native search capability, without requiring any redevelopment or modification of the information sources.

2.3.1 RUBRIC

To ameliorate the performance problems of keyword and Boolean searching, a novel query augmentation method was developed called concept search. In this approach, users formulate queries by selecting concepts that have meaning in the subject domain. These concepts are developed by users knowledgeable in the domain and are stored in a concept knowledge base. Queries are made using single concepts or simple Boolean combinations of concepts (or combinations of concepts and specific text patterns).

A domain knowledge base contains concepts organized in a semantic network. Each concept is defined by a set of attributes and their values. It may also include a set of subconcepts and optionally a set of evidence (text patterns that indicate the presence of the concept in a document). Each concept includes weightings that specify how it relates to adjacent concepts and how a set of evidence contributes to relevance assessment. The set of these concepts constitutes the knowledge base that is made available to users for the formulation of queries.

A query composed of high-level domain concepts is successively decomposed by RUBRIC into lower and lower level domain concepts using the linkages in the knowledge base. At each level, any evidence patterns that may be present are formed into individual queries, along with the relevance weighting determined by the domain experts. These queries are then broadcast to all selected information sources. Like a thesaurus, the evidence set in the knowledge base contains all the various ways that a concept is likely to be referenced in the media, such as all the different synonyms or alternate spellings of a word, or the values a particular field may contain. Unlike a thesaurus, the evidence set is highly focused by the domain experts to assure a precise retrieval.

Figure 3 - OmniPort Test Home Page

When each of the individual queries gets a response from the information sources, RUBRIC gathers each response, performing an ACCRUE function on the set of evidence weightings for each pattern that 'hits' on a particular document. An ACCRUE function calculates a relevance ranking by giving a higher score to documents with more or better quality pattern matches. RUBRIC then reports the response set to the user with the appropriate relevance ranking associated with each document.