OmniPort: Integrating Legacy Data into the WWW


Shelley G. Ford,
Chief, Information Analysis Branch
Defense Technical Information Center

Robert C. Stern,
Advanced Decision Systems
Booz„Allen & Hamilton, Inc.

Abstract:

Today's information seekers are faced with two seemingly contradictory problems. First, the amount of data they can retrieve is limited by the number of different access protocols, search languages and indexing vocabularies they can master. Second, the sheer volume of available data threatens to overwhelm the user's ability to extract meaningful information from this ever-growing mass of data. This paper describes OmniPort, a system under development for the Defense Technical Information Center (DTIC), designed to address both of these problems by offering consistent single-query access to heterogeneous, distributed data sources and by providing a seamless environment for analyzing retrieved data. The former capability is supplied by the Minerva 'middleware' environment developed by Booz„Allen & Hamilton, the latter by Mosaic.

1.0 The Problem

OmniPort came into existence to solve two seemingly contradictory problems. First, because of the need to master a bewildering variety of access protocols, search languages and indexing vocabularies, information users are able to find only a small percentage of the relevant data that should be available given current network technology. Second, information users are being overwhelmed by the sheer volume of data that can be retrieved on almost any subject. The better the first problem is solved, the worse the second problem tends to become.

This paper describes the OmniPort system, under development for the Defense Technical Information Center (DTIC) by the Advanced Decision Systems group of Booz„Allen & Hamilton. This first section discusses the linked problems of data access in a networked environment and information retrieval from the masses of data available. The second section describes the Minerva middleware. The third section describes the Mosaic environment being developed to provide consistent user access to the data that Minerva retrieves. Finally, a fourth section discusses future growth plans and opportunities.

1.1 Data Access

The emergence of the Internet as a common means of information exchange has led to the creation of increasingly powerful tools for locating and retrieving relevant data. These tools have included archie, veronica, gopher and the Wide Area Information Service (WAIS). The most powerful of these, WAIS, provides a text search engine that is used to index primary words in each document. A user can then submit a search pattern to the WAIS engine which identifies and ranks documents based on degree of match and allows the retrieval of listed document. [1] Each of these tools, however, requires the data to be organized (or re-organized, in the case of legacy data) into the format used by the access tool. This process can be expensive, time consuming and always carries the risk of data loss or corruption during the conversion process. Using WAIS makes sense when the data has not previously been made available using a searchable online system. However, when the data is already organized and accessible to local users via a modern database or textbase manager, it is wasteful to rehost the data in order to make it available on a wide area network (WAN).

While the amount of data accessible using gopher, WAIS or the World Wide Web (WWW) is increasing rapidly, the percentage of data accessible via a standard Internet access tool is small when compared to the amount of relevant data that exists in organized, searchable form on network accessible computers. These existing legacy systems are the product of decades of information gathering and data analysis. The problem of accessing this vast store of legacy data has until recently resisted practical resolution. Either users were forced to learn multiple search interfaces or a costly time-consuming data conversion effort was required.

1.2 Information Retrieval

Although users currently can retrieve only a small percentage of the data that might be available, they are already being overwhelmed by the volume and kinds of data that can be obtained now on the Internet. So much information is available today in so many formats (full-text, graphics, audio, video, etc.) that the usable portion lies buried in masses of data where it cannot be located or retrieved easily. The expansion of the National Information Infrastructure (NII) to create an "information superhighway" will only add thousands of new information sources.

Effective tools must be developed to help the end user extract meaningful information from the masses of data now or soon to be available on the networks. One such tool is Mosaic, a toolset that uses the WWW hypertext paradigm to browse the Internet. More importantly, it introduces the concept of linked analysis tools that can be used to view and, ultimately, to manipulate any data that is retrieved. The set of helper applications that can be defined for any Mosaic client forms the core of this set of linked tools. Each retrieved document is displayed in the manner that makes it most understandable and usable.

The hypertext browsing approach however does not solve the problem of selecting relevant documents from the many thousands of pages that can be accessed. This problem can only be addressed by a query capability, designed to help the user at both ends of the query process. First, by assisting the user in formulating effective queries and, then, by providing relevance ranking of returned documents, help can be provided for inexpert or infrequent users. Individual query systems sometimes provide these capabilities, but only on a limited scale and only to homogeneous information sources. The Internet environment requires that these capabilities operate across multiple dissimilar, geographically-distributed data sources.

1.3 The OmniPort Project

DTIC began developing OmniPort as a tool for integrating the vast stores of legacy data gathered over the years by the Department of Defense (DoD) Information Analysis Centers (IACs). The IAC program was created by DoD to "improve the productivity of scientists, engineers, managers, and technicians in the Defense community through the timely dissemination of evaluated information". [2] IACs act as central data repositories in such specialized subject areas as ceramic materials, cold regions, soil mechanics and platform survivability/vulnerability, among others. Each IAC has acted independently to collect technical materials and data in its particular area of expertise. Therefore, at the 26 IACs, there are now 26 different data collections, with little commonality in data storage and retrieval methods. Rather than attempt to rehost and centralize these data sources, DTIC opted to employ Minerva technology. This approach had several benefits, including leaving ownership and maintenance with the responsible IACs. The intended OmniPort user community includes members of the DoD acquisition and technology community who need access to IAC information in order to perform their jobs. To make OmniPort easily and inexpensively available to this large and diverse user population, it neede a front-end user interface that would work on all common desktop platforms. Mosaic was identified as the sole environment that met these requirements. In addition, Mosaic's increasing popularity within the DoD community would mean greater acceptance of OmniPort and reduced training costs.


2.0 Minerva

The Advanced Decision Systems group of Booz„Allen & Hamilton has developed a middleware environment that directly addresses the data access and information retrieval problems outlined in Section 1. Minerva provides users with simple, consistent access to multiple information sources, regardless of physical location or the method used to access documents. Minerva operates with existing information sources, each with its own native search capability, without requiring any redevelopment or modification of the information sources.

2.1 Communications Architecture

Underlying the Minerva environment is a layered communications architecture that provides the linguistic power necessary to communicate among OmniPort's own processes and between OmniPort and the information sources it provides to its users. This architecture is represented graphically in Figure 1. The graphic representation of this architecture as three concentric circles is deliberate, in that the three languages involved are related to each other in a subset/superset relationship. These are, from the least inclusive to the most:

Figure 1 - OmniPort Communications Architecture

„ Text Reference Language (TRL) is the language that encodes user queries in a uniform manner. The name is something of a misnomer, because the queries possible within the definition of TRL will retrieve any data, not just text documents. TRL comprises a superset of the search operators offered by existing text search engines. It allows users to generate queries of the maximum possible richness and power.

„ Metalanguage adds a layer of support for the full range of commands and responses possible within OmniPort, of which the queries formed in TRL are just one example. Besides TRL queries, the metalanguage syntax supports the definition of requests for documents, for highlighting within a document and for drone initialization. Essentially, any command necessary for the various OmniPort processes to keep each other informed is definable in the metalanguage.

„ Transport Language defines the low-level communications layer necessary for distributed processing. The transport language syntax permits the encapsulation of metalanguage commands with necessary packeting, status and routing information so that each OmniPort process can identify the messages on which it must act.

2.2 Process Architecture

The OmniPort software operates in a distributed, multiprocess environment. Figure 2 graphically presents the four main components of this architecture. These are, in the order they appear to interact with the user:

„ Desktops that offer the user a GUI supporting the formation of queries, the display of results and the analysis of documents in the results set. (It should be noted that the desktop is the only part of this architecture the user actually 'sees'. The rest is effectively invisible.) The OmniPort desktop is the Mosaic interface described in Section 3.

Figure 2 - OmniPort Process Architecture from a User's Perspective

„ Distributed Information Operating Environment (DIOE) that provides the transport backbone for the communication between the architectural components. The primary components of this structure are a network of dispatcher processes that manage the passing of messages (related to concepts, queries, document lists, documents, etc.) around the network of distributed resources.

„ Query Augmentation Services that assist users by broadening queries. Query augmentation offers several methodologies for expanding a user's query to increase the number of relevant documents retrieved. Once the documents are retrieved, the query augmentation services provide relevance ranking to assist the user in identifying the most relevant documents.

„ Drones that integrate native search capabilities by translating between OmniPort's TRL and the native search engine's query language, as well as providing other services, such as the highlighting of retrieval terms in the document.

2.3 The Benefits of Minerva To OmniPort

Minerva provides OmniPort with a means of connecting users to a potentially limitless set of data sources. Minerva also provides the transport mechanism that carries queries formed by users at their Mosaic desktop to any connected data sources regardless of each source's native query language. The user forms a single query and selects any or all available sources; Minerva transports and translates the query so that each source receives it in a form it knows how to process. Minerva then collects the resulting response set and presents it to the user in a single Mosaic page.

This architecture provides a large number of benefits. Among the most important is fault tolerance. The Internet (or any distributed environment) is in a constant state of flux. Sources come on-line and go away at irregular intervals. Minerva's DIOE is designed to respond to both the disappearance and reappearance of sources in the appropriate manner. When a source goes off-line, the associated drone informs its associated dispatcher which broadcasts that information to all other dispatchers. Similarly, the reappearance of a source is noted, broadcast and that source is then included in all future transactions.

Equally important is the ease with which new sources can be included in the architecture. Drones are specific to a search engine, not to a source. If a new source with a search engine for which a drone already exists is added, it usually requires only the installation of the appropriate drone code. (This is true for text sources; incorporating a new structured database also requires the creation of a table that maps field names.)

Another important benefit of Minerva is its query augmentation services. To understand the importance of a query augmentation service, it is necessary to define two measures of the quality of a retrieval: recall and precision. The recall measure compares the number of relevant documents retrieved by a search to the total number of available relevant documents. The result is expressed as a percentage. For example, if the total number of available documents that relate to a user's topic of interest is 100 and an attempted search by a user returns 25 of those documents, the recall is 25%. The greater the number of documents that should have been included that actually are included, the higher the recall. The precision measure compares the number of relevant documents in a retrieved set to the number of irrelevant documents in the same set. Precision is also given as a percentage. For example, if a search retrieves 100 documents and only 30 are actually relevant to the user's needs, the precision is 30%. The higher the precision, the fewer irrelevant documents are in the response set.

How do these measures apply to an actual search? If a researcher is interested in engine damage, and submits a query in the form '(AND "engine" "damage")', the retrieval set will not include documents that mention "compressor stage damage" despite the fact that such documents are relevant. Thus, the search would be lower in recall to the extent that these relevant documents were missed. The search may include documents that contain the search words "engine" and "damage" but do not actually discuss engine damage. For example, a document containing a phrase such as "the engine performed well but the left wing sustained damage" would be included but not relevant. Thus, the precision would be lower.

To get around this problem, OmniPort is designed to work with multiple query augmentation services. The query augmentation services expand a user's query into a broader set of related search patterns, thereby increasing the likelihood that the query will retrieve a broader set of relevant documents. The disadvantage of a broadened query is that, while it increases recall, it runs the risk of decreasing precision. There are two primary means in which a query augmentation service can combat this potential loss of precision. First, it can broaden the query in a smart fashion, increasing precision by using domain knowledge to assure that added search criteria are truly relevant to the users' desired goals. Second, it can rank the retrieved documents by relevance, so that users are easily able to identify those documents that are most likely to contain relevant data. Ideally, a query augmentation service should do both. Any time penalty associated with the added processing of a query augmentation service is more than made up for by the greater likelihood that the user will obtain the desired response set using fewer queries and with less time wasted browsing irrelevant documents.

A number of possible approaches can be taken to query augmentation and the subsequent relevance ranking. Most are based on a thesaurus, so that the individual words in the user's text pattern are each expanded by including synonyms for the words in the search set. Relevance ranking, in these cases, is done mainly by calculating the percentage of search terms that were found in any particular document. In developing OmniPort, Booz„Allen incorporated a sophisticated, knowledge-based query augmentation capability known as RUBRIC. [3]

2.3.1 RUBRIC

To ameliorate the performance problems of keyword and Boolean searching, a novel query augmentation method was developed called concept search. In this approach, users formulate queries by selecting concepts that have meaning in the subject domain. These concepts are developed by users knowledgeable in the domain and are stored in a concept knowledge base. Queries are made using single concepts or simple Boolean combinations of concepts (or combinations of concepts and specific text patterns).

A domain knowledge base contains concepts organized in a semantic network. Each concept is defined by a set of attributes and their values. It may also include a set of subconcepts and optionally a set of evidence (text patterns that indicate the presence of the concept in a document). Each concept includes weightings that specify how it relates to adjacent concepts and how a set of evidence contributes to relevance assessment. The set of these concepts constitutes the knowledge base that is made available to users for the formulation of queries.

A query composed of high-level domain concepts is successively decomposed by RUBRIC into lower and lower level domain concepts using the linkages in the knowledge base. At each level, any evidence patterns that may be present are formed into individual queries, along with the relevance weighting determined by the domain experts. These queries are then broadcast to all selected information sources. Like a thesaurus, the evidence set in the knowledge base contains all the various ways that a concept is likely to be referenced in the media, such as all the different synonyms or alternate spellings of a word, or the values a particular field may contain. Unlike a thesaurus, the evidence set is highly focused by the domain experts to assure a precise retrieval.

Figure 3 - OmniPort Test Home Page

When each of the individual queries gets a response from the information sources, RUBRIC gathers each response, performing an ACCRUE function on the set of evidence weightings for each pattern that 'hits' on a particular document. An ACCRUE function calculates a relevance ranking by giving a higher score to documents with more or better quality pattern matches. RUBRIC then reports the response set to the user with the appropriate relevance ranking associated with each document.


3.0 The Mosaic Interface

Mosaic provides the client display process and the communications protocol for any client site. Each client workstation is, in fact, invisible to Minerva. As the user selects from the OmniPort home page (Figure 3) any of the OmniPort-specific operations (specifically the "Open" or "Get" buttons), a script is launched on the Mosaic server which in turn starts any necessary Minerva processes. As Minerva passes information back to the Mosaic server, the script generates the HTML required to display the response on the client's screen.

Figure 4 - OmniPort Concept Query Form

Selecting the "Open" button with the "Concept" radio button pushed causes a form to display that allows the user to formulate a concept search. The user can select from a list of concepts generated in real-time from the knowledge base. Once a concept has been selected, the user can choose from a list of information sources, similarly generated in real-time. (All OmniPort displays except for the home page and information pages, such as the help screens, are generated from code running on the server.) Figure 4 shows the OmniPort Concept Query Form with the concept "Composite Armor" and a WAIS database of SURVIAC documents selected.

The results obtained by that query are shown in Figure 5. Twenty-two documents were found in the WAIS source that matched some or all the search patterns associated with the concept, "Composite Armor". The ACCRUE algorithm calculated a relevance ranking based on the number of patterns that matched each document and the relevance score associated with each pattern. The resulting score is displayed next to the title of each document. To retrieve a particular document, the user needs only to click on the title, which is a hyperlink to the document display page.

Figure 5 - OmniPort Query Results Page


4.0 Future Growth

In its current state, OmniPort is a proof-of-concept system, not yet ready for wide distribution, but a clear growth path exists to bring it into a fully operational state within a year. This will involve the development and integration of a number of new features into OmniPort and will also require some further refinement of Mosaic itself.

4.1 OmniPort Growth

OmniPort will grow in capability, in part through the incorporation of Minerva features developed or under development for other customers. Among the features that will be included before entering operational testing in Spring 1995 are a Query By Example (QBE) capability which will allow users to submit all or part of a document as a model. OmniPort will perform a word frequency analysis on the model and retrieve documents with similar word frequencies. This solves the problem for a user who has, accidentally or intentionally, found a document that provides exactly the information being sought, and now wants to ask: "Are there any more out there like this one?" Additional features that may be added are: an electronic mail gateway that would allow users without a Mosaic client (for example, while on the road) to connect to OmniPort; and a news feed monitor that would search for patterns in real-time against any ASCII data stream, such as a news feed or message traffic, and notify a user when matches are found.

4.2 Mosaic Growth

Clearly, the most important change to Mosaic that can occur from the OmniPort point of view is the creation of stable, fully forms-capable clients for MS Windows and Macintosh. This is absolutely critical to the success of OmniPort and to the success of Mosaic as a tool. Currently, both clients are in alpha release and both have problems which prevent their being used for anything other than testing for consistency of operation with the 'baseline' X-Windows client. A feature that would allow page updating in real-time, as opposed to simply regenerating a page image with new data, would be exrremely useful. Also needed is improved integration with helper applications, which would support simple cutting and pasting between applications, and improved security options, such as built-in digital signature authentication.


5.0 Notes

[1] Micheal Robbins, WAIS: A New Vision for Publishing in MicroTimes, 21 March 1994.

[2] Directory of Department of Defense Information Analysis Centers, DTIC, Alexandria, Virginia, Aug. 1993.

[3] Richard A. Tong and Appelbaum, Lee, "Conceptual Information Retrieval from Full-Text" in Proceedings RIAO-88 - User-Oriented Context-Based Text and Image Handling, MIT, Cambridge, Massachusetts, March 1988.


6.0 Authors' Biographies

Shelley G. Ford
sford@dgis.dtic.dla.mil
Ms. Ford has over 15 years of experience in both government and private industry developing information products and services for professionals. She is currently the Chief of the Information Analysis Branch, within the Research, Development, and Acquisition Support Directorate of the Defense Technical Information Center, where she serves as the project manager for OmniPort development. Ms. Ford has a degree in English and a Masters degree in Library and Information Science from the University of Maryland. In addition to OmniPort, the Information Analysis Branch is creating Mosaic Home Pages to assist Department of Defense officials in locating information and to disseminate selected information to the DoD community.

Robert C. Stern
rstern@ads.com
Mr. Stern has 12 years experience in the operation, design, implementation, and management of software system development activities in support of data processing and analysis primarily in the medical and intelligence fields. Mr. Stern has been responsible for the development and management of systems that have demonstrated advanced technologies, such as knowledge-based processing, network architectures, and natural language processing. His areas of research are the automatic or semi-automatic expansion of knowledge bases and the automatic extraction of database federations based on domain knowledge. Mr. Stern received an MSCS degree from the University of Texas-Arlington in 1983.