OmniPort: Integrating Legacy Data into the WWW
Shelley G. Ford,
Chief, Information Analysis Branch
Defense Technical Information Center
Robert C. Stern,
Advanced Decision Systems
Booz„Allen & Hamilton, Inc.
Abstract:
Today's information seekers are faced with two
seemingly contradictory problems. First, the amount of data they can
retrieve is limited by the number of different access protocols, search
languages and indexing vocabularies they can master. Second, the sheer
volume of available data threatens to overwhelm the user's ability to
extract meaningful information from this ever-growing mass of data.
This paper describes OmniPort, a system under development for the
Defense Technical Information Center (DTIC), designed to address both
of these problems by offering consistent single-query access to
heterogeneous, distributed data sources and by providing a seamless
environment for analyzing retrieved data. The former capability is
supplied by the Minerva 'middleware' environment developed by
Booz„Allen & Hamilton, the latter by Mosaic.
1.0 The Problem
OmniPort came into existence to solve two
seemingly contradictory problems. First, because of the need to master
a bewildering variety of access protocols, search languages and
indexing vocabularies, information users are able to find only a small
percentage of the relevant data that should be available given current
network technology. Second, information users are being overwhelmed by
the sheer volume of data that can be retrieved on almost any subject.
The better the first problem is solved, the worse the second problem
tends to become.
This paper describes the OmniPort system, under development for the Defense Technical Information Center (DTIC) by the Advanced Decision Systems group of Booz„Allen & Hamilton. This first section
discusses the linked problems of data access in a networked environment
and information retrieval from the masses of data available. The second section describes the Minerva middleware. The third section
describes the Mosaic environment being developed to provide consistent
user access to the data that Minerva retrieves. Finally, a fourth section discusses future growth plans and opportunities.
The emergence of the Internet as a common means of
information exchange has led to the creation of increasingly powerful
tools for locating and retrieving relevant data. These tools have
included archie, veronica, gopher and the Wide Area Information Service
(WAIS). The most powerful of
these, WAIS, provides a text search engine that is used to index
primary words in each document. A user can then submit a search
pattern to the WAIS engine which identifies and ranks documents based
on degree of match and allows the retrieval of listed document. [1]
Each of these tools, however, requires the data to be organized (or
re-organized, in the case of legacy data) into the format used by the
access tool. This process can be expensive, time consuming and always
carries the risk of data loss or corruption during the conversion
process. Using WAIS makes sense when the data has not previously been
made available using a searchable online system. However, when the
data is already organized and accessible to local users via a modern
database or textbase manager, it is wasteful to rehost the data in
order to make it available on a wide area network (WAN). While the
amount of data accessible using gopher, WAIS or the World Wide Web
(WWW) is increasing rapidly, the percentage of data accessible via a
standard Internet access tool is small when compared to the amount of
relevant data that exists in organized, searchable form on network
accessible computers. These existing legacy systems are the product of
decades of information gathering and data analysis. The problem of
accessing this vast store of legacy data has until recently resisted
practical resolution. Either users were forced to learn multiple
search interfaces or a costly time-consuming data conversion effort was
required.
1.2 Information Retrieval
Although users currently can
retrieve only a small percentage of the data that might be available,
they are already being overwhelmed by the volume and kinds of data that
can be obtained now on the Internet. So much information is available
today in so many formats (full-text, graphics, audio, video, etc.) that
the usable portion lies buried in masses of data where it cannot be
located or retrieved easily. The expansion of the National Information
Infrastructure (NII) to create an "information superhighway" will only
add thousands of new information sources.
Effective tools must be developed to help the end user extract
meaningful information from the masses of data now or soon to be
available on the networks. One such tool is Mosaic, a toolset that
uses the WWW hypertext paradigm to browse the Internet. More
importantly, it introduces the concept of linked analysis tools that
can be used to view and, ultimately, to manipulate any data that is
retrieved. The set of helper applications that can be defined for any
Mosaic client forms the core of this set of linked tools. Each
retrieved document is displayed in the manner that makes it most
understandable and usable.
The hypertext browsing approach however does not solve the problem of
selecting relevant documents from the many thousands of pages that can
be accessed. This problem can only be addressed by a query capability,
designed to help the user at both ends of the query process. First, by
assisting the user in formulating effective queries and, then, by
providing relevance ranking of returned documents, help can be provided
for inexpert or infrequent users. Individual query systems sometimes
provide these capabilities, but only on a limited scale and only to
homogeneous information sources. The Internet environment requires
that these capabilities operate across multiple dissimilar,
geographically-distributed data sources.
1.3 The OmniPort Project
DTIC began developing OmniPort as
a tool for integrating the vast stores of legacy data gathered over the
years by the Department of Defense (DoD) Information Analysis Centers (IACs).
The IAC program was created by DoD to "improve the productivity of
scientists, engineers, managers, and technicians in the Defense
community through the timely dissemination of evaluated information". [2]
IACs act as central data repositories in such specialized subject areas
as ceramic materials, cold regions, soil mechanics and platform
survivability/vulnerability, among others. Each IAC has acted
independently to collect technical materials and data in its particular
area of expertise. Therefore, at the 26 IACs, there are now 26
different data collections, with little commonality in data storage and
retrieval methods. Rather than attempt to rehost and centralize these
data sources, DTIC opted to employ Minerva technology. This approach
had several benefits, including leaving ownership and maintenance with
the responsible IACs. The intended OmniPort user community includes
members of the DoD acquisition and technology community who need
access to IAC information in order to perform their jobs. To make
OmniPort easily and inexpensively available to this large and diverse
user population, it neede a front-end user interface that would work on
all common desktop platforms. Mosaic was identified as the sole
environment that met these requirements. In addition, Mosaic's
increasing popularity within the DoD community would mean greater
acceptance of OmniPort and reduced training costs.
The Advanced Decision Systems group of Booz„Allen
& Hamilton has developed a middleware environment that directly
addresses the data access and information retrieval problems outlined
in Section 1. Minerva provides users with simple, consistent access to
multiple information sources, regardless of physical location or the
method used to access documents. Minerva operates with existing
information sources, each with its own native search capability,
without requiring any redevelopment or modification of the information
sources.
Underlying the Minerva environment is a layered
communications architecture that provides the linguistic power
necessary to communicate among OmniPort's own processes and between
OmniPort and the information sources it provides to its users. This
architecture is represented graphically in Figure 1. The graphic
representation of this architecture as three concentric circles is
deliberate, in that the three languages involved are related to each
other in a subset/superset relationship. These are, from the least
inclusive to the most:
Figure 1 - OmniPort Communications Architecture
„ Text Reference Language (TRL) is the language that
encodes user queries in a uniform manner. The name is something of a
misnomer, because the queries possible within the definition of TRL
will retrieve any data, not just text documents. TRL comprises a
superset of the search operators offered by existing text search
engines. It allows users to generate queries of the maximum possible
richness and power.
„ Metalanguage adds a layer of support for the full
range of commands and responses possible within OmniPort, of which the
queries formed in TRL are just one example. Besides TRL queries, the
metalanguage syntax supports the definition of requests for documents,
for highlighting within a document and for drone initialization.
Essentially, any command necessary for the various OmniPort processes
to keep each other informed is definable in the metalanguage.
„ Transport Language defines the low-level
communications layer necessary for distributed processing. The
transport language syntax permits the encapsulation of metalanguage
commands with necessary packeting, status and routing information so
that each OmniPort process can identify the messages on which it must
act.
The OmniPort software operates in a distributed,
multiprocess environment. Figure 2 graphically presents the four main
components of this architecture. These are, in the order they appear
to interact with the user:
„ Desktops that offer the user a GUI supporting the
formation of queries, the display of results and the analysis of
documents in the results set. (It should be noted that the desktop is
the only part of this architecture the user actually 'sees'. The rest
is effectively invisible.) The OmniPort desktop is the Mosaic
interface described in Section 3.
Figure 2 - OmniPort Process Architecture from a User's Perspective
„ Distributed Information Operating Environment (DIOE)
that provides the transport backbone for the communication between the
architectural components. The primary components of this structure are
a network of dispatcher processes that manage the passing of messages
(related to concepts, queries, document lists, documents, etc.) around
the network of distributed resources.
„ Query Augmentation Services that assist users by
broadening queries. Query augmentation offers several methodologies
for expanding a user's query to increase the number of relevant
documents retrieved. Once the documents are retrieved, the query
augmentation services provide relevance ranking to assist the user in
identifying the most relevant documents.
„ Drones that integrate native search capabilities by
translating between OmniPort's TRL and the native search engine's query
language, as well as providing other services, such as the highlighting
of retrieval terms in the document.
Minerva provides OmniPort with a means of connecting
users to a potentially limitless set of data sources. Minerva also
provides the transport mechanism that carries queries formed by users
at their Mosaic desktop to any connected data sources regardless of
each source's native query language. The user forms a single query and
selects any or all available sources; Minerva transports and translates
the query so that each source receives it in a form it knows how to
process. Minerva then collects the resulting response set and presents
it to the user in a single Mosaic page.
This architecture provides a large number of benefits.
Among the most important is fault tolerance. The Internet (or any
distributed environment) is in a constant state of flux. Sources come
on-line and go away at irregular intervals. Minerva's DIOE is designed
to respond to both the disappearance and reappearance of sources in
the appropriate manner. When a source goes off-line, the associated
drone informs its associated dispatcher which broadcasts that
information to all other dispatchers. Similarly, the reappearance of a
source is noted, broadcast and that source is then included in all
future transactions.
Equally important is the ease with which new sources
can be included in the architecture. Drones are specific to a search
engine, not to a source. If a new source with a search engine for
which a drone already exists is added, it usually requires only the
installation of the appropriate drone code. (This is true for text
sources; incorporating a new structured database also requires the
creation of a table that maps field names.)
Another important benefit of Minerva is its query
augmentation services. To understand the importance of a query
augmentation service, it is necessary to define two measures of the
quality of a retrieval: recall and precision. The recall measure
compares the number of relevant documents retrieved by a search to the
total number of available relevant documents. The result is expressed
as a percentage. For example, if the total number of available
documents that relate to a user's topic of interest is 100 and an
attempted search by a user returns 25 of those documents, the recall is
25%. The greater the number of documents that should have been
included that actually are included, the higher the recall. The
precision measure compares the number of relevant documents in a
retrieved set to the number of irrelevant documents in the same set.
Precision is also given as a percentage. For example, if a search
retrieves 100 documents and only 30 are actually relevant to the user's
needs, the precision is 30%. The higher the precision, the fewer
irrelevant documents are in the response set.
How do these measures apply to an actual search? If a
researcher is interested in engine damage, and submits a query in the
form '(AND "engine" "damage")', the retrieval set will not include
documents that mention "compressor stage damage" despite the fact that
such documents are relevant. Thus, the search would be lower in recall
to the extent that these relevant documents were missed. The search may
include documents that contain the search words "engine" and "damage"
but do not actually discuss engine damage. For example, a document
containing a phrase such as "the engine performed well but the left
wing sustained damage" would be included but not relevant. Thus, the
precision would be lower.
To get around this problem, OmniPort is designed to
work with multiple query augmentation services. The query augmentation
services expand a user's query into a broader set of related search
patterns, thereby increasing the likelihood that the query will
retrieve a broader set of relevant documents. The disadvantage of a
broadened query is that, while it increases recall, it runs the risk of
decreasing precision. There are two primary means in which a query
augmentation service can combat this potential loss of precision.
First, it can broaden the query in a smart fashion, increasing
precision by using domain knowledge to assure that added search
criteria are truly relevant to the users' desired goals. Second, it
can rank the retrieved documents by relevance, so that users are easily
able to identify those documents that are most likely to contain
relevant data. Ideally, a query augmentation service should do both.
Any time penalty associated with the added processing of a query
augmentation service is more than made up for by the greater likelihood
that the user will obtain the desired response set using fewer queries
and with less time wasted browsing irrelevant documents.
A number of possible approaches can be taken to query
augmentation and the subsequent relevance ranking. Most are based on a
thesaurus, so that the individual words in the user's text pattern are
each expanded by including synonyms for the words in the search set.
Relevance ranking, in these cases, is done mainly by calculating the
percentage of search terms that were found in any particular document.
In developing OmniPort, Booz„Allen incorporated a sophisticated,
knowledge-based query augmentation capability known as RUBRIC. [3]
2.3.1 RUBRIC
To ameliorate the performance problems of
keyword and Boolean searching, a novel query augmentation method was
developed called concept search. In this approach, users formulate
queries by selecting concepts that have meaning in the subject domain.
These concepts are developed by users knowledgeable in the domain and
are stored in a concept knowledge base. Queries are made using single
concepts or simple Boolean combinations of concepts (or combinations
of concepts and specific text patterns).
A domain knowledge base contains concepts organized in a semantic
network. Each concept is defined by a set of attributes and their
values. It may also include a set of subconcepts and optionally a set
of evidence (text patterns that indicate the presence of the concept in
a document). Each concept includes weightings that specify how it
relates to adjacent concepts and how a set of evidence contributes to
relevance assessment. The set of these concepts constitutes the
knowledge base that is made available to users for the formulation of
queries.
A query composed of high-level domain concepts is successively
decomposed by RUBRIC into lower and lower level domain concepts using
the linkages in the knowledge base. At each level, any evidence
patterns that may be present are formed into individual queries, along
with the relevance weighting determined by the domain experts. These
queries are then broadcast to all selected information sources. Like a
thesaurus, the evidence set in the knowledge base contains all the
various ways that a concept is likely to be referenced in the media,
such as all the different synonyms or alternate spellings of a word, or
the values a particular field may contain. Unlike a thesaurus, the
evidence set is highly focused by the domain experts to assure a
precise retrieval.
Figure 3 - OmniPort Test Home Page
When each of the
individual queries gets a response from the information sources,
RUBRIC gathers each response, performing an ACCRUE function on the set
of evidence weightings for each pattern that 'hits' on a particular
document. An ACCRUE function calculates a relevance ranking by giving
a higher score to documents with more or better quality pattern
matches. RUBRIC then reports the response set to the user with the
appropriate relevance ranking associated with each document.
Mosaic provides the client display process and the
communications protocol for any client site. Each client workstation
is, in fact, invisible to Minerva. As the user selects from the
OmniPort home page (Figure 3) any of the OmniPort-specific operations
(specifically the "Open" or "Get" buttons), a script is launched on the
Mosaic server which in turn starts any necessary Minerva processes. As
Minerva passes information back to the Mosaic server, the script
generates the HTML required to display the response on the client's
screen.
Figure 4 - OmniPort Concept Query Form
Selecting the "Open" button with the "Concept" radio
button pushed causes a form to display that allows the user to
formulate a concept search. The user can select from a list of
concepts generated in real-time from the knowledge base. Once a
concept has been selected, the user can choose from a list of
information sources, similarly generated in real-time. (All OmniPort
displays except for the home page and information pages, such as the
help screens, are generated from code running on the server.) Figure 4
shows the OmniPort Concept Query Form with the concept "Composite
Armor" and a WAIS database of SURVIAC documents selected.
The results obtained by that query are shown in Figure
5. Twenty-two documents were found in the WAIS source that matched some
or all the search patterns associated with the concept, "Composite
Armor". The ACCRUE algorithm calculated a relevance ranking based on
the number of patterns that matched each document and the relevance
score associated with each pattern. The resulting score is displayed
next to the title of each document. To retrieve a particular document,
the user needs only to click on the title, which is a hyperlink to the
document display page.
Figure 5 - OmniPort Query Results Page
In its current state, OmniPort is a proof-of-concept
system, not yet ready for wide distribution, but a clear growth path
exists to bring it into a fully operational state within a year. This
will involve the development and integration of a number of new
features into OmniPort and will also require some further refinement of
Mosaic itself.
OmniPort will grow in capability, in part through the
incorporation of Minerva features developed or under development for
other customers. Among the features that will be included before
entering operational testing in Spring 1995 are a Query By Example
(QBE) capability which will allow users to submit all or part of a
document as a model. OmniPort will perform a word frequency analysis
on the model and retrieve documents with similar word frequencies.
This solves the problem for a user who has, accidentally or
intentionally, found a document that provides exactly the information
being sought, and now wants to ask: "Are there any more out there like
this one?" Additional features that may be added are: an electronic
mail gateway that would allow users without a Mosaic client (for
example, while on the road) to connect to OmniPort; and a news feed
monitor that would search for patterns in real-time against any ASCII
data stream, such as a news feed or message traffic, and notify a user
when matches are found.
Clearly, the most important change to Mosaic that can
occur from the OmniPort point of view is the creation of stable, fully
forms-capable clients for MS Windows and Macintosh. This is absolutely
critical to the success of OmniPort and to the success of Mosaic as a
tool. Currently, both clients are in alpha release and both have
problems which prevent their being used for anything other than testing
for consistency of operation with the 'baseline' X-Windows client. A
feature that would allow page updating in real-time, as opposed to
simply regenerating a page image with new data, would be exrremely
useful. Also needed is improved integration with helper applications,
which would support simple cutting and pasting between applications,
and improved security options, such as built-in digital signature
authentication.
[1] Micheal Robbins, WAIS: A New Vision for Publishing in MicroTimes, 21 March 1994.
[2] Directory of Department of Defense Information Analysis Centers, DTIC, Alexandria, Virginia, Aug. 1993.
[3] Richard A. Tong and Appelbaum, Lee, "Conceptual Information Retrieval from Full-Text" in Proceedings RIAO-88 - User-Oriented Context-Based Text and Image Handling, MIT, Cambridge, Massachusetts, March 1988.
Shelley G. Ford
sford@dgis.dtic.dla.mil
Ms. Ford has over 15 years of experience in both
government and private industry developing information products and
services for professionals. She is currently the Chief of the
Information Analysis Branch, within the Research, Development, and
Acquisition Support Directorate of the Defense Technical Information
Center, where she serves as the project manager for OmniPort
development. Ms. Ford has a degree in English and a Masters degree in
Library and Information Science from the University of Maryland. In
addition to OmniPort, the Information Analysis Branch is creating
Mosaic Home Pages to assist Department of Defense officials in locating
information and to disseminate selected information to the DoD
community.
Robert C. Stern
rstern@ads.com
Mr. Stern has 12 years experience in the operation,
design, implementation, and management of software system development
activities in support of data processing and analysis primarily in the
medical and intelligence fields. Mr. Stern has been responsible for
the development and management of systems that have demonstrated
advanced technologies, such as knowledge-based processing, network
architectures, and natural language processing. His areas of research
are the automatic or semi-automatic expansion of knowledge bases and
the automatic extraction of database federations based on domain
knowledge. Mr. Stern received an MSCS degree from the University of
Texas-Arlington in 1983.