Overview of Practical Reasoning's

Core of Discovery


Practical Reasoning's Core of Discovery is a content mining system. It assists users in

collections of electronic content in all forms, including

The Core of Discovery provides

By document we mean any information-bearing message in recorded (especially, but not necessarily, electronic) form (cf., page 34, Elaine Svenonius, The Intellectual Foundation of Information Organization, MIT Press, 2000). Reports, press releases, memoranda, e-mail messages, correspondence, digitized images, movies or sounds, among much else, are typcially documents of interest for content mining. Further, while elemental documents are our basic unit of focus, documents may be composed from other documents, as a book is composed of chapters, or a report is composed of text and images.

Content mining is directed at collections of documents. Users need content mining tools when collections contain valuable information and are too extensive to examine otherwise. A dozen memoranda can easily be inspected manually for needed information. One simply reads them. A collection of several thousand or more documents cannot be read by one person in a timely fashion. Collections of hundreds of thousands of documents cannot be read through by a team in a timely fashion. (And much larger collections are very common.) In these kinds of situations, we need content mining tools.

Elemental, or primitive, terms are typically words (dictionary entries). They may also be data elements (such as dates, numbers, or even BLOBs) or other information entities (such as markup of various sorts). We might think of documents as either atomic (not composed of any other documents) or molecular (composed of other documents). Terms, then, would be the subatomic particles from which atomic documents are built. Terms are the smallest units of meaning in a content collection and we use them in indexing the documents. Terms may also be composite. A phrase is a term that is a composite of primitive terms. A Boolean search term is a term that is a Boolean composite (And, Or, Not) of primitive terms and Boolean expressions.

A document may have metadata associated with it, that is, structured information about the document. Metadata appear in many forms, including: in relational database management systems (RDBMS), in the header sections of HTML or XML documents, and in RDF documents. Oftentimes, document collection owners will develop catalogs that describe the contents of the collection. These catalogs are bodies of metadata. Content mining can be greatly facilitated if appropriate metadata are to hand.

The Core of Discovery's set of retrieval services subsumes and greatly extends the standard concepts in content searching and retrieval. So innovative is the Core of Discovery that a new framework—the Retrieval Services framework—is needed for thinking about information retrieval and content management.

The Retrieval Services Framework

A retrieval service in a content mining system is the support for a particular kind of query that the system can pose. A number of retrieval services are standardly deployed in content mining systems. The Core of Discovery provides these and introduces new ones.

Retrieval Service: documents–containing–term queries

With the documents–containing–term retrieval service, users provide a term to the content mining system, and receive back a list of documents containing the term. This is the most basic retrieval service and it is standardly found in information retrieval and other kinds of content mining systems.

When the given term is primitive (a single word), this is known as keyword search. If the given term is a Boolean combination (And, Or, Not) of primitive terms, this is known as Boolean operator search. There are a number of additional variations, all counting as documents–containing–term searches in the Retrieval Services framework. They include:

Retrieval services of the documents–containing–term type are the oldest and best established. They are the basis for full–text retrieval systems, and in any serious content mining system they are indispensable. They are also known to be insufficiently powerful for other than small collections (< 1 or 2 thousand documents). In particular, since documents–containing–term queries have no notion of ranking by relevance, they are hardly useful on large collections where thousands of "matches" may occur. Thus, Web search engines do not rely very much on this retrieval service.

Practical Reasoning's Core of Discovery product will, nonetheless, support all the standard types of documents–containing–term queries.

Retrieval Service: documents–matching–metadata queries

Documents in a collection will often have metadata—records about the documents—associated with them. Library catalogs, museum finding aids, HTML and XML META tags, and manually-coded keywords for scholarly abstracts are all examples of commonly-occurring metadata. These metadata are often easy to expoit for retrieval purposes, because they can be organized fairly easily with relational database management systems.

As in the case of the documents–containing–term retrieval services, documents–matching–metadata queries have long been available and are well established. Again, they are indispensable in any serious content mining system. And like documents–containing–term, they are also known to be insufficiently powerful for other than small collections (< 1 or 2 thousand documents). Further, the cost of acquiring metadata often militates against its use.

Practical Reasoning's Core of Discovery product will, nonetheless, support documents–matching–metadata queries.

Retrieval Service: documents–associatedWith–term queries

Relevance is the type of association (of a document with a search term) having the greatest commercial import. If, given a term, a set of relevant documents can be identified, each with a score on relevance, then the return set can be sorted by relevance and the user directed to the (putedly) most relevant documents. For large collections, such as are found on the Internet and in organizational repositories, this is an essential feature. All the familiar Internet search engines support documents–associated(byRelevance)With–term queries.

There is a problem, however: there is no algorithm that by general assent accurately calculates actual relevance. Moreover, it is well known that the relevance rankings produced by different search engines agree very little among themselves.

The Core of Discovery, uniquely, supports documents–associated(byRelevance)With–term queries using the DCB representation. A remarkable fact about DCB—and a great strength—is that, partly in consequence of being an associationally indexing scheme, it can find and rank as highly relevant documents that do not contain the given search term (or term complex). Experimental testing with human subjects at The Wharton School has provided good prima facie evidence that DCB produces accurate (and superior) rankings on difficult collections.

A fundamental barrier to the commercial use of DCB (as well as a similar representation, known as Latent Semantic Indexing) has always been its computational cost. Professor Kimbrough has invented, implemented, and convincingly tested the Variable Approximation algorithm for matrix multiplication, which is used by the Core of Discovery to create DCB-based indexes. In addition, Professor Kimbrough has invented an algorithm for Incremental Update (adding and deleting indexed documents) for these indexes, thereby also greatly increasing the applicable scope of the DCB representation. These inventions are currently in the patent process by the University of Pennsylvania, and would be licensed to the venture here envisioned. They would give the venture a unique competitive advange.

Retrieval Service: documents–associatedWith–document queries

The query form is now: Given a document (or complex of documents), find the set of associated documents. Here, the kind of association of most interest is similarity (or dissimilarity) of content or subject matter. Sometimes these are called "more like this/these" or "not like this/these" queries. More grandiosely, the given document is simply a multiword query and the retrieval service is marketed as a "natural language" or "content agent" feature. When the given document is from the collection under investigation, this kind of retrieval service is useful for supporting relevance feedback searches.

The Core of Discovery provides a uniquely powerful version of this retrieval service. Using a variant of the DCB representation, and the Variable Approximation algorithm, we are able to rank by similarity of content to a given document (including a multiword query) every document in a collection. (The most dissimilar documents are those furthest down in the ranking.) This ranking, moreover, is associational so that, e.g., documents with different word patterns might well be closely associated, if they differ in many words that are similar in meaning.

Retrieval Service: terms–associatedWith–term queries

Given a particular term appearing in a collection, how are the other terms in the collection associated with it?

This kind of retrieval service is useful when, for example, the search term a user has in mind is present in the collection, but is not the primary term associated with the concept the user has in mind. For example, the user may be interested in documents on problems with how firms remember things. The user may have in mind the search term 'corporate memory', but the document collection may contain this expression only rarely, compared to the frequent use of 'organizational memory'. A terms-associatedwith-term query, on 'corporate' or 'memory' will likely reveal a strong association with 'organizational', cluing the user to try a documents-associatedWith-term query on 'organizational memory'.

The Core of Discovery supports this retrieval service. We are not aware of any other product that does so. (Some Internet search engines do use thesauruses to expand users' queries and allow users to "polish" their queries by directing the engine to include or exclude certain expansion terms.)

Retrieval Service: terms–associatedWith–document queries

Given a particular document appearing in a collection, what (given the other documents in the collection) are the terms associated with it?

Of course, the terms in a document are associated with it. So are other terms, however. In our previous example, 'organizational' is associated with 'corporate' because both are associated with (co-occur with) 'memory'. A terms-associatedWith-document retrieval service can reveal such important information. Given a document, it can return a ranked list of terms associated with the document and not appearing in it. Also, it can return a ranked list of all terms associated with the document. Quite possibly, a term appearing in a document will be more weakly associated with the document than a term not appearing in the document. This is useful information for serious searchers.

The Core of Discovery supports this retrieval service. We are not aware of any other product that does so.

Combinations of Retrieval Services

The Core of Discovery supports certain combinations of retrieval services. These are particularly useful in filtering retrieval sets. For example, a user might perform a relevance-ranked search (documents–associated(byRelevance)With–term query) and then filter the retrieval set to see which of the returned documents do not have the query term (documents–containing–term query).

Many other combinations of serices are possible and useful to the serious investigator.

Multidimensional, or Pattern-Oriented, Retrieval Services

Professor Kimbrough, in conjunction with his students, has developed and/or supervised two highly original retrieval concepts. The two concepts—originally known as Homer and MOTC—have in common that they are tools for exploring patterned associations between two or more variables of interest. In the case of Homer, the variables of interest stand for categories of documents.

The Homer concept was originated by Mr. Garett O. Dworman and is the subject of his Ph.D. thesis, which was supervised by Professor Kimbrough (Pattern-Oriented Access to Document Collections, University of Pennsylvania, 1999). As part of his thesis work, Dr. Dworman implemented prototypes of the Homer concept and conducted experimental investigations into its efficacy. He found, very convincingly, that naïve subjects using Homer could quickly and accurately discover interesting and high-quality generalizations in collections of semistructured documents (specifically, emergency room medical records in SGML).

As a product, Homer would consist of two parts: a user interface and a back-end indexing and retrieval service. The backend indexing and retrieval relies on a variant of the DCB representation, so the Variable Approximation algorithm and code will be valuable in practice for doing the indexing. The user interface has been redeveloped from scratch by Dr. Dworman, since his graduation in December 1999.

The MOTC concept originated in a seminar conducted by Professor Kimbrough and led to a journal publication authored by the seminar's participants. A prototype implementation was created by Mr. Tate Shafer and revised subsequently. Practical Reasoning, Inc. has since purchased that prototype (and attendant rights) from Mr. Shafer.

MOTC combines data visualization with user interaction and rigorous statistical methods. Using MOTC, content explorers are able to conjecture relationships among several variables at once (the prototype is limited to 8). MOTC presents a visualization indicating the strength and quality of the conjectured relationships and MOTC uses an established statistical technique to measure rigorously the strength and quality of the relationships. Working with MOTC, users do "hypothesis hunting."

Both MOTC and Homer can be implemented in an "industrial strength" fashion within 18-24 months. For the serious content investigator, neither MOTC nor Homer have a close substitute.

Product Forms

The Core of Discovery has four main software modules:
  1. Collection acquisition
    This module is used for setting up a collection for subsequent retrieval and indexing. Its functions include document tokenizing, application of stop lists and thesauri, stemming, and system file and database setup.
  2. Collection indexing
    This module accepts files and databases produced by the pre-indexing processing done in the collection acquisition module, and creates index files. The Variable Approximation and the Incremental Update algorithms are used by the collection indexing module.
  3. Retrieval services
    This module accepts retrieval requests and fulfills them by accessing the index files created by the collection indexing module, as well as various files created by the collection acquisition module.
  4. User interface
    This module provides the (Web-based) interface between the user and the retrieval services module.
The Core of Discovery can be delivered to customers in any of several ways:

Product Life Cycle

There is much flexibiltiy evident here in the product form of the Core of Discovery (see above). This naturally suggests a number of customer migration options. A "fly before you buy" migration schedule should be particularly attractive to customers. The modular characteristics of the Core of Discovery facilitate this sort of incremental commitment.

Certain core elements of the Core of Discovery software are unlikely to provide opportunities for substantial enhancement. We foresee at best only incremental improvement to the Variable Approximation code, or to the Incremental Update code, or for such basic tasks as document tokenization. There remain, however, ample opportunities for continued improvement of the product.

Finally, it should be emphasized that content mining is, and will remain, an empirical discipline, for which theory is at best indirectly relevant. Thus, continued experimentation and conceptual innovation will be required to keep any content mining product competitive at the world-class level.






| Steven O. Kimbrough | kimbrough@practicalreasoning.com | 2000-6-20 | cod-overview.htm | Revision 1.4 |

/* $Header: cod-overview.html,v 1.2 2000/10/16 14:47:50 sok Exp $ */