Practical Reasoning's Core of Discovery is a content mining system. It assists users in
By document we mean any information-bearing message in recorded (especially, but not necessarily, electronic) form (cf., page 34, Elaine Svenonius, The Intellectual Foundation of Information Organization, MIT Press, 2000). Reports, press releases, memoranda, e-mail messages, correspondence, digitized images, movies or sounds, among much else, are typcially documents of interest for content mining. Further, while elemental documents are our basic unit of focus, documents may be composed from other documents, as a book is composed of chapters, or a report is composed of text and images.
Content mining is directed at collections of documents. Users need content mining tools when collections contain valuable information and are too extensive to examine otherwise. A dozen memoranda can easily be inspected manually for needed information. One simply reads them. A collection of several thousand or more documents cannot be read by one person in a timely fashion. Collections of hundreds of thousands of documents cannot be read through by a team in a timely fashion. (And much larger collections are very common.) In these kinds of situations, we need content mining tools.
Elemental, or primitive, terms are typically words (dictionary entries). They may also be data elements (such as dates, numbers, or even BLOBs) or other information entities (such as markup of various sorts). We might think of documents as either atomic (not composed of any other documents) or molecular (composed of other documents). Terms, then, would be the subatomic particles from which atomic documents are built. Terms are the smallest units of meaning in a content collection and we use them in indexing the documents. Terms may also be composite. A phrase is a term that is a composite of primitive terms. A Boolean search term is a term that is a Boolean composite (And, Or, Not) of primitive terms and Boolean expressions.
A document may have metadata associated with it, that is, structured information about the document. Metadata appear in many forms, including: in relational database management systems (RDBMS), in the header sections of HTML or XML documents, and in RDF documents. Oftentimes, document collection owners will develop catalogs that describe the contents of the collection. These catalogs are bodies of metadata. Content mining can be greatly facilitated if appropriate metadata are to hand.
The Core of Discovery's set of retrieval services subsumes and greatly extends the standard concepts in content searching and retrieval. So innovative is the Core of Discovery that a new frameworkthe Retrieval Services frameworkis needed for thinking about information retrieval and content management.
A retrieval service in a content mining system is the support for a particular kind of query that the system can pose. A number of retrieval services are standardly deployed in content mining systems. The Core of Discovery provides these and introduces new ones.
When the given term is primitive (a single word), this is known as keyword search. If the given term is a Boolean combination (And, Or, Not) of primitive terms, this is known as Boolean operator search. There are a number of additional variations, all counting as documentscontainingterm searches in the Retrieval Services framework. They include:
John*son
would lead to documents being returned if they contained "Johnson" or
"Johnston".
stor is a stem for "store",
"story", "Storey", "stories" and so on.
whimpleton might return documents containing reference to
a famous tennis championship.
Practical Reasoning's Core of Discovery product will, nonetheless, support all the standard types of documentscontainingterm queries.
Documents in a collection will often have metadatarecords
about the documentsassociated with them. Library
catalogs, museum finding aids, HTML and XML META tags,
and manually-coded keywords for scholarly abstracts are all examples
of commonly-occurring metadata. These metadata are often easy to
expoit for retrieval purposes, because they can be organized fairly
easily with relational database management systems.
As in the case of the documentscontainingterm retrieval services, documentsmatchingmetadata queries have long been available and are well established. Again, they are indispensable in any serious content mining system. And like documentscontainingterm, they are also known to be insufficiently powerful for other than small collections (< 1 or 2 thousand documents). Further, the cost of acquiring metadata often militates against its use.
Practical Reasoning's Core of Discovery product will, nonetheless, support documentsmatchingmetadata queries.
Relevance is the type of association (of a document with a search term) having the greatest commercial import. If, given a term, a set of relevant documents can be identified, each with a score on relevance, then the return set can be sorted by relevance and the user directed to the (putedly) most relevant documents. For large collections, such as are found on the Internet and in organizational repositories, this is an essential feature. All the familiar Internet search engines support documentsassociated(byRelevance)Withterm queries.
There is a problem, however: there is no algorithm that by general assent accurately calculates actual relevance. Moreover, it is well known that the relevance rankings produced by different search engines agree very little among themselves.
The Core of Discovery, uniquely, supports documentsassociated(byRelevance)Withterm queries using the DCB representation. A remarkable fact about DCBand a great strengthis that, partly in consequence of being an associationally indexing scheme, it can find and rank as highly relevant documents that do not contain the given search term (or term complex). Experimental testing with human subjects at The Wharton School has provided good prima facie evidence that DCB produces accurate (and superior) rankings on difficult collections.
A fundamental barrier to the commercial use of DCB (as well as a similar representation, known as Latent Semantic Indexing) has always been its computational cost. Professor Kimbrough has invented, implemented, and convincingly tested the Variable Approximation algorithm for matrix multiplication, which is used by the Core of Discovery to create DCB-based indexes. In addition, Professor Kimbrough has invented an algorithm for Incremental Update (adding and deleting indexed documents) for these indexes, thereby also greatly increasing the applicable scope of the DCB representation. These inventions are currently in the patent process by the University of Pennsylvania, and would be licensed to the venture here envisioned. They would give the venture a unique competitive advange.
The query form is now: Given a document (or complex of documents), find the set of associated documents. Here, the kind of association of most interest is similarity (or dissimilarity) of content or subject matter. Sometimes these are called "more like this/these" or "not like this/these" queries. More grandiosely, the given document is simply a multiword query and the retrieval service is marketed as a "natural language" or "content agent" feature. When the given document is from the collection under investigation, this kind of retrieval service is useful for supporting relevance feedback searches.
The Core of Discovery provides a uniquely powerful version of this retrieval service. Using a variant of the DCB representation, and the Variable Approximation algorithm, we are able to rank by similarity of content to a given document (including a multiword query) every document in a collection. (The most dissimilar documents are those furthest down in the ranking.) This ranking, moreover, is associational so that, e.g., documents with different word patterns might well be closely associated, if they differ in many words that are similar in meaning.
This kind of retrieval service is useful when, for example, the search term a user has in mind is present in the collection, but is not the primary term associated with the concept the user has in mind. For example, the user may be interested in documents on problems with how firms remember things. The user may have in mind the search term 'corporate memory', but the document collection may contain this expression only rarely, compared to the frequent use of 'organizational memory'. A terms-associatedwith-term query, on 'corporate' or 'memory' will likely reveal a strong association with 'organizational', cluing the user to try a documents-associatedWith-term query on 'organizational memory'.
The Core of Discovery supports this retrieval service. We are not aware of any other product that does so. (Some Internet search engines do use thesauruses to expand users' queries and allow users to "polish" their queries by directing the engine to include or exclude certain expansion terms.)Of course, the terms in a document are associated with it. So are other terms, however. In our previous example, 'organizational' is associated with 'corporate' because both are associated with (co-occur with) 'memory'. A terms-associatedWith-document retrieval service can reveal such important information. Given a document, it can return a ranked list of terms associated with the document and not appearing in it. Also, it can return a ranked list of all terms associated with the document. Quite possibly, a term appearing in a document will be more weakly associated with the document than a term not appearing in the document. This is useful information for serious searchers.
The Core of Discovery supports this retrieval service. We are not aware of any other product that does so.Many other combinations of serices are possible and useful to the serious investigator.
Professor Kimbrough, in conjunction with his students, has developed and/or supervised two highly original retrieval concepts. The two conceptsoriginally known as Homer and MOTChave in common that they are tools for exploring patterned associations between two or more variables of interest. In the case of Homer, the variables of interest stand for categories of documents.
The Homer concept was originated by Mr. Garett O. Dworman and is the subject of his Ph.D. thesis, which was supervised by Professor Kimbrough (Pattern-Oriented Access to Document Collections, University of Pennsylvania, 1999). As part of his thesis work, Dr. Dworman implemented prototypes of the Homer concept and conducted experimental investigations into its efficacy. He found, very convincingly, that naïve subjects using Homer could quickly and accurately discover interesting and high-quality generalizations in collections of semistructured documents (specifically, emergency room medical records in SGML).
As a product, Homer would consist of two parts: a user interface and a back-end indexing and retrieval service. The backend indexing and retrieval relies on a variant of the DCB representation, so the Variable Approximation algorithm and code will be valuable in practice for doing the indexing. The user interface has been redeveloped from scratch by Dr. Dworman, since his graduation in December 1999.
The MOTC concept originated in a seminar conducted by Professor Kimbrough and led to a journal publication authored by the seminar's participants. A prototype implementation was created by Mr. Tate Shafer and revised subsequently. Practical Reasoning, Inc. has since purchased that prototype (and attendant rights) from Mr. Shafer.
MOTC combines data visualization with user interaction and rigorous statistical methods. Using MOTC, content explorers are able to conjecture relationships among several variables at once (the prototype is limited to 8). MOTC presents a visualization indicating the strength and quality of the conjectured relationships and MOTC uses an established statistical technique to measure rigorously the strength and quality of the relationships. Working with MOTC, users do "hypothesis hunting."Both MOTC and Homer can be implemented in an "industrial strength" fashion within 18-24 months. For the serious content investigator, neither MOTC nor Homer have a close substitute.
Certain core elements of the Core of Discovery software are unlikely to provide opportunities for substantial enhancement. We foresee at best only incremental improvement to the Variable Approximation code, or to the Incremental Update code, or for such basic tasks as document tokenization. There remain, however, ample opportunities for continued improvement of the product.
In the nearer term, there are excellent opportunities to apply machine learning to the indexing process as well as to provision of retrieval services. (MOTC was originally designed with this in mind). Also, our main indexing algorithm (the Variable Approximation algorithm) is inherently parallelizeable. Pursuing this, perhaps with Linux clusters, should allow us to scale up massively and at the same time economically.
Finally, it should be emphasized that content mining is, and will remain, an empirical discipline, for which theory is at best indirectly relevant. Thus, continued experimentation and conceptual innovation will be required to keep any content mining product competitive at the world-class level.