Core of Discovery Executive Briefing:

Capabilities of

Practical Reasoning's Core of Discovery


Information & Knowledge Management: Example Problems

The Plato Problem

Imagine that you have access to a large document collection and are interested in obtaining information about Plato, the ancient Greek philosopher. You successfully retrieve all documents containing the word 'Plato'. When you examine these documents you find what you expected: many of them are indeed about the Greek philosopher; others are about various commercial products trading on the philosopher's name, but having nothing to do with him.

You are then surprised when a friend presents you with a completely different set of documents, also retrieved from the same large document collection. Nearly all of these new documents are very much about Plato the philosopher, yet none of them mention 'Plato' (or any close term, e.g., 'Platon' a foreign spelling).

How can this be? Your friend explains. 'Plato' was not the philosopher's orginal name. It was a nickname given to Aristocles by his wrestling coach ('personal trainer' in today's language) and it means 'chubby' or 'chubs'. Somehow it stuck. Many authors, however, have written about the philosopher using only his orginal name. These writings would not be retrieved by a keyword search using 'Plato'.

The Plato Problem is the problem of designing a computer system that helps in finding relevant and useful documents that do not contain the search term(s) you have in mind. The scope of this problem extends far beyond the world of classical scholarship. In fact, except in relatively trivial cases, it appears much more often than not. Here is a representative account of a search in a small (40,000 documents) collection by a leading expert in information retrieval. The area of application is litigation support.

Sometimes we followed a trail of linguistic creativity through the database. In searching for documents discussing "trap correction" (one of the key phrases), we discovered that relevant, unretrieved documents had discussed the same issue but referred to it as the "wire warp." Continuing our search, we found that in still other doucments trap correction was referred to in a third and novel way: the "shunt correction system." Finally, we discovered the inventor of this system was a man named "Coxwell" which directed us to some documents he had authored, only he referred to the system as the "Roman circle method." Using the Roman circle method in a query directed us to still more relevant but unretrieved documents, but this was not the end either. Further searching revealed that the system had been tested in another city, and all documents germane to those tests referred to the sysem as the "air truck." At this point the search ended, having consumed over an entire 40-hour week of on-line searching, but there is no reason to believe that we had reached the end of the trail; we simply ran out of time.

The Raynaud Problem

You have a daughter who has been diagnosed with an unpleasant disease called Raynaud's Syndrome. You are told that there is no known effective treatment of the disease, and you quickly verify this by a thorough search of all the on-line medical research papers.

Concerned about your daughter, you wonder if, somehow, a solution has been found, but no one has "made the connection" with Raynaud's Syndrome. The Raynaud Problem is the problem of designing a computer system that will help you in following up on your hunches, specifically here and in general. Can you find a plausible hypothesis for a treatment in the existing literature?

The Raynaud Problem appears very often, athough it is rarely attacked for lack of effective means. Investors seek to find trends or developments that will affect markets in surprising ways. This is a form of the Raynaud Problem. Litigators, and other investigators, look for evidence of links in an unknown (and surprising) causal chain. This, too, is a form of the Raynaud Problem. Many other forms naturally occur.

The Practice Problem

You are a hospital administrator and you would like to have information about how the practices of your emergency room doctors differ. Do they treat similar patients differently? If so, how are the treatments different? Do they see different kinds of patients? and so on. Unfortunately, the only electronic records you have of the patient visits are in document form. There is no database to mine, just a series of emergency room documents.

The Practice Problem is the problem of designing a computer system that will help you find and see meaningful patterns of behavior, based on information extracted from weakly-structured documents. These patterns of behavior can only emerge from the document collection as a whole. The information is not present in any small number of documents. It is implicit in the collection, not explicit in any document.

Like the Raynaud Problem, the Practice Problem appears quite often, athough it too is rarely attacked for lack of effective means. You have a Practice Problem if (a) you are interested in observing change (or lack of change) across departments, industries, organizations, countries, regions, time periods, etc., and (b) you have to rely on a document collection (rather than a database) for your information.

The Anonymous Problem

Your are presented with a text—a sample of speech, a document or a portion of a document—whose authorship is unknown or in doubt. Can you identify the author or must the text remain anonymous? Can the date or influencing circumstances of the text be identified?

There are many applications for which even a significant narrowing of possibilities is of value. Automated text classification of this kind will be useful to museums and archives, legal services, and intelligence gathering, both in government and commerce.

The Solutions

Practical Reasoning's Core of Discovery has highly original and powerful capabilities (feature sets) that can be used to attack the Plato, Raynaud, Practice, and Anonymous Problems.

Concept Searching

The Core of Discovery's Clark module is a concept retrieval tool. It is the right tool for anyone facing the Plato Problem. In traditional search engines, words are taken literally, without regard to their larger meanings and associations. We search for documents containing a "key word". Instead, we need to search for documents pertaining to the concept. We don't just want documents with "Plato" (the word) in them; we want documents about Plato (the ancient Greek philosopher, who was taught by Socrates and who taught Aristotle, etc.).

Users of course often do not have strong knowledge of the concepts corresponding to the search terms they wish to use. The Core of Discovery automatically finds concepts when it indexes collections. Search terms given to the Clark module are treated as concepts, not as key words. Consequently, a Clark search often finds and ranks highly documents that are very relevant to the given concept but that do not contain the actual word used to initiate the search. In our Plato example, Clark would likely give a high ranking to a document about the author of The Meno and a low ranking to a document about Plato's Retreat (a sex club in New York), even though the first document did not contain the word 'Plato'. (The reader might try such a query on the Web today.)

The Core of Discovery's concept retrieval algorithms can be used for concept searching on either words (search terms) or entire documents. Creative, interative employment of this feature will be valuable for attacking Anonymous problems.

Empirical Thesaurus

The Core of Discovery's Empirical Thesaurus module is useful both in concept searching and in traditional key word searching. A user can specify any word appearing in a collection and the Empirical Thesaurus will return a report that shows the pattern of association between the user's word and the other words appearing in the collection. The returned pattern will usually be surprising because it will show associations (or lack of associations!) between words that were not anticipated by the user. In this way, for example, users might discover that the word they had in mind is not much used, but that a near-synonym is. Thus, they should direct searches based on the newly found synonym.

Surprising Associations

The Core of Discovery's Swanson module is named in honor of Don Swanson who did in fact have a daughter diagnosed with Raynaud's Syndrome and who, through extremely diligent reading and information searches, discovered the hypothesis that fish oil could be used to treat Raynaud's Syndrome, because it reduces blood viscosity.

Swanson found that many of the documents about Raynaud's mentioned blood viscosity. He then searched for a topic such that:

  1. the Raynaud's documents did not mention the topic,
  2. the documents about the topic did not mention the Raynaud's documents,
  3. the documents about the topic often did mention blood viscosity
The topic in question turned out to be fish oil. Swanson had found a plausible hypothesis (after reading the documents located by this process). Subsequent clinical studies confirmed it. His hunch was right: the medical literature contained the information essential to providing a treatment for Raynaud's, but no one had "made the connection" until Swanson found it.

The Core of Discovery's Swanson module duplicates the algorithm Swanson implicitly followed in finding the fish oil hypothesis. Human judgment and followon investigation will always be needed to validate these found hypotheses, but finding "the contenders" in the first place is a major achievement. In addition, the Swanson module is an important tool for approaching Anonymous problems.

Patterns Across Categories

The Core of Discovery's Lewis module finds and displays variations in word patterns across categories of documents. As such, it is an ideal tool for anyone facing the Practice Problem or the Anonymous Problem.

The Core of Discovery provides the basic indexing used by the Lewis module. Documents in collections are categorized (by time period, by gender of subject, by organization, by subject matter--in any way desired by the user) and indexed. Once this is done, the Lewis module can be used to present and explore word pattern differences among the categories. The display of this information is striking even to novices and has proved to be highly revealing of information in the collection. Again, this information is implicit in the collection, not explicit in any small number of doucments. Lewis finds implicit information.

Do the Solutions Work?

Practical Reasoning's Core of Discovery is unusual among search engines and knowledge management systems, in that there is strong and growing evidence that it indeed is effective. Briefly:

  1. The Swanson algorithm, which is automated in the Core of Discovery, was applied by Swanson to make other discoveries, several of which have been confirmed by clinical studies. His original finding, regarding fish oil and Raynaud's Syndrome, was duplicated computationally on the set of documents Swanson had at the time of his discovery. Computational followon has not been widespread because of the enormous cost. The achievement of the Core of Discovery is to have reduced this cost drastically and to have made it practical for non-experts to do such studies.
  2. In laboratory experiments conducted by Prof. Kimbrough at the University of Pennsylvania, the concept searching algoritm used by the Clark module of the Core of Discovery was shown to be remarkably effective in ranking relevant documents. This is true both absolutely and in comparison with leading information retrieval products.
  3. The concept for the Lewis module was originated by Mr. Garett Dworman and explored systematically in his Ph.D. thesis at the University of Pennsylvania (supervised by Prof. Kimbrough). Through controlled experiments on diverse document collections, Dworman demonstrated that the Lewis module can be highly effective in helping users to gain reliably and accurately information implicit in a collection as a whole. In particular, Dworman studied the Practice Problem as described above, using hospital emergency room records. The experimental results were validated by a blind panel of medical experts.
  4. The surprising effectiveness of carefully considered attacks on the anonymous problem has been amply documented in a recent book, Author Unknown (2000, Henry Holt and Company), by Prof. Don Foster of Vassar College. The Core of Discovery brings to play valuable automation in support of such detecting activities, thereby reducing their cost and increasing their scope of application.

How Do We Do It?

The modules in Practical Reasoning's Core of Discovery product make essential, and innovative, use of insights into the mathematics of indexing. These insights were developed in Prof. Kimbrough's laboratory at the University of Pennsylvania, many of them under funded research projects since 1990. Some of these insights have been published in the open literature; others remain proprietary.

In addition, and more importantly from a business perspective, the modules in the Core of Discovery rely essentially on proprietary computational technology for exploiting the mathematical insights. This computational technology was developed by Prof. Kimbrough and is proprietary.

For more information...

Contact: Professor Steven O. Kimbrough, University of Pennsylvania, Steinberg Hall - Dietrich Hall, Suite 1300, Philadelphia, PA 19104-6366. (215) 898-5133. kimbrough@wharton.upenn.edu. http://grace.wharton.upenn.edu/~sok/

Prof. Kimbrough is President of Practical Reasoning, Inc.


/* $Header: cod-exec-brief.html,v 1.2 2001/03/04 10:56:57 sok Exp $ */