Imagine that you have access to a large document collection and are interested in obtaining information about Plato, the ancient Greek philosopher. You successfully retrieve all documents containing the word 'Plato'. When you examine these documents you find what you expected: many of them are indeed about the Greek philosopher; others are about various commercial products trading on the philosopher's name, but having nothing to do with him.
You are then surprised when a friend presents you with a completely different set of documents, also retrieved from the same large document collection. Nearly all of these new documents are very much about Plato the philosopher, yet none of them mention 'Plato' (or any close term, e.g., 'Platon' a foreign spelling).
How can this be? Your friend explains. 'Plato' was not the philosopher's orginal name. It was a nickname given to Aristocles by his wrestling coach ('personal trainer' in today's language) and it means 'chubby' or 'chubs'. Somehow it stuck. Many authors, however, have written about the philosopher using only his orginal name. These writings would not be retrieved by a keyword search using 'Plato'.
The Plato Problem is the problem of designing a computer system that helps in finding relevant and useful documents that do not contain the search term(s) you have in mind. The scope of this problem extends far beyond the world of classical scholarship. In fact, except in relatively trivial cases, it appears much more often than not. Here is a representative account of a search in a small (40,000 documents) collection by a leading expert in information retrieval. The area of application is litigation support.
You have a daughter who has been diagnosed with an unpleasant disease called Raynaud's Syndrome. You are told that there is no known effective treatment of the disease, and you quickly verify this by a thorough search of all the on-line medical research papers.
Concerned about your daughter, you wonder if, somehow, a solution has been found, but no one has "made the connection" with Raynaud's Syndrome. The Raynaud Problem is the problem of designing a computer system that will help you in following up on your hunches, specifically here and in general. Can you find a plausible hypothesis for a treatment in the existing literature?
The Raynaud Problem appears very often, athough it is rarely attacked for lack of effective means. Investors seek to find trends or developments that will affect markets in surprising ways. This is a form of the Raynaud Problem. Litigators, and other investigators, look for evidence of links in an unknown (and surprising) causal chain. This, too, is a form of the Raynaud Problem. Many other forms naturally occur.
You are a hospital administrator and you would like to have information about how the practices of your emergency room doctors differ. Do they treat similar patients differently? If so, how are the treatments different? Do they see different kinds of patients? and so on. Unfortunately, the only electronic records you have of the patient visits are in document form. There is no database to mine, just a series of emergency room documents.
The Practice Problem is the problem of designing a computer system that will help you find and see meaningful patterns of behavior, based on information extracted from weakly-structured documents. These patterns of behavior can only emerge from the document collection as a whole. The information is not present in any small number of documents. It is implicit in the collection, not explicit in any document.
Like the Raynaud Problem, the Practice Problem appears quite often, athough it too is rarely attacked for lack of effective means. You have a Practice Problem if (a) you are interested in observing change (or lack of change) across departments, industries, organizations, countries, regions, time periods, etc., and (b) you have to rely on a document collection (rather than a database) for your information.
Your are presented with a texta sample of speech, a document or a portion of a documentwhose authorship is unknown or in doubt. Can you identify the author or must the text remain anonymous? Can the date or influencing circumstances of the text be identified?
There are many applications for which even a significant narrowing of possibilities is of value. Automated text classification of this kind will be useful to museums and archives, legal services, and intelligence gathering, both in government and commerce.
The Core of Discovery's Clark module is a concept retrieval tool. It is the right tool for anyone facing the Plato Problem. In traditional search engines, words are taken literally, without regard to their larger meanings and associations. We search for documents containing a "key word". Instead, we need to search for documents pertaining to the concept. We don't just want documents with "Plato" (the word) in them; we want documents about Plato (the ancient Greek philosopher, who was taught by Socrates and who taught Aristotle, etc.).
Users of course often do not have strong knowledge of the concepts corresponding to the search terms they wish to use. The Core of Discovery automatically finds concepts when it indexes collections. Search terms given to the Clark module are treated as concepts, not as key words. Consequently, a Clark search often finds and ranks highly documents that are very relevant to the given concept but that do not contain the actual word used to initiate the search. In our Plato example, Clark would likely give a high ranking to a document about the author of The Meno and a low ranking to a document about Plato's Retreat (a sex club in New York), even though the first document did not contain the word 'Plato'. (The reader might try such a query on the Web today.)
The Core of Discovery's concept retrieval algorithms can be used for concept searching on either words (search terms) or entire documents. Creative, interative employment of this feature will be valuable for attacking Anonymous problems.
The Core of Discovery's Empirical Thesaurus module is useful both in concept searching and in traditional key word searching. A user can specify any word appearing in a collection and the Empirical Thesaurus will return a report that shows the pattern of association between the user's word and the other words appearing in the collection. The returned pattern will usually be surprising because it will show associations (or lack of associations!) between words that were not anticipated by the user. In this way, for example, users might discover that the word they had in mind is not much used, but that a near-synonym is. Thus, they should direct searches based on the newly found synonym.
The Core of Discovery's Swanson module is named in honor of Don Swanson who did in fact have a daughter diagnosed with Raynaud's Syndrome and who, through extremely diligent reading and information searches, discovered the hypothesis that fish oil could be used to treat Raynaud's Syndrome, because it reduces blood viscosity.
Swanson found that many of the documents about Raynaud's mentioned blood viscosity. He then searched for a topic such that:
The Core of Discovery's Swanson module duplicates the algorithm Swanson implicitly followed in finding the fish oil hypothesis. Human judgment and followon investigation will always be needed to validate these found hypotheses, but finding "the contenders" in the first place is a major achievement. In addition, the Swanson module is an important tool for approaching Anonymous problems.
The Core of Discovery's Lewis module finds and displays variations in word patterns across categories of documents. As such, it is an ideal tool for anyone facing the Practice Problem or the Anonymous Problem.
The Core of Discovery provides the basic indexing used by the Lewis module. Documents in collections are categorized (by time period, by gender of subject, by organization, by subject matter--in any way desired by the user) and indexed. Once this is done, the Lewis module can be used to present and explore word pattern differences among the categories. The display of this information is striking even to novices and has proved to be highly revealing of information in the collection. Again, this information is implicit in the collection, not explicit in any small number of doucments. Lewis finds implicit information.
Practical Reasoning's Core of Discovery is unusual among search engines and knowledge management systems, in that there is strong and growing evidence that it indeed is effective. Briefly:
The modules in Practical Reasoning's Core of Discovery product make essential, and innovative, use of insights into the mathematics of indexing. These insights were developed in Prof. Kimbrough's laboratory at the University of Pennsylvania, many of them under funded research projects since 1990. Some of these insights have been published in the open literature; others remain proprietary.
In addition, and more importantly from a business perspective, the modules in the Core of Discovery rely essentially on proprietary computational technology for exploiting the mathematical insights. This computational technology was developed by Prof. Kimbrough and is proprietary.
Prof. Kimbrough is President of Practical Reasoning, Inc.