Reply to comment
14Jun
2010
With little fanfare, OSTI has introduced a powerful new semantic search tool on Information Bridge, the flagship collection of DOE research reports. (Seehttp://www.osti.gov/bridge/) The technical name for this new tool is “term vector similarity analysis,” or simply More-Like-This (MLT). MLT promises not only to greatly improve the discovery of relevant reports, but also to facilitate seeing the underlying structure of DOE research. It does this by looking beyond one’s original search terms. So far as I know OSTI is the first federal STI agency to implement MLT.
As the name More-Like-This suggests, MLT introduces a 2-stage search process. The user first conducts a conventional search, by entering one or more search terms. The search engine then finds just those documents that make heavy use of the search terms. So far so good. With MLT the user then selects an “anchor document” in the hit list that represents what they are looking for, and opens that document’s bibliographic page. When the user clicks the MLT button (under the Title) the search engine re-searches the entire database for similar documents. Three MLT hits are displayed. These are the three documents that show the greatest MLT similarity to the anchor document.
MLT similarity is based on all of the terms used in the anchor document, not just the original search terms. In effect the entire anchor document is the “search term” for the new search. Thus the MLT search may reveal similar reports that make little or no use of the original search terms. Given this overall similarity in language between the anchor document and the MLT hit documents, it is quite likely that the MLT hits are closely related to the anchor document, even if they do not use the original search terms. This is the power of ML: finding related research that does not show up on the conventional hit list or that does not look relevant when it does show up.
For example, when I first tried MLT I found an important DOE report that I probably never would have found without it. The title is “Using architectures for semantic interoperability to create journal clubs for emergency response.” I have no interest in emergency response and never heard of a journal club, so this title meant nothing to me. But it appeared on an MLT search which suggested it was closely related to the topic of information sharing in science. On inspection it turned out to be quite important.
Another useful feature of MLT is that it reveals clusters of related research. This happens when one does MLT searches on MLT search results, which I call MLT browsing. That is, one gets the three original MLT hits, then does MLT searches on each hit, and so on. As one moves out from the anchor document it becomes apparent that there are distinct groups of related reports, which correspond to small research communities. MLT can be used to explore these communities. Given that one of the leading uses of search in science and technology is to understand what is going on, this browsing and clustering should be a very useful feature. In principle one could map the MLT structure of all of DOE research this way.
It is the mathematical basis for MLT that makes it so powerful (although users do not have to know about the math). Here are the mathematical basics. To begin with, most people are familiar with X-Y graphs. Two points A and B, plotted on such a graph, have a certain distance apart. If we add a third point C, we can say if C is closer to A than B is, or further away, or the same distance. So we can tell which point is closest to any other given point.
Less familiar perhaps is the fact that we can draw three arrows from the origin of the graph (where the X and Y axes intersect) to the three points A, B and C. These arrows are called vectors. Instead of distances we can see which angle is smaller, the angle between A’s vector and B’s or between A’s vector and C’s. We can also still consider the distances between the points A, B and C, but now they are the distances between the points of the vector arrows. This is basic vector math and lots of measurements in the world can be treated as vectors.
What makes MLT mathematically hairy is that, instead of this simple X-Y graph, with its two axes, there is a huge graph in which every term in the document has its own axis. No one can see this graph, which has thousands of axes, but the computer can process it. Each document becomes a singly arrow in this graph, based on how many times each term occurs. The closer the arrows are together the more similar the documents, taking all the terms into account. It is elegant and it works.
What makes MLT powerful is that people working on related problems tend to use related language, even if they do not use exactly the same language. We coin and use the words we do in order to talk about what is important in the world, especially in science and technology. Specialties often emerge, and diverge, as core problems are explored. In these cases the language of the core problem remains, even if it does not appear in the report titles or the abstracts, because these tend to focus on the specialty. MLT can reveal the underlying structure of the research. For more on this diverging structure of science and technology see:
http://www.osti.gov/ostiblog/home/entry/sharing_results_is_the_engine
More-Like-This is more than just an add on, it is a powerful new way of searching. MLT goes beyond the user’s original search terms to consider all of the words used in all of the documents in the database. Moreover it does this without using thesauri, taxonomies, ontologies, or any other human crafted semantic system. It just uses the natural language of science and technology.
David Wojick
Senior Consultant for Innovation
OSTI
read more...