Accelerating Science Discovery - Join the Discussion

Making Scientific Databases Work Together—For You (psst . . . that's "search interoperability")
by Dr. Walt Warnick on Mon, 13 Feb, 2012
Data points of light travel over the world.

Sometimes something complex can work so seamlessly that it’s easy to miss. We think that’s the case with our solution in achieving search interoperability.

As you may know, “search interoperability” is just a fancy way of saying that lots of scientific databases scattered far and wide can be made to work together so that your job as a seeker of science information is easy. You can go to one search box, say Science.gov, type in your search term, and get results from over a hundred important repositories and a couple of thousand scientific websites – with one click.

And you know that this is a good thing, because as a practical matter, you cannot be expected to conjure in advance which database might hold the information you seek. Nor can you be expected to search dozens of sources one-by-one. That would be an onerous task. Also, as an experienced seeker of quality science information, you are well aware that commercial search engines (read, Google, Bing, etc.) sometimes cannot mine the deep web for you, thus missing R&D results residing there (see Federated Search - The Wave of the Future?).

So achieving search interoperability with OSTI’s federated search tools, such as Science.gov, WorldWideScience.org, and the E-print Network, has been an important development, though by no means easily accomplished. There are myriad obstacles that can block information exchanges between systems.  (To learn more about the broad topic of interoperability and obstacles to exchanging information, see the Wikipedia article on interoperability.)

Specific to our world of scientific and technical information, the challenge of interoperability basically stems from the simple fact that database designers rarely build-in interoperability. Why would they? Their main concern is ensuring your needs are met within the individual database you are using. Designers havelittle or no need to worry about exchanging information with other databases.

Here is where the beauty of OSTI comes into play: we set out—in our design phase—to make databases work together even when the databases themselves were not originally designed to be interoperable with other databases. We’re pretty proud of that fact.  We’re even prouder that we achieved it.

But we had to clear a significant hurdle. That is, convincing the owners of the various databases that their content could be federated without adding to their burden. We knew from experience that our vision would not be achieved if we started out by asking, “If only you would voluntarily perform an added, non-trivial task … .”  So, early on, we decided to avoid placing any burden on the content owners (besides handling the increased traffic). Thus, OSTI’s integration of information sources, whether by federation or other means, places no burden on the source owner other than handling increased traffic.

Science.gov, WorldWideScience.org, and the E-print Networkachieve search interoperability for the covered databases and websites. What this means is that each search query on these tools searches about a hundred repositories, whether the repositories use XML; PDF; LaTeX; PostScript; or HTML.  If full text is searchable by the database, the federated search covers full text, too.

Federated search has evolved considerably over the last decade, thanks in part to OSTI’s innovations in the field.  We now routinely integrate just about any kind of information resource.  For example, we make e-print databases like arXiv, CERN Preprints, and many more databases like PubMed Central all full-text searchable and field searchable via a single federated search query; then we integrate about 35,000 other repositories hosted by universities, professional societies, and others using discovery service technology. We make all this as convenient to the user as searching a single integrated entity.  We search by metadata field, like author or date.  In addition, we had to re-invent and deploy relevance ranking so that it would work in the federated environment.  We have expanded the number of databases that can be integrated by a single federation to the point that the number of databases is no longer a practical limitation.

We fully recognize that there is more to interoperability than search, as indicated in the Wikipedia article referenced above.  We always encourage the adoption of functional standards, which if adopted broadly would achieve interoperability beyond search interoperability.  If and when standards were widely adopted, we would gladly use those standards and retire the custom solutions that are necessary in our federations.  Our federations already include those few databases that have adopted such standards.  We have developed a reusable piece of technology based on OAI-PMH(Open Archives Initiative Protocol for Metadata Harvesting)that enables near zero effort federation of databases that support this standard.

In essence, due to our seamless implementation of search interoperability, our customers are handed a single point of entry to many widely-disparate sources of science information.  This is good for our customers, and good for science.  We laid the groundwork to achieve search interoperability; we then achieved search interoperability using search industry standards where we could and innovating solutions when standards were not available.  This is what innovators do.  And OSTI will continue to innovate.

Comments

About the Author

Dr. Walt Warnick's picture
Dr. Walt Warnick
Former Director, U.S. DOE Office of Scientific and Technical Information