How to Integrate Anything on the Web

by Walt Warnick on Wed, August 03, 2011

OSTI is especially proud of its web integration work whereby we take multiple web pages, documents, and web databases and make them appear to the user as if they were an integrated whole.   Once the sources are virtually integrated by OSTI, the virtual collection becomes searchable via a single query.  Because information on the web appears in a variety of formats, from HTML web pages, to PDF documents, to searchable databases, OSTI has developed and uses a suite of integration approaches to make them searchable via single query.  

OSTI has two goals that make it critical for us to understand multiple solutions for integrating science content on the web.  First, we make DOE science information widely available and searchable by appropriate audienceswherever they may be; and second, we make science information from around the world searchable by DOE researchers.  Since migrating to a fully electronic operationin the late 1990s, OSTI has met these goals by deploying various search architectures for integrating content via the web.

Within the information science circles that we engage in, we are well known for our pioneering work with the integration technology known as federated search. However, there are other, possibly lesser known,  technologies that we employ to integrate web content.

To  integrate information sources which are not interoperable, we see three categories of solutions:  1) you can create a data warehouse where you copy the information items, standardize metadata, and host them on your own servers;  2) you can create a discovery service wherein you index source items without copying them and then host the index on your server   (this technology is similar to that used by the major search engines except that you carefully direct the indexing tools, i.e., the crawler, so that only pre-selected material is indexed); and 3) you can use federated search to take advantage of existing search interfaces associated with the information sources.

DOE Project Summaries is an example of the data warehouse approach.  It makes tens of thousands of web pages, each describing a single DOE R&D project, searchable via a single query.  The institutional repository search capability in E-print Network is an example of the discovery service approach.   It makes tens of thousands of institutional repositories hosted by universities and professional societies searchable via a single query. is an example of the federated search approach.  It makes over 40 large scientific and technical information databases/resources from 14 agencies of the U. S. government all searchable via a single query.  All together, about 200 million “pages” of information are searched with each query of

Each solution has advantages and disadvantages and each is appropriate for certain situations. 

If you are the owner of the information, or if the owners of the information sources allow their sources to be copied, and if you have sufficient resources, then creating a data warehouse is usually the way to go. 

As resources become more scarce or if the owners of the information sources do not allow them to be copied but will allow them to be indexed, then creating a discovery service works well. 

If the collection of information you are targeting is enormous or the information owners will not allow copying or indexing of their content and the underlying information sources provide sufficent search interfaces, then a carefully done federated search is the way to go.  It is often the case that a product that relies upon federated search is in fact a federation of federations.  For example, one of the sources federated by is DOE ScienceAccelerator which is itself a federation of OSTI products that cover technical reports, project summaries, conference proceedings, patents, and more.

The upshot is that OSTI has the tools it needs to virtually integrate just about any web sites, documents hosted on institutional repositories, or databases that one encounters on the web.  As these are the architectures used for just about all text found on the web, generally speaking, OSTI can now integrate anything. 

