by Dr. Walt Warnick on Thu, 13 Mar, 2008
by Walt Warnick and Sol Lederman
This is the second in a three part series of articles about the deficiencies of web crawling and indexing, the superiority of federated search to the serious researcher, and the value of OSTI federated search applications in advancing science. Part 1 identified a number of serious limitations of Google and the other crawlers. This article shows how federated search overcomes these limitations. The final article in the series highlights a number of federated search applications and databases that OSTI makes available to the public.
In Part 1, we explained that Google, being a surface web crawler, cannot access the deep web, which consists of content that resides in databases. We also noted that the deep web is several hundred times larger than the surface web and that a large percent of the highly sought after scientific and technical information resides in the deep web. We also explained that there is no way to determine the quality of any particular document in the surface web. Any web citizen can post a document to the web and it will likely be indexed.
Federated search applications overcome the two aforementioned limitations of surface crawlers - (1) limited access to content, and (2) the difficulty in determining its quality. Limited access is overcome by the federated search engine's specialized knowledge of how to query a database and how to retrieve its documents. The quality concern is overcome by the complementary efforts of database owners and creators of federated search applications. First, databases that are made available to federated search applications are managed by owners, or organizations, who have criteria for determining what documents are included in their collections. Content owners usually document the criteria, allowing the researcher to know how the documents in a collection he is searching have been vetted prior to inclusion. Every single one of OSTI's federated search applications provides access to content whose criteria for inclusion is strictly managed by a respected organization or governing body, usually US or foreign governments or agencies on their behalf. Second, the creators of the federated search application select quality databases that best meet the needs of users.
We discussed scalability limitations of Google's index in Part 1. Managing a gigantic index costs Google in terms of hardware, software, and performance. Federated search does not suffer from these limitations because the deep web content it accesses is distributed among numerous sources, and there is no index to maintain. The decentralized organization allows for scalability; one can build federated search applications that federate (aggregate) content from other federated search applications, which in turn federate other such applications, and so on, in a hierarchical fashion. While this cascading of applications is not without growth limits, research into this and other approaches to scalability is in its infancy and preliminary results are very promising.
Note: As federated search applications grow in the scope of databases they encompass, the quality of the content available to the researcher need not decrease. Federated search discriminates among source databases and this determines the quality of content. The larger a surface crawler's index becomes, however, the more low quality content it is likely to contain. Not only is the quality of crawled content unpredictable, so is the time it will take for that content to be indexed. In the federated search paradigm, as soon as a content publisher makes a new document available in its collection, the very next relevant user search will find it.
Google, Yahoo!, and Microsoft, owners of the most prominent search engines, are not ignorant of the deficiencies of surface crawling. Their joint support of the sitemap protocol is intended to overcome their inability to query deep web databases by asking content publishers, as in the case of Google Scholar, to provide URLs to documents within their collections. Additionally, the sitemap protocol is intended to provide document meta data to the crawlers such as date and time of last update, frequency of update, importance, and other information to facilitate crawling. Results from using the sitemap protocol to index deep web content have been mixed as not everyone participates and the quality of the content provided by those who do participate is difficult to determine.
Federated search technology is not without its challenges, the greatest of which is the cost of building and maintaining the software "connectors" that search and retrieve documents from deep web databases. The increasing adoption of standards among content publishers' search interfaces (XML, Z39.50/SRU, Z39.50/SRW, Web Services) is driving down the cost and effort to build and maintain connectors. The other great challenge of federated search, speed of obtaining search results due to performance limitations at the content providers' sites, is being addressed through OSTI-funded research including caching of common search results, automatic selection of the right sources to search to minimize unnecessary load placed on sources not relevant to a query, strategic mirroring of content, and other approaches.
For the serious researcher and the science attentive public, the future of federated search is bright. OSTI was a pioneer in introducing federated search into the federal government and OSTI continues to pioneer innovative approaches to managing and overcoming the challenges to quickly delivering the most relevant high quality content. In Part 3, the final article of this series, we will look at OSTI federated search applications that advance science, and we will suggest that federated search might someday dominate web search architecture.
Walter L. Warnick, Ph.D.
Consultant to OSTI