by Dr. Walt Warnick on Wed, March 12, 2008
by Walt Warnick and Sol Lederman
The web is growing.
For providing searchable access to the content that matters the most to scientists and researchers, Google and the other web crawlers can't keep up. Instead, growing numbers of scientists, researchers, and science attentive citizens turn to OSTI's federated search applications for high quality research material that Google can't find. And, given fundamental limitations on how web crawlers find content, those conducting research will derive even more benefit from OSTI's innovation and investment in federated search in the coming years.
This is the first of three articles that discuss and compare the strengths and weaknesses of two web search architectures: the crawling and indexing architecture as used today by Google and the federated search architecture used by Science.gov and WorldWideScience.org. This article points out the limitations of the crawling architecture for serious researchers. The second article explains how federated search overcomes these obstacles. The third article highlights a number of OSTI's federated search offerings that advance science, and suggests that federated search may someday become the dominant web search architecture.
Google is a "surface web" crawler; it discovers content by taking a list of known web pages and following links to new web pages and to documents. This approach finds documents that have links referencing them. It finds none of the majority of web content that is contained in the "deep web."
The deep web consists of documents which typically reside in databases. These documents can only be discovered by performing queries against the search interface offered by the content's owner. Crawlers aren't designed to query databases, yet databases are where a large percentage of the world's most sought after scientific and technical content reside. Considering that the deep web is several hundred times larger than the surface web, document-wise, the deep web and its hidden assets are too important to ignore. We will discuss the benefits of searching the deep web in subsequent articles.
In addition to the limitation of crawlers not being able to access database content, the quality of any particular surface web document is unknown without an effort on the researcher's part to determine it. Any person can post documents to the web and, unless the web sites hosting the documents are determined to be spam, the documents may well be indexed by the crawlers and made available in search results. While relevance ranking algorithms such as Google's PageRank do somewhat minimize the chances of being presented with the lowest quality content, these algorithms are far from perfect gauges of quality. Link popularity, the number of links that point to a document, is a major factor in determining a document's ranking in a search result list (the other factor being, of course, the occurrence of the user's search terms in the documents being ranked.) Because popularity and quality are not the same thing, following the herd is not the best approach when conducting serious research.
In addition to the content quality issue, crawling suffers from serious scaling issues as the barrier to entry for those wanting to create internet content steadily decreases. With the rapidly growing number of blogs and personal web-sites, the crawlers are struggling to maintain, if not grow, their massive indexes. Google, some time ago, quietly stopped advertising the number of documents in its index. For business, operational, or technical reasons, or some combination thereof, Google has also been pruning its index. Crawling is rapidly approaching the point of diminishing returns, if it hasn't already arrived there. The costs to add hardware to its index farm, to update and search its index, and to identify new content are huge, and the incremental benefit to Google at some point is low enough (or negative) that growth of its index will continue to slow, or it will stall.
As new documents can be made to fit into Google's index, and old ones are removed, there is no guarantee that the overall quality of Google' index will improve. Unless the web crawlers can somehow be trained to discriminate against low quality content, we should expect overall web content quality to decrease; this problem is not unique to Google.
A further challenge to the surface crawlers is providing access to the most current content. The user of a surface web search engine is at the mercy of the crawler; only in the crawler's next pass over a given web site will the index be updated to reflect the existence of a document introduced to the site. That update could take hours or days following the document's addition. Worse than that, the fact that the document is new means that it will not have many links to it initially, it will not be popular, and it will not likely be found in very many of the underspecified queries that most users provide.
In Part 2 of this series we will see how federated searching overcomes the major limitations of the crawl and index architecture. We believe that federated search is indeed the wave of the future and that it will eclipse surface crawling for the serious researcher.
Walter L. Warnick, Ph.D.
Consultant to OSTI