Accelerating Science Discovery - Join the Discussion

Federated Search - The Wave of the Future?: Part 1

by Dr. Walt Warnick on Wed, 12 Mar, 2008

by Walt Warnick and Sol Lederman

The web is growing.

For providing searchable access to the content that matters the most to scientists and researchers, Google and the other web crawlers can't keep up. Instead, growing numbers of scientists, researchers, and science attentive citizens turn to OSTI's federated search applications for high quality research material that Google can't find. And, given fundamental limitations on how web crawlers find content, those conducting research will derive even more benefit from OSTI's innovation and investment in federated search in the coming years.

This is the first of three articles that discuss and compare the strengths and weaknesses of two web search architectures: the crawling and indexing architecture as used today by Google and the federated search architecture used by Science.gov and WorldWideScience.org. This article points out the limitations of the crawling architecture for serious researchers. The second article explains how federated search overcomes these obstacles. The third article highlights a number of OSTI's federated search offerings that advance science, and suggests that federated search may someday become the dominant web search architecture. 

Google is a "surface web" crawler; it discovers content by taking a list of known web pages and following links to new web pages and to documents. This approach finds documents that have links referencing them. It finds none of the majority of web content that is contained in the "deep web."

The deep web consists of documents which typically reside in databases. These documents can only be discovered by performing queries against the search interface offered by the content's owner. Crawlers aren't designed to query databases, yet databases are where a large percentage of the world's most sought after scientific and technical content reside. Considering that the deep web is several hundred times larger than the surface web, document-wise, the deep web and its hidden assets are too important to ignore. We will discuss the benefits of searching the deep web in subsequent articles.

In addition to the limitation of crawlers not being able to access database content, the quality of any particular surface web document is unknown without an effort on the researcher's part to determine it. Any person can post documents to the web and, unless the web sites hosting the documents are determined to be spam, the documents may well be indexed by the crawlers and made available in search results. While relevance ranking algorithms such as Google's PageRank do somewhat minimize the chances of being presented with the lowest quality content, these algorithms are far from perfect gauges of quality. Link popularity, the number of links that point to a document, is a major factor in determining a document's ranking in a search result list (the other factor being, of course, the occurrence of the user's search terms in the documents being ranked.) Because popularity and quality are not the same thing, following the herd is not the best approach when conducting serious research.

In addition to the content quality issue, crawling suffers from serious scaling issues as the barrier to entry for those wanting to create internet content steadily decreases. With the rapidly growing number of blogs and personal web-sites, the crawlers are struggling to maintain, if not grow, their massive indexes. Google, some time ago, quietly stopped advertising the number of documents in its index. For business, operational, or technical reasons, or some combination thereof, Google has also been pruning its index. Crawling is rapidly approaching the point of diminishing returns, if it hasn't already arrived there. The costs to add hardware to its index farm, to update and search its index, and to identify new content are huge, and the incremental benefit to Google at some point is low enough (or negative) that growth of its index will continue to slow, or it will stall.

As new documents can be made to fit into Google's index, and old ones are removed, there is no guarantee that the overall quality of Google' index will improve. Unless the web crawlers can somehow be trained to discriminate against low quality content, we should expect overall web content quality to decrease; this problem is not unique to Google.

A further challenge to the surface crawlers is providing access to the most current content. The user of a surface web search engine is at the mercy of the crawler; only in the crawler's next pass over a given web site will the index be updated  to reflect the existence of a document introduced to the site. That update could take hours or days following the document's addition. Worse than that, the fact that the document is new means that it will not have many links to it initially, it will not be popular, and it will not likely be found in very many of the underspecified queries that most users provide.

In Part 2 of this series we will see how federated searching overcomes the major limitations of the crawl and index architecture. We believe that federated search is indeed the wave of the future and that it will eclipse surface crawling for the serious researcher.

Walter L. Warnick, Ph.D.
Director, OSTI

Sol Lederman
Consultant to OSTI

 

Other Related Topics: doe, federated search, osti, web crawling

Comments

Google Indexing

I whole heartedly agree with what you say about google and other search engines in regards to how they index and crawl certain pages. Googles algorithm is quite lacking in terms of determining what is a quality site with quality content. A lot of site owners try and find new ways of "tricking" the algorithm, but some people dont realise that google is smarter than they think. Subsequenly they do constant changes to their algorithm to try and stay one step ahead, but beacuse of this never ending battle between search engines and site owners to try and game the system for self benefit, Its going to be a long time before they figure out a formula to find the QUALITY and deeply embeded documents that so many people search for and should have access to.

Could have used this for my dissertation

I wish I'd have read this earlier. I recently wrote my dissertation and although I felt like I lived in the library I couldn't get my hands on the most up to date material and Google scholar just didn't cut it. I'll certainly be spreading the word about Federated search to others writing papers.

The Question

It sounds like the Federated Search is leaps and bounds better than Google. So why is its market share so small? I've asked a few of my coworkers and not one of them has even heard of Federated. Looking forward to reading the other parts.

Had no clue

Wow! I am a seo specialist myself, but didn't have a clue about this. Very interesting to read. Hope it can start a discussion here in the comment thread. Thanks for the article.

Why is this the way of the future?

I'd love to hear a bit more on how federated search is going to become more mainstream

I also had no clue

A friend sent me to this article and I always belived that there had to be a way to access "deeper" information on the web than Google.  This info really needs to get out there!!

Semantic Search

The real revolution in search will be semantic search. During my time at Microsoft it was all the rage, leading to Bing, the "decision engine" - google is watching, and copying, bing very hard right now in an effort to shut them out of the market.  

Love it

The federated search feature is easily one of the best resources on the web for searching edu documents. I have used it for quite a while now and it really does show more of these types of searches than google finds. I know that there are some special search operators that you can use in google to search only edu and gov type material but to be honest I really don't know how to use them.

Expanded Information

My grandson mentioned the federated search feature to me recently which brought me to your site looking for more information, and like Steve above I too would like to understand more about how to use the special search operators but haven't been able to find much in the way of guidance or explanation. So any advice on where to look would be appreciated.

About the Author

Dr. Walt Warnick's picture
Dr. Walt Warnick
Former Director, U.S. DOE Office of Scientific and Technical Information