by Dr. Walt Warnick on Thu, July 08, 2010
The success of Google has been so profound that the word “Google” is now considered a verb. http://en.wikipedia.org/wiki/Google_(verb) “To Google” has come to mean to search the web via the free search engine provided by Google, Inc. The adjective derived from the verb “Google” is “Googleable.” Similarly, the antonym of “Googleable” is “non-Googleable,” which turns out to be an especially useful word. For most practical purposes, the term “non-Googleable” is synonymous with the phrase “deep web.” The major difference between the word and the phrase in a world where Google, Inc., is the largest capitalized company is that the term non-Googleable is intuitively understood.
Anyway, it is generally acknowledged among students of the web that the bulk of the information in it is non-Googleable, a fact which typically comes as a surprise to people who do not study the web. In particular, the information residing in databases is often non-Googleable, and it often happens that scieintific and technical information resides in databases of documents.
The reason that databases are typically non-Googleable becomes clear once one considers how search engines like Google, Yahoo!, Ask.com and Bing acquire the content they search. The search engines rely upon crawlers to visit web pages well in advance of a patron’s search. The crawler creates an index of each page it visits and then follows hyperlinks on each page to find new pages to index. Typically, crawlers are flummoxed by the front page of a databases because such pages typically do not offer hyperlinks to the database content. Thus, crawlers used by companies like Google, Inc., typically cannot get past the front page of a database, leaving the database content non-Googleable.
There is one exception, which gets complicated, so if you are not a knowledge-management enthusiast, please just skip ahead to the next paragraph. The situation becomes complex when database owners make special arrangements with Google, Inc., to expose database content. A consequence of such arrangements is that database content can become Googleable. In the early 2000s, OSTI pioneered such an arrangement with Google for OSTI’s suite of databases. The effort was successful, but exposing our database content to Google was a laborious process that took many months to coordinate with Google, Inc. In the meantime, OSTI took up the same project with Yahoo! In this way, the content of OSTI’s own databases became Googleable and searchable via Yahoo! and later by Ask.com and Bing. Subsequently, Google partnered with Yahoo! and certain other search engine companies to simplify the process of exposing database content. The result is called the Site Map Protocol. http://en.wikipedia.org/wiki/Site_map_protocol . Despite the availability of the Site Map Protocol, it remains true today that the bulk of scientific and technical information is non-Googleable. Some database owners do not use the Protocol, but, in addition, search engine companies sometimes choose not to include database content in their indices.
One way to make non-Googleable content searchable is via federated search http://www.osti.gov/home/fedsearch.html, which OSTI has deployed to create Science.gov (a virtual integration of databases of R&D results from Federal government agencies) and WorldWideScience.org (a virtual integration of databases of R&D results from 60+ governments from around the world). Making this content searchable can be accomplished in no practical way other than by federated search. http://www.osti.gov/ostiblog/home/entry/no_alternative_to_federated_search1
I was recently mildly chastised by a friend from Microsoft who pointed out that the term non-Googleable reveals a kind of corporate bias. He implied that, on occasion, I should use the term “non-Bingable.” I hereby extend an apology to my friend at Microsoft, with whom OSTI has collaborated to significant mutual benefit. As soon as “to Bing” is recognized as a verb, I will immediately adopt the adjective “non-Bingable.”