U.S. Department of Energy Office of Science Office of Scientific and Technical Information

Much of Science is Non-Googleable: An Emerging Solution


Much of Science is Non-Googleable: An Emerging Solution

It is my pleasure to be here at NSF today. Today’s workshop topic, "Connecting People to Science and Scholarly Knowledge," is precisely what we do at the Department of Energy Office of Scientific and Technical Information.



True or False?

Most useful information is available via familiar search engines such as Google and Yahoo!

We have been in this business for 60 years, but over the last 10 years we have become entirely Web-based. Of course, we are not the only ones who connect people to knowledge using the Web. Google and Yahoo do this, too. An important misunderstanding has evolved around Google and Yahoo. That is, the false assumption that most useful information is available via familiar search engines such as Google and Yahoo!


Google, non-Googleable evolution

Google: v., to search for information through Google

Googleable: adj., information found by Googling

Non-Googleable: adj., information that cannot be found by Googling

In fact, much of the information on the Web is inherently unavailable to Google and Yahoo! This key limitation would come as a surprise to many Web users, especially young students. The concept that if you "Google" long enough you can find it, is so firmly entrenched in the Web-cognizant public, that the word "Google" has been elevated to a verb.

In the Web-savvy vernacular, "To Google" of course means to search the Web using the Google search engine. For example, to find information about the interagency organization CENDI, one "Googles" CENDI.

So we accept that "Google" has become a verb. This has led those of us at OSTI to do some word-creation of our own. It naturally follows that the adjective derived from that verb is "Googleable," referring, of course, to information that can be found by "Googling." It is just a short jump to arrive at the antonym "non-Googleable," referring to information that cannot be found by Google. And there is a lot of it! Folks who study this topic estimate that over 90 percent of the Web, even 99 percent of the Web, is non-Googleable.


Larry Page, speaking to scientists, AAAS 2007

"Virtually all economic growth (in the world) was due to technological progress. I think as a society we're not really paying attention to that."

He called on the scientists to make more of their research available digitally. "We have to unlock the wealth of scientific knowledge and get it to everyone."

By the way, I want to make it perfectly clear that I am not saying anything more about Google than Google is saying about itself. Google founder Larry Page personally delivered a speech at this year's annual meeting of the American Association for the Advancement of Science (AAAS) where he lamented that much of science is not available for Google to retrieve. The July 27 issue of Science Magazine presented an article by a Google Research Director who acknowledged the same thing.

I coined the term non-Googleable because the concept is so critically important to science. It turns out that great quantities of science knowledge are non-Googleable. This observation is profoundly important for science in general – and for my organization in particular.


OSTI Mission

• Information fuels discovery

• Superior access to quality information speeds discovery

The mission of the Office of Scientific and Technical Information is to advance science and sustain technological creativity by making R&D findings available and useful to DOE researchers and the American people.

Search capability is key to our mission. We have developed a different kind of search technology to retrieve that information which is unavailable to Google. Its use in gateways such as Science.gov and WorldWideScience.org is what Eleanor Frierson and I came here to talk about today.


Systems that crawl the Web do not typically reach below the surface

• Google "crawls" the surface Web, but scientific databases are largely found in the deep Web

• Scientific databases stump Google

The limitations of Googling are inherent in the underlying technology employed by Google and each of the other familiar search engine companies. Google and the others are committed to crawling technology.

To prepare for searches by Web patrons, a Web crawler (or "spider" or "robot") visits as many Web sites as it can find, mostly by following links. An index of each such page is created, thus slowly building a vast composite index of all the pages visited. Later, when Web patrons perform a search, they are actually searching the composite index.

Difficulty arises because vast numbers of Web pages cannot be accessed by following links. For example, to find an e-print on a database of e-prints, it is typically necessary to enter a search term on the front page of the database. At this point, a crawler is stumped. As a consequence, the content of the database is not accessed by the crawler, and that content is non-Googleable.



Google well recognizes this problem and it implicitly acknowledges that Google alone cannot solve it. Rather, Google implores database owners to take special steps to accommodate its crawlers.

However, there is another way to make information in multiple databases searchable. It is called federated search. Unlike the Google solution, it places no burdens on database owners.

GOVEXEC.com article:

Google moves ahead with plan to open up federal Web sites

Google is making strides on an initiative to make information stored on public government Web sites more accessible to people looking for it, but challenges remain, officials with the search engine company said Wednesday. Three federal organizations recently agreed to structure their sites to make them accessible for nearly all Internet searches, the officials said. Information on the Plain Language Web site aimed at eliminating jargon in government communications, and on sites by the Energy Department's Office of Scientific and Technical Information and the Education Department's National Center for Education Statistics, has been opened up to the three most popular search engines: Google, Yahoo and MSN.


We need systems that probe the deep Web

Federated search can open the 90+ percent of the Web that is non-Googleable. Federated search allows users to search multiple data sources simultaneously, in parallel, using a single query from a single user interface.

Here is how it works. A Web patron seeking science information comes to a gateway site like Science.gov and enters a query, just as he or she would do at Google. But, while the patron’s experience looks like Google, the architecture behind what happens next is entirely different.

The query is transmitted to the gateway server–in Oak Ridge, Tennessee, in the case of Science.gov–and then it is fanned out to each of a suite of databases geographically spread out across the US, or even the entire world.

At each database, the query launches a search and brings back a hit list. The list is then transmitted back to the gateway server, where the hits are relevancy ranked and presented to the Web patron. So, in the span of about 20 seconds, the query was transmitted to numerous databases, searches were executed at numerous databases, and the results brought back and ranked for the patron.

Federated search is cheap to implement, places no burdens on the database owner, and it allows for fielded searching.


Basic Search

Search term: "electric vehicles" AND batteries

This is an example of a search on WorldWideScience.org, which is a brand new global science gateway that relies on federated search. It is built on the same architecture as Science.gov—the US national gateway—but taken to the international level.

You want to search international databases for science information on, say, "electric vehicles" using batteries. You go to WorldWideScience.org, a gateway which now has 17 portals from 11 countries. You enter your search term into the WorldWideScience search box. The query is immediately sent to all the portals in parallel, in real time, within WorldWideScience.org


Basic Search (cont.)

You can sort results by rank.


Advanced Search

Or maybe you want to perform an advanced search on the topic of avian influenza, or bird flu. You go to the advanced search page and enter your search terms.


Advanced Search (cont.)

Again, the query is sent to all the portals in parallel, in real time and returned.


Advanced Search (cont.)

This view is sorted by source, with a drop-down menu for navigation.


Advanced Search (cont.)

Once you find a result, you can go to the bibliographical record . . .


Advanced Search (cont.)

And you can view the full text where available.


Federated Search: Advantages

• Current, real-time results

• No burden for database owner

• Inexpensive to implement

• No need-to-know for user

• No searching door-to-door

• Allows for fielded searching

• Interoperability is automatically achieved

You did not have to know the name of the portals; you did not have to search them door-to-door. Instead, you received real time results – in fact, the most current available, from authoritative information portals.

Even at this early stage, WorldWideScience.org searches across more than 200 million pages of important scientific portals worldwide. That's a lot of science information accessible from one search box—equivalent to a shelf of documents seven miles long. This is the first time of which we're aware that federated searching has been accomplished on a global scale.

So where once we had isolated portals of information, they now work as a unit, an integrated whole. Federated search, through gateways such as WorldWideScience.org and Science.gov, speeds communication, accelerates discovery and expedites scientific and economic progress.


Additional Points

• Federated searching has its own set of limitations

• Neither crawling nor federated searching is a panacea

• Federated searching does things that crawling can’t do, and crawling does things that federated searching can't do—they are complementary technologies

• Federated searching has advanced rapidly and should continue to do so


Orbach/Brindley Signing

Development of WorldWideScience.org officially kicked off in January 2007, when Dr. Raymond Orbach, U.S. Department of Energy Under Secretary for Science, and Lynne Brindley, Chief Executive of the British Library, signed a Statement of Intent to partner in this challenging endeavor.


Signed Statement of Intent

At the time of the signing, we extended invitations to other nations to join the partnership, and committed to delivering a prototype by the end of 2007. In June, we publicly launched this prototype Global Science Gateway at the ICSTI conference in Nancy, France. As mentioned earlier, with the recent addition of a source from New Zealand, WorldWideScience now searches 17 portals from 11 countries. African portals will soon be added, at which time all the world’s inhabited continents will be represented. [Editor's note: the African portals were added on August 17, 2007.]


WorldWideScience.org Sources

• Australian Antarctic Data Centre

• Article@INIST (France)

• Canada Institute for Scientific and Technical Information

• Defence Research and Development (Canada)

• DEFF Global E-Prints (Denmark)

• DEFF Research Database (Denmark)

• J-EAST (Japan)

• Journal@rchive (Japan)

• J-STAGE (Japan)

• J-STORE (Japan)

• NARCIS (Netherlands)

• ReaD (Japan)

• Science.gov (United States)

• SciELO (Brazil)

• UK PubMed Central

• Vascoda (Germany)

• Transactions & Proceedings (New Zealand)


Current National Partners in WorldWideScience.org

We hope the flags of many other nations will soon be waving in support of this effort to "connect people to science" worldwide.

In addition to the U.S. and the UK these countries have contributed:

• Australia, which is providing access to its Antarctic Data Centre

• Brazil, providing a Brazilian scientific journals database

• Canada, providing access to the databases of the Canada Institute for Scientific and Technical Information and Defense Research and Development Canada

• Denmark, providing access to an e-prints database and a research in progress database

• France, providing access to the research of the National Center for Scientific Research

• Germany, providing access to Vascoda, a portal to scholarly research information sponsored by the German Research Foundation and managed by TIB-Hannover

• Japan, providing access to several databases of the Japan Science and Technology Agency; and

• New Zealand, providing access to Transactions & Proceedings of the Royal Society of New Zealand 1868–1961.

• The Netherlands, providing access to NARCIS, a portal to Dutch scientific information.


Next Steps for WorldWideScience.org

• Establish governance structure

• Increase number of national sources

• Add non-English sources

• Develop translations capabilities

• Integrate limited-access sources

There is unharnessed power in federating search across national boundaries. So we won't stop here. Assuming nations want to move this prototype into a sustained, permanent global portal, we need to establish a governance structure. The International Council for Scientific and Technical Information has indicated preliminarily that it will serve as an umbrella for governance of WorldWideScience.org. This structure will fulfill the need to make decisions about the system's content, features, availability, and funding.

We could certainly expect that a governing body will want to increase the number of national sources searched. We also anticipate adding non-English sources with a translations capability. And, finally, there are many sources that require some form of user authentication, and we would envision tools that will facilitate authentication to allow for searching of these sources through this portal.


Basic Search (cont.) The results of the searches of all the portals are returned to your desktop compiled and ranked for you by relevance.


Basic Search (cont.) You can also sort results by source.


In short WorldWideScience.org pulls together often isolated islands of information into one easily searchable gateway So, to re-cap: Google by-and-large does not search within scientific databases, which are thus non-Googleable. Federated search opens up a part of the Web that is non-Googleable. Without WorldWideScience.org to search the national portals, information customers faced the tedious task of visiting portals door-to-door. They would typically first need to know that each portal existed (unlikely) and then search each gateway one at a time (highly impractical). WorldWideScience.org changes all this; with WorldWideScience.org you can search portal upon portal in parallel, with only one query, saving time and effort. Of course, there is much to be done. The world is dotted with large and often isolated, Web-based collections of scientific information. Once found, any one of these databases can be searched. But finding specific databases is a challenge, and searching them all collectively, until just recently, was only a dream. Now this dream is within reach.