by Dr. Walt Warnick on Thu, March 05, 2009
As Isaac Newton said, "If I have seen further than others, it is by standing on the shoulders of giants." Newton was not alone on those shoulders. Everyone in science, from his day to ours, draws on the work of others.
Science is all about the flow of knowledge: New methods, instruments, techniques, concepts, results, questions, data, etc. The flows are endless, complex and in all directions. It is rightly called a diffusion process. This concept is reflected in a host of statutes that form the legislative basis for OSTI.
Often the Knowledge Scientists Need Resides in Distant Communities
Nor do we depend just on the work of giants. We also depend on our colleagues down the hall, or at another lab, as well as a myriad of other researchers we do not know.
According to the National Science Foundation, there are over 2.5 million research workers worldwide, with more than 1.2 million in the U.S. alone.1 If we look at all the articles, reports, emails and conversations that pass between them, we could count billions of knowledge transactions every year. This incredible diffusion of knowledge is the very fabric of science.
Given that the diffusion of knowledge is central to science, it behooves us to see if we can accelerate it. We note that diffusion takes time. Sometimes it takes a long time. Every diffusion process has a speed. Our thesis is that speeding up diffusion will accelerate the advancement of science.
The millions of researchers are grouped into thousands of communities. A community may be defined as a group of researchers working on a single scientific problem.
The Web of Science indexes about 8,700 journals2, representing many different research communities. That's a lot of science to keep up with. Currently it is difficult for researchers, who primarily track journals within their specific discipline, to hear about discoveries made in distant scientific communities.
In fact, diffusion across distant communities can take years. In contrast, within an individual scientific community, internal communication systems are normally quicker. These include journals, conferences, email groups, and other outlets that ease communication.
Many communities use related methods and concepts: mathematics, instrumentation, and computer applications. Thus there is significant potential for diffusion ACROSS communities, including very distant communities. We see this as an opportunity.
Sequential Diffusion is Too Slow!
Diffusion to distant communities takes a long time because it often proceeds sequentially, typically spreading from the community of origin (A) to a neighbor (B), then to community (C), a neighbor of B, and so on. This happens because neighboring communities are in fairly close contact.
Science will progress faster if this diffusion lag time is diminished. The concept of global discovery is to transform this sequential diffusion process into a parallel process. This means that new knowledge flows directly to distant communities. The goal is to reduce the lag time from years to months and from months to days.
Modeling Knowledge Diffusion Suggests How to Accelerate It
In thinking about how to speed up diffusion across distant communities, we have looked at diffusion research, including computer modeling. We are particularly interested in recent work that applies models of disease dynamics to the spread of scientific ideas. The spread of new ideas in science is mathematically similar to the spread of disease, even though one produces positive results, the other negative. Our goal is to foster epidemics of new knowledge.
You might ask "Why is the math of disease related to the math of knowledge diffusion?" It is because neither involves considerations of conservation of mass. This makes disease and knowledge diffusion unlike many other kinds of diffusion that obey laws of conservation of mass. Consider, for example, diffusion of pollution. If pollution diffuses from point A to point B, point A now has less of it. But if knowledge diffuses from person X to person Y, person X still has what he started with.
We have been working with a group of modelers led by Luis Bettencourt of Los Alamos National Laboratory. They have written an important new paper, currently in press in Physica A: Statistical Mechanics and Its Applications, entitled: "The power of a good idea: quantitative modeling of the spread of ideas from epidemiological models."3 This paper applies a disease model to the spread of Feynman diagrams just after World War II. Feynman diagrams are a central method of analysis in particle physics.4
Looking at these models has led us to focus on a parameter called the contact rate. In the disease model, this is the rate at which people come into contact with a person who has the disease. Increasing the contact rate speeds up the spread of the disease. In the case of the diffusion of knowledge, the contact rate involves the number of people, or communities, that find out about the new discovery.
Increasing the contact rate speeds up the spread of the idea. By way of example, we asked the modeling team to increase the contact rate that they derived for the spread of Feynman diagrams. Simply doubling the contact rate dramatically reduced the time it took for the idea to diffuse.
Our focus, therefore, is on increasing the contact rate. To do this we must reduce a huge gap in how the Internet works today.
Two Conferences Laid the Groundwork
The seeds for the concepts you are about to hear about are traceable to a workshop held at the National Academy of Science in 2000. The workshop was chaired by our moderator, Al Trivelpiece.
The participants recognized the enormous potential for advancing science presented by new and emerging electronic media. Even today, the Internet must still be considered new. Its emergence as a useful medium for communication really only dates from the first half of the 1990s, just a dozen years ago or so. It is a transformational technology still in its infancy.
Our goal is to forge its application to the advancement of science.
The deep Web is Huge
Basically, there are two ways to get to knowledge on the Web. In fact, one can think of them as two kinds of knowledge. The first is the ordinary Web page, of which there are several billions. This sea of Web pages is what is searched when you use a search engine like Google. We call it the surface Web.
We Need Systems that probe the Deep Web
However, beneath this surface Web there are vast document repositories. They often have their own search tool for searching within the repository, but popular search engines like Google do not reach within these databases, even though they are Web accessible. We call the part of the Web in which repositories and databases reside the deep Web.
Analysts estimate that perhaps 99 percent of all the Web-accessible scientific documents are in deep Web databases. Because these documents are not accessible to search engines and robots, this creates a huge gap in knowledge searchability.
The problem of accessing all this deep Web science mirrors the problem of diffusion across distant communities. This is because many of the deep Web databases are maintained within specific communities, including specialized journals, scientific societies, university departments, or with individual researchers.
Within each community the deep Web document repositories are typically well known. But they are hard for a scientist in a distant community to find. Worse, once found, each repository must be searched sequentially, making widespread search prohibitively difficult.
We have Begun to Solve the Problem
We have begun to close this gap and solve the sequential search problem. Conceptually the solution is simple. It is simultaneous deep Web search with integrated ranking of results. All it takes is virtual aggregation or federation of diverse deep Web databases. The federated databases are searched in parallel, not sequentially. This greatly increases the contact rate across distant communities, speeding up the diffusion of new knowledge.
We call this result Global Discovery. It means making each original discovery globally available. Federated deep Web search transforms local discovery into global discovery.
While the concept is simple, making it a reality is not. The current challenge of feterated search is that the number of databases that can be searched simultaneously is limited. That's a tough problem to solve, and one that we're working on.
There is a practical way and an impractical way to search databases in parallel. The impractical way is to combine repositories. This is impractical because of the huge diversity of formats and the enormous size of the repositories.
Rather, the practical way to search repositories in parallel is through a technicque called federated search. Among government agencies, DOE pioneered the application of federated search on the Web.
Current Systems Provide Examples for Global Discovery
Currently we in DOE have three production systems that we consider examples or prototypes for Global Discovery. Each provides simultaneous search and ranking of multiple document repositories with millions of pages of scientific content. These sites differ primarily in the parts of the deep Web that each explores. Each addresses different technological and access issues as well.
- Science.gov. Science.gov is an interagency initiative of 17 U.S. government science organizations within 12 federal agencies. These agencies form the voluntary Science.gov Alliance. The Science.gov site aggregates 30 databases containing about 50 million pages. All of the major science-funding agencies are represented, so the content spans most of science. We have just launched Version 3.0, with important new ranking capability.
Science.gov is a true Global Discovery facility. You'll hear more about Science.gov in an upcoming talk in this session, given by Science.gov Alliance co-chair Eleanor Frierson of the Department of Agriculture's National Agricultural Library, and Abe Lederman, founder and president of Deep Web Technologies Incorporated.
I invite you to visit the exhibit booth for Science.gov (booth # 506) near the exhibit booth for the DOE Office of Science.
- The E-print Network makes searchable 12 million pages of e-prints from hundreds of thousands of e-prints on 20,000 sites. In effect it makes 20,000 isolated islands of information act as if it were an integrated whole.
- OSTI's Science Conferences portal. This new site aggregates the conference proceedings from 16 databases run by scientific societies, DOE facilities and national labs. Emphasis is on the physical sciences and engineering. It searches several hundred thousand articles.
Of course, these are just our attempts at Global Discovery. There are, of course, tools being offered by others, too. Each has strengths and weaknesses. The tools that we offer have the common characteristic that they do not require owners of repositories to do anything other than handle more traffic.
Our Current Projects Initiate more High-impact Innovations
Each of these examples is a working Global Discovery tool. With these tools, scientists can begin to approach Global Discovery in a meaningful way. They can seek out relevant knowledge across all scientific communities. However, even taken together, these tools search only a small percentage of the vast resources within the deep Web.
I will share here what, we believe, is a key to success. When trying to integrate information from diverse sources, it is important to avoid adding burdens to information owners. The history of information management has seen a number of instances where seemingly promising efforts to integrate information have been hampered because too few information owners signed on: Government Information Locator System (GILS), Open Archive Initiative (OAI), Institutional Repositories, and others. While DOE adopted the protocols advanced by these efforts, too often few other information owners did so. Our view is that these efforts stumbled because they placed demands on the information owners who did not enjoy the benefits. In contrast, we believe that those who seek to integrate information from diverse sources need to bear the burdens themselves.
A second key to success is precision searching. The problem that is inherent with federated search techniques is information glut. Customers can get so many hits that it is beyond human capacity to sort through them. The way around this problem is with precision searching, one version of which is relevance ranking. We are all familiar with relevance ranking. It is what Google does. Google itself credits its relevance ranking for its success.
But the methodology for relevance ranking on the deep Web is far different than on the surface Web. We have broken new ground regarding relevance ranking on the deep Web. Try Science.gov and see for yourself.
A Global Discovery Gateway would Advance Science
Our ultimate goal is to have a true Global Discovery facility. To help create it, we have undertaken a number of activities which, collectively, we call Innovations in Scientific Knowledge and Advancement, or ISKA.5 The Global Discovery facility would aggregate, search and rank all of the important, Web-accessible databases. It has the same goal as the fabled Library of Alexandria, namely to make all of science available in one place. Except in this case the place is everywhere at once, because anyone in the world could access the Global Discovery facility.
We Stand on the Rim of a New Era of Global Discovery: Innovations in Scientific Knowledge and Advancement
Imagine if you will, a Google-like search capability that returns results across the whole of science, giving our scientists information on research they didn't even know existed - except that this search would tap into the great stores of knowledge on the deep Web not readily accessed by Google. This search, rather than crawling across indexed information, would rapidly probe the world's most comprehensive databases and delve into the world's great scientific laboratories in real time.
But we have a long way to go before this vision of Global Discovery becomes a reality. To that end we are conducting applied research on a number of challenges related to the vision. To begin with, we are still in the early stages of virtually aggregating databases that number in the hundreds or thousands. We are looking at grid-based systems to do this, and we are looking at advanced algorithms to rank search results.
- Expand content . . .
- Increase searchable content within databases
- Incorporate databases from all scientific communities
- Beyond text - numeric data, audio, video, etc.
- Enhance precision searching . . .
- Integrate analytical tools
- Improve sophistication and speed of relevancy ranking
- Next-generation algorithms
- Visualization techniques
- Amplify computing power . . .
- Deploy emerging technologies to enable extraction of ever increasing content
- Increase computer power, storage capacity
- Special architecture, grid technology
"The calculus of innovation is really quite simple"
I will conclude by noting that, to the extent that knowledge about new methods and concepts is spread more quickly, science itself will be accelerated. The stakes are enormous. William Brody, member of the Executive Committee of The Council on Competitiveness and President of my alma mater Johns Hopkins University, recently testified before Congress as follows:6
The calculus of innovation is really quite simple:
Knowledge drives innovation;
Innovation drives productivity;
Productivity drives our economic growth.
That's all there is to it.
- National Science Foundation, "Scientists and engineers engaged in R&D, by country," Science and Technology Pocket Data Book , 2000, NSF 00-328 ( Arlington , VA , 2000), February 2006.
- Thompson Scientific, Web of Science, 2005, The Thompson Corporation, February 2006.
- Lu?s M.A. Bettencourt, et.al., "The power of a good idea: quantitative modeling of the spread of ideas from epidemiological models," Physica A: Statistical Mechanics and Its Applications, 2006, Los Alamos National Laboratory, February 2006.
- David Kaiser, "Drawing Theories Apart: The Dispersion of Feynman Diagrams in Postwar Physics," (University of Chicago Press, 2005), February 2006.
- About Innovations in Scientific and Knowledge Advancement, The U.S. Department of Energy Office of Scientific and Technical Information Home Page, 2005, February 2006.
- William R. Brody, U.S. Competitiveness: The Innovation Challenge, Testimony to the House Committee on Science, July 21, 2005.