Enabling Scientific Discovery through a Science Information Infrastructure

 Walter L. Warnick, Ph.D, Director
Department of Energy Office of Scientific and Technical Information
April 17, 2002

Keynote Address
DOD Information Management Conference:
From Information Delivery to Knowledge Management

 

Slide 1- Right Information, Right Place, Right Time

I will begin by noting what a privilege it has been for me to work with the leader of the Defense Technical Information Center (DTIC), Kurt Mulholm. For a number of years now, he and I have represented our two agencies on an interagency group called CENDI. During several of those years, Kurt was the chairman of CENDI. I have learned a great deal from him both about the information business and about leadership. DOD is very fortunate to have someone of his vision and management skill to lead its STI operations through a very challenging period.

I am doubly honored to be invited as a keynote speaker at this Conference. First, I am honored because DOD is an agency where I invested several years early in my federal career. I did hands-on research during six years with the Naval Research Laboratory, and it is nice to be back. Secondly, I am honored that my DOD colleagues in the information business think that what we are doing in DOE is worth hearing about. Thank you for inviting me.

My vision for a comprehensive information infrastructure for the sciences is to fulfill the need to have the right science information at the right time. For the past several years, this vision has been my daily inspiration.

Slide 2 - Shared Knowledge

Traditional Mission

All scientists well appreciate that scientific progress is enabled only if knowledge is shared. Isaac Newton said it best 300 years ago, "If I have been able to see further, it was only because I stood on the shoulders of giants." Newton credited his own incredible advances to the knowledge he obtained from others. Today, the same notion is reflected in numerous laws that make the sharing of knowledge a mission of DOE and other science agencies. My Office of Scientific and Technical Information (OSTI) does its best to advance that part of the DOE mission.

Sometimes we in science agencies tend to lose sight of the fact that, at the most fundamental level, the purpose of the R&D enterprise is not to fund researchers. Rather, the fundamental purpose of the R&D enterprise is to make progress on key science topics. Funding researchers is but the main mechanism by which the more fundamental goal of science progress is pursued.

Yes, researchers explore science questions and make discoveries. But the R&D cycle is incomplete until the knowledge thus gained is shared. Sharing of science knowledge is what my office in DOE and DTIC in DOD are all about.

New Way

Periodically throughout history, there have come along technological developments that have transformed society: the railroads, the telephone, electricity, the automobile. I believe that we are now on the front edge of one of those transformational technologies-the Internet.

I say "front edge," for what we have seen so far is barely the beginning. As broadband becomes ubiquitous--and by broadband I mean 100 megabits per second, not the 1.5 megabits per second which is the speed of DSL--its impact will be profound in ways that cannot now be fully anticipated. It is impossible to predict how millions of intelligent people will apply this wonderful new technology to the problems they face. All kinds of human interactions will be changed, from the location of the workplace, to the design of the workplace, to all manner of education and entertainment. The way that scientists share knowledge will also be transformed.

This is where all of us come in. Throughout the science agencies, the knowledge community is working toward the new future. We are building on our traditions and proceeding forward one step at a time. Together, we are making progress that I will review with you today.

Within DOE, the nature of our business is changing faster and more profoundly than anything else that is happening in my agency.

Slide 3 - science.gov's Web Debut

Science.Gov: A Shared Vision

In March 2002, DOE and nine other Federal science agencies, including DOD's DTIC, took a major step in information technology deployment when the interagency science.gov Web site went online in test mode. Science.gov provides a gateway to authoritative science information of U.S. Government agencies, including R&D results. It serves as the official FirstGov for Science portal in support of the Administration's e-government initiatives.

A formal launch is scheduled later this spring. Science.gov will raise the visibility and use of the results of the R&D at all the science agencies.

Science.gov began as a vision and gained momentum as ten Federal science agencies first began collaborating and then persevered in making the vision a reality. This is how it happened.

Slide 4 - science.gov Foundation:  Workshops

When the Web emerged in the mid-1990s, organizations like mine in DOE and DTIC already had years of experience in compiling and managing huge databases of information. The Web offered a natural vehicle for making these databases accessible as never before.

The Web not only allowed us to post what we were already doing more broadly, it also opened the door for us to do it better.

The Web also opened up our customer base. Once primarily serving librarians, OSTI's tools began to serve researchers and students directly, in addition to librarians. These new customers eagerly used our databases, but they needed help in discovering the databases in the first place. Thus, OSTI began to offer tools that allowed researchers and students to do cross-database searching. We did what we could within DOE, but it was clear that the boundaries in science did not coincide with the boundaries between Agencies. We needed to reach out beyond DOE.

In 2000, DOE sponsored a blue ribbon workshop at the National Academy of Sciences for furthering the vision for a future sciences information infrastructure. We at DOE believed that the deployment of current technology could create an inexpensive network of dispersed science resources. From that workshop came a report that yielded a high-level vision that encompassed more than just DOE.

Following the workshop, it was agreed that an interagency strategy was needed to achieve the vision. Various science agencies had much to offer from the science disciplines covered by their missions. For DOE, our strength is a wealth of scientific and technical information, mostly in the physical sciences, generated over the last 60 years.

To translate the vision of the blue-ribbon panel into a working strategy, a workshop was held at NIST in the spring of 2001. Federal agency representatives from information centers like DTIC recognized the opportunities for making agencies' science information more accessible to researchers, students, and the science-attentive citizen. The concept of science.gov was endorsed as the interagency science portal or gateway.

No longer would patrons need to know ahead of time which agency held which information. Recognizing that "the building blocks are available now" and there is "no need to wait - no need to experiment," the principal science agencies broke new ground and formed the Science.gov Alliance.

As noted previously, Science.gov was launched in test mode in March 2002. It includes the latest in web tools, including Distributed Search. Beyond its technology, science.gov is at least as noteworthy as an organizational achievement. Having all these agencies work together voluntarily to produce something this significant is extremely rare. I am not aware of any milestone of comparable significance ever having been achieved by voluntary collaboration among government agencies.

The result is a vision realized. Science.gov is the new interagency gateway to the science information of several Federal agencies. It is the recognized FirstGov for Science portal.

The current incarnation of Science.gov is but the latest step toward a science information infrastructure. There is much more to be done. Now that real interagency collaboration has been demonstrated, there is reason to think that more great things can be accomplished.

Slide 5 - Nature Quote

What We Have Accomplished

Evidence shows that usage increases when access is more convenient. As Nature magazine noted:

"Articles freely available online are more highly cited. For greater impact and faster scientific progress, authors and publishers should aim to make research easy to access." 

Nature, Vol. 411, No. 6837, p. 521, 2001

Making research easy to access has been our major motivation in the development of Science.gov.

I pause here a moment to add that the 9/11 attack has made us more aware than ever that there is a downside to information accessibility. Whereas we once asked whether information might be useful to hostile governments, and, if it was, we classified it, now we ask if it might be useful to terrorists.

Such concerns have caused DOE to limit public access to a quantity of information from the Internet. There has been much controversy about this. DOE took quick action in an effort to safeguard homeland security following guidance from the DOE Secretary's office. On March 19, a memorandum by the White House Chief of Staff provided dissemination guidance to all federal departments and agencies.

The White House guidance, while emphasizing the need for homeland security, acknowledged the commitment to open and efficient exchange of scientific and technical information. OSTI continues to honor this commitment.

Slide 6 - OSTI Web Tools Collage

OSTI has developed and deployed a series of "E-Government" tools. Today, scientists generally record and communicate their findings by three main ways: journal literature; gray literature; and preprints. OSTI has developed tools to encourage information discovery and retrieval in each of these three ways. These tools are described in the handout at the corner table. They are also described at the OSTI home page.

Time does not permit me to review OSTI's Web Tools, nor those of the Labs and other DOE organizations. I will, however, elaborate on one of the technologies OSTI is using to improve access, retrievability and discovery of science information in several of its Web resources--Distributed Search, the means to access the Deep Web.

Slide 7 - Deep Web Searching

While the best known search tools focus on those Web pages that are easily found and indexed by traditional Web crawlers (called the "surface Web"), OSTI is primarily focused on the "Deep Web" because that is where the best information is. The term "Deep Web" refers to that vast repository of underlying content, such as documents in online databases, that general-purpose Web crawlers cannot reach. Such databases typically have search engines of their own. Estimated to be 500 times larger than the Surface Web, Deep Web content has remained mostly unavailable to cross-site searches due to the limitations of traditional search engines.

As an example of how the content of the Surface Web and the Deep Web differ, at OSTI about 100 surface Web pages are included among our Surface Web resources, but over 8 million Deep Web pages! Our Deep Web collection contains, for example, the gray literature scientific and technical output of the entire agency since 1995. To support the search and retrieval of Deep Web content, OSTI applied a novel Distributed Search tool, Distributed Explorit.  It was developed by Abe Lederman of Deep Web Technologies in collaboration with OSTI.

Explorit has since served as the cornerstone for additional OSTI Web tools requiring Deep Web searches and also serves as the search tool for Science.gov.

Bill Arms of Cornell University, who is a leading thinker about information management, has noted that simple search algorithms applied to enormous collections can be a tremendous aid to human thought. Explorit is just such an algorithm.

In the following sequence of slides, we begin with a Web patron who wants to initiate a search on Science.gov. This patron happens to be located in Washington, DC, but he could be any place with internet access (Slide 8). The patron enters a search term and selects the databases he wishes to search.  The query is communicated to a server at Oak Ridge, Tennessee (Slide 9).  From there, Explorit pulses the selected databases wherever they happen to be located around the country (Slide 10).  The search engine at each of those databases performs a search and reports the results back to the server at Oak Ridge (Slide 11).  The compiled hit list is communicated to the patron in DC (Slide 12).

Next Major Advance in Information Retrieval

Current Situation

Great quantities of scientific and technical information of special interest to DOE have been made available on the Web via tools created by OSTI. In addition, even greater quantities of information of some interest to DOE have been made available on the Web by other government agencies and by other organizations. To serve our patrons, OSTI has accepted the challenge of offering up cutting edge ways to help our patrons discover and retrieve information from all these sources. This is the fundamental purpose of our efforts to create a digital library.

Helping patrons find information across disparate databases poses technical difficulties. Because various entities have posted information primarily for their own purposes, there is little interoperability between information databases. There are three conceptual approaches for establishing interoperability across digital library collections: Federation, Harvesting and Gathering.(1)

The three approaches differ in the extent of the burden accepted by owners of the information collections. Conversely, the three approaches differ by the extent of the benefits offered to patrons intent on information discovery and retrieval.

At the high burden end of the spectrum is the Federation approach. Through this approach, participating organizations agree on interoperability standards and protocols - and then build the systems around the protocol. Through the Federation, extensive up-front efforts are made to reach agreement on organizational, content and technical issues. The library community is an excellent example of such a Federation where they share on-line catalog records using Z39.50, MARC, and the Anglo-American Cataloging Rules (AACR2). The cost of participation is normally high for federated activities. It is resource intensive to implement, adhere to, and maintain standards. As a result, membership in Federations tends to be limited. Many owners of important content are not part of any Federation, and are not likely to join.

At the low burden end of the spectrum is the Gathering approach. It requires no cooperation among the owners of the information collections from which information is gathered. There are two main ways by which Gathering can be accomplished: the Web Crawling approach and the Distributed Search approach. Web Crawling depends on Web based search engines, such as Google and Alta Vista, to provide information discovery, or on special crawler applications like our use of Webinator in our PrePRINT Network. Web Crawling is dependent upon the availability of information in permanently accessible Web sites (i.e., the surface Web). It cannot be used if the information resides in databases accessible only by search engines dedicated to those databases, such as OSTI's Information Bridge, NLM's PubMed, or DTIC's STINET.

The other method to accomplish Gathering is Distributed Search, which we have already discussed. Distributed Search is well suited to databases accessible only by search engines dedicated to those databases (the Deep Web), such as OSTI's Information Bridge, NLM's PubMed, or DTIC's STINET. As Web Crawling and Distributed Search each work best on different "parts" of the Web, they are complementary.

In the middle of the burden spectrum is the Harvesting approach. It envisions that each owner of an information collection (or some intermediary) adopt metadata standards so as to facilitate the efforts of other entities to harvest the information in the collection. The Harvesting approach facilitates accessibility of digital library collections by making it easy to harvest and combine disparate collections. The Harvesting approach is exemplified by the Open Archives Initiative (OAI)(3). OAI is new. As of this writing, only a modest number of small collections are compliant with the OAI protocol.

State of the Art

In OSTI's view, today, the most important information collections are on the Deep Web. Thus, the practical state of the art in retrieving information from the major collections hosted by government agencies is Distributed Search. OSTI, working with Abe Lederman, deployed one of the first, if not THE first, government Distributed Search system when, on April 21, 1999, we launched EnergyFiles. Since then, Distributed Search has been more broadly deployed. Most recently, it is a featured tool on Science.gov, launched last month. There is tremendous potential for further application of Distributed Search.

Distributed Search has great strengths. It, in effect, integrates disparate underlying databases so that they can be accessed as if they were one. For the purposes of the patron seeking government information, it is an ideal tool, as it relieves the patron from first having to identify which agency might hold the information he/she seeks. Distributed Searching places no burdens, except increased traffic, on the owners of databases. Distributed Search systems are automatically kept as current as the underlying databases. They are capable of capitalizing on the special strengths of the search engines of the underlying databases, including full-text searching, if that is available.

Limitations of Distributed Search

Distributed Search also has limitations. The most serious is the limitation on the number of underlying databases that can be usefully searched at one time. The limit is set, not by the number of databases that can be queried, nor by the number of hits that can be brought back to the patron. Rather, the limit is set by the number of hits that the patron can reasonably be expected to assimilate. For example, if 50 databases each bring back 10 hits, the patron has to filter through 500 hits. This is not reasonable.

Michael Nelson (NASA), who operates the NASA Technical Reports Server which uses Distributed Search, estimates that the practical limit on the number of underlying databases that can be reasonably handled is about 20. OSTI's experience is consistent with that estimate.

Upon closer examination, one recognizes that it is not so much the total number of hits that sets a practical limit on Distributed Search, but rather it is the fact that the patron has no basis upon which to value any one hit more than another. It is the absence of relevancy ranking that limits Distributed Search. To illustrate this point, consider the following example; it is not uncommon when doing a search on Google to get back a large number of hits--even more than 500. But Google rescues the patron by rank ordering the hits according to relevancy: hits most likely to be important are presented first. Thus, by simply examining the first several hits that Google presents, patrons are likely to find the answer they seek. Patrons ignore the remaining hits.

Overcoming the Limitation

So, the question presents itself, how can the limitation of Distributed Search be overcome? Two alternatives present themselves.

(1) Abandon Distributed Search in favor of an alternative technology, namely, the Harvesting approach, as exemplified by the Open Archive Initiative.

(2) Figure a way to build relevancy ranking into Distributed Search.

I will discuss each of these two alternatives, beginning with the Open Archive Initiative

Open Archive Initiative (OAI)

As the most prominent Harvesting approach, the Open Archive Initiative is strongly touted by its champions as the next major advance in information discovery and retrieval. OAI envisions that databases will be made available by their owners in XML with certain metadata standards. The provider of such a database is termed a Data Provider. Using this resource, relatively simple software can be developed to harvest the data and bring it to a central server, where it can be accessed by patrons. The entity doing the harvesting is called the Service Provider.

The first thing that must be done under the Harvesting approach is for the Service Provider to compile a metadata index. Such an index must be in place before patrons can use products coming from this approach. To begin this process, the Service Provider accesses the site of a Data Provider (Slide 13).  The Data Provider allows the Service Provider to harvest metadata. The metadata is stored by the Service Provider (Slide 14).  For each additional Data Provider, the process is repeated. The Service Provider accesses the site of the Data Providers, harvests metadata, and compiles it (Slide 15, Slide 16, Slide 17, Slide 18).  Having accessed all the Data Providers, the Service Provider is ready to deal with patrons. Again, we have a patron in Washington, DC, but the patron could well be anywhere with internet access (Slide 19).  The patron launches a search against the Service Provider in Tennessee (Slide 20).  The Service Provider returns the results to the patron (Slide 21).

Incidentally, the sequence of events envisioned under the Open Archive Initiative is very similar to the sequence of events performed by traditional search engines, like Google. Google crawls web sites, harvests an index of the text on those sites, compiles a central index, and then uses that index to handle patrons' queries.

OSTI has successfully adopted OAI in a proof-of-principle exercise. If OSTI were to fully adopt OAI, we would need to make each of our relevant databases compliant with the OAI protocol. Such a task is doable. Once complete, OSTI would be a Data Provider. Further, OSTI would want to create tools by which such data are harvested and combined with data from other agencies and other sources, and thus be a Service Provider. Government information centers like OSTI and DTIC are ideally suited for the OAI protocol, as they have responsibilities consistent with the roles of both Data Providers and Service Providers.

One of the main advantages claimed for OAI is that, once a master data set exists on a central server, it is possible to do relevancy ranking. Thus, OAI has the potential to overcome the main limitation of Distributed Search.

As OAI specifies a metadata protocol, it requires a degree of voluntary cooperation on the part of independent data providers. Such cooperation might well pose a problem for the broad adoption of OAI. The prospects for OAI were boosted considerably by the recent test launch of Science.gov, for Science.gov demonstrated that major information centers can indeed cooperate to produce a beneficial result. These centers might well take a further step and adopt OAI, but only if a solid case can be made that benefits would then accrue to the centers and their patrons.

Limitation of OAI

OAI has its own set of limitations. First is the quality of the relevancy ranking that can be done on OAI data. Conceptually, there are multiple ways to do relevancy ranking.  I will discuss two:  (1) semantic analysis where the words of the data source are examined, and (2) traffic analysis where the frequency of links from other web sites are counted, such as is done by Google. Either way, there are reasons to question the quality of relevancy ranking that can be done with OAI. With semantic analysis, one can examine, for example, the number of times that a search term appears in the record and the proximity of multiple search terms to one another, and rank the different hits accordingly. OAI is somewhat limited for semantic analysis because what is harvested is metadata, rather than full text. OAI also has limitations for doing relevancy ranking through linkage analysis, but I will defer discussing this topic to another occasion.

A second limitation for OAI relates to the mass transfer of records from one computer to another. Because harvesting envisions a copying of metadata from one server (the Data Provider) to another (the Service Provider), especially large databases, like PubMed with over 10 million records, may pose a problem. Harvesting such large databases has not yet been attempted. This problem is complicated by the fact that in order for the Service Provider to be kept current, some means must be employed to periodically repeat the transfer of records from the Data Provider to the Service Provider.

In conclusion, OAI is well suited for bibliographic data sources whose owners sign on to the OAI protocol. My personal view is that government agencies should continue to explore OAI, offer up trial databases as Data Providers, and launch a trial Service Provider. OSTI will be a part of such trials.

Slide 22 - Overcoming the Limitation of Distributed Search  

Conceptually, there is an approach that can overcome the limitation of Distributed Search. For full-text databases, imagine that each one has a means for ranking the relevancy of hits from that database. Next, imagine that the means by which relevancy is measured and ranked is the same for each of the databases. Then, it follows that the hits from each of the databases can be interleaved according to relevancy, just as if the original databases were all federated into one database.(2)  Bingo!

Thus, the next major advance in information retrieval will be accomplished if owners of databases adopt compatible methods of relevancy ranking. Patrons could launch a Distributed Search against all such databases, retrieve hundreds or thousands of hits, but focus their attention on only the most relevant hits. To the patron, the distributed databases would give the same result as if the databases were combined into one.

My personal view is that conquering the challenge of relevancy ranking should be the next goal for Science.gov and CENDI. This goal will not be easy to achieve. As the first step, OSTI is getting its own house in order by deploying a tool for relevancy ranking in our gray literature product, the Information Bridge. To be part of the larger vision, it is important that the relevancy algorithm that we deploy be as compatible as possible with similar tools deployed by other database owners.

Next, OSTI will deploy a compatible relevancy ranking tool in our other full text products. Once we have achieved these goals, Science.gov and CENDI could lead the way to the next major advance for information retrieval.

These are near term steps. In the long term, once broad band becomes ubiquitous, there will surely be techniques available to overcome all the limitations that we now struggle with.

Slide 23 - Promise of...Discovery

In Closing

In closing, Science.gov is an important step toward fulfilling the need to have the right information at the right time. Making this happen was no fluke. It takes collaboration and input from many parties, as science is not bounded by organization or agency. Working together to forge partnerships and alliances, our future capabilities to advance science are unlimited.

We have already come a long way. We have made progress by working together, both within DOE and with DOD and other agencies. On the other hand, the knowledge sharing business is changing fast, and there is much more to be done. Now that the first major steps are complete, our challenge now is to properly select among the myriad choices for our future direction.

The promise of greater access, retrievability, and discovery of more and more knowledge is too great to be denied. While none of us in the STI community have perfect foresight, our prospects for the future are best if we continue to work together. I pledge DOE's support to this end. To my colleagues in other agencies, we say, "Let's roll!"

Thank you.

 

1. Arms, William Y., "Thoughts about Interoperability in the NSDL". Draft for discussion, August 2000.

2. E.M. Voorhees and R.M. Tong. "Multiple Search Engines in Database Merging." In Proceedings of the Second ACM International Conference on Digital Libraries (DL'97), July 1997. 

3. Open Archives Initiative, "The Open Archives Initiative Protocol for Metadata Harvesting", edited by Herbert Van de Sompel and Carl Lagoze. Version 1.1, July 2001.