WorldWideScience.org Accelerating Global Scientific Discovery
WorldWideScience.org Accelerating Global Scientific Discovery
WorldWideScience.org Accelerating Global Scientific Discovery
Walter L. Warnick, Ph.D.
United States Department of Energy
Office of Scientific & Technical Information
(Operating Agent for WorldWideScience.org)
Washington, D.C., United States
Good afternoon and thank you for that introduction. Thank you to KISTI for hosting this conference and to both KISTI and ICSTI for organizing it.
I will begin my talk by explaining what is WorldWideScience. WorldWideScience begins with the observation that many countries publish scientific and technical information on the web. We call the sites where this information is posted a "national science portal." National portals can take a variety of forms, from a collection of conventional web pages, to a single database on the web, to multiple unconnected databases, to a federation of many such databases so that users can search multiple databases with a single query.
Of course, I was particularly familiar with the national science portal of the United States, Science.gov, which is hosted on behalf of the United States Science.gov Alliance by my Office of Scientific and Technical Information.
A couple years ago, it occurred to me that researchers and science attentive citizens would greatly benefit if all the national science portals could be searched via a single query. Eventually, the tool that allowed single query searching of multiple national portals came to be called WorldWideScience. I will begin by recapping the short history of WorldWideScience.org.
International partnership kicks off global science gateway
Just two years ago, at the ICSTI June 2006 meeting in Bethesda, Maryland, U.S., I expressed a vision for a "global science gateway." The concept was endorsed by the British Library, which offered to collaborate with my agency, the United States Department of Energy (DOE).
In January 2007, a bilateral partnership was formalized when the British Library and DOE signed a Statement of Intent to partner in the development of the Global Science Gateway. Dr. Raymond Orbach, DOE Under Secretary for Science, and Dame Lynne Brindley, Chief Executive of the British Library, participated in the signing ceremony in London. The Statement of Intent invited other nations to join the collaboration.
WorldWideScience.org was launched in June 2007
"WorldWideScience.org," was developed by the U.S. Department of Energy's Office of Scientific and Technical Information. It was unveiled to ICSTI members and the public at last year’s ICSTI conference in Nancy, France. ICSTI members also endorsed the proposal to serve as the umbrella organization for the future long-term governance of WorldWideScience.org. We thank them for their support.
A Terms of Reference document was developed to define this governance structure, now called the WorldWideScience Alliance. The Terms of Reference were accepted by ICSTI members at the 2008 Winter Meeting in Paris.
Tomorrow, there will be a special ceremony to launch the WorldWideScience Alliance. Please attend. I think it is a historic event.
WorldWideScience.org, the Global Science Gateway, enables anyone with Internet access to launch a single query search of over 200 million pages of scientific and technical information from more than 40 countries, covering all of the world's inhabited continents and nearly half the world's population. WorldWideScience.org improves the visibility of R&D outputs and expands scientific communication by providing equal access to diverse geographic settings. It does what Professor Min discussed this morning—WorldWideScience overcomes national boundaries. WorldWideScience relies upon an unconventional technology, called federated search. The technology is provided by Deep Web Technologies. Its president is Abe Lederman, who is here today.
- Grown from 12 databases from 10 countries one year ago to 32 databases from 44 countries today
- A quantity of science (more than 200 million pages from every inhabited continent)
- A breakthrough in content enabled by breakthrough technology
WorldWideScience has grown rapidly.
Today, it has grown to the point that the quantity of science searched—over 200 million pages—is comparable to the quantity of science searched by conventional web search engines, such as Google.
Equally important, the science searched by WorldWideScience tends to be scholarly, as opposed to the lay content searched by conventional web search engines. The bulk of the content on WorldWideScience is not accessible by conventional search engines. We say that the content searchable via WorldWideScience is non-Googleable. Thus, WorldWideScience is a complementary source to Google, rather than being duplicative.
We have generally limited the sources to those with English language interfaces. We hope to include non-English sources in the future as machine translation capabilities address this need.
Current WorldWideScience.org Sources
- African Journals Online
- Article@INIST (France)
- Australian Antarctic Data Centre
- Canada Institute for Scientific and Technical Information
- CSIR Research Space (South Africa)
- Defence Research and Development Canada (Canada)
- Directory of Open Access Jornals (Sweeden)
- DEFF Global E Prints (Denmark)
- DEFF Research Database (Denmark)
- Electronic Table of Contents (ETOC) (United Kingdom)
- Indian Academy of Sciences
- Indian Institute of Science Eprints
- Indian Institute of Science Theses & Dissertations
- Indian Medlars Centre
- J-EAST (Japan)
- J-STAGE (Japan)
- J-STORE (Japan)
- Journal@rchive (Japan)
- Korea Science (Korea)
- NARCIS (Netherlands)
- Science.gov (United States)
- Scientific Electronic Library Online (Argentina, Brazil, Chile, Colombia, Portugal, Spain)
- Transactions and Proceedings of the Royal Society of New Zealand 1868-1961 (New Zealand)
- UK PubMed Central (United Kingdom)
- Vascoda (Germany)
- VTT Technical Research Centre of Finland Publications Register
- VTT Technical Research Centre of Finland Research Register
Our current sources include large prominent collections such as Science.gov (the U.S. contribution) in addition to smaller sources of highly valuable science. Many of the geographic areas included in WorldWideScience.org, such as some of the African countries included in African Journals Online, are not traditionally well represented in standard scientific and technical resources.
Current Information Partners in WorldWideScience.org
- Burkina Faso
- Congo, DR
- Cote d'Ivoire
Current Information Partners in WorldWideScience.org (cont.)
- Libyan Arab Jamahiriya
- The Netherlands
- New Zealand
- South Africa
- United Kingdom
- United States
Founding Alliance MembersCanada Institute for Scientific and Technical Information (CISTI) – Canada VTT Technical Research Centre of Finland (VTT) – Finland Institut de l'Information Scientifique et Technique (INIST) – France TIB – German National Library of Science and Technology – Germany Japan Science and Technology Agency (JST) – Japan Korea Institute of Science and Technology Information (KISTI) – Korea Scientific Electronic Library Online (SciELO) – Argentina, Brazil, Chile, Colombia, Portugal, Spain Council for Scientific and Industrial Research (CSIR) – South Africa African Journals Online (AJOL) – Representing 24 African countries British Library – United Kingdom Science.gov Alliance – United States International Council for Scientific and Technical Information (ICSTI) International Network for the Availability of Scientific Publications (INASP)
Along with vastly increasing its content since its inception, WorldWideScience.org has also transitioned from bilateral management to a multilateral governance structure, called the WorldWideScience Alliance.
The list of Founding Alliance members is in a state of flux, as new members sign up.
The Alliance consists of twelve founding member organizations representing 38 countries. ICSTI serves as an Alliance member and primary sponsor.
Other countries and organizations are invited to participate.
Alliance Executive Board:
- Chair – Richard Boulderstone, British Library
- Deputy Chair – Pam Bjornson, Canada Institute for Scientific and Technical Information
- Treasurer – Tae-Sul Seo, Korea Institute of Science and Technology Information
- Ex-Officio Member – Walter Warnick, WorldWideScience.org Operating Agent, U.S. DOE Office of Scientific and Technical Information
- Ex-Officio Member – Herbert Gruttemeier, ICSTI President, French Institut de l'Information Scientifique et Technique
- At-Large Member – Yvonne Halland, Council for Scientific and Industrial Research, South Africa
An election for the Alliance's Executive Board was held in early April 2008. Richard Boulderstone, Director, E-Strategy and Information Systems, British Library, was elected Chair. Pam Bjornson, Director General, Canada Institute for Scientific and Technical Information, was elected Deputy Chair. Tae-Sul Seo, Senior Researcher, Korea Institute of Science and Technology Information, was elected Treasurer. The Executive Board also includes two ex-officio members, myself as the head of OSTI, which serves as the WorldWideScience.org Operating Agent, and ICSTI President Herbert Gruttemeier, French Institut de l'Information Scientifique et Technique. The At-Large Member is Yvonne Halland, Strategic Information Resources Coordinator at the Council for Scientific and Industrial Research in South Africa.
A formal ceremony commemorating the establishment of the Alliance will be held tomorrow in the conference's closing session. Founding members of the Alliance will participate in the events. Everyone is welcome to attend.
WorldWideScience.org & Federated Search Technology
Many popular search engines rely on crawler based technology.
Now that I have shared some of the background and history of WorldWideScience.org, I would like to talk about the technology behind it.
WorldWideScience.org implements federated searching to provide its coverage of global research and development results. It is much different than conventional web search technologies, such as used by Google. You might be wondering why we used different search technology.
The reason is that information in databases, where most of the world's high quality science resides, can seldom be searched by conventional technology. It is a little known fact that many of the popular search engines overlook a large portion of the web. Their technology relies upon crawlers, which find and visit websites one at a time by following hyperlinks. Each time a crawler finds a page, it indexes it. The index is then merged with the master index, and when the user does a search, the query is actually applied against the master index. When there is a match, the results are to hyperlinks indexed sometime in the past. Crawlers are inherently incapable of getting inside a database.
Federated search drills down to the deep Web where scientific databases reside.
Federated search systems
Probe the Deep Web
The bulk of science information, especially scholarly science information, resides in databases. Crawlers, like Google's, can get to the first page of a database, but typically they cannot get past the front page. The database's own search box is often the only systematic way to see the contents of the database, and crawlers are unable to process the search box. This part of the web that is off limits to crawlers is called the Deep Web. It is possible for database owners to take special steps to expose their database content to crawlers; however, many organizations who own databases do not pursue these options.
Federated search is a different kind of web search architecture. When the user places a query on a federated search application, like WorldWideScience.org, the query is transmitted to all the servers that host the databases. Those servers then translate the query into its own database and execute the search. Each remote database reports its results back to the WorldWideScience.org server, which combines the hits from all the databases, and sorts them in relevance ranked order. Finally, the ranked list is returned to the user. The whole process can take anywhere from about a second to around twenty seconds, depending on the complexity of the search and the speed of the source databases. Thus, WorldWideScience.org allows the user to search multiple data sources with a single query in real time.
Providing a specific example, if a user searches on the term "nanotechnology", the WorldWideScience.org interface sends the query to the 32 source databases, which independently run this search and begin returning results.
As results are returned to WorldWideScience.org, the combined results from all sources are run against WorldWideScience.org's relevance-ranking algorithm. The arrow on the right shows where the patron has the flexibility to reorder the results by source, date, title, or author. The arrow under the first hit shows how the system uses stars to indicate the degree of relevance of the search result to the search query. And the arrow under the second hit shows how an icon is presented if there is an ultimate link to full text.
Users can view the complete record for each result, and the full-text document if it is available. Here we have a record from KISTI's KoreaScience database, which has a very nice, complete full record as well as a link to the full text.
Here, we've gone from the bibliographic record at KoreaScience to the full text journal article – demonstrating how a patron in another country, Germany, for example, can use WorldWideScience.org to find materials from KoreaScience. With a large number of open access sources, WorldWideScience.org provides a single point of access to vast quantities of full-text science literature.
Recent & Future Enhancements:
- Results Clustering
- Personalized Alert Service
- Translation Capabilities
Along with increasing the number and diversity of scientific sources searched by WorldWideScience.org, the Alliance has several other near-term goals to incorporate Web 2.0 functionality. First, let me discuss results clustering.
Just this week, we have released a new version of WorldWideScience.org that includes the Clustering capability. I will illustrate this capability with a screen shot of it from Science.gov. In many ways WorldWideScience.org is modeled after Science.gov. A search on "fuel cells" has produced the relevance ranked results list shown in the middle of the screen. To the left, clusters of results based on similar keywords and concepts also add value for the user. Clustering search results into more precise subtopics helps the patron to further refine his or her search and drill down to the information needed more quickly. By simply clicking on the cluster of interest, those results are immediately displayed. Then, each one of those clusters can be displayed in an even more granular fashion.
To illustrate, the first cluster "fuel cell systems" can be expanded as shown. If the user is primarily interested in documents dealing with "storage system", clicking on that term will bring up the 16 results on that topic.
We have also included a new feature to pull up the Wikipedia entry corresponding to the user’s search term. We envision this feature becoming very valuable, especially for users looking for related terms and concepts.
Within the next couple months, we plan to incorporate Alerts services, which will allow users to set up a profile and then generate automatic queries against the WorldWideScience.org sources on a routine basis. Shown here in Science.gov, the user creates the initial search to be run against the sources.
New documents are then delivered weekly to the user's email account. As this screen shows, the weekly Alert was run on May 9, and documents entered into the databases within the last few days are retrieved.
<h4>Courtesy of Teragram Corporation: </h4>We are also looking into translations capabilities in order to include non-English sources in WorldWideScience.org. The Department of Energy is engaged in several research projects that might prove fruitful in late 2008 or 2009. This slide shows an early prototype of work done by Teragram Corporation on Chinese-English translation. A search for the word “gene” is translated into the corresponding Chinese characters.
Courtesy of Teragram Corporation:
Along with a translated title and abstract, the red text at the bottom shows a possible English version of the same work described in the original Chinese paper.
The WorldWideScience Alliance will continue to evaluate translations capabilities and we welcome the participation of others interested in this research.
Through such efforts as clustering, alerts services, and translations, WorldWideScience.org complements other trends in global scientific communication. National research organizations recognize the importance of increasing visibility of their R&D outputs, even in very small countries. At the same time, full-text information accessibility has increased. Through the innovative combinations of federated search and other technologies, scientists and citizens across the globe now have unprecedented access to scientific knowledge.
Thus, the availability of WorldWideScience and the formation of the Alliance tomorrow are historic events.
Thank you again to ICSTI for their sponsorship, and to all of the WorldWideScience Alliance Members and source owners. We look forward to a long and beneficial partnership.