by Sol Lederman on Tue, Feb 26, 2008
The E-print Network is one of OSTI's most popular and powerful research offerings yet few of its users know about the advanced technology that drives it and makes it simple to use. Professional researchers in basic and applied science are able to access over 5 million e-prints gathered from nearly 28,000 world-wide databases and web-sites. Numerous OSTI innovations ensure that the E-print Network's documents are of extremely high quality, are highly relevant to researchers, and are easy and quick to find. This is the first in a series of articles about the technology behind this very important component of the Science Accelerator. This article serves as an overview; subsequent articles will provide more technical information.
The E-print Network is a federated search application. It federates (aggregates) search results from over 50 content databases in a number of scientific disciplines from a single user query. The E-print Network, however, uses federated search in an innovative way; One of the databases it searches is a special collection formed by harvesting over 1.3 million E-prints from nearly 28,000 hand-picked web-sites. A custom-designed crawler is responsible for performing the harvesting and custom software is used to build an index of the 1.3 million E-prints so that they can be searched quickly together with the non-harvested databases. Most E-print Network users are unaware that the application is, in fact, a blend of federated search and Google-like crawling technologies. This marriage of the two technologies reflects OSTI's insight in realizing that e-prints not only reside in certain well-known repositories, but also in many thousands of web-sites of leading scientists, researchers, and members of the academic community. Furthermore, it is OSTI's innovation that drives the development and evolution of tools to make this highly unique portal a huge success.
Critical to the success of the E-print Network is the combining of automation with human effort. While maintenance of most of the E-print collections is a straightforward process of ensuring that access to the searched databases is not disrupted, the harvested collection, also known as "E-prints on Web Sites," requires ongoing and careful attention. Dr. Dennis Traylor is OSTI's Director of the E-print Network and is responsible for the care of feeding of "Eprints on Web Sites." Dr. Traylor and his staff work tirelessly to ensure that each of the crawled web-sites meets OSTI's strict criteria for inclusion in the E-print Network. Additionally, the E-print staff categorizes web-sites, and tracks the movement of content contributors from one university or research center to another. OSTI has developed a number of specialized tools to support the large maintenance effort, yet no technology exists that can replace the "high touch" that keeps the quality of this critical component of the E-print Network high.
Other OSTI innovations that keep the quality of the crawled and harvested content high include automatic de-duplication of documents, managing of collaborations among individuals from different organizations who contribute identical documents, automatic removal of documents that are highly likely to not be E-prints, and automatic removal of non-English documents. Additionally, the sites that are crawled to obtain E-prints are recrawled on a regular basis to ensure that this special collection remains current.
Beyond the ability to perform live searches of the "E-prints on Web-sites," the E-print Network uses special technology to create web pages where a user can browse lists of individual scientists that contribute E-prints; these pages are categorized by scientific discipline and are in alphabetical order by scientists' names. The special technology also automatically generates web pages for over 3,000 scientific societies and professional associations whose work is of interest to the US Department of Energy. Users can browse these web pages by language or they can search for societies by language and/or discipline.
I hope that you appreciate the sophistication that comprises the components of the E-print Network but that doesn't compromise ease of use. Users can perform flexible searches, selecting individual collections or categories of collections, integrated search of "E-print on Web-sites" is seamless, results are relevance ranked, and users can create alert profiles that cause the system to perform weekly searches on the user's behalf and email them new documents that match their query.
I have worked directly with Dr. Traylor for several years and I can attest to the passion that this man brings to the E-print Network. Also, I have proudly developed software that support various components of the system's technology and I am delighted to describe them in more detail in upcoming articles.
Consultant to OSTI