by Sol Lederman on Wed, 5 Mar, 2008
In Part 1 of this series I provided an overview of the technology that drives the E-print Network. In this article I will provide some detail about how the harvested collection, the "E-prints on Web Sites" component of the E-print Network, is constructed. In Part 3, I will discuss the technology of the portion of the E-print Network that relies on federated search of databases.
In Part 1 I explained that the E-print Network combines federated sources searched in real-time with harvested content. The harvested content, consisting of over 1.3 million e-prints, is found by directing a crawler to 28,000 web sites belonging to scientists, researchers, and members of the academic community. In OSTI terminology, harvesting is synonymous with conducting a directed crawl of web sites.
Before we look at the technology behind the harvesting, let's consider the question of why the content is harvested at all. Why not search the contributors' web sites in real-time in the same way that other collections are searched in real-time via federated search? There are several reasons for harvesting the content. First, a large number of e-prints are not found in databases. They are predominantly stored as document files in web server directories. Accessing files stored this way is the job of a web crawler, not that of a federated search engine. This is the case because, a crawler, once it locates the index page for a set of e-prints, easily harvests all e-prints referenced in that index page. The second reason that content is harvested is that storage of e-print documents is highly decentralized. Simultaneously searching 28,000 web sites is neither practical nor necessary, given that the contents of most of these sites don't change in the very short term. Federated search works best where there are silos of information to mine. The third reason for not searching the web sites in real-time is that specialized software performs a significant amount of processing to weed out documents that are not e-prints; it is not practical to perform this level of processing on live sets of documents.
One might also ask, why harvest at all, rather than search the contributors' web sites via free commercial search engines, like Google and Yahoo! which crawl billions of web pages? The reason is that using a crawl directed by scientists weeds out less relevant sites and focuses attention on quality sources at universities and professional societies.
Now that we understand the rationale for harvesting e-prints, we can look at how the harvesting takes place. A custom-built crawler crawls the list of web sites developed by E-print Director Dr. Dennis Traylor and his staff. The list consists of hand-picked sites selected because they contain content in subject areas of interest to DOE. Labor-intensive as it may be, this first step is absolutely crucial to ensuring the quality of the harvested collection. The crawler identifies documents of type PDF, Postscript, and DVI. These three document types are the ones used by nearly every e-print produced. This first of two crawling phases produces a list of candidate e-prints. Special processing will shorten the list before the crawler proceeds to harvest documents.
The harvesting system crawls potential e-print web sites on a regular basis; continuous maintenance of the list results in new sites being added and some old sites being removed between crawls. An important time-saving task of the harvesting system is to eliminate candidate e-prints that already exist in the special collection. Thus, once the crawler's first pass produces a list of candidate documents based on their URLs, these URLs are compared to those of documents previously harvested and added to the collection. Duplicates are removed, the list of documents to retrieve is greatly shortened, and the second crawling phase can proceed much more quickly.
The second crawling phase retrieves potential e-print documents. Special processing allows for later recrawling of sites that were unavailable during the first retrieval attempt. The text and other attributes of each potential e-print document is extracted and analyzed using proprietary heuristics to eliminate documents that are not likely to be e-prints. Additionally, a complex comparison is performed among retrieved documents to eliminate duplicates, including those that have different file names or are of different file types.
The final phase of the harvesting process is the creation of a searchable index. The text that was extracted from each document in the previous phase is used to build the index that will become the "E-prints on Web Sites" special collection. Coded into the index is information regarding collaborations. This information allows for E-print users to discover multiple sources of a document, which is quite common given that researchers frequently collaborate with others in writing e-prints. A special interface to the index was built that allowed for a connector to be built to easily search it together with other E-print collections.
I hope that you have enjoyed this brief "under the hood" tour of the making of the E-print special collection index. In Part 3, the final part of this series, I will share some interesting "facts and figures" about the E-print Network federated collections, and discuss some of the unique features of the application.
Consultant to OSTI