Architecture: What is under the Hood
E-print Network allows the information patron to search multiple data sources with a single query from the user interface. While the user gets a seamless, Google-like search and retrieval experience, sophisticated Web technology is used behind the scenes. Such sophistication is necessitated by the fact that different researchers and research organizations use different mechanisms to publish e-prints. The publication mechanisms fall into one of two general categories. First, "papers" and articles can be published in databases or portals. Typically, such databases and portals have their own search engine, and they are often not readily crawled and indexed. Alternatively, "papers" and articles can be published as simple Web site documents. In order to make all the documents searchable and retrievable, E-print Network implements a blend of federated search to search databases and portals and Web harvesting technologies to make Web site documents searchable.
When the information patron enters a query in the search box, the query is sent to every individual database or portal searched by E-print Network. The individual data sources send back to E-print Network a list of results from the search query. The information patron can review this hit list and travel to the host site of a particular hit for more detailed information.
In addition to this federated approach, E-print Network searches an internally maintained index of harvested Web content. This internal index is very specific to the domain of E-print Network and the Web addresses, or URLs. These URLs are pre-selected and screened before they are added to the internal E-print Network index.
Whether the search results come via federated search or Web harvesting, E-print Network then ranks the hits and presents them to the user in relevance order.
This process allows E-print Network some key advantages when compared with general purpose crawler-based search engines. Federated search does not place any requirements or burdens on owners of the individual data sources, other than handling increased traffic. Federated searches are inherently as current as the individual data sources, as they are searched in real time. Web harvesting-based searches focus exclusively on quality e-print sources.