by Peter Lincoln on Mon, Oct 5, 2009
As OSTI Director Walt Warnick likes to say, today's Web is like the Model T Ford -- revolutionary but ready for vast improvement. This is especially true when it comes to making the Web work for science and technology. In that spirit I want to describe a new kind of Web Portal, one which has yet to be built. It is called the X-Portal.
An X-Portal provides comprehensive coverage for a specific science or technology community, where X refers to that community. In other words, an X-Portal for biofuels is a comprehensive biofuels portal. X = Neutron Science gives a comprehensive neutron science portal, and so on. There can be as many X-Portals as there are communities, but each has a similar design.
The need for X-Portals
The need for X-Portals is based on the fact that today's search engines and portals typically provide less than 5% coverage of any given science community. Today's Web portals and search engines, while revolutionary, are technologically immature and far from comprehensive. As a result they do relatively little to overcome the cognitive barrier of findability. One can usually find something relevant, but it is seldom the best thing out there. With 5% coverage the odds are 19 to 1 against finding the best accessible content. Moreover, if the coverage extends to a large number of other communities, as with Google, even that 5% may be swamped by hits on other communities.
There are two principal reasons for these deficiencies. First of all today's portals try to be too broad, so they wind up being shallow. This means they only capture a small fragment of any given technical community. Second, because they are so broad they cannot make use of the emerging technologies of federation, semantic analysis, mapping and visualization. These new technologies require a certain amount of analytical effort that is specific to each community. When the content is too broad these technologies are prohibitively difficult to apply.
Given this lack of coverage, the present state of affairs is that search engines and portals serve merely as starting points for complex, laborious navigational searches. The promise of the Web is simply not being realized. Thus a key concept in the X-Portal design is Comprehensive Community Coverage. In order to put everybody in touch with everybody else, everybody's content has to be included.
The technological challenges
This leads to the technological challenge of finding and including everybody's content. For example, what would it take to find and collect all the Web accessible content of everyone working on developing and using biofuels, either nationally or globally? This is the X-Portal challenge. This challenge is a labor intensive effort which cannot be entirely automated. In particular, deep Web resources must be ferreted out and federated. Surface Web resources, such as individual author publication collections, are also hard to find.
However, there are emerging technologies for collecting and screening candidate Web resources. Moreover, much like building a library, once the X-Portal resources are found, collected and federated, the upkeep is relatively modest. What is needed is to develop standard, efficient methods for performing the comprehensive resource collection function. Since no one has ever tried to build a portal with comprehensive coverage, we don't know how to do it, but surely it can be done. The difference between 5% coverage and 95% coverage is enormous, but how do we get there?
Moreover, there are many emerging analytical techniques and technologies that promise to be extremely fruitful, given comprehensive community coverage. Consider mapping for example. Mapping a random 5% of a community is not particularly useful. A highway map that only showed a 5% sample of roadways would be useless. It is unlikely that such a map would show a single through route. This is why Web mapping of scientific and technical communities is not generally useful today, the data is too sparse.
The OSTI model
The best existing model of the federation of Web content is a set of three portals developed by DOE's Office of Scientific and Technical Information. OSTI pioneered the basic federation technology used in these sites. The three portals are: E-print Network http://www.osti.gov/eprints, Science.gov http://www.science.gov, andWorldwidescience.org http://worldwidescience.org.
Each portal federates several dozen deep web databases, plus a large number of surface websites, about 30,000 in the case of E-print Network. Moreover, the portals are nested. Worldwidescience.org federates national portals from many countries, including Science.gov. Science.gov federates numerous U.S. federal portals, including E-print network.
The OSTI system of nested federated portals demonstrates that the basic federation technology works. However, because the scope of each of these portals is all of science and technology, none achieves greater than 5% community coverage, if that. There are also significant technological issues related to scaling up the non-nested federation of deep Web databases beyond 50 or so. Nor do any of these portals attempt to make use of the emerging semantic, mapping and visualization tools, except for some crude clustering.
The community collection challenge
Perhaps the greatest challenge is developing technologies to support the comprehensive collecting of Web accessible content. A technology community is a cluster of related activities. Different people are working on different aspects of the same problem, so the challenge is to find all of those people, many of whom will not know about one another, and aggregate their Web content into the X-Portal. The scope might be either national or global, the latter being much more difficult, but potentially much more important.
There are a large number of emerging semantic mapping tools that hold great promise for supporting this community aggregation job. In the scientific community these include citation and co-authorship mapping, relational thesauri and taxonomies, and possibly the semantic web. There are a host of other mapping techniques under development as well. See for example http://scimaps.org for a large collection of these.
The challenge is clear. Given that an X-Portal is designed to put everybody in a national or global science or technology community in touch with everybody else, how do we find everybody in the first place? Note by the way that this question can only now be posed, because the Web has only recently become sufficiently comprehensive. The X-Portal is a technology whose time has just come.