Summary: Scalable Information Organization
Javed Aslam, Fred Reiss, and Daniela Rus
Department of Computer Science
Hanover, NH 03755 USAĦ
jaa, frr, rus˘ @cs.dartmouth.edu
We present three scalable extensions of the star algorithm for information organization that use
sampling. The star algorithm organizes a document collection into clusters that are naturally induced
by the topic structure of collection, via a computationally efficient cover by dense subgraphs. We
also provide supporting data from extensive experiments.
Our goal is to develop a completely automated information organization system for digital libraries,
automated tools for librarians to classify this information, automatic tools to create reference pointers
into such collections, and automated tools that allow users to locate information effectively.
We focus on static and dynamic digital collections of unstructured text. We consider the problem of
determining the topic structure of text data, without a priori knowledge of the number of topics in the
data or any other information about their composition. We assume that the collections may be static
(for example, digital legacy collections) or dynamic (for example, news wires). We look to discover