Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test

Summary: Aggregation of a Term Vocabulary for Peer-to-Peer
Information Retrieval: a DHT Stress Test
Fabius Klemm and Karl Aberer
School of Computer and Communication Sciences
Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland
{Fabius.Klemm, Karl.Aberer}@epfl.ch
Abstract. There has been an increasing research interest in developing full-text
retrieval based on peer-to-peer (P2P) technology. So far, these research efforts
have largely concentrated on efficiently distributing an index. However, rank-
ing of the results retrieved from the index is a crucial part in information re-
trieval. To determine the relevance of a document to a query, ranking algorithms
use collection-wide statistics. Term frequency - inverse document frequency (TF-
IDF), for example, is based on frequencies of documents containing a given term
in the whole collection. Such global frequencies are not readily available in a
distributed system. In this paper, we study the feasibility of aggregating global
frequencies for a large term vocabulary in a P2P setting. We use a distributed
hash table (DHT) for our analysis. Traditional applications of DHTs, such as file
sharing, index keys in the order of tens of thousands. Aggregation of a vocabulary
consisting of millions of terms poses extreme requirements to a DHT implemen-
tation. We study different aggregation strategies and propose optimizations to


