Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Collection Statistics for Fast Duplicate Document Detection
 

Summary: Collection Statistics for Fast Duplicate
Document Detection
ABDUR CHOWDHURY, OPHIR FRIEDER, DAVID GROSSMAN,
and MARY CATHERINE McCABE
Illinois Institute of Technology
We present a new algorithm for duplicate document detection that uses collection statistics. We com-
pare our approach with the state-of-the-art approach using multiple collections. These collections
include a 30 MB 18,577 web document collection developed by Excite@Home and three NIST collec-
tions. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly
similar in the number of documents to the Excite@Home collection. The other two collections are
both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5--528,023
document collection. We show that our approach called I-Match, scales in terms of the number of
documents and works well for documents of all sizes. We compared our solution to the state of the
art and found that in addition to improved accuracy of detection, our approach executed in roughly
one-fifth the time.
1. INTRODUCTION
Data portals are everywhere. The tremendous growth of the Internet has
spurred the existence of data portals for nearly every topic. Some of these por-
tals are of general interest; some are highly domain specific. Independent of the
focus, the vast majority of the portals obtain data, loosely called documents,

  

Source: Argamon, Shlomo - Department of Computer Science, Illinois Institute of Technology

 

Collections: Computer Technologies and Information Sciences