Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Efficient Similarity Estimation for Systems Exploiting Data Redundancy
 

Summary: Efficient Similarity Estimation for Systems
Exploiting Data Redundancy
Kanat Tangwongsan1, Himabindu Pucha2, David G. Andersen1, Michael Kaminsky3
1
Carnegie Mellon University, 2
IBM Research Almaden, 3
Intel Labs Pittsburgh
Abstract--Many modern systems exploit data redundancy to
improve efficiency. These systems split data into chunks, generate
identifiers for each of them, and compare the identifiers among
other data items to identify duplicate chunks. As a result, chunk
size becomes a critical parameter for the efficiency of these sys-
tems: it trades potentially improved similarity detection (smaller
chunks) with increased overhead to represent more chunks.
Unfortunately, the similarity between files increases unpre-
dictably with smaller chunk sizes, even for data of the same type.
Existing systems often pick one chunk size that is "good enough"
for many cases because they lack efficient techniques to determine
the benefits at other chunk sizes. This paper addresses this
deficiency via two contributions: (1) we present multi-resolution

  

Source: Andersen, Dave - School of Computer Science, Carnegie Mellon University

 

Collections: Computer Technologies and Information Sciences