Summary: Efficient Similarity Estimation for Systems
Exploiting Data Redundancy
Kanat Tangwongsan1, Himabindu Pucha2, David G. Andersen1, Michael Kaminsky3
Carnegie Mellon University, 2
IBM Research Almaden, 3
Intel Labs Pittsburgh
Abstract--Many modern systems exploit data redundancy to
improve efficiency. These systems split data into chunks, generate
identifiers for each of them, and compare the identifiers among
other data items to identify duplicate chunks. As a result, chunk
size becomes a critical parameter for the efficiency of these sys-
tems: it trades potentially improved similarity detection (smaller
chunks) with increased overhead to represent more chunks.
Unfortunately, the similarity between files increases unpre-
dictably with smaller chunk sizes, even for data of the same type.
Existing systems often pick one chunk size that is "good enough"
for many cases because they lack efficient techniques to determine
the benefits at other chunk sizes. This paper addresses this
deficiency via two contributions: (1) we present multi-resolution