skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Short Survey of Document Structure Similarity Algorithms

Abstract

This paper provides a brief survey of document structural similarity algorithms, including the optimal Tree Edit Distance algorithm and various approximation algorithms. The approximation algorithms include the simple weighted tag similarity algorithm, Fourier transforms of the structure, and a new application of the shingle technique to structural similarity. We show three surprising results. First, the Fourier transform technique proves to be the least accurate of any of approximation algorithms, while also being slowest. Second, optimal Tree Edit Distance algorithms may not be the best technique for clustering pages from different sites. Third, the simplest approximation to structure may be the most effective and efficient mechanism for many applications.

Authors:
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
15013935
Report Number(s):
UCRL-CONF-202728
TRN: US200803%%983
DOE Contract Number:  
W-7405-ENG-48
Resource Type:
Conference
Resource Relation:
Conference: Presented at: The 5th International Conference on Internet Computing, Las Vegas, NV, United States, Jun 21 - Jun 24, 2004
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ALGORITHMS; APPROXIMATIONS; INTERNET

Citation Formats

Buttler, D. A Short Survey of Document Structure Similarity Algorithms. United States: N. p., 2004. Web.
Buttler, D. A Short Survey of Document Structure Similarity Algorithms. United States.
Buttler, D. 2004. "A Short Survey of Document Structure Similarity Algorithms". United States. https://www.osti.gov/servlets/purl/15013935.
@article{osti_15013935,
title = {A Short Survey of Document Structure Similarity Algorithms},
author = {Buttler, D},
abstractNote = {This paper provides a brief survey of document structural similarity algorithms, including the optimal Tree Edit Distance algorithm and various approximation algorithms. The approximation algorithms include the simple weighted tag similarity algorithm, Fourier transforms of the structure, and a new application of the shingle technique to structural similarity. We show three surprising results. First, the Fourier transform technique proves to be the least accurate of any of approximation algorithms, while also being slowest. Second, optimal Tree Edit Distance algorithms may not be the best technique for clustering pages from different sites. Third, the simplest approximation to structure may be the most effective and efficient mechanism for many applications.},
doi = {},
url = {https://www.osti.gov/biblio/15013935}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Feb 27 00:00:00 EST 2004},
month = {Fri Feb 27 00:00:00 EST 2004}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: