A Short Survey of Document Structure Similarity Algorithms
This paper provides a brief survey of document structural similarity algorithms, including the optimal Tree Edit Distance algorithm and various approximation algorithms. The approximation algorithms include the simple weighted tag similarity algorithm, Fourier transforms of the structure, and a new application of the shingle technique to structural similarity. We show three surprising results. First, the Fourier transform technique proves to be the least accurate of any of approximation algorithms, while also being slowest. Second, optimal Tree Edit Distance algorithms may not be the best technique for clustering pages from different sites. Third, the simplest approximation to structure may be the most effective and efficient mechanism for many applications.
- Research Organization:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- W-7405-ENG-48
- OSTI ID:
- 15013935
- Report Number(s):
- UCRL-CONF-202728; TRN: US200803%%983
- Resource Relation:
- Conference: Presented at: The 5th International Conference on Internet Computing, Las Vegas, NV, United States, Jun 21 - Jun 24, 2004
- Country of Publication:
- United States
- Language:
- English
Similar Records
Proteomic Analyses using High-Efficiency Separations and Accurate Mass Measurements
Scientific Application Requirements for Leadership Computing at the Exascale