| | |
Summary: A Tool for Computing the Visual Similarity of Web Pages
Mar´ia Alpuente and Daniel Romero
DSIC-ELP, Universidad Polit´ecnica de Valencia
Camino de Vera s/n, 46022 Valencia, Spain
{alpuente,dromero}@dsic.upv.es
Abstract--Recently, we proposed a functional technique for
identifying similar Web pages that is based on measuring tree
similarity. The key idea behind the method is to transform each
Web page into a compressed, normalized tree that effectively
represents its visual structure. In this work, we develop an
optimization of this technique that is based on memoization
and that achieves significant improvements in efficiency in
both time and space. This work also presents a tool that
implements the proposed technique as well as two case studies
for two real scenarios. Experiments on real documents show
that the optimized algorithm performs significantly better than
the original technique and demonstrate the practicality of our
approach.
Keywords-Web page comparison; visual similarity; tree edit
distance; Web document clustering.
|