| | |
Summary: An Investigation of Documents from the World
Wide Web
Allison Woodruff
Paul M. Aoki
Eric Brewer
Paul Gauthier
Lawrence A. Rowe
Computer Science Division
University of California at Berkeley
Berkeley, CA 94720-1776
email: {woodruff,aoki,brewer,gauthier,rowe}@cs.berkeley.edu
Abstract:
We report on our examination of pages from the World Wide Web. We have analyzed data
collected by the Inktomi Web crawler (this data currently comprises over 2.6 million HTML
documents). We have examined many characteristics of these documents, including: document
size; number and types of tags, attributes, file extensions, protocols, and ports; the number of
in-links; and the ratio of document size to the number of tags and attributes. For a more limited set
of documents, we have examined the following: the number and types of syntax errors and
readability scores. These data have been aggregated to create a number of ranked lists, e.g., the ten
most-used tags, the ten most common HTML errors.
|