Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

An Investigation of Documents from the World Allison Woodruff

Summary: An Investigation of Documents from the World
Wide Web
Allison Woodruff
Paul M. Aoki
Eric Brewer
Paul Gauthier
Lawrence A. Rowe
Computer Science Division
University of California at Berkeley
Berkeley, CA 94720-1776
email: {woodruff,aoki,brewer,gauthier,rowe}@cs.berkeley.edu
We report on our examination of pages from the World Wide Web. We have analyzed data
collected by the Inktomi Web crawler (this data currently comprises over 2.6 million HTML
documents). We have examined many characteristics of these documents, including: document
size; number and types of tags, attributes, file extensions, protocols, and ports; the number of
in-links; and the ratio of document size to the number of tags and attributes. For a more limited set
of documents, we have examined the following: the number and types of syntax errors and
readability scores. These data have been aggregated to create a number of ranked lists, e.g., the ten
most-used tags, the ten most common HTML errors.


Source: Aoki, Paul M. - Intel Research Berkeley
California at Irvine, University of - Department of Information and Computer Science, Database Research Group
Palo Alto Research Center (PARC), User Interface Research


Collections: Computer Technologies and Information Sciences