Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Hashes are not suitable to verify fixity of the public archived web

Journal Article · · PLoS ONE
 [1];  [2];  [3];  [4];  [5];  [5]
  1. Columbia College, Columbia, MO (United States)
  2. Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
  3. Data Archiving and Networked Services, The Hague (Netherlands)
  4. Internet Archive, San Francisco, CA (United States)
  5. Old Dominion Univ., Norfolk, VA (United States)

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
89233218CNA000001
OSTI ID:
2406553
Report Number(s):
LA-UR--23-25690
Journal Information:
PLoS ONE, Journal Name: PLoS ONE Journal Issue: 6 Vol. 18; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (24)

The impact of JavaScript on archivability journal January 2015
The evolution of web archiving journal May 2016
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations journal June 2014
ARCHANGEL: Tamper-Proofing Video Archives Using Temporal Content Hashes on the Blockchain conference June 2019
Client-Side Reconstruction of Composite Mementos Using ServiceWorker conference June 2017
Archive Assisted Archival Fixity Verification Framework conference June 2019
TrendMachine: A Temporal Webpage Resilience Portal conference June 2023
Composable Ledgers for Distributed Synchronic Web Archiving conference June 2023
Making Digital Artifacts on the Web Verifiable and Reliable journal September 2015
Global web archive integration with memento conference June 2012
An evaluation of caching policies for memento timemaps conference July 2013
Only One Out of Five Archived Web Pages Existed as Presented conference January 2015
Routing Memento Requests Using Binary Classifiers conference June 2016
MemGator - A Portable Concurrent Memento Aggregator conference June 2016
InterPlanetary Wayback conference June 2016
Rewriting History conference October 2017
Unobtrusive and Extensible Archival Replay Banners Using Custom Elements conference May 2018
Archangel conference August 2018
To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages journal July 2023
All WARC and no playback: The materialities of data-centered web archives research journal January 2023
Melting Pot of Origins: Compromising the Intermediary Web Services that Rehost Websites conference January 2020
HTTP Framework for Time-Based Access to Resource States -- Memento report December 2013
WARChain: Consensus-based trust in web archives via proof-of-stake blockchain technology1 journal July 2022
Distributed Digital Artifacts on the Semantic Web journal February 2016

Similar Records

The DSA Toolkit Shines Light Into Dark and Stormy Archives
Journal Article · Mon May 09 00:00:00 EDT 2022 · Code4Lib Journal · OSTI ID:1879399

Summer 2021 Internship Report
Technical Report · Tue Aug 17 00:00:00 EDT 2021 · OSTI ID:1814765

Secure Image Hash Comparison for Warhead Verification
Conference · Fri Jun 06 00:00:00 EDT 2014 · OSTI ID:1237813