skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content

Journal Article · · PLoS ONE
ORCiD logo [1];  [1];  [1];  [1];  [2];  [2]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States). Research Library. Digital library Research and Prototyping Team
  2. Univ. of Edinburgh, Scotland (United Kingdom). Language Technology Group

Increasingly, scholarly articles contain URI references to “web at large” resources including project web sites, scholarly wikis, ontologies, online debates, presentations, blogs, and videos. Authors reference such resources to provide essential context for the research they report on. A reader who visits a web at large resource by following a URI reference in an article, some time after its publication, is led to believe that the resource’s content is representative of what the author originally referenced. However, due to the dynamic nature of the web, that may very well not be the case. We reuse a dataset from a previous study in which several authors of this paper were involved, and investigate to what extent the textual content of web at large resources referenced in a vast collection of Science, Technology, and Medicine (STM) articles published between 1997 and 2012 has remained stable since the publication of the referencing article. We do so in a two-step approach that relies on various well-established similarity measures to compare textual content. In a first step, we use 19 web archives to find snapshots of referenced web at large resources that have textual content that is representative of the state of the resource around the time of publication of the referencing paper. We find that representative snapshots exist for about 30% of all URI references. In a second step, we compare the textual content of representative snapshots with that of their live web counterparts. We find that for over 75% of references the content has drifted away from what it was when referenced. These results raise significant concerns regarding the long term integrity of the web-based scholarly record and call for the deployment of techniques to combat these problems.

Research Organization:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC52-06NA25396
OSTI ID:
1627808
Journal Information:
PLoS ONE, Vol. 11, Issue 12; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (26)

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot journal December 2014
Searching for Quantum Gravity with High Energy Atmospheric Neutrinos and AMANDA-II journal August 2009
Estimating frequency of change journal August 2003
A large-scale study of the evolution of web pages conference January 2003
The web changes everything: understanding the dynamics of web content conference January 2009
Zoetrope: interacting with the ephemeral web conference January 2008
Persistence of Web references in scientific research journal March 2001
Web Citation Availability: Analysis and Implications for Scholarship journal July 2003
The impact of impermanent Web-located citations: A study of 123 scholarly conference publications journal January 2005
The risk of using the Internet as reference resource: A comparative study journal April 2008
Ecology in the information age: patterns of use and attrition rates of internet-based citations in ESA journals, 1997–2005 journal April 2008
Disappearing act: decay of uniform resource locators in health care management journals journal April 2009
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations journal June 2014
Detecting near-duplicates for web crawling conference January 2007
Similarity estimation techniques from rounding algorithms conference January 2002
Measures of the Amount of Ecologic Association Between Species journal July 1945
Identifying almost identical files using context triggered piecewise hashing journal September 2006
The Distribution of the Flora in the Alpine Zone.1 journal February 1912
Effectual Web Content Mining using Noise Removal from Web Pages journal April 2015
Textual similarities based on a distributional approach conference January 1999
Adaptive near-duplicate detection via similarity learning
  • Hajishirzi, Hannaneh; Yih, Wen-tau; Kolcz, Aleksander
  • Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10 https://doi.org/10.1145/1835449.1835520
conference January 2010
RTED: a robust algorithm for the tree edit distance journal December 2011
Web Information Extraction by HTML Tree Edit Distance Matching conference November 2007
Error Detecting and Error Correcting Codes journal April 1950
Reminiscing About 15 Years of Interoperability Efforts journal November 2015
A large-scale study of the evolution of Web pages journal January 2004

Cited By (3)

Identifying PIDs playing FAIR journal November 2019
Geo-DMP: A DTN-Based Mobile Prototype for Geospatial Data Retrieval journal December 2019
Verified, Shared, Modular, and Provenance Based Research Communication with the Dat Protocol journal June 2019