Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Data Science and Machine Learning for Genome Security

Technical Report ·
DOI:https://doi.org/10.2172/1855003· OSTI ID:1855003
This report describes research conducted to use data science and machine learning methods to distinguish targeted genome editing versus natural mutation and sequencer machine noise. Genome editing capabilities have been around for more than 20 years, and the efficiencies of these techniques has improved dramatically in the last 5+ years, notably with the rise of CRISPR-Cas technology. Whether or not a specific genome has been the target of an edit is concern for U.S. national security. The research detailed in this report provides first steps to address this concern. A large amount of data is necessary in our research, thus we invested considerable time collecting and processing it. We use an ensemble of decision tree and deep neural network machine learning methods as well as anomaly detection to detect genome edits given either whole exome or genome DNA reads. The edit detection results we obtained with our algorithms tested against samples held out during training of our methods are significantly better than random guessing, achieving high F1 and recall scores as well as with precision overall.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
NA0003525
OSTI ID:
1855003
Report Number(s):
SAND2021-12015; 700525
Country of Publication:
United States
Language:
English

Similar Records

Epigenetic Footprints of CRISPR/Cas9-Mediated Genome Editing in Plants
Journal Article · Thu Jan 30 19:00:00 EST 2020 · Frontiers in Plant Science · OSTI ID:1606824

$\mathrm{CROPSR}$: an automated platform for complex genome-wide $\mathrm{CRISPR}$ g$\mathrm{RNA}$ design and validation
Journal Article · Tue Feb 15 19:00:00 EST 2022 · BMC Bioinformatics · OSTI ID:1980954