Data Science and Machine Learning for Genome Security
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
This report describes research conducted to use data science and machine learning methods to distinguish targeted genome editing versus natural mutation and sequencer machine noise. Genome editing capabilities have been around for more than 20 years, and the efficiencies of these techniques has improved dramatically in the last 5+ years, notably with the rise of CRISPR-Cas technology. Whether or not a specific genome has been the target of an edit is concern for U.S. national security. The research detailed in this report provides first steps to address this concern. A large amount of data is necessary in our research, thus we invested considerable time collecting and processing it. We use an ensemble of decision tree and deep neural network machine learning methods as well as anomaly detection to detect genome edits given either whole exome or genome DNA reads. The edit detection results we obtained with our algorithms tested against samples held out during training of our methods are significantly better than random guessing, achieving high F1 and recall scores as well as with precision overall.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- NA0003525
- OSTI ID:
- 1855003
- Report Number(s):
- SAND2021-12015; 700525
- Country of Publication:
- United States
- Language:
- English
Similar Records
Epigenetic Footprints of CRISPR/Cas9-Mediated Genome Editing in Plants
$\mathrm{CROPSR}$: an automated platform for complex genome-wide $\mathrm{CRISPR}$ g$\mathrm{RNA}$ design and validation
Journal Article
·
Thu Jan 30 19:00:00 EST 2020
· Frontiers in Plant Science
·
OSTI ID:1606824
$\mathrm{CROPSR}$: an automated platform for complex genome-wide $\mathrm{CRISPR}$ g$\mathrm{RNA}$ design and validation
Journal Article
·
Tue Feb 15 19:00:00 EST 2022
· BMC Bioinformatics
·
OSTI ID:1980954