DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A FAIR and AI-ready Higgs boson decay dataset

Journal Article · · Scientific Data

Abstract To enable the reusability of massive scientific datasets by humans and machines, researchers aim to adhere to the principles of findability, accessibility, interoperability, and reusability (FAIR) for data and artificial intelligence (AI) models. This article provides a domain-agnostic, step-by-step assessment guide to evaluate whether or not a given dataset meets these principles. We demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs boson decays and quark and gluon background, and is available through the CERN Open Data Portal. We use additional available tools to assess the FAIRness of this dataset, and incorporate feedback from members of the FAIR community to validate our results. This article is accompanied by a Jupyter notebook to visualize and explore this dataset. This study marks the first in a planned series of articles that will guide scientists in the creation of FAIR AI models and datasets in high energy particle physics.

Sponsoring Organization:
USDOE
Grant/Contract Number:
SC0021258; SC0021396; SC0021225; SC0021395
OSTI ID:
1845043
Journal Information:
Scientific Data, Journal Name: Scientific Data Journal Issue: 1 Vol. 9; ISSN 2052-4463
Publisher:
Nature Publishing GroupCopyright Statement
Country of Publication:
United Kingdom
Language:
English

References (44)

Advances in Machine and Deep Learning for Modeling and Real-Time Detection of Multi-messenger Sources book January 2021
Identifying boosted objects with N-subjettiness journal March 2011
Soft drop journal May 2014
Towards an understanding of jet substructure journal September 2013
End-to-End Physics Event Classification with CMS Open Data: Applying Image-Based Deep Learning to Detector Data for the Direct Classification of Collision Events at the LHC journal March 2020
Geant4—a simulation toolkit
  • Agostinelli, S.; Allison, J.; Amako, K.
  • Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 506, Issue 3 https://doi.org/10.1016/S0168-9002(03)01368-8
journal July 2003
An introduction to PYTHIA 8.2 journal June 2015
End-to-end jet classification of quarks and gluons with the CMS Open Data
  • Andrews, M.; Alison, J.; An, S.
  • Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 977 https://doi.org/10.1016/j.nima.2020.164304
journal October 2020
Parton distributions with LHC data journal February 2013
Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC journal September 2012
Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC journal September 2012
Deep learning journal May 2015
Accelerated, scalable and reproducible AI-driven gravitational wave detection journal July 2021
Enabling real-time multi-messenger astrophysics discoveries with deep learning journal October 2019
The FAIR Guiding Principles for scientific data management and stewardship journal March 2016
A design framework and exemplar metrics for FAIRness journal June 2018
The anti- k t jet clustering algorithm journal April 2008
HEPData: a repository for high energy physics data journal October 2017
Particle-flow reconstruction and global event description with the CMS detector journal October 2017
Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV journal May 2018
Pileup mitigation at CMS in 13 TeV data journal September 2020
Jet flavour classification using DeepJet journal December 2020
Identification of b-quark jets with the CMS experiment journal April 2013
Deep transfer learning for star cluster classification: I. application to the PHANGS– HST survey journal February 2020
Star cluster classification in the PHANGS– HST survey: Comparison between human and machine learning approaches journal July 2021
Exploring the space of jets with CMS open data journal February 2020
Jet tagging via particle clouds journal March 2020
Interaction networks for the identification of boosted H → b b ¯ decays journal July 2020
Jet substructure studies with CMS open data journal October 2017
Jet Substructure as a New Higgs-Search Channel at the Large Hadron Collider journal June 2008
Exposing the QCD Splitting Function with CMS Open Data journal September 2017
Metric Space of Collider Events journal July 2019
Large Mass Hierarchy from a Small Extra Dimension journal October 1999
ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
Deep Residual Learning for Image Recognition conference June 2016
DeepDriveMD: Deep-Learning Driven Adaptive Molecular Simulations for Protein Folding conference November 2019
FastJet user manual: (for version 3.0.2) journal March 2012
Event generator tunes obtained from underlying event and multiparton scattering measurements journal March 2016
The Machine Learning landscape of top taggers journal January 2019
root-project/root: v6.18/02 software August 2019
/VBF1Parked/Run2012C-22Jan2013-v1/AOD dataset January 2017
/QCD_Pt_300to470_TuneCUETP8M1_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM dataset January 2019
Sample with jet, track and secondary vertex properties for Hbb tagging ML studies HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC dataset January 2019
/BulkGravTohhTohbbhbb_narrow_M-600_13TeV-madgraph/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v1/MINIAODSIM dataset January 2019