Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

AI Benchmark Democratization and Carpentry

Journal Article · · No journal information
OSTI ID:3008660
Benchmarks are a cornerstone of modern machine learning, enabling reproducibility, comparison, and scientific progress. However, AI benchmarks are increasingly complex, requiring dynamic, AI-focused workflows. Rapid evolution in model architectures, scale, datasets, and deployment contexts makes evaluation a moving target. Large language models often memorize static benchmarks, causing a gap between benchmark results and real-world performance. Beyond traditional static benchmarks, continuous adaptive benchmarking frameworks are needed to align scientific assessment with deployment risks. This calls for skills and education in AI Benchmark Carpentry. From our experience with MLCommons, educational initiatives, and programs like the DOE's Trillion Parameter Consortium, key barriers include high resource demands, limited access to specialized hardware, lack of benchmark design expertise, and uncertainty in relating results to application domains. Current benchmarks often emphasize peak performance on top-tier hardware, offering limited guidance for diverse, real-world scenarios. Benchmarking must become dynamic, incorporating evolving models, updated data, and heterogeneous platforms while maintaining transparency, reproducibility, and interpretability. Democratization requires both technical innovation and systematic education across levels, building sustained expertise in benchmark design and use. Benchmarks should support application-relevant comparisons, enabling informed, context-sensitive decisions. Dynamic, inclusive benchmarking will ensure evaluation keeps pace with AI evolution and supports responsible, reproducible, and accessible AI deployment. Community efforts can provide a foundation for AI Benchmark Carpentry.
Research Organization:
Helmholtz-Zentrum, Berlin; Boston U.; Virginia U.; Illinois U., Urbana; MIT, Lincoln Lab; Johannesburg U.; Google Inc.; McGill U.; UC, San Diego; Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Wisconsin U., Madison; Texas U., El Paso; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Unlisted; Harvard U.; INFN, Padua; Duke U.; Microsoft, Redmond; Rutherford; Argonne National Laboratory (ANL), Argonne, IL (United States); Haimson Res., Santa Clara; New York U.; Christian Brothers U.; Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Prince Mohammad U., Al Khobar
Sponsoring Organization:
US Department of Energy
DOE Contract Number:
89243024CSC000002
OSTI ID:
3008660
Report Number(s):
FERMILAB-PUB-25-0835-CSAID; oai:inspirehep.net:3093130; arXiv:2512.11588
Journal Information:
No journal information, Journal Name: No journal information
Country of Publication:
United States
Language:
English

Similar Records

Intern-Artificial Intelligence Benchmarking
Journal Article · Mon Jan 19 19:00:00 EST 2026 · No journal information · OSTI ID:3014039

An MLCommons Scientific Benchmarks Ontology
Journal Article · Wed Nov 05 23:00:00 EST 2025 · No journal information · OSTI ID:3004873

Democratizing uncertainty quantification
Journal Article · Tue Oct 29 20:00:00 EDT 2024 · Journal of Computational Physics · OSTI ID:2584646

Related Subjects