Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

MLCommons Science Benchmarks

Conference · · No journal information
DOI:https://doi.org/10.2172/3019259· OSTI ID:3019259
Benchmarks are a cornerstone of modern machine learning practice, providing standardized eval- uations that enable reproducibility, comparison, and scientific progress. Yet, as AI systems particularly deep learning models become increasingly dynamic, traditional static benchmarking approaches are losing their relevance. Models rapidly evolve in architecture, scale, and capability; datasets shift; and deployment contexts continuously change, creating a moving target for evaluation. Without adaptive benchmarking frame- works, both scientific assessment and real-world de- ployment risk becoming misaligned with actual system behavior. Drawing on our experience from MLCommons, educa- tional initiatives, and government programs such as the DOE s Million Parameter Consortium, we identify key barriers that hinder the broader adoption and utility of benchmarking in AI. These include substantial resource demands, limited access to specialized hardware, lack of expertise in benchmark design, and uncertainty among practitioners about how to relate benchmark results to their own application domains. Moreover, current benchmarks often emphasize peak performance on leadership-class hardware, offering limited guidance for more diverse, real-world deployment scenarios. We argue that benchmarking itself must become dy- namic in order to incorporate evolving models, updated data, and heterogeneous computational platforms while maintaining transparency, reproducibility, and inter- pretability. Democratizing this process requires not only technical innovation, but also systematic educational efforts spanning undergraduate to professional levels to develop sustained expertise in benchmark design and use. Finally, benchmarks should be framed and com- municated to support application-relevant comparisons, enabling both developers and users to make informed, context-sensitive decisions. Advancing dynamic and inclusive benchmarking practices will be essential to ensure that evaluation keeps pace with the evolving AI landscape and supports responsible, reproducible, and accessible AI deployment.
Research Organization:
Illinois U., Urbana; Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Virginia U.; Cornell U.
Sponsoring Organization:
US Department of Energy
DOE Contract Number:
89243024CSC000002
OSTI ID:
3019259
Report Number(s):
FERMILAB-POSTER-25-0223-CSAID-STUDENT; oai:inspirehep.net:2965397
Resource Type:
Conference poster
Conference Information:
Journal Name: No journal information
Country of Publication:
United States
Language:
English

Similar Records

AI Benchmark Democratization and Carpentry
Journal Article · Thu Dec 11 23:00:00 EST 2025 · No journal information · OSTI ID:3008660

Intern-Artificial Intelligence Benchmarking
Journal Article · Mon Jan 19 19:00:00 EST 2026 · No journal information · OSTI ID:3014039

An MLCommons Scientific Benchmarks Ontology
Journal Article · Wed Nov 05 23:00:00 EST 2025 · No journal information · OSTI ID:3004873

Related Subjects