MLCommons Science Benchmarks

Hawks, Benjamin G.; Tran, Nhan Viet; Colombo, Marco; von Laszewski, Gregor; Shiraishi, Reece; Krishnan, Anjay

doi:10.2172/3019259

MLCommons Science Benchmarks

Conference · Mon Aug 25 00:00:00 EDT 2025 · No journal information

DOI:https://doi.org/10.2172/3019259· OSTI ID:3019259

Hawks, Benjamin G. ^[1]; Tran, Nhan Viet ^[1]; Colombo, Marco; von Laszewski, Gregor ^[2]; Shiraishi, Reece ^[3]; Krishnan, Anjay ^[4]

Fermilab
Virginia U.
Cornell U.
Illinois U., Urbana

Benchmarks are a cornerstone of modern machine learning practice, providing standardized eval- uations that enable reproducibility, comparison, and scientific progress. Yet, as AI systems particularly deep learning models become increasingly dynamic, traditional static benchmarking approaches are losing their relevance. Models rapidly evolve in architecture, scale, and capability; datasets shift; and deployment contexts continuously change, creating a moving target for evaluation. Without adaptive benchmarking frame- works, both scientific assessment and real-world de- ployment risk becoming misaligned with actual system behavior. Drawing on our experience from MLCommons, educa- tional initiatives, and government programs such as the DOE s Million Parameter Consortium, we identify key barriers that hinder the broader adoption and utility of benchmarking in AI. These include substantial resource demands, limited access to specialized hardware, lack of expertise in benchmark design, and uncertainty among practitioners about how to relate benchmark results to their own application domains. Moreover, current benchmarks often emphasize peak performance on leadership-class hardware, offering limited guidance for more diverse, real-world deployment scenarios. We argue that benchmarking itself must become dy- namic in order to incorporate evolving models, updated data, and heterogeneous computational platforms while maintaining transparency, reproducibility, and inter- pretability. Democratizing this process requires not only technical innovation, but also systematic educational efforts spanning undergraduate to professional levels to develop sustained expertise in benchmark design and use. Finally, benchmarks should be framed and com- municated to support application-relevant comparisons, enabling both developers and users to make informed, context-sensitive decisions. Advancing dynamic and inclusive benchmarking practices will be essential to ensure that evaluation keeps pace with the evolving AI landscape and supports responsible, reproducible, and accessible AI deployment.

Research Organization:: Illinois U., Urbana; Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Virginia U.; Cornell U.

Sponsoring Organization:: US Department of Energy

DOE Contract Number:: 89243024CSC000002

OSTI ID:: 3019259

Report Number(s):: FERMILAB-POSTER-25-0223-CSAID-STUDENT; oai:inspirehep.net:2965397

Resource Type:: Conference poster

Conference Information:: Journal Name: No journal information

Country of Publication:: United States

Language:: English

Similar Records

AI Benchmark Democratization and Carpentry

Journal Article · Thu Dec 11 23:00:00 EST 2025 · No journal information · OSTI ID:3008660

Intern-Artificial Intelligence Benchmarking

Journal Article · Mon Jan 19 19:00:00 EST 2026 · No journal information · OSTI ID:3014039

An MLCommons Scientific Benchmarks Ontology

Journal Article · Wed Nov 05 23:00:00 EST 2025 · No journal information · OSTI ID:3004873

MLCommons Science Benchmarks

Citation Formats

Similar Records

Related Subjects