MLCommons

AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results

MLCommons — Wed, 27 Aug 2025 14:44:31 +0000

Today, AVCC^® and MLCommons^® announced new results for their new MLPerf^® Automotive v0.5 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. MLPerf Automotive is the result of a broad, cross-industry collaboration that provides new and critical performance data for stakeholders who procure AI systems for automobiles.

MLPerf Automotive: A Cross-industry Partnership to Provide Critical Performance Data

The MLPerf Automotive benchmark was designed by a working group composed of technical experts from thirteen organizations representing both the AI community and the automotive manufacturing industry, including Ambarella, ARM, Bosch, C-Tuning Foundation, CeCaS, Cognata, Motional, NVIDIA, Qualcomm, Red Hat, Samsung, Siemens EDA, UC Davis, and ZF Group.

The v0.5 version of the benchmark includes three performance tests:

2-D object recognition and segmentation
3-D object recognition

“Many of the key scenarios for AI in automotive environments relate to safety, both inside and outside of a car or truck,” said James Goel, Automotive Working Group co-chair. “AI systems can train on 2-D images to be able to detect objects in a car’s blind spot or to implement adaptive cruise control. Having high-quality, 8-megapixel imagery – as we do in this benchmark – ensures that the results reflect real-world demands on these systems. In addition, 3-D imagery is critical for training and testing collision avoidance systems, whether assisting a human driver or as part of a fully automated vehicle.”

The benchmark suite implements two scenarios for measuring performance: a “single stream” of requests that measures raw performance and throughput; and a “constant stream” where requests arrive at intervals, providing important data on system latency.

“As vehicles become increasingly intelligent through AI integration, every millisecond counts when it comes to safety,” said Kasper Mecklenburg, Automotive Working Group co-chair and principal autonomous driving solution engineer, Automotive Business, Arm. “That’s why latency and determinism are paramount for automotive systems, and why public, transparent benchmarks are crucial in providing Tier 1s and OEMs with the guidance they need to ensure AI-defined vehicles are truly up to the task.”

Datasets for these benchmark tests were sourced through partnerships with Cognata and Motional. Cognata provided 8-megapixel imagery for the 2-D object recognition and segmentation tests, and Motional provided its nuScenes dataset for the 3-D object recognition test.

“Reliable autonomy depends on transparent, realistic benchmarks,” said Danny Atsmon, Founder and CEO of Cognata. “By contributing Cognata’s 8-megapixel imagery dataset, we are not only strengthening MLPerf Automotive v0.5 but also reinforcing Cognata’s role as a trusted provider of benchmark-quality datasets that accelerate safe, large-scale adoption of ADAS and AV systems.”

“Since its release in 2019, nuScenes has played an important role in many critical advancements by the autonomous vehicle and research communities,” said Sourabh Vora, Senior Director of Engineering at Motional. “The first publicly available dataset of its kind, nuScenes pioneered an industry-wide culture of data-sharing and collaboration, and continues to serve as the industry standard for datasets today. Motional is proud to continue supporting the development of safe autonomous systems with the use of nuScenes for the MLPerf Automotive benchmark suite.”

Valuable performance data, immediately relevant for decision-makers

“The MLPerf Automotive benchmark suite is the product of a unique and essential collaboration, bringing the expertise of the computing, sensor, AI, and automotive communities together towards a common goal: transparent and trustworthy performance data on these emerging automotive AI systems,” said David Kanter, head of MLPerf at MLCommons. “I thank all our partners and collaborators who contributed to this bold first step. I also welcome the submitters in the first round of the benchmark: GateOverflow and Nvidia.”

Markus Tremmel, Bosch ADAS Chief Expert, said, “Selecting the right AI acceleration solution is the basis of every successful ADAS/AD system development. MLPerf^® Automotive will greatly enhance and simplify this selection process by establishing realistic automotive benchmarks representing high-resolution system scenarios and modern AI/ML network architectures.”

“The release of MLPerf Automotive v0.5 is a major milestone, showing what’s possible when leaders from AI, semiconductor, and automotive industries work together. With transparent, reproducible benchmarks, OEMs and suppliers can confidently evaluate solutions for next-generation safety-critical automotive systems,” said Stephen Miller, AVCC Technical Chair and Technical Expert, ADAS Product Management at Bosch. “AVCC is proud to partner with MLCommons to support this important joint initiative, which strengthens the foundation for innovation across the global automotive ecosystem.”

View the results

To view the results for MLPerf Automotive, please visit the Automotive benchmark results page.

About AVCC

AVCC is a global automated and autonomous vehicle (AV) consortium that specifies and benchmarks solutions for AV computing, cybersecurity, functional safety, and building block interconnects. AVCC is a not-for-profit membership organization building an ecosystem of OEMs, automotive suppliers, and semiconductor and software suppliers in the automotive industry. The consortium addresses the complexity of the intelligent-vehicle software-defined automotive environment and promotes member-driven dialogue within technical working groups to address non-differentiable common challenges. AVCC is committed to driving the evolution of autonomous and automated solutions up to L5 performance. For additional information on AVCC membership and technical reports, please visit www.avcc.org or email outreach@avcc.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results appeared first on MLCommons.

New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems

MLCommons — Mon, 04 Aug 2025 14:56:19 +0000

San Francisco, CA. Today, MLCommons® announced results for its industry-standard MLPerf® Storage v2.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. This round of the benchmark saw dramatically increased participation, more geographic representation from submitting organizations, and greater diversity of the systems submitted for testing.

The benchmark results show that storage systems performance continues to improve rapidly, with tested systems serving roughly twice the number of accelerators than in the v1.0 benchmark round.

Additionally, the v2.0 benchmark adds new tests that replicate real-world checkpointing for AI training systems. The benchmark results provide essential information for stakeholders who need to configure the frequency of checkpoints to optimize for high performance – particularly at scale.

Version 2.0 adds checkpointing tasks, delivers essential insights

As AI training systems have continued to scale up to billions and even trillions of parameters, and the largest clusters of processors have reached one hundred thousand accelerators or more, system failures have become a prominent technical challenge. Because data centers tend to run accelerators at near-maximum utilization for their entire lifecycle, both the accelerators themselves and the supporting hardware (power supplies, memory, cooling systems, etc.) are heavily burdened, minimizing their expected lifetime. This is a chronic issue, especially in large clusters: if the mean time to failure for an accelerator is 50,000 hours, then a 100,000-accelerator cluster running for extended periods at full utilization will likely experience a failure every half-hour. A cluster with one million accelerators would expect to see a failure every three minutes. Worse, because AI training usually involves massively parallel computation where all the accelerators are moving in lockstep on the same iteration of training, a failure of one processor can grind an entire cluster to a halt.

It is now broadly accepted that saving checkpoints of intermediate training results at regular intervals is essential to keep AI training systems running at high performance. The AI training community has developed mathematical models that can optimize cluster performance and utilization by trading off the overhead of regular checkpoints against the expected frequency and cost of failure recovery (rolling back the computation, restoring the most recent checkpoint, restarting the training from that point, and duplicating the lost work). Those models, however, require accurate data on the scale and performance of the storage systems that are used to implement the checkpointing system.

The MLPerf Storage v2.0 checkpoint benchmark tests provide precisely that data, and the results from this round suggest that stakeholders procuring AI training systems need to carefully consider the performance of the storage systems they buy, to ensure that they can store and retrieve a cluster’s checkpoints without slowing the system down to an unacceptable level. For a deeper understanding of the issues around storage systems and checkpointing, as well as of the design of the checkpointing benchmarks, we encourage you to read this post from Wes Vaske, a member of the MLPerf Storage working group.

“At the scale of computation being implemented for training large AI models, regular component failures are simply a fact of life,” said Curtis Anderson, MLPerf Storage working group co-chair. “Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance. This initial round of checkpoint benchmark results shows us that current storage systems offer a wide range of performance specifications, and not all systems are well-matched to every checkpointing scenario. It also highlights the critical role of software frameworks such as PyTorch and TensorFlow in coordinating training, checkpointing, and failure recovery, as well as some opportunities for enhancing those frameworks to further improve overall system performance.”

Workload benchmarks show rapid innovation in support of larger-scale training systems

Continuing from the v1.0 benchmark suite, the v2.0 suite measures storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators’ “think time” the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

The v2.0 results show that submitted storage systems have substantially increased the number of accelerators they can simultaneously support, roughly twice the number compared to the systems in the v1.0 benchmark.

“Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It’s no surprise to see that storage system providers are innovating to support ever larger scale systems,” said Oana Balmau, MLPerf Storage working group co-chair.

The v2.0 submissions also included a much more diverse set of technical approaches to delivering high-performance storage for AI training, including:

6 local storage solutions;
2 solutions using in-storage accelerators;
13 software-defined solutions;
12 block systems;
16 on-prem shared storage solutions;
2 object stores.

“Necessity continues to be the mother of invention: faced with the need to deliver storage solutions that are both high-performance and at unprecedented scale, the technical community has stepped up once again and is innovating at a furious pace,” said Balmau.

MLPerf Storage v2.0: skyrocketing participation and diversity of submitters

The MLPerf Storage benchmark was created through a collaborative engineering process by 35 leading storage solution providers and academic research groups across 3 years. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v2.0 benchmark results, from a broad set of technology providers, reflect the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v2.0 includes >200 performance results from 26 submitting organizations: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong. The submitters represent seven different countries, demonstrating the value of the MLPerf Storage benchmark to the global community of stakeholders.

“The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,” said David Kanter, Head of MLPerf at MLCommons. The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it. I would especially like to welcome first-time submitters Alluxio, ExponTech, FarmGPU, H3C, Kingston, KIOXIA, Oracle, Quanta Computer, Samsung, Sandisk, TTA, UBIX, IBM, and WDC.”

“This level of participation is a game-changer for benchmarking: it enables us to openly publish more accurate and more representative data on real-world systems,” Kanter continued. That, in turn, gives the stakeholders on the front lines the information and tools they need to succeed at their jobs. The checkpoint benchmark results are an excellent case in point: now that we can measure checkpoint performance, we can think about optimizing it.”

We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark suite.

View the Results

To view the results for MLPerf Storage v2.0, please visit the Storage benchmark results.

About MLCommons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems appeared first on MLCommons.

Announcing the MLPerf Storage v2.0 Checkpointing Workload

MLCommons — Mon, 04 Aug 2025 14:43:55 +0000

The previous versions of MLCommons® MLPerf ®Storage (v0.5 and v1.1) focused on ingestion for training data-intensive models like 3D U-Net, ResNet-50, and CosmoFlow. The new checkpointing workload is focused on addressing the backup and recovery speed for training, with a focus on LLMs on scale-out systems.

In this blog, we’ll cover the motivation for checkpointing, how failure rates of AI systems depend on the scale of the cluster, the details of how the workload is run, and how to interpret the results of this new workload.

Failures Halt AI Training Clusters

AI training today is typically a synchronous process. In particular, as a model is being trained on a large cluster, every participating member in the cluster is using the same set of weights. Conceptually, each node will read in data and perform calculations, and then all the nodes will share their calculations to update the weights of the model. The weights must be updated across all nodes before starting the next set of calculations.

Synchronous training makes the entire process much more predictable and reproducible for machine learning researchers and engineers. One downside of synchronous training is that straggler nodes can limit the progress of training jobs. In particular, the failure of a single component that prevents the work of a single node from contributing to the model update results in a cluster-wide stall of the training job.,

Server uptimes are typically measured in months if not years, so failures are often not a major consideration. However, two things set AI training apart from traditional enterprise or cloud workloads.

First, the systems in AI training systems are typically run at maximum performance for their entire lifecycle. This usage model and utilization result in higher failure rates than in traditional systems.

Second, AI training is typically synchronous while traditional workloads are decoupled. If a traditional web server fails, its load can be easily transitioned to another web server. However, due to the synchronous nature of AI training, the failure rate of a cluster scales with the power of the number of components. For example, if a cluster of 100,000 accelerators has a failure rate of R for a given accelerator, then the failure rate of the cluster is R^100,000. For large-scale systems, even a tiny component failure rate can be highly problematic.

Checkpointing Ensures Forward Progress

The MLPerf Storage Working Group has developed a model of the failure rates of AI clusters and estimates of the target checkpoint intervals based on real-world data. Meta’s “Llama3: Herd of Models” paper provides information about failure rates seen at scale (https://arxiv.org/abs/2407.21783). In that paper, Meta describes the training cluster used (16,000 accelerators) to train the Llama 3 models. The training process took 54 days and encountered 419 failures. The majority of these failures scale with the size of the cluster, and assuming the failures are uniformly distributed over time yields a model for the Mean Time Between Failures (MTTF) as the cluster size scales up:

Due to the cluster failure rate scaling to the power of cluster size, MTTF exponentially decreases as cluster size increases. The model correlates fairly closely with Meta’s data for 16,000 accelerators, predicting 7.67 failures per day versus the observed data (7.76 failures/day).

Generally, checkpointing strategies are designed to carefully balance the frequency of checkpointing against the protection. A common rule of thumb with large-scale systems is that losing 5% or less of the progress since the last checkpoint is reasonable. Put another way, there should be roughly 20 checkpoints between expected failures.

The exponential impact of cluster size on failure directly translates into a similar impact on checkpointing. For 16,000 accelerators (which is similar to the size of some clusters that are deployed today), 155 checkpoints per day are necessary, or roughly once every 9.3 minutes. At the scale of 100,000 accelerators, this equates to 967 checkpoints per day or a checkpoint every 1.5 minutes.

Finally, the last metric we want to pull from this model is the time available to perform a checkpoint. First, we wanted to ensure we were taking checkpoints frequently enough to minimize the time between failures. Now, we want to ensure we’re not spending too much time with the cluster stopped while we’re taking checkpoints.

As with most forms of redundancy and data protection, checkpointing is not free – it imposes some overhead on the overall AI training effort. As the frequency of checkpoints increases with cluster size, this overhead becomes more pronounced. To keep the overhead below 5% is increasingly challenging as clusters grow in size. For example, a 16,000 accelerator cluster that checkpoints every 9.3 minutes, the checkpointing process can take 28 seconds. For a larger cluster comprising 100,000 accelerators, this falls to a mere 4.4 seconds. These examples illustrate how increasing cluster size for AI training creates an increasing bottleneck in the storage system.

Checkpointing Workload in v2.0

The new checkpointing workload is implemented to be representative of real systems as much as possible. The checkpoint workload defined here creates an in-memory representation of the model and optimizer state, then uses PyTorch save and load to perform the write and read operations.

For the workload, we define four different model sizes:

· 8 Billion parameters

· 70 Billion parameters

· 405 Billion parameters

· 1 Trillion parameters

The model sizes are intended to emulate the Llama 3 Herd of Models with a 1T parameter model added with extrapolated values to represent larger models currently in development.

For CLOSED submissions, each of the model sizes has a required number of processes for saving or loading a checkpoint. These process numbers represent a reasonable minimum size of a system that can train a model of a given size.

OPEN submissions allow changing the number of processes to a multiple of the model’s defined Data Parallelism (see the table below). This is to represent sharding the model over more hosts, which results in more processes initiating the checkpoint save or load at the same time and with less data per process (the model and optimizer states stay the same size as more data parallel instances are used as the model and optimizer are replicated between data parallel instances).

Each process executing the checkpoint will write one or more large files (100s of MBs to GBs in size) depending on the model size, and benchmark-internal parameters representing the parallelism configuration and the model and optimizer sharding method.

Additionally, there’s an operating mode used only for client-local storage solutions like local NVMe devices called Subset mode. When run in subset mode – regardless of the required number of processes – only eight processes are used.

Subset mode represents using an NVMe drive in each of the accelerator nodes as the first tier of checkpointing storage. Subset mode will perform a portion of the checkpoint depending on the model size.

Finally, to ensure the test is measuring the performance of the underlying storage, a successful checkpoint workload run consists of 10 Save operations followed by 10 Load operations. If the client memory is above a threshold (3x the size of a node’s checkpoint), then the submitter must clear the filesystem caches between the save and load phases.

Table 1 – Checkpointing Parameters by Model Size

Model	8B	70B	405B	1T
Data Parallelism	1	1	2	2
Total Processes	8	64	512	1024
Checkpoint size	105 GB	912 GB	5.29 TB	15 TB
Subset: 8-Process Size	105 GB	114 GB	94 GB	144 GB

One important takeaway is the sheer size of the checkpoints. The actual model only accounts for a portion of the data in a given checkpoint. The bulk of the data is actually the optimizer state of the cluster at that point in time.

For example, the checkpoint for the 8 Billion parameter model totals 105GB, and the optimizer state makes up 90GB. For the 1 Trillion parameter model, the optimizer makes up a whopping 13.2TB of the 15TB of checkpoint data.

This helps to put some real sense of scale around the challenge of checkpointing. For a very large model like a 1 trillion parameter training on a cluster of 100,000 accelerators, writing out the 15TB checkpoint every 1.5 minutes results in over 14 Petabytes of writes per day for checkpoints alone.

For synchronous checkpointing, this implies writing 15TB of data in under 5 seconds – averaging about 3.6 TB/s across the cluster or roughly 200GB/s for the entire duration of the training job.

Interpreting the Results

The results of a checkpointing workload will include one or more of the supported model sizes with the associated Save and Load performance scores.

The default metric is throughput (GiB/s) for the operation, with the option to expand the table to see the result as a duration number as well (in seconds).

On the surface, the large reads and writes may seem like a trivial workload, but the large number of processes performing the operations in parallel results in interleaving of data at the physical layer in a way that can inhibit performance in storage systems.

Furthermore, a checkpoint Save or Load is only complete when every process is done. This means that the numbers represented in the results table are based on the time taken by the worst-performing process rather than an average of steady-state performance.

End users and customers will have differing views on the relative importance of Save vs Load. Save operations happen more frequently, which means more chances for a bad save to impact operations, while every Load operation is likely being performed during the most critical time: while the cluster is stopped.

Additionally, comparing the subset and full checkpoint modes with each other is difficult. As executed, the subset mode does not provide full fault tolerance and requires either a secondary storage solution or additional software enhancements by the end user.

Conclusion

In this blog post, we have provided an overview of the motivation for checkpointing as a new workload within MLPerf Storage, the relationship between AI training cluster size and checkpointing frequency and volume, an overview of checkpointing in MLPerf Storage v2.0, and a rough guide on interpreting the results. This is the first industry standard benchmark for checkpointing in AI training and will ultimately help the community understand performance requirements and enable greater optimization and capabilities for the entire ecosystem.

You can read more about the MLPerf Storage v2.0 results on our blog.

The post Announcing the MLPerf Storage v2.0 Checkpointing Workload appeared first on MLCommons.

MLCommons Releases MLPerf Client v1.0: A New Standard for AI PC and Client LLM Benchmarking

MLCommons — Wed, 30 Jul 2025 15:02:04 +0000

MLCommons®, the consortium behind the industry-standard MLPerf® benchmarks, today announced the release of MLPerf Client v1.0, a benchmark that sets a new standard for measuring the performance of large language models (LLMs) on PCs and other client-class systems. This release marks a major milestone in the effort to bring standardized, transparent AI performance metrics to the fast-emerging AI PC market.

MLPerf Client v1.0 introduces an expanded set of supported models, including Llama 2 7B Chat, Llama 3.1 8B Instruct, and Phi 3.5 Mini Instruct. It also adds Phi 4 Reasoning 14B as an experimental option to preview the next generation of high-reasoning-capable LLMs. These additions allow the benchmark to reflect real-world use cases across a broader range of model sizes and capabilities.

The benchmark also expands its evaluation scope with new prompt categories. These include structured prompts for code analysis and experimental long-context summarization tests using roughly 4,000- and 8,000-token inputs, representing workloads increasingly relevant to both developers and advanced users.

Hardware and platform support have also expanded significantly. MLPerf Client v1.0 now supports AMD NPUs and GPUs working together via the ONNX Runtime and the Ryzen AI SDK. Intel NPUs and GPUs are supported through OpenVINO. GPUs from AMD, Intel, and NVIDIA are supported across the board through the ONNX Runtime GenAI with DirectML, offering wide compatibility for GPU-equipped systems. Qualcomm Technologies NPUs and CPUs are supported in hybrid operation using Qualcomm Genie and the QAIRT SDK. Also, Apple Mac GPUs are supported through MLX.

Additionally, the benchmark offers early, experimental support for several other acceleration paths. Intel NPUs and GPUs are supported via Microsoft Windows ML using the OpenVINO execution provider. NVIDIA GPUs are supported via llama.cpp with CUDA, and Apple Mac GPUs are supported via llama.cpp with Metal.

For users and developers, MLPerf Client v1.0 provides both command-line and graphical user interfaces. The newly developed GUI offers intuitive, cross-platform benchmarking with key usability enhancements, such as real-time readouts of compute and memory usage, persistent results history, comparison tables across test runs, and CSV exports for offline analysis. The CLI enables easy automation and scripting for regression testing or large-scale evaluations.

MLPerf Client v1.0 is the result of collaboration among major industry stakeholders, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, and leading PC OEMs. The benchmark is available now as an open and free download from mlcommons.org, and it will continue to evolve alongside the rapidly growing AI PC ecosystem.

“MLPerf Client v1.0 is a major step forward for benchmarking AI capabilities on consumer systems,” said Ramesh Jaladi, co-chair of the MLPerf Client working group at MLCommons. “It provides a reliable, vendor-neutral standard that OEMs, silicon providers, reviewers, and end users can trust.”

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone. The organization produces industry-leading benchmarks, datasets, and best practices that span the full range of ML applications—from massive cloud training to resource-constrained edge devices. Its MLPerf benchmark suite has become the de facto standard for evaluating AI performance.

Learn more at www.mlcommons.org.

The post MLCommons Releases MLPerf Client v1.0: A New Standard for AI PC and Client LLM Benchmarking appeared first on MLCommons.

MLCommons Launches MLPerf Mobile on Google Play Store

MLCommons — Thu, 10 Jul 2025 18:35:26 +0000

MLCommons^®, the open engineering consortium behind the widely used MLPerf® suite of AI benchmarks, today announced that MLPerf^® Mobile, its flagship mobile AI performance benchmarking tool, is now available for the first time on the Google Play Store.

Previously offered only as a downloadable APK, MLPerf Mobile can now be easily accessed and installed by a global audience of Android users directly through Google Play. The free, open-source app empowers users to measure how well their smartphones and tablets handle real-world artificial intelligence (AI) and machine learning (ML) tasks.

“Publishing MLPerf Mobile to the Play Store is a major milestone in making AI benchmarking more accessible and transparent,” said Mostafa El-Khamy, chair of the Mobile working group at MLCommons. “Now anyone—from developers and hardware reviewers to everyday users—can assess their device’s AI performance using benchmarks developed collaboratively through MLCommons.”

What is MLPerf Mobile?

MLPerf Mobile is an AI benchmarking app designed to evaluate mobile device performance on state-of-the-art ML workloads, including:

Image classification
Object detection
Image segmentation
Language understanding
Super-resolution upscaling
Text-to-image generation

The app uses on-device hardware acceleration where supported and is optimized for Android via TensorFlow Lite delegate fallback. MLPerf Mobile supports both casual usage for quick performance insights and official benchmarking for MLCommons member submissions.

Key Features:

First-time availability on the Google Play Store
Benchmarks based on the latest AI models and workloads
Tuned performance for modern mobile SoCs with hardware AI acceleration
Multiple test modes: from fast casual tests to full MLPerf-compliant runs
Configurable cool-down periods to reduce thermal throttling
Optional cloud-based result storage with multi-device access (free account required)

MLPerf Mobile is maintained by the MLPerf Mobile Working Group, part of MLCommons®, a non-profit AI engineering organization with over 125 member institutions spanning academia and industry.

Download MLPerf Mobile today.

MLPerf Mobile is available for free on the Google Play Store.

For technical users, source code and documentation remain available on the MLCommons GitHub repository. Issues, contributions, and feedback are welcome.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

About MLCommons®

Learn more at www.mlcommons.org.

The post MLCommons Launches MLPerf Mobile on Google Play Store appeared first on MLCommons.

MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders

MLCommons — Fri, 27 Jun 2025 17:11:26 +0000

As agentic AI rapidly transforms the enterprise adoption landscape, MLCommons is convening experts from across academia, civil society, and industry to better understand the risks associated around practical, near-term deployments of AI agents.

The challenge we posed to our community was: Imagine you are the decision maker on a product launch in 2026 for an agent that could take limited actions, such as placing an order or issuing a refund – what evaluations would you need to make an informed decision?

Last month, MLCommons co-hosted the Agentic Reliability Evaluation Summit (ARES) with the Oxford Martin School’s AI Governance Initiative, Schmidt Sciences, Laboratoire National de Metrologie et D’Essais (LNE), and the AI Verify Foundation. 34 organizations attended the convening — including AI labs, large software companies, AI safety institutes, startups, and civil society organizations — to discuss new and emerging evaluation methods for agentic AI.

Today, MLCommons is announcing a new collaboration with contributors from Advai, AI Verify Foundation, Anthropic, Arize AI, Cohere, Google, Intel, LNE, Meta, Microsoft, NASSCOM, OpenAI, Patronus AI, Polytechnique Montreal, Qualcomm, QuantumBlack – AI by McKinsey, Salesforce, Schmidt Sciences, ServiceNow, University of Cambridge, University of Oxford, University of Illinois Urbana-Champaign, and University of California, Santa Barbara to co-develop an open agent reliability evaluation standard to operationalize trust in agentic deployments. This collaboration brings together a diverse ecosystem spanning AI builders, major enterprise deployers, specialized testing and safety providers, and important global policy organizations to ensure that this standard is both technically sound and practically implementable in the near-term. By uniting these different perspectives, this collaboration will create frameworks and benchmarks that are grounded in technical reality while addressing the practical needs of the global AI ecosystem.

We have outlined a set of principles to evaluate the reliability of near-term agentic use-cases, focusing on scenarios with limited actions within structured contexts. The evaluation framework will be organized into four key categories: Correctness, Safety, Security, and Control. The outcome from this group will comprise: 1) Design principles for agentic development 2) Benchmarks that are relevant to business for correctness, safety & control, and security.

We welcome collaboration with a diverse range of enterprises, technology leaders, and business builders who share our vision. If you are interested in joining, please email ai-risk-and-reliability-chairs@mlcommons.org.

The post MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders appeared first on MLCommons.

Introducing the 2025 MLCommons Rising Stars

MLCommons — Thu, 12 Jun 2025 05:48:52 +0000

We are pleased to announce the third annual MLCommons Rising Stars cohort of 38 junior researchers from 25 institutions globally! These promising researchers, drawn from over 150 applicants, have demonstrated excellence in Machine Learning (ML) and Systems research and stand out for their current and future contributions and potential. This year’s program has been expanded to include data systems research as an area of interest.

The MLCommons Rising Stars program provides a platform for talented young researchers working at the intersection of ML and systems to build connections with a vibrant research community, engage with industry and academic experts, and develop their skills. The program continues to promote diversity in the research community by seeking researchers from historically underrepresented backgrounds. We are pleased to welcome nine international Rising Stars to this year’s cohort.

As part of our commitment to fostering the growth of our Rising Stars, we organized the Rising Stars workshop at the Meta Headquarters in Menlo Park, CA, in May, where the cohort showcased their work, explored research opportunities, gained new skill sets via career building sessions, and had the opportunity to network with researchers across academia and industry.

“ML is a fast-growing field with rapid adoption across all industries, and we believe that the biggest breakthroughs are yet to come. By nurturing and supporting the next generation of researchers, both domestically and globally, we aim to foster an inclusive environment where these individuals can make groundbreaking contributions that will shape the future of ML and systems research. The Rising Stars program is our investment in the future, and we are excited to see the innovative ideas and solutions that these talented researchers will bring to the table,” said Vijay Janapa Reddi, MLCommons VP and Research Chair and steering committee member of the Rising Stars program.

This year’s program marked a pivotal step toward strengthening the community at the intersection of ML and systems, especially by connecting early-career researchers who share common challenges and goals. The shared experiences, discussions on generative AI, and connections forged at the workshop are expected to fuel new collaborations and research directions. As we reflect on this year’s success, we are more committed than ever to growing a supportive, interdisciplinary, and inclusive ecosystem for future Rising Stars.

As part of the two-day Rising Stars workshop held at Meta Headquarters in Menlo Park, our 2025 cohort engaged in dynamic sessions with leading researchers and alumni. One of the highlights was the Alumni Panel, moderated by Abdulrahman Mahmoud (MBZUAI), featuring Rising Stars ’23 alumni: Sercan Aygun (University of Louisiana at Lafayette), Muhammad Husnain Mubarik (AMD), Francisco Romero (Plix & Georgia Tech), and Zishen Wan (Georgia Tech). The panel shared valuable insights into their career journeys, academic-industry transitions, and lessons learned since completing the program. Attendees had the opportunity to ask questions about career growth, building a research identity, and balancing long-term research goals.

The workshops keynotes and invited talks came from distinguished leaders in ML and systems, including Vikas Chandra (Meta) on “Generative AI for Immersive Worlds”, Dan Fu (Stanford University) on “Kittens and Chipmunks: More Efficient ML Algorithms with Kernels”, Joel Emer (MIT) in a candid AMA session, Vijay Janapa Reddi (Harvard University) on “Architecture 2.0”, and Jason Cong (UCLA) on “Efficient LLMs: From Model Innovation to Customized Acceleration”. In addition, the AR/VR Panel, moderated by Muhammad Husnain Mubarik, brought together Huichu Liu (Meta) and Hyoukjun Kwon (UC Irvine) for an engaging discussion on immersive technologies. The program also spotlighted MLCommons Research Presentations by David Kanter (MLCommons) and Arun Tejusve Raghunath Rajan (Meta), offering a forward-looking perspective on collaborative benchmarking and infrastructure for ML innovation. This was followed by a moderated Q&A session on ML Systems at Meta, led by Abdulrahman Mahmoud (MBZUAI), with insights from Zachary DeVito (Meta), one of the first contributors to PyTorch, who shared his experiences in API design. Through lightning talks, poster sessions, breakout research discussions, and panels, the event fostered a vibrant environment for exchanging ideas and forging collaborations.

We extend our warmest congratulations to this year’s Rising Stars and express our gratitude to everyone who applied.

We would also like to thank Rising Stars organizers Udit Gupta (Cornell Tech), Abdulrahman Mahmoud (MBZUAI), Sercan Aygun (University of Louisiana at Lafayette) and Muhammad Husnain Mubarik (AMD), Lillian Pentecost (Amherst College), Akanksha Atrey (Nokia Bell Labs) and Vijay Janapa Reddi (Harvard University) for all their efforts in putting together the program and selecting an impressive cohort of recipients.

Finally, we would also like to extend our appreciation to Kelly Berschauer (MLCommons), Vikas Chandra (Meta), Petrina Mitchell (META), and David Scott (META), for their support in organizing the program and workshop.

The post Introducing the 2025 MLCommons Rising Stars appeared first on MLCommons.

Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore

MLCommons — Fri, 06 Jun 2025 18:43:00 +0000

Last month, MLCommons President Peter Mattson joined several industry leaders and experts to discuss the road forward for responsible AI innovation at the Asia Tech x summit in Singapore. The panel discussion, titled “Innovating Within a Secure, Reliable and Safe AI Landscape,” focused on finding a balance between mitigating risk and fostering innovation as AI continues to evolve.

Mattson was joined by panel moderator Denise Wong, Assistant Chief Executive at the Infocomm Media Development Authority, alongside the following experts:

Yoichi Iida, Special Policy Adviser to Minister, Ministry of Internal Affairs and Communications, Japan
Dr. Bo Li, Professor, University of Illinois at Urbana-Champaign and Chief Executive Officer, Virtue AI
Dr. Chris Meserole, Executive Director, Frontier Model Forum
Professor Lam Kwok Yan, Executive Director, Digital Trust Centre Singapore and Singapore AI Safety Institute

Throughout the discussion, Mattson emphasized the importance of utilizing accessible data in developing reliable and secure AI systems — including via MLCommons’ Croissant tool, which summarizes and describes datasets. “Data is what determines the behavior of your system,” said Mattson. “It’s not the python, it’s the data that was used to train, fine tune, and prompt the AI. That is the key to success.”

Mattson also highlighted MLCommons’ partnership with the AI Verify Foundation in Singapore to develop benchmarks — like AILuminate — that help enterprise leaders understand and mitigate the risks of AI deployment. As part of the summit, MLCommons and AI Verify Foundation were proud to release proof-of-concept AILuminate testing in Chinese – a major milestone in ensuring clear, rigorous LLM benchmarking in the world’s most spoken language.

While in Singapore, MLCommons also released updated reliability scores for a new and expanded list of LLMs, as well as a new partnership with NASSCOM — India’s top tech trade association — to expand independent, empirically-sound benchmarking across South Asia.

“MLCommons has a mission to make AI better for everyone,” said Mattson. “We interpret that as growing the market in a way that delivers benefits to us all.”

The MLCommons AI Risk & Reliability working group is a highly collaborative, diverse set of experts committed to building a safer AI ecosystem. We welcome others to join us.

The post Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore appeared first on MLCommons.

New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI

MLCommons — Wed, 04 Jun 2025 14:56:30 +0000

Today, MLCommons^®announced new results for the MLPerf^® Training v5.0 benchmark suite, highlighting the rapid growth and evolution of the field of AI. This round of benchmark results includes a record number of total submissions, as well as increased submissions for most benchmarks in the suite compared to the v4.1 benchmark.

MLPerf Training v5.0 introduces new Llama 3.1 405B benchmark

The MLPerf Training benchmark suite comprises full system tests that stress models, software, and hardware for a range of machine learning (ML) applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

Version 5.0 introduces a new large language model pretraining benchmark based on the Llama 3.1 405B generative AI system, which is the largest model to be introduced in the training benchmark suite. It replaces the gpt3-based benchmark included in previous versions of the MLPerf Training benchmark suite. An MLPerf Training task force selected the new benchmark because it is a competitive model representative of the current state-of-the-art LLMs, including recent algorithmic updates and training on more tokens. More information on the new benchmark can be found here. Despite just being introduced, the Llama 3.1 405B benchmark is already receiving more submissions than the gpt3-based predecessor saw in previous rounds – demonstrating the popularity and importance of large-scale training.

Rapid performance improvements for key training scenarios

The MLPerf Training working group regularly adds emerging training workloads to the benchmark suite to ensure that it reflects industry trends. The Training 5.0 benchmark results show notable performance improvements for newer benchmarks, indicating that the industry is prioritizing emerging training workloads over older ones. The Stable Diffusion benchmark saw a 2.28x speed increase for 8-processor systems compared to the 4.1 version six months ago, and the Llama 2.0 70B LoRA benchmark increased its speed 2.10x versus version 4.1; both outpacing historical expectations for computing performance improvements over time as per Moore’s Law. Older benchmarks in the suite saw more modest performance improvements.

On multi-node, 64-processor systems, the RetinaNet benchmark saw a 1.43x speedup compared to the prior v3.1 benchmark round (the most recent to include comparable scale systems), while the Stable Diffusion benchmark had a dramatic 3.68x increase.

“This is the sign of a robust technology innovation cycle and co-design: AI takes advantage of new systems, but the systems are also evolving to support high-priority scenarios,” said Shriya Rishab, MLPerf Training working group co-chair.

Increasing diversity of processors, increasing scale of systems, broadening ecosystem

Submissions to MLPerf Training 5.0 utilized 12 unique processors, all in the available (production) category. Five of the processors have become publicly available since the last version of the benchmark suite.

AMD Instinct MI300X 192GB HBM3
AMD Instinct MI325X 256GB HBM3e
NVIDIA Blackwell GPU (GB200)
NVIDIA Blackwell GPU (B200-SXM-180GB)
TPU-trillium

Submissions also included three new processor families:

5th Generation AMD Epyc Processor (“Turin”)
Intel Xeon 6 Processor (“Granite Rapids”)
Neoverse V2 as part of NVIDIA GB200

In addition, the number of multi-node systems submitted increased more than 1.8x when compared to version 4.1.

“The picture is clear: AI workloads are scaling up, systems are scaling up to run them, and hardware innovation continues to boost performance for key scenarios,” said Hiwot Kassa, MLPerf Training working group co-chair. “In-house large scale systems were built by few companies, but the increased proliferation – and competition – in AI-optimized systems is enabling the broader community to scale up their own infrastructure. Most notably, we see an increasing cadre of cloud service providers offering access to large-scale systems, democratizing access to training large models.

“The industry is not standing still, and neither can we. MLCommons is committed to continuing to evolve our benchmark suite so that we can capture and report on the innovation that is happening in the field of AI.”

Record industry participation

The MLPerf Training v5.0 round includes 201 performance results from 20 submitting organizations: AMD, ASUSTeK, Cisco Systems Inc., CoreWeave, Dell Technologies, GigaComputing, Google Cloud, Hewlett Packard Enterprise, IBM, Krai, Lambda, Lenovo, MangoBoost, Nebius, NVIDIA, Oracle, Quanta Cloud Technology, SCITIX, Supermicro, and TinyCorp.

“We would especially like to welcome first-time MLPerf Training submitters AMD, IBM, MangoBoost, Nebius, and SCITIX,” said David Kanter, Head of MLPerf at MLCommons. ”I would also like to highlight Lenovo’s first set of power benchmark submissions in this round – energy efficiency in AI training systems is an increasingly critical issue in need of accurate measurement.”

MLPerf Training v5.0 set a new high-water mark for the >200 submissions. The vast majority of the individual benchmark tests that carried over from the previous round saw an increase in submissions.

Robust participation by a broad set of industry stakeholders strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

Please visit the Training benchmark page to view the full results for MLPerf Training v5.0 and find additional information about the benchmarks.

About ML Commons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

Press Inquiries: contact press@mlcommons.org

The post New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI appeared first on MLCommons.

Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety

MLCommons — Tue, 03 Jun 2025 22:36:33 +0000

The Association for Advanced Artificial Intelligence (AAAI) 2025 workshop on Datasets and Evaluators for AI Safety on March 3rd, 2025, provided a forum to discuss the challenges and strategies in AI safety. Drawing on contributions from leading experts across academia, industry, and policy, the discussions revealed that advancing AI safety is as much about creating ethical, adaptive, and collaborative governance mechanisms as it is about embracing rigorous technical benchmarks.

Establishing robust standards for low-risk AI

The workshop highlighted the need for standardisation and reliable benchmarks to reduce AI risks. In his talk on “Building an Ecosystem for Reliable, Low-Risk AI,” Dr Peter Mattson, President of MLCommons, outlined the important role of standardised safety benchmarks. The introduction of frameworks like the AILuminate Benchmark, which evaluates AI systems across 12 distinct safety hazards (e.g., physical, non-physical, and contextual risks), has sparked a conversation on making safety evaluations robust and accessible. Such benchmarks provide a quantitative measure of risks (from privacy and intellectual property challenges to non-physical harms) and are a foundation for industry-wide standards.

The workshop also showcased metadata tools such as the Croissant vocabulary, which standardises ML datasets to improve reproducibility and transparency. These initiatives address a core barrier in verifying and replicating AI results by ensuring that datasets are FAIR (findable, accessible, interoperable, and reusable). Ultimately, the workshop showed that safer and more reliable AI deployments can be achieved through greater technical rigour in evaluation.

Balancing technical rigor with ethical considerations

Another significant insight from the workshop was the call to integrate ethical governance into technical evaluations. Participants highlighted that technical benchmarks must be complemented by sociotechnical frameworks that manage AI’s current fairness, inclusivity, and broader societal impacts. For instance, a presentation by Prof Virginia Dignum explored the tension between long-term speculative risks and immediate ethical challenges. Prof Dignum argued that an excessive focus on future harms might detract from addressing present-day issues such as bias and fairness.

Panel discussions underscored that effective AI safety requires both robust technical metrics and frameworks for social responsibility to tackle complex real-world environments. Without the latter, unforeseen biases, or worse, are more likely to be perpetuated when AI systems “in the wild” encounter safety challenges not anticipated during model training.

Adapting evaluation methodologies in a dynamic landscape

The rapidly evolving nature of AI technologies calls for a dynamic approach to safety evaluations. The workshop continuously illustrated that safety frameworks must be adaptive; what is deemed safe today may not hold in a future where AI systems evolve unexpectedly. Panelists agreed that traditional, static evaluation methods are often ill-equipped to handle the nuances of modern AI applications.

Emergent harms, manifesting as indirect or second-order effects, challenge conventional safety paradigms. Practitioners from multiple disciplinary backgrounds emphasised how all AI safety metrics require continuous re-evaluation and adjustment as models scale in complex environments. This interdisciplinary call to action reminds us that AI safety evaluations must be as flexible and dynamic as the technologies they are designed to safeguard.

Creating collaborative and interdisciplinary approaches

Underlying the discussions at AAAI 2025 was the recognition that addressing AI safety is inherently an interdisciplinary challenge. From data scientists and software engineers to policymakers and ethicists, the workshop brought together diverse experts, enumerating unique priorities in AI safety. For instance, during the panel discussions, some experts advocated for technical measures such as implementing self-customised safety filters and automated source-referencing. Meanwhile, others stressed governance priorities like a broad citizen-first approach and frameworks to safeguard vulnerable groups.
Integrating these viewpoints is critical for developing safety frameworks that are both technically sound and socially responsive. Collaborative discussions of this kind help to bridge the gap between theoretical research and practical implementation. Unsurprisingly, there was broad consensus that some form of collective action will drive the evolution of safer, more reliable AI systems. Initiatives such as the development of the Croissant vocabulary and the ongoing work on standardised benchmarks (e.g., AILuminate) effectively capture this spirit of cooperation, given both rely on cross-sector engagement to succeed.

Conclusion

Ultimately, several critical themes for the AAAI 2025 Datasets and Evaluators for AI Safety workshop were the need for adaptive evaluation frameworks to handle real-world complexity and ongoing interdisciplinary discussions to ensure that our notion of “risk” is as robust as possible. Many discussion points were held throughout the day, but these themes were particularly important for ongoing work such as MLCommons’ AI Risk and Reliability initiative. By the end of the event, we felt a desire for more interdisciplinary workshops like
this. The common theme of adaptability tells us that it will be critical for practitioners to stay abreast of the AI safety responses to fast-moving developments we continue to see across the ecosystem.

The post Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety appeared first on MLCommons.

MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark

MLCommons — Thu, 29 May 2025 15:10:24 +0000

MLCommons today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India’s premier technology trade association, to bring AILuminate’s globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate’s Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs).

”We’re looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership, along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”

“The rapid development of AI is reshaping India’s technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”

The NASSCOM collaboration builds on MLCommons’ intentionally global approach to AI benchmarking. Modeled after MLCommons’ ongoing partnership with Singapore’s AI Verify Foundation, the NASSCOM partnership will help to meet South Asia’s urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region’s industry experts, policymakers, civil society members, and academic researchers. MLCommons’ partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.

AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking.

“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We’re grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.”

Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

MLCommons MLPerf Training Expands with Llama 3.1 405B

MLCommons — Mon, 05 May 2025 15:43:00 +0000

About MLPerf Training

Pretraining Large Language Models (LLMs) is the initial phase of LLM training where the model learns to understand and generate human-like text by training the model on vast amounts of unstructured textual data. A pre-trained model can then be fine-tuned on domain-specific datasets, which gives the model knowledge of specific tasks. Reasoning approaches have become popular recently, helping models better explain their thinking process. Because pre-training uses large amounts of data, it is generally the most compute-intensive phase of LLM training.

MLCommons^® added a GPT3 Pretraining benchmark to MLPerf^® Training in 2022 to help academics and industry experts optimize LLM pretraining. It has been a huge success, as we have seen over 3x speedups on the results of this benchmark for large systems (over 10,000 accelerators).

Over this same period, newer state-of-the-art models have demonstrated greater scale and many architectural advancements. For example, Google’s PaLM model, released in 2022, contains 540B parameters, three times as many as GPT3’s 175B parameter count. Meta’s Llama 2 and Llama 3 series, released in 2023, utilize new algorithms such as Root Mean Square Layer Normalization (RMSNorm, Zhang et al., https://arxiv.org/abs/1910.07467), Rotary Position Embedding (RoPE, Su et al., 2024), and Grouped Query Attention (GQA, Ainslie et al., 2023).

A MLPerf Training working group task force was created to investigate a July 2024 proposal to replace GPT3 with Llama 3.1 405B. Task force members evaluated the proposal and industry options to recommend the best new pretraining reference. They agreed that Llama 3.1 405B would best represent the current state-of-the-art pretraining and adopted it as a new benchmark. Details on the review and selection process are included in this blog.

Model selection

The most important factor the task force considered when constructing this new pretraining benchmark was the model architecture. As mentioned above, the current GPT3 architecture does not include some of the recent algorithm updates and therefore is not competitive with other state-of-the-art models such as Nemotron-4, GPT4.5 and Claude 3.7 Sonnet. However, Meta’s Llama 3.1 405B model, with an architecture and scale similar to these top-tier models, demonstrates on-par performance with these models across multiple quality benchmarks, thus positioning itself as a competitive model representative of the current state-of-the-art models.

Choosing a model with high community engagement was also an important consideration, and Llama 3.1 405B is the perfect candidate on that basis. With more than 300 million total downloads of all Llama versions to date, the Llama family of models is very widely adopted. The MLPerf community, for instance, has already included Llama 2 in the LoRA benchmark. Additionally, Meta’s released technical paper details the training process, model architecture, dataset and hyperparameters, which allowed us to easily create a high-quality benchmark based on their work.

One other requirement for the new MLPerf Training benchmark is that the model architecture must be publicly accessible so that all submitters can download it and reproduce the results on their end. This typically requires a permissive license. Thanks to Meta’s support, the current Llama 3.1 405B model is now available to all MLCommons members for benchmarking.

Dataset selection

To help submitters transition easily from GPT3 to Llama 3.1 405B, we use the same AllenAI C4-en dataset (Colossal Cleaned Common Crawl dataset) in both benchmarks. This dataset contains 365M English-language paragraphs scraped from the internet, cleaned and preprocessed to convenient JSON formats for training the model.

One key difference is that Llama 3.1 405B uses a context length of 8,192, which is 4x larger than the 2,048-token context of GPT3. The context length represents the amount of text that a model can process at once, thus a larger context length leads to better understanding.

Benchmark technical details

The reference code is implemented in the NVIDIA NeMo Framework, an open-sourced and scalable AI framework built for large language models. By offering easy-to-use modular components and pre-trained model recipes, NeMo enables developers to efficiently build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA H100 GPUs.

To ensure that the new benchmark model setup is easily understandable without complex knobs and hyperparameters scattered around the training script, it starts with NeMo’s integrated Llama 3.1 405B training recipe, which closely followed the original paper’s hyperparameters and introduced only the minimum necessary changes to our script, such as adjusting the optimizer’s learning rates.

When researching how to construct this benchmark, we first explored the GPT3-style setup. For the GPT3 pretraining benchmark, the task force pretrained from random weights for about 12.6T tokens to generate a stable customized checkpoint. Submitters were required to continue pretraining from this checkpoint until they reached a predetermined quality metric. This methodology was used to ensure that the GPT3 pretraining benchmark showcased a reproducible slice of the actual GPT3 pretraining process.

Applying the same methodology to the Llama 3.1 405B pretraining benchmark, the task force generated a customized checkpoint trained on 32B tokens. But experiment results showed that resuming from this checkpoint had more variation in the convergence curves than we could tolerate. This meant that resuming from a custom-trained checkpoint in the benchmark was not feasible. The interpretation that we were still in very early phases of training after the 32B tokens, meant that the checkpoint generated was still quite unstable.

So the taskforce decided to train from the fully trained and stable HuggingFace checkpoint by giving it some new information, which led to a much more stable convergence trend. The benchmark now begins from the checkpoint and stops when the evaluation log perplexity reaches our defined target.

One noticeable difference between the task force’s implementation and Meta’s implementation is that we replaced the original Tiktokenizer, which used a vocab size of 128k, with the 32k vocab size Mixtral 8x22b tokenizer. The motivation behind this change was that Meta’s checkpoint is already well-trained, so it will not learn anything new on the current C4 dataset. Replacing the tokenizer forces the model checkpoint to adapt to a new, different token distribution, therefore continuing to pretrain on the current dataset. This difference implies that, when we resume from the pre-trained checkpoint, only the first 32,000 rows of the 128,256-row word embedding layer weights will be loaded to the model.

As mentioned before, the Llama 3.1 405B pretraining benchmark is a replacement for the GPT3 pretraining benchmark, and the following components will be unchanged between the two:

Validation dataset: To reduce the cost of validation, we chose to create a customized subset of the full C4 validation dataset so that the evaluation time would not be significant compared to training time. Specifically, we found out that using only 5,760 sequences is sufficient to measure model convergence, and since only 91,205 validation samples are needed to yield the required 5,760 validation sequences, the first 91,205 unshuffled rows of the full C4 validation dataset were selected as the customized validation dataset for this benchmark.

Loss and target accuracy metric: As with GPT3, log perplexity is used, computed on the customized validation dataset to evaluate the model’s convergence behavior. This metric is not costly to compute and is widely adopted as a versatile indicator of evaluating LLMs.
Limit number of submission logs: We recognize that the Llama 3.1 405B pretraining benchmark is very expensive to run, so to ensure fairness while reducing submitters costs, each submitter is required to only run the benchmark 3 times, the same as the GPT3 pretraining benchmark. The median result is then reported.
Restriction on hyperparameter searching: Because running the benchmark is expensive, we believe it is reasonable to disallow hyperparameter searches and borrowing. Most hyperparameters have been given static values. For a very few others, such as the learning rate, formulas are provided to compute them based on the global batch size.

Conclusion

In this blog post, we have provided an overview of a new LLM pretraining benchmark based on Meta’s Llama 3.1 405B model with twice as many parameters and four times as many training tokens as GPT3. This benchmark is massively more computationally intensive and contains its own new challenges. It is the task force’s belief that this new benchmark will encourage more innovation in the area of LLM pretraining.

About MLCommons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.