News Archives - MLCommons

AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results

MLCommons — Wed, 27 Aug 2025 14:44:31 +0000

Today, AVCC^® and MLCommons^® announced new results for their new MLPerf^® Automotive v0.5 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. MLPerf Automotive is the result of a broad, cross-industry collaboration that provides new and critical performance data for stakeholders who procure AI systems for automobiles.

MLPerf Automotive: A Cross-industry Partnership to Provide Critical Performance Data

The MLPerf Automotive benchmark was designed by a working group composed of technical experts from thirteen organizations representing both the AI community and the automotive manufacturing industry, including Ambarella, ARM, Bosch, C-Tuning Foundation, CeCaS, Cognata, Motional, NVIDIA, Qualcomm, Red Hat, Samsung, Siemens EDA, UC Davis, and ZF Group.

The v0.5 version of the benchmark includes three performance tests:

2-D object recognition and segmentation
3-D object recognition

“Many of the key scenarios for AI in automotive environments relate to safety, both inside and outside of a car or truck,” said James Goel, Automotive Working Group co-chair. “AI systems can train on 2-D images to be able to detect objects in a car’s blind spot or to implement adaptive cruise control. Having high-quality, 8-megapixel imagery – as we do in this benchmark – ensures that the results reflect real-world demands on these systems. In addition, 3-D imagery is critical for training and testing collision avoidance systems, whether assisting a human driver or as part of a fully automated vehicle.”

The benchmark suite implements two scenarios for measuring performance: a “single stream” of requests that measures raw performance and throughput; and a “constant stream” where requests arrive at intervals, providing important data on system latency.

“As vehicles become increasingly intelligent through AI integration, every millisecond counts when it comes to safety,” said Kasper Mecklenburg, Automotive Working Group co-chair and principal autonomous driving solution engineer, Automotive Business, Arm. “That’s why latency and determinism are paramount for automotive systems, and why public, transparent benchmarks are crucial in providing Tier 1s and OEMs with the guidance they need to ensure AI-defined vehicles are truly up to the task.”

Datasets for these benchmark tests were sourced through partnerships with Cognata and Motional. Cognata provided 8-megapixel imagery for the 2-D object recognition and segmentation tests, and Motional provided its nuScenes dataset for the 3-D object recognition test.

“Reliable autonomy depends on transparent, realistic benchmarks,” said Danny Atsmon, Founder and CEO of Cognata. “By contributing Cognata’s 8-megapixel imagery dataset, we are not only strengthening MLPerf Automotive v0.5 but also reinforcing Cognata’s role as a trusted provider of benchmark-quality datasets that accelerate safe, large-scale adoption of ADAS and AV systems.”

“Since its release in 2019, nuScenes has played an important role in many critical advancements by the autonomous vehicle and research communities,” said Sourabh Vora, Senior Director of Engineering at Motional. “The first publicly available dataset of its kind, nuScenes pioneered an industry-wide culture of data-sharing and collaboration, and continues to serve as the industry standard for datasets today. Motional is proud to continue supporting the development of safe autonomous systems with the use of nuScenes for the MLPerf Automotive benchmark suite.”

Valuable performance data, immediately relevant for decision-makers

“The MLPerf Automotive benchmark suite is the product of a unique and essential collaboration, bringing the expertise of the computing, sensor, AI, and automotive communities together towards a common goal: transparent and trustworthy performance data on these emerging automotive AI systems,” said David Kanter, head of MLPerf at MLCommons. “I thank all our partners and collaborators who contributed to this bold first step. I also welcome the submitters in the first round of the benchmark: GateOverflow and Nvidia.”

Markus Tremmel, Bosch ADAS Chief Expert, said, “Selecting the right AI acceleration solution is the basis of every successful ADAS/AD system development. MLPerf^® Automotive will greatly enhance and simplify this selection process by establishing realistic automotive benchmarks representing high-resolution system scenarios and modern AI/ML network architectures.”

“The release of MLPerf Automotive v0.5 is a major milestone, showing what’s possible when leaders from AI, semiconductor, and automotive industries work together. With transparent, reproducible benchmarks, OEMs and suppliers can confidently evaluate solutions for next-generation safety-critical automotive systems,” said Stephen Miller, AVCC Technical Chair and Technical Expert, ADAS Product Management at Bosch. “AVCC is proud to partner with MLCommons to support this important joint initiative, which strengthens the foundation for innovation across the global automotive ecosystem.”

View the results

To view the results for MLPerf Automotive, please visit the Automotive benchmark results page.

About AVCC

AVCC is a global automated and autonomous vehicle (AV) consortium that specifies and benchmarks solutions for AV computing, cybersecurity, functional safety, and building block interconnects. AVCC is a not-for-profit membership organization building an ecosystem of OEMs, automotive suppliers, and semiconductor and software suppliers in the automotive industry. The consortium addresses the complexity of the intelligent-vehicle software-defined automotive environment and promotes member-driven dialogue within technical working groups to address non-differentiable common challenges. AVCC is committed to driving the evolution of autonomous and automated solutions up to L5 performance. For additional information on AVCC membership and technical reports, please visit www.avcc.org or email outreach@avcc.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results appeared first on MLCommons.

New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems

MLCommons — Mon, 04 Aug 2025 14:56:19 +0000

San Francisco, CA. Today, MLCommons® announced results for its industry-standard MLPerf® Storage v2.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. This round of the benchmark saw dramatically increased participation, more geographic representation from submitting organizations, and greater diversity of the systems submitted for testing.

The benchmark results show that storage systems performance continues to improve rapidly, with tested systems serving roughly twice the number of accelerators than in the v1.0 benchmark round.

Additionally, the v2.0 benchmark adds new tests that replicate real-world checkpointing for AI training systems. The benchmark results provide essential information for stakeholders who need to configure the frequency of checkpoints to optimize for high performance – particularly at scale.

Version 2.0 adds checkpointing tasks, delivers essential insights

As AI training systems have continued to scale up to billions and even trillions of parameters, and the largest clusters of processors have reached one hundred thousand accelerators or more, system failures have become a prominent technical challenge. Because data centers tend to run accelerators at near-maximum utilization for their entire lifecycle, both the accelerators themselves and the supporting hardware (power supplies, memory, cooling systems, etc.) are heavily burdened, minimizing their expected lifetime. This is a chronic issue, especially in large clusters: if the mean time to failure for an accelerator is 50,000 hours, then a 100,000-accelerator cluster running for extended periods at full utilization will likely experience a failure every half-hour. A cluster with one million accelerators would expect to see a failure every three minutes. Worse, because AI training usually involves massively parallel computation where all the accelerators are moving in lockstep on the same iteration of training, a failure of one processor can grind an entire cluster to a halt.

It is now broadly accepted that saving checkpoints of intermediate training results at regular intervals is essential to keep AI training systems running at high performance. The AI training community has developed mathematical models that can optimize cluster performance and utilization by trading off the overhead of regular checkpoints against the expected frequency and cost of failure recovery (rolling back the computation, restoring the most recent checkpoint, restarting the training from that point, and duplicating the lost work). Those models, however, require accurate data on the scale and performance of the storage systems that are used to implement the checkpointing system.

The MLPerf Storage v2.0 checkpoint benchmark tests provide precisely that data, and the results from this round suggest that stakeholders procuring AI training systems need to carefully consider the performance of the storage systems they buy, to ensure that they can store and retrieve a cluster’s checkpoints without slowing the system down to an unacceptable level. For a deeper understanding of the issues around storage systems and checkpointing, as well as of the design of the checkpointing benchmarks, we encourage you to read this post from Wes Vaske, a member of the MLPerf Storage working group.

“At the scale of computation being implemented for training large AI models, regular component failures are simply a fact of life,” said Curtis Anderson, MLPerf Storage working group co-chair. “Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance. This initial round of checkpoint benchmark results shows us that current storage systems offer a wide range of performance specifications, and not all systems are well-matched to every checkpointing scenario. It also highlights the critical role of software frameworks such as PyTorch and TensorFlow in coordinating training, checkpointing, and failure recovery, as well as some opportunities for enhancing those frameworks to further improve overall system performance.”

Workload benchmarks show rapid innovation in support of larger-scale training systems

Continuing from the v1.0 benchmark suite, the v2.0 suite measures storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators’ “think time” the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

The v2.0 results show that submitted storage systems have substantially increased the number of accelerators they can simultaneously support, roughly twice the number compared to the systems in the v1.0 benchmark.

“Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It’s no surprise to see that storage system providers are innovating to support ever larger scale systems,” said Oana Balmau, MLPerf Storage working group co-chair.

The v2.0 submissions also included a much more diverse set of technical approaches to delivering high-performance storage for AI training, including:

6 local storage solutions;
2 solutions using in-storage accelerators;
13 software-defined solutions;
12 block systems;
16 on-prem shared storage solutions;
2 object stores.

“Necessity continues to be the mother of invention: faced with the need to deliver storage solutions that are both high-performance and at unprecedented scale, the technical community has stepped up once again and is innovating at a furious pace,” said Balmau.

MLPerf Storage v2.0: skyrocketing participation and diversity of submitters

The MLPerf Storage benchmark was created through a collaborative engineering process by 35 leading storage solution providers and academic research groups across 3 years. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v2.0 benchmark results, from a broad set of technology providers, reflect the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v2.0 includes >200 performance results from 26 submitting organizations: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong. The submitters represent seven different countries, demonstrating the value of the MLPerf Storage benchmark to the global community of stakeholders.

“The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,” said David Kanter, Head of MLPerf at MLCommons. The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it. I would especially like to welcome first-time submitters Alluxio, ExponTech, FarmGPU, H3C, Kingston, KIOXIA, Oracle, Quanta Computer, Samsung, Sandisk, TTA, UBIX, IBM, and WDC.”

“This level of participation is a game-changer for benchmarking: it enables us to openly publish more accurate and more representative data on real-world systems,” Kanter continued. That, in turn, gives the stakeholders on the front lines the information and tools they need to succeed at their jobs. The checkpoint benchmark results are an excellent case in point: now that we can measure checkpoint performance, we can think about optimizing it.”

We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark suite.

View the Results

To view the results for MLPerf Storage v2.0, please visit the Storage benchmark results.

About MLCommons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems appeared first on MLCommons.

Announcing the MLPerf Storage v2.0 Checkpointing Workload

MLCommons — Mon, 04 Aug 2025 14:43:55 +0000

The previous versions of MLCommons® MLPerf ®Storage (v0.5 and v1.1) focused on ingestion for training data-intensive models like 3D U-Net, ResNet-50, and CosmoFlow. The new checkpointing workload is focused on addressing the backup and recovery speed for training, with a focus on LLMs on scale-out systems.

In this blog, we’ll cover the motivation for checkpointing, how failure rates of AI systems depend on the scale of the cluster, the details of how the workload is run, and how to interpret the results of this new workload.

Failures Halt AI Training Clusters

AI training today is typically a synchronous process. In particular, as a model is being trained on a large cluster, every participating member in the cluster is using the same set of weights. Conceptually, each node will read in data and perform calculations, and then all the nodes will share their calculations to update the weights of the model. The weights must be updated across all nodes before starting the next set of calculations.

Synchronous training makes the entire process much more predictable and reproducible for machine learning researchers and engineers. One downside of synchronous training is that straggler nodes can limit the progress of training jobs. In particular, the failure of a single component that prevents the work of a single node from contributing to the model update results in a cluster-wide stall of the training job.,

Server uptimes are typically measured in months if not years, so failures are often not a major consideration. However, two things set AI training apart from traditional enterprise or cloud workloads.

First, the systems in AI training systems are typically run at maximum performance for their entire lifecycle. This usage model and utilization result in higher failure rates than in traditional systems.

Second, AI training is typically synchronous while traditional workloads are decoupled. If a traditional web server fails, its load can be easily transitioned to another web server. However, due to the synchronous nature of AI training, the failure rate of a cluster scales with the power of the number of components. For example, if a cluster of 100,000 accelerators has a failure rate of R for a given accelerator, then the failure rate of the cluster is R^100,000. For large-scale systems, even a tiny component failure rate can be highly problematic.

Checkpointing Ensures Forward Progress

The MLPerf Storage Working Group has developed a model of the failure rates of AI clusters and estimates of the target checkpoint intervals based on real-world data. Meta’s “Llama3: Herd of Models” paper provides information about failure rates seen at scale (https://arxiv.org/abs/2407.21783). In that paper, Meta describes the training cluster used (16,000 accelerators) to train the Llama 3 models. The training process took 54 days and encountered 419 failures. The majority of these failures scale with the size of the cluster, and assuming the failures are uniformly distributed over time yields a model for the Mean Time Between Failures (MTTF) as the cluster size scales up:

Due to the cluster failure rate scaling to the power of cluster size, MTTF exponentially decreases as cluster size increases. The model correlates fairly closely with Meta’s data for 16,000 accelerators, predicting 7.67 failures per day versus the observed data (7.76 failures/day).

Generally, checkpointing strategies are designed to carefully balance the frequency of checkpointing against the protection. A common rule of thumb with large-scale systems is that losing 5% or less of the progress since the last checkpoint is reasonable. Put another way, there should be roughly 20 checkpoints between expected failures.

The exponential impact of cluster size on failure directly translates into a similar impact on checkpointing. For 16,000 accelerators (which is similar to the size of some clusters that are deployed today), 155 checkpoints per day are necessary, or roughly once every 9.3 minutes. At the scale of 100,000 accelerators, this equates to 967 checkpoints per day or a checkpoint every 1.5 minutes.

Finally, the last metric we want to pull from this model is the time available to perform a checkpoint. First, we wanted to ensure we were taking checkpoints frequently enough to minimize the time between failures. Now, we want to ensure we’re not spending too much time with the cluster stopped while we’re taking checkpoints.

As with most forms of redundancy and data protection, checkpointing is not free – it imposes some overhead on the overall AI training effort. As the frequency of checkpoints increases with cluster size, this overhead becomes more pronounced. To keep the overhead below 5% is increasingly challenging as clusters grow in size. For example, a 16,000 accelerator cluster that checkpoints every 9.3 minutes, the checkpointing process can take 28 seconds. For a larger cluster comprising 100,000 accelerators, this falls to a mere 4.4 seconds. These examples illustrate how increasing cluster size for AI training creates an increasing bottleneck in the storage system.

Checkpointing Workload in v2.0

The new checkpointing workload is implemented to be representative of real systems as much as possible. The checkpoint workload defined here creates an in-memory representation of the model and optimizer state, then uses PyTorch save and load to perform the write and read operations.

For the workload, we define four different model sizes:

· 8 Billion parameters

· 70 Billion parameters

· 405 Billion parameters

· 1 Trillion parameters

The model sizes are intended to emulate the Llama 3 Herd of Models with a 1T parameter model added with extrapolated values to represent larger models currently in development.

For CLOSED submissions, each of the model sizes has a required number of processes for saving or loading a checkpoint. These process numbers represent a reasonable minimum size of a system that can train a model of a given size.

OPEN submissions allow changing the number of processes to a multiple of the model’s defined Data Parallelism (see the table below). This is to represent sharding the model over more hosts, which results in more processes initiating the checkpoint save or load at the same time and with less data per process (the model and optimizer states stay the same size as more data parallel instances are used as the model and optimizer are replicated between data parallel instances).

Each process executing the checkpoint will write one or more large files (100s of MBs to GBs in size) depending on the model size, and benchmark-internal parameters representing the parallelism configuration and the model and optimizer sharding method.

Additionally, there’s an operating mode used only for client-local storage solutions like local NVMe devices called Subset mode. When run in subset mode – regardless of the required number of processes – only eight processes are used.

Subset mode represents using an NVMe drive in each of the accelerator nodes as the first tier of checkpointing storage. Subset mode will perform a portion of the checkpoint depending on the model size.

Finally, to ensure the test is measuring the performance of the underlying storage, a successful checkpoint workload run consists of 10 Save operations followed by 10 Load operations. If the client memory is above a threshold (3x the size of a node’s checkpoint), then the submitter must clear the filesystem caches between the save and load phases.

Table 1 – Checkpointing Parameters by Model Size

Model	8B	70B	405B	1T
Data Parallelism	1	1	2	2
Total Processes	8	64	512	1024
Checkpoint size	105 GB	912 GB	5.29 TB	15 TB
Subset: 8-Process Size	105 GB	114 GB	94 GB	144 GB

One important takeaway is the sheer size of the checkpoints. The actual model only accounts for a portion of the data in a given checkpoint. The bulk of the data is actually the optimizer state of the cluster at that point in time.

For example, the checkpoint for the 8 Billion parameter model totals 105GB, and the optimizer state makes up 90GB. For the 1 Trillion parameter model, the optimizer makes up a whopping 13.2TB of the 15TB of checkpoint data.

This helps to put some real sense of scale around the challenge of checkpointing. For a very large model like a 1 trillion parameter training on a cluster of 100,000 accelerators, writing out the 15TB checkpoint every 1.5 minutes results in over 14 Petabytes of writes per day for checkpoints alone.

For synchronous checkpointing, this implies writing 15TB of data in under 5 seconds – averaging about 3.6 TB/s across the cluster or roughly 200GB/s for the entire duration of the training job.

Interpreting the Results

The results of a checkpointing workload will include one or more of the supported model sizes with the associated Save and Load performance scores.

The default metric is throughput (GiB/s) for the operation, with the option to expand the table to see the result as a duration number as well (in seconds).

On the surface, the large reads and writes may seem like a trivial workload, but the large number of processes performing the operations in parallel results in interleaving of data at the physical layer in a way that can inhibit performance in storage systems.

Furthermore, a checkpoint Save or Load is only complete when every process is done. This means that the numbers represented in the results table are based on the time taken by the worst-performing process rather than an average of steady-state performance.

End users and customers will have differing views on the relative importance of Save vs Load. Save operations happen more frequently, which means more chances for a bad save to impact operations, while every Load operation is likely being performed during the most critical time: while the cluster is stopped.

Additionally, comparing the subset and full checkpoint modes with each other is difficult. As executed, the subset mode does not provide full fault tolerance and requires either a secondary storage solution or additional software enhancements by the end user.

Conclusion

In this blog post, we have provided an overview of the motivation for checkpointing as a new workload within MLPerf Storage, the relationship between AI training cluster size and checkpointing frequency and volume, an overview of checkpointing in MLPerf Storage v2.0, and a rough guide on interpreting the results. This is the first industry standard benchmark for checkpointing in AI training and will ultimately help the community understand performance requirements and enable greater optimization and capabilities for the entire ecosystem.

You can read more about the MLPerf Storage v2.0 results on our blog.

The post Announcing the MLPerf Storage v2.0 Checkpointing Workload appeared first on MLCommons.

MLCommons Launches MLPerf Mobile on Google Play Store

MLCommons — Thu, 10 Jul 2025 18:35:26 +0000

MLCommons^®, the open engineering consortium behind the widely used MLPerf® suite of AI benchmarks, today announced that MLPerf^® Mobile, its flagship mobile AI performance benchmarking tool, is now available for the first time on the Google Play Store.

Previously offered only as a downloadable APK, MLPerf Mobile can now be easily accessed and installed by a global audience of Android users directly through Google Play. The free, open-source app empowers users to measure how well their smartphones and tablets handle real-world artificial intelligence (AI) and machine learning (ML) tasks.

“Publishing MLPerf Mobile to the Play Store is a major milestone in making AI benchmarking more accessible and transparent,” said Mostafa El-Khamy, chair of the Mobile working group at MLCommons. “Now anyone—from developers and hardware reviewers to everyday users—can assess their device’s AI performance using benchmarks developed collaboratively through MLCommons.”

What is MLPerf Mobile?

MLPerf Mobile is an AI benchmarking app designed to evaluate mobile device performance on state-of-the-art ML workloads, including:

Image classification
Object detection
Image segmentation
Language understanding
Super-resolution upscaling
Text-to-image generation

The app uses on-device hardware acceleration where supported and is optimized for Android via TensorFlow Lite delegate fallback. MLPerf Mobile supports both casual usage for quick performance insights and official benchmarking for MLCommons member submissions.

Key Features:

First-time availability on the Google Play Store
Benchmarks based on the latest AI models and workloads
Tuned performance for modern mobile SoCs with hardware AI acceleration
Multiple test modes: from fast casual tests to full MLPerf-compliant runs
Configurable cool-down periods to reduce thermal throttling
Optional cloud-based result storage with multi-device access (free account required)

MLPerf Mobile is maintained by the MLPerf Mobile Working Group, part of MLCommons®, a non-profit AI engineering organization with over 125 member institutions spanning academia and industry.

Download MLPerf Mobile today.

MLPerf Mobile is available for free on the Google Play Store.

For technical users, source code and documentation remain available on the MLCommons GitHub repository. Issues, contributions, and feedback are welcome.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

About MLCommons®

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone. The organization produces industry-leading benchmarks, datasets, and best practices that span the full range of ML applications—from massive cloud training to resource-constrained edge devices. Its MLPerf benchmark suite has become the de facto standard for evaluating AI performance.

Learn more at www.mlcommons.org.

The post MLCommons Launches MLPerf Mobile on Google Play Store appeared first on MLCommons.

MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders

MLCommons — Fri, 27 Jun 2025 17:11:26 +0000

As agentic AI rapidly transforms the enterprise adoption landscape, MLCommons is convening experts from across academia, civil society, and industry to better understand the risks associated around practical, near-term deployments of AI agents.

The challenge we posed to our community was: Imagine you are the decision maker on a product launch in 2026 for an agent that could take limited actions, such as placing an order or issuing a refund – what evaluations would you need to make an informed decision?

Last month, MLCommons co-hosted the Agentic Reliability Evaluation Summit (ARES) with the Oxford Martin School’s AI Governance Initiative, Schmidt Sciences, Laboratoire National de Metrologie et D’Essais (LNE), and the AI Verify Foundation. 34 organizations attended the convening — including AI labs, large software companies, AI safety institutes, startups, and civil society organizations — to discuss new and emerging evaluation methods for agentic AI.

Today, MLCommons is announcing a new collaboration with contributors from Advai, AI Verify Foundation, Anthropic, Arize AI, Cohere, Google, Intel, LNE, Meta, Microsoft, NASSCOM, OpenAI, Patronus AI, Polytechnique Montreal, Qualcomm, QuantumBlack – AI by McKinsey, Salesforce, Schmidt Sciences, ServiceNow, University of Cambridge, University of Oxford, University of Illinois Urbana-Champaign, and University of California, Santa Barbara to co-develop an open agent reliability evaluation standard to operationalize trust in agentic deployments. This collaboration brings together a diverse ecosystem spanning AI builders, major enterprise deployers, specialized testing and safety providers, and important global policy organizations to ensure that this standard is both technically sound and practically implementable in the near-term. By uniting these different perspectives, this collaboration will create frameworks and benchmarks that are grounded in technical reality while addressing the practical needs of the global AI ecosystem.

We have outlined a set of principles to evaluate the reliability of near-term agentic use-cases, focusing on scenarios with limited actions within structured contexts. The evaluation framework will be organized into four key categories: Correctness, Safety, Security, and Control. The outcome from this group will comprise: 1) Design principles for agentic development 2) Benchmarks that are relevant to business for correctness, safety & control, and security.

We welcome collaboration with a diverse range of enterprises, technology leaders, and business builders who share our vision. If you are interested in joining, please email ai-risk-and-reliability-chairs@mlcommons.org.

The post MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders appeared first on MLCommons.

Introducing the 2025 MLCommons Rising Stars

MLCommons — Thu, 12 Jun 2025 05:48:52 +0000

We are pleased to announce the third annual MLCommons Rising Stars cohort of 38 junior researchers from 25 institutions globally! These promising researchers, drawn from over 150 applicants, have demonstrated excellence in Machine Learning (ML) and Systems research and stand out for their current and future contributions and potential. This year’s program has been expanded to include data systems research as an area of interest.

The MLCommons Rising Stars program provides a platform for talented young researchers working at the intersection of ML and systems to build connections with a vibrant research community, engage with industry and academic experts, and develop their skills. The program continues to promote diversity in the research community by seeking researchers from historically underrepresented backgrounds. We are pleased to welcome nine international Rising Stars to this year’s cohort.

As part of our commitment to fostering the growth of our Rising Stars, we organized the Rising Stars workshop at the Meta Headquarters in Menlo Park, CA, in May, where the cohort showcased their work, explored research opportunities, gained new skill sets via career building sessions, and had the opportunity to network with researchers across academia and industry.

“ML is a fast-growing field with rapid adoption across all industries, and we believe that the biggest breakthroughs are yet to come. By nurturing and supporting the next generation of researchers, both domestically and globally, we aim to foster an inclusive environment where these individuals can make groundbreaking contributions that will shape the future of ML and systems research. The Rising Stars program is our investment in the future, and we are excited to see the innovative ideas and solutions that these talented researchers will bring to the table,” said Vijay Janapa Reddi, MLCommons VP and Research Chair and steering committee member of the Rising Stars program.

This year’s program marked a pivotal step toward strengthening the community at the intersection of ML and systems, especially by connecting early-career researchers who share common challenges and goals. The shared experiences, discussions on generative AI, and connections forged at the workshop are expected to fuel new collaborations and research directions. As we reflect on this year’s success, we are more committed than ever to growing a supportive, interdisciplinary, and inclusive ecosystem for future Rising Stars.

As part of the two-day Rising Stars workshop held at Meta Headquarters in Menlo Park, our 2025 cohort engaged in dynamic sessions with leading researchers and alumni. One of the highlights was the Alumni Panel, moderated by Abdulrahman Mahmoud (MBZUAI), featuring Rising Stars ’23 alumni: Sercan Aygun (University of Louisiana at Lafayette), Muhammad Husnain Mubarik (AMD), Francisco Romero (Plix & Georgia Tech), and Zishen Wan (Georgia Tech). The panel shared valuable insights into their career journeys, academic-industry transitions, and lessons learned since completing the program. Attendees had the opportunity to ask questions about career growth, building a research identity, and balancing long-term research goals.

The workshops keynotes and invited talks came from distinguished leaders in ML and systems, including Vikas Chandra (Meta) on “Generative AI for Immersive Worlds”, Dan Fu (Stanford University) on “Kittens and Chipmunks: More Efficient ML Algorithms with Kernels”, Joel Emer (MIT) in a candid AMA session, Vijay Janapa Reddi (Harvard University) on “Architecture 2.0”, and Jason Cong (UCLA) on “Efficient LLMs: From Model Innovation to Customized Acceleration”. In addition, the AR/VR Panel, moderated by Muhammad Husnain Mubarik, brought together Huichu Liu (Meta) and Hyoukjun Kwon (UC Irvine) for an engaging discussion on immersive technologies. The program also spotlighted MLCommons Research Presentations by David Kanter (MLCommons) and Arun Tejusve Raghunath Rajan (Meta), offering a forward-looking perspective on collaborative benchmarking and infrastructure for ML innovation. This was followed by a moderated Q&A session on ML Systems at Meta, led by Abdulrahman Mahmoud (MBZUAI), with insights from Zachary DeVito (Meta), one of the first contributors to PyTorch, who shared his experiences in API design. Through lightning talks, poster sessions, breakout research discussions, and panels, the event fostered a vibrant environment for exchanging ideas and forging collaborations.

We extend our warmest congratulations to this year’s Rising Stars and express our gratitude to everyone who applied.

We would also like to thank Rising Stars organizers Udit Gupta (Cornell Tech), Abdulrahman Mahmoud (MBZUAI), Sercan Aygun (University of Louisiana at Lafayette) and Muhammad Husnain Mubarik (AMD), Lillian Pentecost (Amherst College), Akanksha Atrey (Nokia Bell Labs) and Vijay Janapa Reddi (Harvard University) for all their efforts in putting together the program and selecting an impressive cohort of recipients.

Finally, we would also like to extend our appreciation to Kelly Berschauer (MLCommons), Vikas Chandra (Meta), Petrina Mitchell (META), and David Scott (META), for their support in organizing the program and workshop.

The post Introducing the 2025 MLCommons Rising Stars appeared first on MLCommons.

Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore

MLCommons — Fri, 06 Jun 2025 18:43:00 +0000

Last month, MLCommons President Peter Mattson joined several industry leaders and experts to discuss the road forward for responsible AI innovation at the Asia Tech x summit in Singapore. The panel discussion, titled “Innovating Within a Secure, Reliable and Safe AI Landscape,” focused on finding a balance between mitigating risk and fostering innovation as AI continues to evolve.

Mattson was joined by panel moderator Denise Wong, Assistant Chief Executive at the Infocomm Media Development Authority, alongside the following experts:

Yoichi Iida, Special Policy Adviser to Minister, Ministry of Internal Affairs and Communications, Japan
Dr. Bo Li, Professor, University of Illinois at Urbana-Champaign and Chief Executive Officer, Virtue AI
Dr. Chris Meserole, Executive Director, Frontier Model Forum
Professor Lam Kwok Yan, Executive Director, Digital Trust Centre Singapore and Singapore AI Safety Institute

Throughout the discussion, Mattson emphasized the importance of utilizing accessible data in developing reliable and secure AI systems — including via MLCommons’ Croissant tool, which summarizes and describes datasets. “Data is what determines the behavior of your system,” said Mattson. “It’s not the python, it’s the data that was used to train, fine tune, and prompt the AI. That is the key to success.”

Mattson also highlighted MLCommons’ partnership with the AI Verify Foundation in Singapore to develop benchmarks — like AILuminate — that help enterprise leaders understand and mitigate the risks of AI deployment. As part of the summit, MLCommons and AI Verify Foundation were proud to release proof-of-concept AILuminate testing in Chinese – a major milestone in ensuring clear, rigorous LLM benchmarking in the world’s most spoken language.

While in Singapore, MLCommons also released updated reliability scores for a new and expanded list of LLMs, as well as a new partnership with NASSCOM — India’s top tech trade association — to expand independent, empirically-sound benchmarking across South Asia.

“MLCommons has a mission to make AI better for everyone,” said Mattson. “We interpret that as growing the market in a way that delivers benefits to us all.”

The MLCommons AI Risk & Reliability working group is a highly collaborative, diverse set of experts committed to building a safer AI ecosystem. We welcome others to join us.

The post Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore appeared first on MLCommons.

New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI

MLCommons — Wed, 04 Jun 2025 14:56:30 +0000

Today, MLCommons^®announced new results for the MLPerf^® Training v5.0 benchmark suite, highlighting the rapid growth and evolution of the field of AI. This round of benchmark results includes a record number of total submissions, as well as increased submissions for most benchmarks in the suite compared to the v4.1 benchmark.

MLPerf Training v5.0 introduces new Llama 3.1 405B benchmark

The MLPerf Training benchmark suite comprises full system tests that stress models, software, and hardware for a range of machine learning (ML) applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

Version 5.0 introduces a new large language model pretraining benchmark based on the Llama 3.1 405B generative AI system, which is the largest model to be introduced in the training benchmark suite. It replaces the gpt3-based benchmark included in previous versions of the MLPerf Training benchmark suite. An MLPerf Training task force selected the new benchmark because it is a competitive model representative of the current state-of-the-art LLMs, including recent algorithmic updates and training on more tokens. More information on the new benchmark can be found here. Despite just being introduced, the Llama 3.1 405B benchmark is already receiving more submissions than the gpt3-based predecessor saw in previous rounds – demonstrating the popularity and importance of large-scale training.

Rapid performance improvements for key training scenarios

The MLPerf Training working group regularly adds emerging training workloads to the benchmark suite to ensure that it reflects industry trends. The Training 5.0 benchmark results show notable performance improvements for newer benchmarks, indicating that the industry is prioritizing emerging training workloads over older ones. The Stable Diffusion benchmark saw a 2.28x speed increase for 8-processor systems compared to the 4.1 version six months ago, and the Llama 2.0 70B LoRA benchmark increased its speed 2.10x versus version 4.1; both outpacing historical expectations for computing performance improvements over time as per Moore’s Law. Older benchmarks in the suite saw more modest performance improvements.

On multi-node, 64-processor systems, the RetinaNet benchmark saw a 1.43x speedup compared to the prior v3.1 benchmark round (the most recent to include comparable scale systems), while the Stable Diffusion benchmark had a dramatic 3.68x increase.

“This is the sign of a robust technology innovation cycle and co-design: AI takes advantage of new systems, but the systems are also evolving to support high-priority scenarios,” said Shriya Rishab, MLPerf Training working group co-chair.

Increasing diversity of processors, increasing scale of systems, broadening ecosystem

Submissions to MLPerf Training 5.0 utilized 12 unique processors, all in the available (production) category. Five of the processors have become publicly available since the last version of the benchmark suite.

AMD Instinct MI300X 192GB HBM3
AMD Instinct MI325X 256GB HBM3e
NVIDIA Blackwell GPU (GB200)
NVIDIA Blackwell GPU (B200-SXM-180GB)
TPU-trillium

Submissions also included three new processor families:

5th Generation AMD Epyc Processor (“Turin”)
Intel Xeon 6 Processor (“Granite Rapids”)
Neoverse V2 as part of NVIDIA GB200

In addition, the number of multi-node systems submitted increased more than 1.8x when compared to version 4.1.

“The picture is clear: AI workloads are scaling up, systems are scaling up to run them, and hardware innovation continues to boost performance for key scenarios,” said Hiwot Kassa, MLPerf Training working group co-chair. “In-house large scale systems were built by few companies, but the increased proliferation – and competition – in AI-optimized systems is enabling the broader community to scale up their own infrastructure. Most notably, we see an increasing cadre of cloud service providers offering access to large-scale systems, democratizing access to training large models.

“The industry is not standing still, and neither can we. MLCommons is committed to continuing to evolve our benchmark suite so that we can capture and report on the innovation that is happening in the field of AI.”

Record industry participation

The MLPerf Training v5.0 round includes 201 performance results from 20 submitting organizations: AMD, ASUSTeK, Cisco Systems Inc., CoreWeave, Dell Technologies, GigaComputing, Google Cloud, Hewlett Packard Enterprise, IBM, Krai, Lambda, Lenovo, MangoBoost, Nebius, NVIDIA, Oracle, Quanta Cloud Technology, SCITIX, Supermicro, and TinyCorp.

“We would especially like to welcome first-time MLPerf Training submitters AMD, IBM, MangoBoost, Nebius, and SCITIX,” said David Kanter, Head of MLPerf at MLCommons. ”I would also like to highlight Lenovo’s first set of power benchmark submissions in this round – energy efficiency in AI training systems is an increasingly critical issue in need of accurate measurement.”

MLPerf Training v5.0 set a new high-water mark for the >200 submissions. The vast majority of the individual benchmark tests that carried over from the previous round saw an increase in submissions.

Robust participation by a broad set of industry stakeholders strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

Please visit the Training benchmark page to view the full results for MLPerf Training v5.0 and find additional information about the benchmarks.

About ML Commons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

Press Inquiries: contact press@mlcommons.org

The post New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI appeared first on MLCommons.

MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark

MLCommons — Thu, 29 May 2025 15:10:24 +0000

MLCommons today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India’s premier technology trade association, to bring AILuminate’s globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate’s Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs).

”We’re looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership, along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”

“The rapid development of AI is reshaping India’s technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”

The NASSCOM collaboration builds on MLCommons’ intentionally global approach to AI benchmarking. Modeled after MLCommons’ ongoing partnership with Singapore’s AI Verify Foundation, the NASSCOM partnership will help to meet South Asia’s urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region’s industry experts, policymakers, civil society members, and academic researchers. MLCommons’ partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.

AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking.

“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We’re grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.”

Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

MLCommons MLPerf Training Expands with Llama 3.1 405B

MLCommons — Mon, 05 May 2025 15:43:00 +0000

About MLPerf Training

Pretraining Large Language Models (LLMs) is the initial phase of LLM training where the model learns to understand and generate human-like text by training the model on vast amounts of unstructured textual data. A pre-trained model can then be fine-tuned on domain-specific datasets, which gives the model knowledge of specific tasks. Reasoning approaches have become popular recently, helping models better explain their thinking process. Because pre-training uses large amounts of data, it is generally the most compute-intensive phase of LLM training.

MLCommons^® added a GPT3 Pretraining benchmark to MLPerf^® Training in 2022 to help academics and industry experts optimize LLM pretraining. It has been a huge success, as we have seen over 3x speedups on the results of this benchmark for large systems (over 10,000 accelerators).

Over this same period, newer state-of-the-art models have demonstrated greater scale and many architectural advancements. For example, Google’s PaLM model, released in 2022, contains 540B parameters, three times as many as GPT3’s 175B parameter count. Meta’s Llama 2 and Llama 3 series, released in 2023, utilize new algorithms such as Root Mean Square Layer Normalization (RMSNorm, Zhang et al., https://arxiv.org/abs/1910.07467), Rotary Position Embedding (RoPE, Su et al., 2024), and Grouped Query Attention (GQA, Ainslie et al., 2023).

A MLPerf Training working group task force was created to investigate a July 2024 proposal to replace GPT3 with Llama 3.1 405B. Task force members evaluated the proposal and industry options to recommend the best new pretraining reference. They agreed that Llama 3.1 405B would best represent the current state-of-the-art pretraining and adopted it as a new benchmark. Details on the review and selection process are included in this blog.

Model selection

The most important factor the task force considered when constructing this new pretraining benchmark was the model architecture. As mentioned above, the current GPT3 architecture does not include some of the recent algorithm updates and therefore is not competitive with other state-of-the-art models such as Nemotron-4, GPT4.5 and Claude 3.7 Sonnet. However, Meta’s Llama 3.1 405B model, with an architecture and scale similar to these top-tier models, demonstrates on-par performance with these models across multiple quality benchmarks, thus positioning itself as a competitive model representative of the current state-of-the-art models.

Choosing a model with high community engagement was also an important consideration, and Llama 3.1 405B is the perfect candidate on that basis. With more than 300 million total downloads of all Llama versions to date, the Llama family of models is very widely adopted. The MLPerf community, for instance, has already included Llama 2 in the LoRA benchmark. Additionally, Meta’s released technical paper details the training process, model architecture, dataset and hyperparameters, which allowed us to easily create a high-quality benchmark based on their work.

One other requirement for the new MLPerf Training benchmark is that the model architecture must be publicly accessible so that all submitters can download it and reproduce the results on their end. This typically requires a permissive license. Thanks to Meta’s support, the current Llama 3.1 405B model is now available to all MLCommons members for benchmarking.

Dataset selection

To help submitters transition easily from GPT3 to Llama 3.1 405B, we use the same AllenAI C4-en dataset (Colossal Cleaned Common Crawl dataset) in both benchmarks. This dataset contains 365M English-language paragraphs scraped from the internet, cleaned and preprocessed to convenient JSON formats for training the model.

One key difference is that Llama 3.1 405B uses a context length of 8,192, which is 4x larger than the 2,048-token context of GPT3. The context length represents the amount of text that a model can process at once, thus a larger context length leads to better understanding.

Benchmark technical details

The reference code is implemented in the NVIDIA NeMo Framework, an open-sourced and scalable AI framework built for large language models. By offering easy-to-use modular components and pre-trained model recipes, NeMo enables developers to efficiently build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA H100 GPUs.

To ensure that the new benchmark model setup is easily understandable without complex knobs and hyperparameters scattered around the training script, it starts with NeMo’s integrated Llama 3.1 405B training recipe, which closely followed the original paper’s hyperparameters and introduced only the minimum necessary changes to our script, such as adjusting the optimizer’s learning rates.

When researching how to construct this benchmark, we first explored the GPT3-style setup. For the GPT3 pretraining benchmark, the task force pretrained from random weights for about 12.6T tokens to generate a stable customized checkpoint. Submitters were required to continue pretraining from this checkpoint until they reached a predetermined quality metric. This methodology was used to ensure that the GPT3 pretraining benchmark showcased a reproducible slice of the actual GPT3 pretraining process.

Applying the same methodology to the Llama 3.1 405B pretraining benchmark, the task force generated a customized checkpoint trained on 32B tokens. But experiment results showed that resuming from this checkpoint had more variation in the convergence curves than we could tolerate. This meant that resuming from a custom-trained checkpoint in the benchmark was not feasible. The interpretation that we were still in very early phases of training after the 32B tokens, meant that the checkpoint generated was still quite unstable.

So the taskforce decided to train from the fully trained and stable HuggingFace checkpoint by giving it some new information, which led to a much more stable convergence trend. The benchmark now begins from the checkpoint and stops when the evaluation log perplexity reaches our defined target.

One noticeable difference between the task force’s implementation and Meta’s implementation is that we replaced the original Tiktokenizer, which used a vocab size of 128k, with the 32k vocab size Mixtral 8x22b tokenizer. The motivation behind this change was that Meta’s checkpoint is already well-trained, so it will not learn anything new on the current C4 dataset. Replacing the tokenizer forces the model checkpoint to adapt to a new, different token distribution, therefore continuing to pretrain on the current dataset. This difference implies that, when we resume from the pre-trained checkpoint, only the first 32,000 rows of the 128,256-row word embedding layer weights will be loaded to the model.

As mentioned before, the Llama 3.1 405B pretraining benchmark is a replacement for the GPT3 pretraining benchmark, and the following components will be unchanged between the two:

Validation dataset: To reduce the cost of validation, we chose to create a customized subset of the full C4 validation dataset so that the evaluation time would not be significant compared to training time. Specifically, we found out that using only 5,760 sequences is sufficient to measure model convergence, and since only 91,205 validation samples are needed to yield the required 5,760 validation sequences, the first 91,205 unshuffled rows of the full C4 validation dataset were selected as the customized validation dataset for this benchmark.

Loss and target accuracy metric: As with GPT3, log perplexity is used, computed on the customized validation dataset to evaluate the model’s convergence behavior. This metric is not costly to compute and is widely adopted as a versatile indicator of evaluating LLMs.
Limit number of submission logs: We recognize that the Llama 3.1 405B pretraining benchmark is very expensive to run, so to ensure fairness while reducing submitters costs, each submitter is required to only run the benchmark 3 times, the same as the GPT3 pretraining benchmark. The median result is then reported.
Restriction on hyperparameter searching: Because running the benchmark is expensive, we believe it is reasonable to disallow hyperparameter searches and borrowing. Most hyperparameters have been given static values. For a very few others, such as the learning rate, formulas are provided to compute them based on the global batch size.

Conclusion

In this blog post, we have provided an overview of a new LLM pretraining benchmark based on Meta’s Llama 3.1 405B model with twice as many parameters and four times as many training tokens as GPT3. This benchmark is massively more computationally intensive and contains its own new challenges. It is the task force’s belief that this new benchmark will encourage more innovation in the area of LLM pretraining.

About MLCommons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs

MLCommons — Mon, 28 Apr 2025 15:04:31 +0000

Today, MLCommons^® is announcing the release of MLPerf^® Client v0.6, an update to the MLPerf Client consumer AI performance benchmark. This release extends support to a broader range of hardware and platforms, including AI PCs with dedicated neural processing units (NPUs), while enhancing usability.

MLPerf Client v0.6 builds on the foundation established by version 0.5, which debuted with LLM-focused performance tests using the Llama 2 7B model from Meta. With this latest release, MLCommons continues its mission to provide a transparent, standardized, and vendor-neutral way to measure AI performance across a growing range of PC hardware.

MLPerf Client represents a collaboration among leaders in the consumer computing space, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, Inc., and top PC OEMs. These stakeholders have pooled resources and expertise to create a standardized performance benchmark for key consumer AI workloads.

Key Updates in v0.6:

Expanded Hardware Support:
New support for NPUs from Intel, alongside continued GPU acceleration via AMD, Intel, and NVIDIA hardware. This milestone makes MLPerf Client the first open benchmark to span both GPU and NPU acceleration on consumer platforms.
Improved Device Selection:
New device enumeration options help users better target and test systems with multiple capable accelerators, such as PCs equipped with multiple GPUs.
Updated Software Stack:
Includes the latest versions of ONNX Runtime, ONNX Runtime GenAI, and Intel OpenVINO, offering performance and compatibility improvements across supported platforms.

“MLPerf Client v0.6 reflects the rapid evolution of the AI PC landscape with the inclusion of NPU evaluation” said Yanni Minkdakis, co-chair of the MLPerf Client working group. “With expanded support and more flexible testing, it’s now easier than ever for the industry and consumers alike to evaluate real-world AI performance on next-generation devices.”

MLPerf Client v0.6 is available now as a free download at mlcommons.org.

About MLCommons

MLPerf Client participation requires an MLCommons membership. For more information and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github

MLCommons — Wed, 16 Apr 2025 17:00:00 +0000

Following the recent French release of the AILuminate v1.1 benchmark, MLCommons is announcing the public release of two French language datasets for the benchmark. AILuminate measures the reliability and integrity of generative AI language systems against twelve categories of hazards and is currently available in English and French. Today’s dataset release includes: a Creative Commons-licensed Demo prompt dataset of over 1,200 prompts; and a Practice Test dataset of 12,000 prompts, both of which span all twelve AILuminate hazard categories.

The Demo prompt dataset is available on GitHub now and the full Practice Test dataset may be obtained by contacting MLCommons. English versions of both datasets are also available for use.

The AILuminate benchmark

The AILuminate benchmark, designed and developed by the MLCommons AI Risk & Reliability working group, assesses LLM responses to 12,000 test prompts per language, across twelve categories of hazards that users of generative AI language systems may encounter. The benchmark was designed to help organizations evaluate language systems before deploying them in production environments, to compare systems before procurement, or to verify their language systems comply with organizational and government policies. In addition to its unique corpus of prompts, AILuminate also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models. MLCommons has used the benchmark to assess the performance of many of the most prominent AI language systems-under-test (SUTs) across all of the hazard categories, and published public reports of their findings.

The AILuminate French language prompt datasets can help provide valuable insights into your AI systems:

Demo Prompt Dataset: A Creative Commons dataset with 1,200 prompts that is a subset of the AILuminate Practice Test dataset. The French language Demo dataset is useful for in-house testing, provides examples of specific hazards, and seeds and inspiration for the creation of other prompts. It is available today on GitHub.

Practice Test Dataset: Includes 12,000 prompts. This dataset closely resembles the official test and is intended to provide a realistic projection of how models would be evaluated by AILuminate when used with the MLCommons AILuminate evaluator. Access to the comprehensive Practice dataset is available upon request from MLCommons.

AILuminate’s prompts are based on 12 hazard categories which are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently available in English and French, with future plans to expand to Simplified Chinese and Hindi. Each prompt dataset is human-made and reviewed by native language speakers for linguistic and cultural relevancy, to ensure accurate linguistic representation. MLCommons intends to regularly update the prompt datasets, including the Creative Commons Demo dataset as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“AI reliability is a global problem that requires an intentionally global approach to address,” said Peter Mattson, Founder and President of MLCommons. “The availability of the French Demo and Practice Test datasets reflects our commitment to evaluating AI risk concerns across languages, cultures, and value systems — and ensures our benchmarks and datasets are readily available to vendors and developers across continents.”

Raising the bar for reliability testing by empowering stakeholders

The AILuminate Demo prompt datasets provide an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems across the twelve defined AILuminate hazard categories. While we discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards, as will policymakers looking for examples of prompts that can generate hazardous results.

Download the Demo prompt dataset today!

The Demo test dataset is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from GitHub. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk & Reliability working group and our efforts to ensure greater reliability of AI systems.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability. We invite others to join the AI Risk & Reliability working group.

FRENCH ANNOUNCEMENT

MLCommons publie le jeu de données de prompts de démonstration AILuminate en français sur GitHub

Une série de 1 200 prompts Creative Commons en français explore l’étendue des risques pour les systèmes d’IA générative

Suite au récent lancement de la version française du benchmark AILuminate v1.1, MLCommons annonce la mise à disposition publique de deux jeux de données en langue française pour ce benchmark. AILuminate évalue la fiabilité et l’intégrité des systèmes d’IA générative dans douze catégories de risques, et ses benchmarks sont désormais disponibles pour les systèmes en anglais et en français. La publication d’aujourd’hui inclut un jeu de 1 200 prompts de démonstration sous licence Creative Commons et un jeu de 12 000 prompts d’essai couvrant les douze types de risques définis par AILuminate.

Le jeu de données de démonstration est dès maintenant disponible sur GitHub, tandis que le jeu complet d’essai peut être obtenu en contactant MLCommons. La version anglaise des deux jeux est également disponible.

Le benchmark AILuminate

Le benchmark AILuminate, conçu et développé par le groupe de travail sur les risques et la fiabilité des IA chez MLCommons, évalue les réponses des modèles linguistiques (LLM) à 12 000 prompts par langue, répartis dans douze catégories de risques auxquels les utilisateurs peuvent être confrontés. Ce benchmark aide les organisations à évaluer leurs systèmes avant leur déploiement, à comparer les produits avant achat ou à vérifier leur conformité aux politiques internes et officielles. En plus d’une collection unique de prompts, AILuminate inclut un système d’évaluation avancé basé sur un ensemble optimisé de modèles d’évaluation des risques. MLCommons a utilisé ce benchmark pour analyser la performance des principaux systèmes linguistiques testés (SUTs) dans toutes les catégories de risques et a publié des rapports publics sur leurs résultats.

Les jeux de données en langue française AILuminate offrent des perspectives précieuses pour vos systèmes d’IA :

Jeu de prompts de démonstration : Un sous-ensemble Creative Commons contenant plus de 1 200 prompts issus du jeu de prompts d’essai. Ce jeu est utile pour les tests internes, fournit des exemples spécifiques aux risques identifiés et peut inspirer la création d’autres prompts. Il est disponible dès aujourd’hui sur GitHub.

Jeu de prompts d’essai : Comprend 12 000 prompts conçus pour simuler une évaluation réaliste par le système officiel AILuminate. Ce jeu complet peut être obtenu sur demande auprès de MLCommons.

Les prompts sont basés sur les douze catégories de risques définies dans le standard d’évaluation AILuminate, qui inclut notamment :

Tous les prompts sont disponibles aujourd’hui en anglais et en français, et seront bientôt disponibles en chinois simplifié et en hindi. Les prompts sont rédigés par un être humain, puis contrôlés par une personne de langue maternelle afin d’en assurer la pertinence linguistique et culturelle. MLCommons prévoit de les mettre à jour régulièrement, notamment le jeu Creative Commons, à mesure que la compréhension communautaire des risques liés à l’IA évolue – par exemple avec le développement du comportement agentique dans les systèmes d’IA. Les données sont formatées pour une utilisation immédiate avec ModelBench, l’outil open-source publié par MLCommons.

« La fiabilité des systèmes IA est un problème mondial nécessitant une approche internationale délibérée », a déclaré Peter Mattson, fondateur et président de MLCommons. « La disponibilité des jeux de prompts français reflète notre engagement quant à l’évaluation des préoccupations liées aux risques de l’IA à travers différentes langues, cultures et systèmes de valeurs – garantissant que nos benchmarks soient accessibles aux entreprises commerciales et aux développeurs dans le monde entier. »

Élever les standards des tests IA en responsabilisant les parties prenantes

Les jeux de données AILuminate fournissent un point de départ important pour le développement des capacités internes en permettant d’évaluer la sécurité des systèmes d’IA générative dans les douze catégories définies par AILuminate. Bien que nous déconseillions l’utilisation de ces prompts pour entraîner des systèmes d’IA (les meilleures pratiques de l’industrie recommandent l’isolation des jeux de données d’évaluation afin d’éviter le surapprentissage), ces prompts sont exploitables dans de nombreux secteurs. Ils servent d’exemples représentatifs pour les utilisateurs qui souhaitent développer leurs propres jeux de données d’entraînement et d’évaluation ou évaluer la qualité de jeux de données existants. Les formateurs en sécurité de l’IA sauront profiter de cet ensemble complet d’exemples couvrant une taxonomie de risques étendue, et les décideurs politiques pourront y trouver des cas concrets où certains prompts peuvent générer des résultats dangereux.

Téléchargez le jeu de données aujourd’hui !

Le jeu de prompts de démonstration est disponible sous licence Creative Commons CC-BY-4.0 et peut être téléchargé directement sur GitHub. Le jeu d’essai complet peut être obtenu en contactant MLCommons. Après l’utilisation des tests d’essai hors ligne ou en ligne, nous encourageons les développeurs à soumettre leurs systèmes à AILuminate pour obtenir un benchmarking public avec validation et publication officielle des résultats.

Pour en savoir plus sur AILuminate et accéder à une FAQ détaillée, nous encourageons toutes les parties prenantes à rejoindre le groupe MLCommons sur les risques IA et la fiabilité.

À propos de MLCommons

MLCommons est le leader mondial du développement de benchmarks pour l’IA. C’est un consortium ouvert dont la mission est d’améliorer l’IA pour tous par la publication de benchmarks et de jeux de données publics. Formée de plus de 125 membres (entreprises de technologie, universitaires et chercheurs du monde entier), MLCommons se concentre sur le développement collaboratif d’outils pour l’industrie de l’IA par le biais de benchmarks, de jeux de données et de mesures publics traitant des risques et de la fiabilité de l’IA. Nous invitons toutes les parties intéressées à rejoindre le groupe sur les risques de l’IA chez MLCommons .

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

News Archives - MLCommons

AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results

New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems

Version 2.0 adds checkpointing tasks, delivers essential insights

Workload benchmarks show rapid innovation in support of larger-scale training systems

MLPerf Storage v2.0: skyrocketing participation and diversity of submitters

View the Results

About MLCommons

Announcing the MLPerf Storage v2.0 Checkpointing Workload

Failures Halt AI Training Clusters

Checkpointing Ensures Forward Progress

Checkpointing Workload in v2.0

Interpreting the Results

Conclusion

MLCommons Launches MLPerf Mobile on Google Play Store

What is MLPerf Mobile?

Key Features:

Download MLPerf Mobile today.

About MLCommons®

MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders

Introducing the 2025 MLCommons Rising Stars

Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore

New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI

MLPerf Training v5.0 introduces new Llama 3.1 405B benchmark

Rapid performance improvements for key training scenarios

Increasing diversity of processors, increasing scale of systems, broadening ecosystem

Record industry participation

View the results

About ML Commons

MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark

MLCommons MLPerf Training Expands with Llama 3.1 405B

About MLPerf Training

Model selection

Dataset selection

Benchmark technical details

Conclusion

About MLCommons

MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs

Key Updates in v0.6:

About MLCommons

MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github

The AILuminate benchmark

Raising the bar for reliability testing by empowering stakeholders

Download the Demo prompt dataset today!

About MLCommons

MLCommons publie le jeu de données de prompts de démonstration AILuminate en français sur GitHub

Une série de 1 200 prompts Creative Commons en français explore l’étendue des risques pour les systèmes d’IA générative

Le benchmark AILuminate

Élever les standards des tests IA en responsabilisant les parties prenantes

Téléchargez le jeu de données aujourd’hui !

À propos de MLCommons

Téléchargez le jeu de données aujourd’hui !