MLCommons https://mlcommons.org/ Better AI for Everyone Wed, 30 Jul 2025 15:02:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png MLCommons https://mlcommons.org/ 32 32 MLCommons Releases MLPerf Client v1.0: A New Standard for AI PC and Client LLM Benchmarking https://mlcommons.org/2025/07/mlperf-client-v1-0/ Wed, 30 Jul 2025 15:02:04 +0000 https://mlcommons.org/?p=3049 Standardizing AI PC Performance: MLCommons Releases MLPerf Client v1.0 with Expanded Models, Prompts, and Hardware Support

The post MLCommons Releases MLPerf Client v1.0: A New Standard for AI PC and Client LLM Benchmarking appeared first on MLCommons.

]]>
MLCommons®, the consortium behind the industry-standard MLPerf® benchmarks, today announced the release of MLPerf Client v1.0, a benchmark that sets a new standard for measuring the performance of large language models (LLMs) on PCs and other client-class systems. This release marks a major milestone in the effort to bring standardized, transparent AI performance metrics to the fast-emerging AI PC market.

MLPerf Client v1.0 introduces an expanded set of supported models, including Llama 2 7B Chat, Llama 3.1 8B Instruct, and Phi 3.5 Mini Instruct. It also adds Phi 4 Reasoning 14B as an experimental option to preview the next generation of high-reasoning-capable LLMs. These additions allow the benchmark to reflect real-world use cases across a broader range of model sizes and capabilities.

The benchmark also expands its evaluation scope with new prompt categories. These include structured prompts for code analysis and experimental long-context summarization tests using roughly 4,000- and 8,000-token inputs, representing workloads increasingly relevant to both developers and advanced users.

Hardware and platform support have also expanded significantly. MLPerf Client v1.0 now supports AMD NPUs and GPUs working together via the ONNX Runtime and the Ryzen AI SDK. Intel NPUs and GPUs are supported through OpenVINO. GPUs from AMD, Intel, and NVIDIA are supported across the board through the ONNX Runtime GenAI with DirectML, offering wide compatibility for GPU-equipped systems. Qualcomm Technologies NPUs and CPUs are supported in hybrid operation using Qualcomm Genie and the QAIRT SDK. Also, Apple Mac GPUs are supported through MLX.

Additionally, the benchmark offers early, experimental support for several other acceleration paths. Intel NPUs and GPUs are supported via Microsoft Windows ML using the OpenVINO execution provider. NVIDIA GPUs are supported via llama.cpp with CUDA, and Apple Mac GPUs are supported via llama.cpp with Metal

For users and developers, MLPerf Client v1.0 provides both command-line and graphical user interfaces. The newly developed GUI offers intuitive, cross-platform benchmarking with key usability enhancements, such as real-time readouts of compute and memory usage, persistent results history, comparison tables across test runs, and CSV exports for offline analysis. The CLI enables easy automation and scripting for regression testing or large-scale evaluations.

MLPerf Client v1.0 is the result of collaboration among major industry stakeholders, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, and leading PC OEMs. The benchmark is available now as an open and free download from mlcommons.org, and it will continue to evolve alongside the rapidly growing AI PC ecosystem.

“MLPerf Client v1.0 is a major step forward for benchmarking AI capabilities on consumer systems,” said Ramesh Jaladi, co-chair of the MLPerf Client working group at MLCommons. “It provides a reliable, vendor-neutral standard that OEMs, silicon providers, reviewers, and end users can trust.”

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone. The organization produces industry-leading benchmarks, datasets, and best practices that span the full range of ML applications—from massive cloud training to resource-constrained edge devices. Its MLPerf benchmark suite has become the de facto standard for evaluating AI performance.

Learn more at www.mlcommons.org.

The post MLCommons Releases MLPerf Client v1.0: A New Standard for AI PC and Client LLM Benchmarking appeared first on MLCommons.

]]>
MLCommons Launches MLPerf Mobile on Google Play Store https://mlcommons.org/2025/07/mlperfmobile-android/ Thu, 10 Jul 2025 18:35:26 +0000 https://mlcommons.org/?p=3037 Industry-standard AI benchmarking tool for smartphones and tablets now broadly available to Android users.

The post MLCommons Launches MLPerf Mobile on Google Play Store appeared first on MLCommons.

]]>
MLCommons®, the open engineering consortium behind the widely used MLPerf® suite of AI benchmarks, today announced that MLPerf® Mobile, its flagship mobile AI performance benchmarking tool, is now available for the first time on the Google Play Store.

Previously offered only as a downloadable APK, MLPerf Mobile can now be easily accessed and installed by a global audience of Android users directly through Google Play. The free, open-source app empowers users to measure how well their smartphones and tablets handle real-world artificial intelligence (AI) and machine learning (ML) tasks.

“Publishing MLPerf Mobile to the Play Store is a major milestone in making AI benchmarking more accessible and transparent,” said Mostafa El-Khamy, chair of the Mobile working group at MLCommons. “Now anyone—from developers and hardware reviewers to everyday users—can assess their device’s AI performance using benchmarks developed collaboratively through MLCommons.”

What is MLPerf Mobile?

MLPerf Mobile is an AI benchmarking app designed to evaluate mobile device performance on state-of-the-art ML workloads, including:

  • Image classification
  • Object detection
  • Image segmentation
  • Language understanding
  • Super-resolution upscaling
  • Text-to-image generation

The app uses on-device hardware acceleration where supported and is optimized for Android via TensorFlow Lite delegate fallback. MLPerf Mobile supports both casual usage for quick performance insights and official benchmarking for MLCommons member submissions.

Key Features:

  • First-time availability on the Google Play Store
  • Benchmarks based on the latest AI models and workloads
  • Tuned performance for modern mobile SoCs with hardware AI acceleration
  • Multiple test modes: from fast casual tests to full MLPerf-compliant runs
  • Configurable cool-down periods to reduce thermal throttling
  • Optional cloud-based result storage with multi-device access (free account required)

MLPerf Mobile is maintained by the MLPerf Mobile Working Group, part of MLCommons®, a non-profit AI engineering organization with over 125 member institutions spanning academia and industry.

Download MLPerf Mobile today.

MLPerf Mobile is available for free on the Google Play Store.

For technical users, source code and documentation remain available on the MLCommons GitHub repository. Issues, contributions, and feedback are welcome.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

About MLCommons®

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone. The organization produces industry-leading benchmarks, datasets, and best practices that span the full range of ML applications—from massive cloud training to resource-constrained edge devices. Its MLPerf benchmark suite has become the de facto standard for evaluating AI performance.

Learn more at www.mlcommons.org.

The post MLCommons Launches MLPerf Mobile on Google Play Store appeared first on MLCommons.

]]>
MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders https://mlcommons.org/2025/06/ares-announce/ Fri, 27 Jun 2025 17:11:26 +0000 https://mlcommons.org/?p=2977 MLCommons and partners unite to create actionable reliability standards for next-generation AI agents.

The post MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders appeared first on MLCommons.

]]>
As agentic AI rapidly transforms the enterprise adoption landscape, MLCommons is convening experts from across academia, civil society, and industry to better understand the risks associated around practical, near-term deployments of AI agents. 

The challenge we posed to our community was: Imagine you are the decision maker on a product launch in 2026 for an agent that could take limited actions, such as placing an order or issuing a refund – what evaluations would you need to make an informed decision?

Last month, MLCommons co-hosted the Agentic Reliability Evaluation Summit (ARES) with the Oxford Martin School’s AI Governance Initiative, Schmidt Sciences, Laboratoire National de Metrologie et D’Essais (LNE), and the AI Verify Foundation. 34 organizations attended the convening — including AI labs, large software companies, AI safety institutes, startups, and civil society organizations — to discuss new and emerging evaluation methods for agentic AI. 

Today, MLCommons is announcing a new collaboration with contributors from Advai, AI Verify Foundation, Anthropic, Arize AI, Cohere, Google, Intel, LNE, Meta, Microsoft, NASSCOM, OpenAI, Patronus AI, Polytechnique Montreal, Qualcomm, QuantumBlack – AI by McKinsey, Salesforce, Schmidt Sciences, ServiceNow, University of Cambridge, University of Oxford, University of Illinois Urbana-Champaign, and University of California, Santa Barbara to co-develop an open agent reliability evaluation standard to operationalize trust in agentic deployments. This collaboration brings together a diverse ecosystem spanning AI builders, major enterprise deployers, specialized testing and safety providers, and important global policy organizations to ensure that this standard is both technically sound and practically implementable in the near-term. By uniting these different perspectives, this collaboration will create frameworks and benchmarks that are grounded in technical reality while addressing the practical needs of the global AI ecosystem.

We have outlined a set of principles to evaluate the reliability of near-term agentic use-cases, focusing on scenarios with limited actions within structured contexts. The evaluation framework will be organized into four key categories: Correctness, Safety, Security, and Control. The outcome from this group will comprise: 1) Design principles for agentic development 2) Benchmarks that are relevant to business for correctness, safety & control, and security. 

We welcome collaboration with a diverse range of enterprises, technology leaders, and business builders who share our vision. If you are interested in joining, please email ai-risk-and-reliability-chairs@mlcommons.org

The post MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders appeared first on MLCommons.

]]>
Introducing the 2025 MLCommons Rising Stars https://mlcommons.org/2025/06/2025-mlc-rising-stars/ Thu, 12 Jun 2025 05:48:52 +0000 https://mlcommons.org/?p=2986 Fostering a community of talented young researchers at the intersection of ML and systems research

The post Introducing the 2025 MLCommons Rising Stars appeared first on MLCommons.

]]>
We are pleased to announce the third annual MLCommons Rising Stars cohort of 38 junior researchers from 25 institutions globally! These promising researchers, drawn from over 150 applicants, have demonstrated excellence in Machine Learning (ML) and Systems research and stand out for their current and future contributions and potential. This year’s program has been expanded to include data systems research as an area of interest.

The MLCommons Rising Stars program provides a platform for talented young researchers working at the intersection of ML and systems to build connections with a vibrant research community, engage with industry and academic experts, and develop their skills. The program continues to promote diversity in the research community by seeking researchers from historically underrepresented backgrounds. We are pleased to welcome nine international Rising Stars to this year’s cohort.

As part of our commitment to fostering the growth of our Rising Stars, we organized the Rising Stars workshop at the Meta Headquarters in Menlo Park, CA, in May, where the cohort showcased their work, explored research opportunities, gained new skill sets via career building sessions, and had the opportunity to network with researchers across academia and industry. 

“ML is a fast-growing field with rapid adoption across all industries, and we believe that the biggest breakthroughs are yet to come. By nurturing and supporting the next generation of researchers, both domestically and globally, we aim to foster an inclusive environment where these individuals can make groundbreaking contributions that will shape the future of ML and systems research. The Rising Stars program is our investment in the future, and we are excited to see the innovative ideas and solutions that these talented researchers will bring to the table,” said Vijay Janapa Reddi, MLCommons VP and Research Chair and steering committee member of the Rising Stars program.

This year’s program marked a pivotal step toward strengthening the community at the intersection of ML and systems, especially by connecting early-career researchers who share common challenges and goals. The shared experiences, discussions on generative AI, and connections forged at the workshop are expected to fuel new collaborations and research directions. As we reflect on this year’s success, we are more committed than ever to growing a supportive, interdisciplinary, and inclusive ecosystem for future Rising Stars.

As part of the two-day Rising Stars workshop held at Meta Headquarters in Menlo Park, our 2025 cohort engaged in dynamic sessions with leading researchers and alumni. One of the highlights was the Alumni Panel, moderated by Abdulrahman Mahmoud (MBZUAI), featuring Rising Stars ’23 alumni: Sercan Aygun (University of Louisiana at Lafayette), Muhammad Husnain Mubarik (AMD), Francisco Romero (Plix & Georgia Tech), and Zishen Wan (Georgia Tech). The panel shared valuable insights into their career journeys, academic-industry transitions, and lessons learned since completing the program. Attendees had the opportunity to ask questions about career growth, building a research identity, and balancing long-term research goals. 

The workshops keynotes and invited talks came from distinguished leaders in ML and systems, including Vikas Chandra (Meta) on “Generative AI for Immersive Worlds”, Dan Fu (Stanford University) on “Kittens and Chipmunks: More Efficient ML Algorithms with Kernels”, Joel Emer (MIT) in a candid AMA session, Vijay Janapa Reddi (Harvard University) on “Architecture 2.0”, and Jason Cong (UCLA) on “Efficient LLMs: From Model Innovation to Customized Acceleration”. In addition, the AR/VR Panel, moderated by Muhammad Husnain Mubarik, brought together Huichu Liu (Meta) and Hyoukjun Kwon (UC Irvine) for an engaging discussion on immersive technologies. The program also spotlighted MLCommons Research Presentations by David Kanter (MLCommons) and Arun Tejusve Raghunath Rajan (Meta), offering a forward-looking perspective on collaborative benchmarking and infrastructure for ML innovation. This was followed by a moderated Q&A session on ML Systems at Meta, led by Abdulrahman Mahmoud (MBZUAI), with insights from Zachary DeVito (Meta), one of the first contributors to PyTorch, who shared his experiences in API design. Through lightning talks, poster sessions, breakout research discussions, and panels, the event fostered a vibrant environment for exchanging ideas and forging collaborations.

We extend our warmest congratulations to this year’s Rising Stars and express our gratitude to everyone who applied. 

We would also like to thank Rising Stars organizers Udit Gupta (Cornell Tech), Abdulrahman Mahmoud (MBZUAI), Sercan Aygun (University of Louisiana at Lafayette) and Muhammad Husnain Mubarik (AMD), Lillian Pentecost (Amherst College), Akanksha Atrey (Nokia Bell Labs) and Vijay Janapa Reddi (Harvard University) for all their efforts in putting together the program and selecting an impressive cohort of recipients. 

Finally, we would also like to extend our appreciation to Kelly Berschauer (MLCommons), Vikas Chandra (Meta), Petrina Mitchell (META), and David Scott (META), for their support in organizing the program and workshop.

The post Introducing the 2025 MLCommons Rising Stars appeared first on MLCommons.

]]>
Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore https://mlcommons.org/2025/06/atx-panel/ Fri, 06 Jun 2025 18:43:00 +0000 https://mlcommons.org/?p=2989 MLCommons President Peter Mattson joined industry leaders and experts to discuss the road forward for responsible AI innovation at the Asia Tech x summit in Singapore.

The post Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore appeared first on MLCommons.

]]>
Last month, MLCommons President Peter Mattson joined several industry leaders and experts to discuss the road forward for responsible AI innovation at the Asia Tech x summit in Singapore. The panel discussion, titled “Innovating Within a Secure, Reliable and Safe AI Landscape,” focused on finding a balance between mitigating risk and fostering innovation as AI continues to evolve. 

Mattson was joined by panel moderator Denise Wong, Assistant Chief Executive at the Infocomm Media Development Authority, alongside the following experts:

  • Yoichi Iida, Special Policy Adviser to Minister, Ministry of Internal Affairs and Communications, Japan
  • Dr. Bo Li, Professor, University of Illinois at Urbana-Champaign and Chief Executive Officer, Virtue AI
  • Dr. Chris Meserole, Executive Director, Frontier Model Forum
  • Professor Lam Kwok Yan, Executive Director, Digital Trust Centre Singapore and Singapore AI Safety Institute

Throughout the discussion, Mattson emphasized the importance of utilizing accessible data in developing reliable and secure AI systems — including via MLCommons’ Croissant tool, which summarizes and describes datasets. “Data is what determines the behavior of your system,” said Mattson. “It’s not the python, it’s the data that was used to train, fine tune, and prompt the AI. That is the key to success.”

Mattson also highlighted MLCommons’ partnership with the AI Verify Foundation in Singapore to develop benchmarks — like AILuminate — that help enterprise leaders understand and mitigate the risks of AI deployment. As part of the summit, MLCommons and AI Verify Foundation were proud to release proof-of-concept AILuminate testing in Chinese – a major milestone in ensuring clear, rigorous LLM benchmarking in the world’s most spoken language.  

While in Singapore, MLCommons also released updated reliability scores for a new and expanded list of LLMs, as well as a new partnership with NASSCOM — India’s top tech trade association — to expand independent, empirically-sound benchmarking across South Asia. 

“MLCommons has a mission to make AI better for everyone,” said Mattson. “We interpret that as growing the market in a way that delivers benefits to us all.”

The MLCommons AI Risk & Reliability working group is a highly collaborative, diverse set of experts committed to building a safer AI ecosystem. We welcome others to join us.

The post Peter Mattson Joins Leaders to Discuss AI Innovation at Asia Tech x Conference in Singapore appeared first on MLCommons.

]]>
New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI https://mlcommons.org/2025/06/mlperf-training-v5-0-results/ Wed, 04 Jun 2025 14:56:30 +0000 https://mlcommons.org/?p=2947 More submissions, new hardware accelerators, and more multi-node systems

The post New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI appeared first on MLCommons.

]]>
Today, MLCommons® announced new results for the MLPerf® Training v5.0 benchmark suite, highlighting the rapid growth and evolution of the field of AI. This round of benchmark results includes a record number of total submissions, as well as increased submissions for most benchmarks in the suite compared to the v4.1 benchmark.

MLPerf Training v5.0 introduces new Llama 3.1 405B benchmark

The MLPerf Training benchmark suite comprises full system tests that stress models, software, and hardware for a range of machine learning (ML) applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

Version 5.0 introduces a new large language model pretraining benchmark based on the Llama 3.1 405B generative AI system, which is the largest model to be introduced in the training benchmark suite. It replaces the gpt3-based benchmark included in previous versions of the MLPerf Training benchmark suite. An MLPerf Training task force selected the new benchmark because it is a competitive model representative of the current state-of-the-art LLMs, including recent algorithmic updates and training on more tokens. More information on the new benchmark can be found here. Despite just being introduced, the Llama 3.1 405B benchmark is already receiving more submissions than the gpt3-based predecessor saw in previous rounds – demonstrating the popularity and importance of large-scale training.

Rapid performance improvements for key training scenarios

The MLPerf Training working group regularly adds emerging training workloads to the benchmark suite to ensure that it reflects industry trends. The Training 5.0 benchmark results show notable performance improvements for newer benchmarks, indicating that the industry is prioritizing emerging training workloads over older ones. The Stable Diffusion benchmark saw a 2.28x speed increase for 8-processor systems compared to the 4.1 version six months ago, and the Llama 2.0 70B LoRA benchmark increased its speed 2.10x versus version 4.1; both outpacing historical expectations for computing performance improvements over time as per Moore’s Law. Older benchmarks in the suite saw more modest performance improvements.

On multi-node, 64-processor systems, the RetinaNet benchmark saw a 1.43x speedup compared to the prior v3.1 benchmark round (the most recent to include comparable scale systems), while the Stable Diffusion benchmark had a dramatic 3.68x increase.

“This is the sign of a robust technology innovation cycle and co-design: AI takes advantage of new systems, but the systems are also evolving to support high-priority scenarios,” said Shriya Rishab, MLPerf Training working group co-chair. 

Increasing diversity of processors, increasing scale of systems, broadening ecosystem

Submissions to MLPerf Training 5.0 utilized 12 unique processors, all in the available (production) category. Five of the processors have become publicly available since the last version of the benchmark suite.

  • AMD Instinct MI300X 192GB HBM3
  • AMD Instinct MI325X 256GB HBM3e
  • NVIDIA Blackwell GPU (GB200)
  • NVIDIA Blackwell GPU (B200-SXM-180GB)
  • TPU-trillium

Submissions also included three new processor families: 

  • 5th Generation AMD Epyc Processor (“Turin”)
  • Intel Xeon 6 Processor (“Granite Rapids”)
  • Neoverse V2 as part of NVIDIA GB200

In addition, the number of multi-node systems submitted increased more than 1.8x when compared to version 4.1.

“The picture is clear: AI workloads are scaling up, systems are scaling up to run them, and hardware innovation continues to boost performance for key scenarios,” said Hiwot Kassa, MLPerf Training working group co-chair. “In-house large scale systems were built by few companies, but the increased proliferation – and competition – in AI-optimized systems is enabling the broader community to scale up their own infrastructure. Most notably, we see an increasing cadre of cloud service providers offering access to large-scale systems, democratizing access to training large models. 

“The industry is not standing still, and neither can we. MLCommons is committed to continuing to evolve our benchmark suite so that we can capture and report on the innovation that is happening in the field of AI.”

Record industry participation

The MLPerf Training v5.0 round includes 201 performance results from 20 submitting organizations: AMD, ASUSTeK, Cisco Systems Inc., CoreWeave, Dell Technologies, GigaComputing, Google Cloud, Hewlett Packard Enterprise, IBM, Krai, Lambda, Lenovo, MangoBoost, Nebius, NVIDIA, Oracle, Quanta Cloud Technology, SCITIX, Supermicro, and TinyCorp.

“We would especially like to welcome first-time MLPerf Training submitters AMD, IBM, MangoBoost, Nebius, and SCITIX,” said David Kanter, Head of MLPerf at MLCommons. ”I would also like to highlight Lenovo’s first set of power benchmark submissions in this round – energy efficiency in AI training systems is an increasingly critical issue in need of accurate measurement.”

MLPerf Training v5.0 set a new high-water mark for the >200 submissions. The vast majority of the individual benchmark tests that carried over from the previous round saw an increase in submissions.

Robust participation by a broad set of industry stakeholders strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

Please visit the Training benchmark page to view the full results for MLPerf Training v5.0 and find additional information about the benchmarks.

About ML Commons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

Press Inquiries: contact press@mlcommons.org

The post New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI appeared first on MLCommons.

]]>
Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety https://mlcommons.org/2025/06/aaai2025/ Tue, 03 Jun 2025 22:36:33 +0000 https://mlcommons.org/?p=2984 Advancing AI safety is as much about creating ethical, adaptive, and collaborative governance mechanisms as it is about embracing rigorous technical benchmarks.

The post Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety appeared first on MLCommons.

]]>
The Association for Advanced Artificial Intelligence (AAAI) 2025 workshop on Datasets and Evaluators for AI Safety on March 3rd, 2025, provided a forum to discuss the challenges and strategies in AI safety. Drawing on contributions from leading experts across academia, industry, and policy, the discussions revealed that advancing AI safety is as much about creating ethical, adaptive, and collaborative governance mechanisms as it is about embracing rigorous technical benchmarks.

Establishing robust standards for low-risk AI

The workshop highlighted the need for standardisation and reliable benchmarks to reduce AI risks. In his talk on “Building an Ecosystem for Reliable, Low-Risk AI,” Dr Peter Mattson, President of MLCommons, outlined the important role of standardised safety benchmarks. The introduction of frameworks like the AILuminate Benchmark, which evaluates AI systems across 12 distinct safety hazards (e.g., physical, non-physical, and contextual risks), has sparked a conversation on making safety evaluations robust and accessible. Such benchmarks provide a quantitative measure of risks (from privacy and intellectual property challenges to non-physical harms) and are a foundation for industry-wide standards.

The workshop also showcased metadata tools such as the Croissant vocabulary, which standardises ML datasets to improve reproducibility and transparency. These initiatives address a core barrier in verifying and replicating AI results by ensuring that datasets are FAIR (findable, accessible, interoperable, and reusable). Ultimately, the workshop showed that safer and more reliable AI deployments can be achieved through greater technical rigour in evaluation.

Balancing technical rigor with ethical considerations

Another significant insight from the workshop was the call to integrate ethical governance into technical evaluations. Participants highlighted that technical benchmarks must be complemented by sociotechnical frameworks that manage AI’s current fairness, inclusivity, and broader societal impacts. For instance, a presentation by Prof Virginia Dignum explored the tension between long-term speculative risks and immediate ethical challenges. Prof Dignum argued that an excessive focus on future harms might detract from addressing present-day issues such as bias and fairness.

Panel discussions underscored that effective AI safety requires both robust technical metrics and frameworks for social responsibility to tackle complex real-world environments. Without the latter, unforeseen biases, or worse, are more likely to be perpetuated when AI systems “in the wild” encounter safety challenges not anticipated during model training.

Adapting evaluation methodologies in a dynamic landscape

The rapidly evolving nature of AI technologies calls for a dynamic approach to safety evaluations. The workshop continuously illustrated that safety frameworks must be adaptive; what is deemed safe today may not hold in a future where AI systems evolve unexpectedly. Panelists agreed that traditional, static evaluation methods are often ill-equipped to handle the nuances of modern AI applications.

Emergent harms, manifesting as indirect or second-order effects, challenge conventional safety paradigms. Practitioners from multiple disciplinary backgrounds emphasised how all AI safety metrics require continuous re-evaluation and adjustment as models scale in complex environments. This interdisciplinary call to action reminds us that AI safety evaluations must be as flexible and dynamic as the technologies they are designed to safeguard.

Creating collaborative and interdisciplinary approaches

Underlying the discussions at AAAI 2025 was the recognition that addressing AI safety is inherently an interdisciplinary challenge. From data scientists and software engineers to policymakers and ethicists, the workshop brought together diverse experts, enumerating unique priorities in AI safety. For instance, during the panel discussions, some experts advocated for technical measures such as implementing self-customised safety filters and automated source-referencing. Meanwhile, others stressed governance priorities like a broad citizen-first approach and frameworks to safeguard vulnerable groups.
Integrating these viewpoints is critical for developing safety frameworks that are both technically sound and socially responsive. Collaborative discussions of this kind help to bridge the gap between theoretical research and practical implementation. Unsurprisingly, there was broad consensus that some form of collective action will drive the evolution of safer, more reliable AI systems. Initiatives such as the development of the Croissant vocabulary and the ongoing work on standardised benchmarks (e.g., AILuminate) effectively capture this spirit of cooperation, given both rely on cross-sector engagement to succeed.

Conclusion

Ultimately, several critical themes for the AAAI 2025 Datasets and Evaluators for AI Safety workshop were the need for adaptive evaluation frameworks to handle real-world complexity and ongoing interdisciplinary discussions to ensure that our notion of “risk” is as robust as possible. Many discussion points were held throughout the day, but these themes were particularly important for ongoing work such as MLCommons’ AI Risk and Reliability initiative. By the end of the event, we felt a desire for more interdisciplinary workshops like
this. The common theme of adaptability tells us that it will be critical for practitioners to stay abreast of the AI safety responses to fast-moving developments we continue to see across the ecosystem.

The post Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety appeared first on MLCommons.

]]>
MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark https://mlcommons.org/2025/05/nasscom/ Thu, 29 May 2025 15:10:24 +0000 https://mlcommons.org/?p=2934 World leader in AI benchmarking announces new partnership with India’s NASSCOM; updated reliability grades for leading LLMs

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India’s premier technology trade association, to bring AILuminate’s globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate’s Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs). 

”We’re looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership,  along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”

 “The rapid development of AI is reshaping India’s technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the  growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”

The NASSCOM collaboration builds on MLCommons’ intentionally global approach to AI benchmarking. Modeled after MLCommons’ ongoing partnership with Singapore’s AI Verify Foundation, the NASSCOM partnership will help to meet South Asia’s urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region’s industry experts, policymakers, civil society members, and academic researchers. MLCommons’ partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.

AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking. 

“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We’re grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.” 

Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year. 

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons MLPerf Training Expands with Llama 3.1 405B https://mlcommons.org/2025/05/training-llama31405b/ Mon, 05 May 2025 15:43:00 +0000 https://mlcommons.org/?p=2877 MLCommons adds new pretraining benchmark for testing large-scale systems

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

]]>
About MLPerf Training

Pretraining Large Language Models (LLMs) is the initial phase of LLM training where the model learns to understand and generate human-like text by training the model on vast amounts of unstructured textual data. A pre-trained model can then be fine-tuned on domain-specific datasets, which gives the model knowledge of specific tasks. Reasoning approaches have become popular recently, helping models better explain their thinking process. Because pre-training uses large amounts of data, it is generally the most compute-intensive phase of LLM training. 

MLCommons® added a GPT3 Pretraining benchmark to MLPerf® Training in 2022 to help academics and industry experts optimize LLM pretraining. It has been a huge success, as we have seen over 3x speedups on the results of this benchmark for large systems (over 10,000 accelerators).

Over this same period, newer state-of-the-art models have demonstrated greater scale and many architectural advancements. For example, Google’s PaLM model, released in 2022, contains 540B parameters, three times as many as GPT3’s 175B parameter count. Meta’s Llama 2 and Llama 3 series, released in 2023, utilize new algorithms such as Root Mean Square Layer Normalization (RMSNorm, Zhang et al., https://arxiv.org/abs/1910.07467), Rotary Position Embedding (RoPE, Su et al., 2024), and Grouped Query Attention (GQA, Ainslie et al., 2023). 

A MLPerf Training working group task force was created to investigate a July 2024 proposal to  replace GPT3 with Llama 3.1 405B. Task force members evaluated the proposal and industry options to recommend the best new pretraining reference. They agreed that Llama 3.1 405B would best represent the current state-of-the-art pretraining and adopted it as a new benchmark. Details on the review and selection process are included in this blog.

Model selection

The most important factor the task force considered when constructing this new pretraining benchmark was the model architecture. As mentioned above, the current GPT3 architecture does not include some of the recent algorithm updates and therefore is not competitive with other state-of-the-art models such as Nemotron-4, GPT4.5 and Claude 3.7 Sonnet. However, Meta’s Llama 3.1 405B model, with an architecture and scale similar to these top-tier models, demonstrates on-par performance with these models across multiple quality benchmarks, thus positioning itself as a competitive model representative of the current state-of-the-art models. 

Choosing a model with high community engagement was also an important consideration, and Llama 3.1 405B is the perfect candidate on that basis. With more than 300 million total downloads of all Llama versions to date, the Llama family of models is very widely adopted. The MLPerf community, for instance, has already included Llama 2 in the LoRA benchmark. Additionally, Meta’s released technical paper details the training process, model architecture, dataset and hyperparameters, which allowed us to easily create a high-quality benchmark based on their work.

One other requirement for the new MLPerf Training benchmark is that the model architecture must be publicly accessible so that all submitters can download it and reproduce the results on their end. This typically requires a permissive license. Thanks to Meta’s support, the current Llama 3.1 405B model is now available to all MLCommons members for benchmarking. 

Dataset selection

To help submitters transition easily from GPT3 to Llama 3.1 405B, we use the same AllenAI C4-en dataset (Colossal Cleaned Common Crawl dataset) in both benchmarks. This dataset contains 365M English-language paragraphs scraped from the internet, cleaned and preprocessed to convenient JSON formats for training the model. 

One key difference is that Llama 3.1 405B uses a context length of 8,192, which is 4x larger than the 2,048-token context of GPT3. The context length represents the amount of text that a model can process at once, thus a larger context length leads to better understanding. 

Benchmark technical details

The reference code is implemented in the NVIDIA NeMo Framework, an open-sourced and scalable AI framework built for large language models. By offering easy-to-use modular components and pre-trained model recipes, NeMo enables developers to efficiently build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA H100 GPUs. 

To ensure that the new benchmark model setup is easily understandable without complex knobs and hyperparameters scattered around the training script, it starts with NeMo’s integrated Llama 3.1 405B training recipe, which closely followed the original paper’s hyperparameters and introduced only the minimum necessary changes to our script, such as adjusting the optimizer’s learning rates

When researching how to construct this benchmark, we first explored the GPT3-style setup. For the GPT3 pretraining benchmark, the task force pretrained from random weights for about 12.6T tokens to generate a stable customized checkpoint. Submitters were required to continue pretraining from this checkpoint until they reached a predetermined quality metric. This methodology was used to ensure that the GPT3 pretraining benchmark showcased a reproducible slice of the actual GPT3 pretraining process.

Applying the same methodology to the Llama 3.1 405B pretraining benchmark, the task force generated a customized checkpoint trained on 32B tokens. But experiment results showed that resuming from this checkpoint had more variation in the convergence curves than we could tolerate. This meant that resuming from a custom-trained checkpoint in the benchmark was not feasible. The interpretation that we were still in very early phases of training after the 32B tokens, meant that the checkpoint generated was still quite unstable.

So the taskforce decided to train from the fully trained and stable HuggingFace checkpoint by giving it some new information, which led to a much more stable convergence trend. The benchmark now begins from the checkpoint and stops when the evaluation log perplexity reaches our defined target.

One noticeable difference between the task force’s implementation and Meta’s implementation is that we replaced the original Tiktokenizer, which used a vocab size of 128k, with the 32k vocab size Mixtral 8x22b tokenizer. The motivation behind this change was that Meta’s checkpoint is already well-trained, so it will not learn anything new on the current C4 dataset. Replacing the tokenizer forces the model checkpoint to adapt to a new, different token distribution, therefore continuing to pretrain on the current dataset. This difference implies that, when we resume from the pre-trained checkpoint, only the first 32,000 rows of the 128,256-row word embedding layer weights will be loaded to the model.

As mentioned before, the Llama 3.1 405B pretraining benchmark is a replacement for the GPT3 pretraining benchmark, and the following components will be unchanged between the two:

  • Validation dataset: To reduce the cost of validation, we chose to create a customized subset of the full C4 validation dataset so that the evaluation time would not be significant compared to training time. Specifically, we found out that using only 5,760 sequences is sufficient to measure model convergence, and since only 91,205 validation samples are needed to yield the required 5,760 validation sequences, the first 91,205 unshuffled rows of the full C4 validation dataset were selected as the customized validation dataset for this benchmark.
  • Loss and target accuracy metric: As with GPT3, log perplexity is used, computed on the customized validation dataset to evaluate the model’s convergence behavior. This metric is not costly to compute and is widely adopted as a versatile indicator of evaluating LLMs.
  • Limit number of submission logs: We recognize that the Llama 3.1 405B pretraining benchmark is very expensive to run, so to ensure fairness while reducing submitters costs, each submitter is required to only run the benchmark 3 times, the same as the GPT3 pretraining benchmark. The median result is then reported.
  • Restriction on hyperparameter searching: Because running the benchmark is expensive, we believe it is reasonable to disallow hyperparameter searches and borrowing. Most hyperparameters have been given static values. For a very few others, such as the learning rate, formulas are provided to compute them based on the global batch size.

Conclusion

In this blog post, we have provided an overview of a new LLM pretraining benchmark based on Meta’s Llama 3.1 405B model with twice as many parameters and four times as many training tokens as GPT3. This benchmark is massively more computationally intensive and contains its own new challenges. It is the task force’s belief that this new benchmark will encourage more innovation in the area of LLM pretraining.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

]]>
MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs https://mlcommons.org/2025/04/mlperf-client-v0-6/ Mon, 28 Apr 2025 15:04:31 +0000 https://mlcommons.org/?p=2856 MLCommons' MLPerf Client v0.6: First Open Benchmark for NPU and GPU AI PCs

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

]]>
Today, MLCommons® is announcing the release of MLPerf® Client v0.6, an update to the MLPerf Client consumer AI performance benchmark. This release extends support to a broader range of hardware and platforms, including AI PCs with dedicated neural processing units (NPUs), while enhancing usability.

MLPerf Client v0.6 builds on the foundation established by version 0.5, which debuted with LLM-focused performance tests using the Llama 2 7B model from Meta. With this latest release, MLCommons continues its mission to provide a transparent, standardized, and vendor-neutral way to measure AI performance across a growing range of PC hardware.

MLPerf Client represents a collaboration among leaders in the consumer computing space, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, Inc., and top PC OEMs. These stakeholders have pooled resources and expertise to create a standardized performance benchmark for key consumer AI workloads.

Key Updates in v0.6:

  • Expanded Hardware Support:
    New support for NPUs from Intel, alongside continued GPU acceleration via AMD, Intel, and NVIDIA hardware. This milestone makes MLPerf Client the first open benchmark to span both GPU and NPU acceleration on consumer platforms.
  • Improved Device Selection:
    New device enumeration options help users better target and test systems with multiple capable accelerators, such as PCs equipped with multiple GPUs.
  • Updated Software Stack:
    Includes the latest versions of ONNX Runtime, ONNX Runtime GenAI, and Intel OpenVINO, offering performance and compatibility improvements across supported platforms.

“MLPerf Client v0.6 reflects the rapid evolution of the AI PC landscape with the inclusion of NPU evaluation” said Yanni Minkdakis, co-chair of the MLPerf Client working group. “With expanded support and more flexible testing, it’s now easier than ever for the industry and consumers alike to evaluate real-world AI performance on next-generation devices.”

MLPerf Client v0.6 is available now as a free download at mlcommons.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

MLPerf Client participation requires an MLCommons membership. For more information and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

]]>
MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github https://mlcommons.org/2025/04/ailuminate-french-datasets/ Wed, 16 Apr 2025 17:00:00 +0000 https://mlcommons.org/?p=2833 Set of 1,200 Creative Commons French Demo prompts and 12,000 Practice Test prompts exercise the breadth of hazards for generative AI systems

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
Following the recent French release of the AILuminate v1.1 benchmark, MLCommons is announcing the public release of two French language datasets for the benchmark. AILuminate measures the reliability and integrity of generative AI language systems against twelve categories of hazards and is currently available in English and French. Today’s dataset release includes: a Creative Commons-licensed Demo prompt dataset of over 1,200 prompts; and a Practice Test dataset of 12,000 prompts, both of which span all twelve AILuminate hazard categories. 

The Demo prompt dataset is available on GitHub now and the full Practice Test dataset may be obtained by contacting MLCommons. English versions of both datasets are also available for use.

The AILuminate benchmark

The AILuminate benchmark, designed and developed by the MLCommons AI Risk & Reliability working group, assesses LLM responses to 12,000 test prompts per language, across twelve categories of hazards that users of generative AI language systems may encounter. The benchmark was designed to help organizations evaluate language systems before deploying them in production environments, to compare systems before procurement, or to verify their language systems comply with organizational and government policies. In addition to its unique corpus of prompts, AILuminate also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models. MLCommons has used the benchmark to assess the performance of many of the most prominent AI language systems-under-test (SUTs) across all of the hazard categories, and published public reports of their findings. 

The AILuminate French language prompt datasets can help provide valuable insights into your AI systems: 

  • Demo Prompt Dataset: A Creative Commons dataset with 1,200 prompts that is a subset of the AILuminate Practice Test dataset. The French language Demo dataset is useful for in-house testing, provides examples of specific hazards, and seeds and inspiration for the creation of other prompts. It is available today on GitHub
  • Practice Test Dataset: Includes 12,000 prompts. This dataset closely resembles the official test and is intended to provide a realistic projection of how models would be evaluated by AILuminate when used with the MLCommons AILuminate evaluator. Access to the comprehensive Practice dataset is available upon request from MLCommons. 

AILuminate’s prompts are based on 12 hazard categories which are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently available in English and French, with future plans to expand to Simplified Chinese and Hindi. Each prompt dataset is human-made and reviewed by native language speakers for linguistic and cultural relevancy, to ensure accurate linguistic representation. MLCommons intends to regularly update the prompt datasets, including the Creative Commons Demo dataset as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“AI reliability is a global problem that requires an intentionally global approach to address,” said Peter Mattson, Founder and President of MLCommons. “The availability of the French Demo and Practice Test datasets reflects our commitment to evaluating AI risk concerns across languages, cultures, and value systems — and ensures our benchmarks and datasets are readily available to vendors and developers across continents.” 

Raising the bar for reliability testing by empowering stakeholders

The AILuminate Demo prompt datasets provide an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems across the twelve defined AILuminate hazard categories. While we discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards, as will policymakers looking for examples of prompts that can generate hazardous results.

Download the Demo prompt dataset today!

The Demo test dataset is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from GitHub. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk & Reliability working group and our efforts to ensure greater reliability of AI systems.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability. We invite others to join the AI Risk & Reliability working group.

FRENCH ANNOUNCEMENT

MLCommons publie le jeu de données de prompts de démonstration AILuminate en français sur GitHub

Une série de 1 200 prompts Creative Commons en français explore l’étendue des risques pour les systèmes d’IA générative

Suite au récent lancement de la version française du benchmark AILuminate v1.1, MLCommons annonce la mise à disposition publique de deux jeux de données en langue française pour ce benchmark. AILuminate évalue la fiabilité et l’intégrité des systèmes d’IA générative dans douze catégories de risques, et ses benchmarks sont désormais disponibles pour les systèmes en anglais et en français. La publication d’aujourd’hui inclut un jeu de 1 200 prompts de démonstration sous licence Creative Commons et un jeu de 12 000 prompts d’essai couvrant les douze types de risques définis par AILuminate.

Le jeu de données de démonstration est dès maintenant disponible sur GitHub, tandis que le jeu complet d’essai peut être obtenu en contactant MLCommons. La version anglaise des deux jeux est également disponible.

Le benchmark AILuminate

Le benchmark AILuminate, conçu et développé par le groupe de travail sur les risques et la fiabilité des IA chez MLCommons, évalue les réponses des modèles linguistiques (LLM) à 12 000 prompts par langue, répartis dans douze catégories de risques auxquels les utilisateurs peuvent être confrontés. Ce benchmark aide les organisations à évaluer leurs systèmes avant leur déploiement, à comparer les produits avant achat ou à vérifier leur conformité aux politiques internes et officielles. En plus d’une collection unique de prompts, AILuminate inclut un système d’évaluation avancé basé sur un ensemble optimisé de modèles d’évaluation des risques. MLCommons a utilisé ce benchmark pour analyser la performance des principaux systèmes linguistiques testés (SUTs) dans toutes les catégories de risques et a publié des rapports publics sur leurs résultats.

Les jeux de données en langue française AILuminate offrent des perspectives précieuses pour vos systèmes d’IA :

  • Jeu de prompts de démonstration : Un sous-ensemble Creative Commons contenant plus de 1 200 prompts issus du jeu de prompts d’essai. Ce jeu est utile pour les tests internes, fournit des exemples spécifiques aux risques identifiés et peut inspirer la création d’autres prompts. Il est disponible dès aujourd’hui sur GitHub.
  • Jeu de prompts d’essai : Comprend 12 000 prompts conçus pour simuler une évaluation réaliste par le système officiel AILuminate. Ce jeu complet peut être obtenu sur demande auprès de MLCommons.

Les prompts sont basés sur les douze catégories de risques définies dans le standard d’évaluation AILuminate, qui inclut notamment :

French translated Hazard Categories for AILuminate

Tous les prompts sont disponibles aujourd’hui en anglais et en français, et seront bientôt disponibles en chinois simplifié et en hindi. Les prompts sont rédigés par un être humain, puis contrôlés par une personne de langue maternelle afin d’en assurer la pertinence linguistique et culturelle. MLCommons prévoit de les mettre à jour régulièrement, notamment le jeu Creative Commons, à mesure que la compréhension communautaire des risques liés à l’IA évolue – par exemple avec le développement du comportement agentique dans les systèmes d’IA. Les données sont formatées pour une utilisation immédiate avec ModelBench, l’outil open-source publié par MLCommons.

« La fiabilité des systèmes IA est un problème mondial nécessitant une approche internationale délibérée », a déclaré Peter Mattson, fondateur et président de MLCommons. « La disponibilité des jeux de prompts français reflète notre engagement quant à l’évaluation des préoccupations liées aux risques de l’IA à travers différentes langues, cultures et systèmes de valeurs – garantissant que nos benchmarks soient accessibles aux entreprises commerciales et aux développeurs dans le monde entier. »

Élever les standards des tests IA en responsabilisant les parties prenantes

Les jeux de données AILuminate fournissent un point de départ important pour le développement des capacités internes en permettant d’évaluer la sécurité des systèmes d’IA générative dans les douze catégories définies par AILuminate. Bien que nous déconseillions l’utilisation de ces prompts pour entraîner des systèmes d’IA (les meilleures pratiques de l’industrie recommandent l’isolation des jeux de données d’évaluation afin d’éviter le surapprentissage), ces prompts sont exploitables dans de nombreux secteurs. Ils servent d’exemples représentatifs pour les utilisateurs qui souhaitent développer leurs propres jeux de données d’entraînement et d’évaluation ou évaluer la qualité de jeux de données existants. Les formateurs en sécurité de l’IA sauront profiter de cet ensemble complet d’exemples couvrant une taxonomie de risques étendue, et les décideurs politiques pourront y trouver des cas concrets où certains prompts peuvent générer des résultats dangereux.

Téléchargez le jeu de données aujourd’hui !

Le jeu de prompts de démonstration est disponible sous licence Creative Commons CC-BY-4.0 et peut être téléchargé directement sur GitHub. Le jeu d’essai complet peut être obtenu en contactant MLCommons. Après l’utilisation des tests d’essai hors ligne ou en ligne, nous encourageons les développeurs à soumettre leurs systèmes à AILuminate pour obtenir un benchmarking public avec validation et publication officielle des résultats.

Pour en savoir plus sur AILuminate et accéder à une FAQ détaillée, nous encourageons toutes les parties prenantes à rejoindre le groupe MLCommons sur les risques IA et la fiabilité.

À propos de MLCommons

MLCommons est le leader mondial du développement de benchmarks pour l’IA. C’est un consortium ouvert dont la mission est d’améliorer l’IA pour tous par la publication de benchmarks et de jeux de données publics. Formée de plus de 125 membres (entreprises de technologie, universitaires et chercheurs du monde entier), MLCommons se concentre sur le développement collaboratif d’outils pour l’industrie de l’IA par le biais de benchmarks, de jeux de données et de mesures publics traitant des risques et de la fiabilité de l’IA. Nous invitons toutes les parties intéressées à rejoindre le groupe sur les risques de l’IA chez MLCommons.

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
CKAN Announces Support for Croissant https://mlcommons.org/2025/04/ckan-croissant/ Wed, 16 Apr 2025 00:30:02 +0000 https://mlcommons.org/?p=2778 MLCommons Croissant metadata standard is now connected to CKAN’s open data catalog capabilities with ML workflows

The post CKAN Announces Support for Croissant appeared first on MLCommons.

]]>

CKAN, the world’s leading open source data management system, recently announced support for the Croissant metadata standard, a format designed by MLCommons to make datasets machine learning-ready. CKAN sites can now easily expose their datasets to ML pipelines with the help of the newly released croissant plugin. This integration, powered by the ckanext-dcat 2.3.0 extension, connects CKAN’s open data catalog capabilities with ML workflows, improving dataset discoverability and interoperability.

Learn more on the CKAN blog: Bridging CKAN and Machine Learning: Introducing Support for the Croissant Standard

The post CKAN Announces Support for Croissant appeared first on MLCommons.

]]>