News Archives - MLCommons https://mlcommons.org/category/news/ Better AI for Everyone Thu, 29 May 2025 15:31:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png News Archives - MLCommons https://mlcommons.org/category/news/ 32 32 MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark https://mlcommons.org/2025/05/nasscom/ Thu, 29 May 2025 15:10:24 +0000 https://mlcommons.org/?p=2934 World leader in AI benchmarking announces new partnership with India’s NASSCOM; updated reliability grades for leading LLMs

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India’s premier technology trade association, to bring AILuminate’s globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate’s Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs). 

”We’re looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership,  along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”

 “The rapid development of AI is reshaping India’s technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the  growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”

The NASSCOM collaboration builds on MLCommons’ intentionally global approach to AI benchmarking. Modeled after MLCommons’ ongoing partnership with Singapore’s AI Verify Foundation, the NASSCOM partnership will help to meet South Asia’s urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region’s industry experts, policymakers, civil society members, and academic researchers. MLCommons’ partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.

AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking. 

“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We’re grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.” 

Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year. 

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons MLPerf Training Expands with Llama 3.1 405B https://mlcommons.org/2025/05/training-llama31405b/ Mon, 05 May 2025 15:43:00 +0000 https://mlcommons.org/?p=2877 MLCommons adds new pretraining benchmark for testing large-scale systems

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

]]>
About MLPerf Training

Pretraining Large Language Models (LLMs) is the initial phase of LLM training where the model learns to understand and generate human-like text by training the model on vast amounts of unstructured textual data. A pre-trained model can then be fine-tuned on domain-specific datasets, which gives the model knowledge of specific tasks. Reasoning approaches have become popular recently, helping models better explain their thinking process. Because pre-training uses large amounts of data, it is generally the most compute-intensive phase of LLM training. 

MLCommons® added a GPT3 Pretraining benchmark to MLPerf® Training in 2022 to help academics and industry experts optimize LLM pretraining. It has been a huge success, as we have seen over 3x speedups on the results of this benchmark for large systems (over 10,000 accelerators).

Over this same period, newer state-of-the-art models have demonstrated greater scale and many architectural advancements. For example, Google’s PaLM model, released in 2022, contains 540B parameters, three times as many as GPT3’s 175B parameter count. Meta’s Llama 2 and Llama 3 series, released in 2023, utilize new algorithms such as Root Mean Square Layer Normalization (RMSNorm, Zhang et al., https://arxiv.org/abs/1910.07467), Rotary Position Embedding (RoPE, Su et al., 2024), and Grouped Query Attention (GQA, Ainslie et al., 2023). 

A MLPerf Training working group task force was created to investigate a July 2024 proposal to  replace GPT3 with Llama 3.1 405B. Task force members evaluated the proposal and industry options to recommend the best new pretraining reference. They agreed that Llama 3.1 405B would best represent the current state-of-the-art pretraining and adopted it as a new benchmark. Details on the review and selection process are included in this blog.

Model selection

The most important factor the task force considered when constructing this new pretraining benchmark was the model architecture. As mentioned above, the current GPT3 architecture does not include some of the recent algorithm updates and therefore is not competitive with other state-of-the-art models such as Nemotron-4, GPT4.5 and Claude 3.7 Sonnet. However, Meta’s Llama 3.1 405B model, with an architecture and scale similar to these top-tier models, demonstrates on-par performance with these models across multiple quality benchmarks, thus positioning itself as a competitive model representative of the current state-of-the-art models. 

Choosing a model with high community engagement was also an important consideration, and Llama 3.1 405B is the perfect candidate on that basis. With more than 300 million total downloads of all Llama versions to date, the Llama family of models is very widely adopted. The MLPerf community, for instance, has already included Llama 2 in the LoRA benchmark. Additionally, Meta’s released technical paper details the training process, model architecture, dataset and hyperparameters, which allowed us to easily create a high-quality benchmark based on their work.

One other requirement for the new MLPerf Training benchmark is that the model architecture must be publicly accessible so that all submitters can download it and reproduce the results on their end. This typically requires a permissive license. Thanks to Meta’s support, the current Llama 3.1 405B model is now available to all MLCommons members for benchmarking. 

Dataset selection

To help submitters transition easily from GPT3 to Llama 3.1 405B, we use the same AllenAI C4-en dataset (Colossal Cleaned Common Crawl dataset) in both benchmarks. This dataset contains 365M English-language paragraphs scraped from the internet, cleaned and preprocessed to convenient JSON formats for training the model. 

One key difference is that Llama 3.1 405B uses a context length of 8,192, which is 4x larger than the 2,048-token context of GPT3. The context length represents the amount of text that a model can process at once, thus a larger context length leads to better understanding. 

Benchmark technical details

The reference code is implemented in the NVIDIA NeMo Framework, an open-sourced and scalable AI framework built for large language models. By offering easy-to-use modular components and pre-trained model recipes, NeMo enables developers to efficiently build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA H100 GPUs. 

To ensure that the new benchmark model setup is easily understandable without complex knobs and hyperparameters scattered around the training script, it starts with NeMo’s integrated Llama 3.1 405B training recipe, which closely followed the original paper’s hyperparameters and introduced only the minimum necessary changes to our script, such as adjusting the optimizer’s learning rates

When researching how to construct this benchmark, we first explored the GPT3-style setup. For the GPT3 pretraining benchmark, the task force pretrained from random weights for about 12.6T tokens to generate a stable customized checkpoint. Submitters were required to continue pretraining from this checkpoint until they reached a predetermined quality metric. This methodology was used to ensure that the GPT3 pretraining benchmark showcased a reproducible slice of the actual GPT3 pretraining process.

Applying the same methodology to the Llama 3.1 405B pretraining benchmark, the task force generated a customized checkpoint trained on 32B tokens. But experiment results showed that resuming from this checkpoint had more variation in the convergence curves than we could tolerate. This meant that resuming from a custom-trained checkpoint in the benchmark was not feasible. The interpretation that we were still in very early phases of training after the 32B tokens, meant that the checkpoint generated was still quite unstable.

So the taskforce decided to train from the fully trained and stable HuggingFace checkpoint by giving it some new information, which led to a much more stable convergence trend. The benchmark now begins from the checkpoint and stops when the evaluation log perplexity reaches our defined target.

One noticeable difference between the task force’s implementation and Meta’s implementation is that we replaced the original Tiktokenizer, which used a vocab size of 128k, with the 32k vocab size Mixtral 8x22b tokenizer. The motivation behind this change was that Meta’s checkpoint is already well-trained, so it will not learn anything new on the current C4 dataset. Replacing the tokenizer forces the model checkpoint to adapt to a new, different token distribution, therefore continuing to pretrain on the current dataset. This difference implies that, when we resume from the pre-trained checkpoint, only the first 32,000 rows of the 128,256-row word embedding layer weights will be loaded to the model.

As mentioned before, the Llama 3.1 405B pretraining benchmark is a replacement for the GPT3 pretraining benchmark, and the following components will be unchanged between the two:

  • Validation dataset: To reduce the cost of validation, we chose to create a customized subset of the full C4 validation dataset so that the evaluation time would not be significant compared to training time. Specifically, we found out that using only 5,760 sequences is sufficient to measure model convergence, and since only 91,205 validation samples are needed to yield the required 5,760 validation sequences, the first 91,205 unshuffled rows of the full C4 validation dataset were selected as the customized validation dataset for this benchmark.
  • Loss and target accuracy metric: As with GPT3, log perplexity is used, computed on the customized validation dataset to evaluate the model’s convergence behavior. This metric is not costly to compute and is widely adopted as a versatile indicator of evaluating LLMs.
  • Limit number of submission logs: We recognize that the Llama 3.1 405B pretraining benchmark is very expensive to run, so to ensure fairness while reducing submitters costs, each submitter is required to only run the benchmark 3 times, the same as the GPT3 pretraining benchmark. The median result is then reported.
  • Restriction on hyperparameter searching: Because running the benchmark is expensive, we believe it is reasonable to disallow hyperparameter searches and borrowing. Most hyperparameters have been given static values. For a very few others, such as the learning rate, formulas are provided to compute them based on the global batch size.

Conclusion

In this blog post, we have provided an overview of a new LLM pretraining benchmark based on Meta’s Llama 3.1 405B model with twice as many parameters and four times as many training tokens as GPT3. This benchmark is massively more computationally intensive and contains its own new challenges. It is the task force’s belief that this new benchmark will encourage more innovation in the area of LLM pretraining.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

]]>
MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs https://mlcommons.org/2025/04/mlperf-client-v0-6/ Mon, 28 Apr 2025 15:04:31 +0000 https://mlcommons.org/?p=2856 MLCommons' MLPerf Client v0.6: First Open Benchmark for NPU and GPU AI PCs

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

]]>
Today, MLCommons® is announcing the release of MLPerf® Client v0.6, an update to the MLPerf Client consumer AI performance benchmark. This release extends support to a broader range of hardware and platforms, including AI PCs with dedicated neural processing units (NPUs), while enhancing usability.

MLPerf Client v0.6 builds on the foundation established by version 0.5, which debuted with LLM-focused performance tests using the Llama 2 7B model from Meta. With this latest release, MLCommons continues its mission to provide a transparent, standardized, and vendor-neutral way to measure AI performance across a growing range of PC hardware.

MLPerf Client represents a collaboration among leaders in the consumer computing space, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, Inc., and top PC OEMs. These stakeholders have pooled resources and expertise to create a standardized performance benchmark for key consumer AI workloads.

Key Updates in v0.6:

  • Expanded Hardware Support:
    New support for NPUs from Intel, alongside continued GPU acceleration via AMD, Intel, and NVIDIA hardware. This milestone makes MLPerf Client the first open benchmark to span both GPU and NPU acceleration on consumer platforms.
  • Improved Device Selection:
    New device enumeration options help users better target and test systems with multiple capable accelerators, such as PCs equipped with multiple GPUs.
  • Updated Software Stack:
    Includes the latest versions of ONNX Runtime, ONNX Runtime GenAI, and Intel OpenVINO, offering performance and compatibility improvements across supported platforms.

“MLPerf Client v0.6 reflects the rapid evolution of the AI PC landscape with the inclusion of NPU evaluation” said Yanni Minkdakis, co-chair of the MLPerf Client working group. “With expanded support and more flexible testing, it’s now easier than ever for the industry and consumers alike to evaluate real-world AI performance on next-generation devices.”

MLPerf Client v0.6 is available now as a free download at mlcommons.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

MLPerf Client participation requires an MLCommons membership. For more information and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

]]>
MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github https://mlcommons.org/2025/04/ailuminate-french-datasets/ Wed, 16 Apr 2025 17:00:00 +0000 https://mlcommons.org/?p=2833 Set of 1,200 Creative Commons French Demo prompts and 12,000 Practice Test prompts exercise the breadth of hazards for generative AI systems

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
Following the recent French release of the AILuminate v1.1 benchmark, MLCommons is announcing the public release of two French language datasets for the benchmark. AILuminate measures the reliability and integrity of generative AI language systems against twelve categories of hazards and is currently available in English and French. Today’s dataset release includes: a Creative Commons-licensed Demo prompt dataset of over 1,200 prompts; and a Practice Test dataset of 12,000 prompts, both of which span all twelve AILuminate hazard categories. 

The Demo prompt dataset is available on GitHub now and the full Practice Test dataset may be obtained by contacting MLCommons. English versions of both datasets are also available for use.

The AILuminate benchmark

The AILuminate benchmark, designed and developed by the MLCommons AI Risk & Reliability working group, assesses LLM responses to 12,000 test prompts per language, across twelve categories of hazards that users of generative AI language systems may encounter. The benchmark was designed to help organizations evaluate language systems before deploying them in production environments, to compare systems before procurement, or to verify their language systems comply with organizational and government policies. In addition to its unique corpus of prompts, AILuminate also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models. MLCommons has used the benchmark to assess the performance of many of the most prominent AI language systems-under-test (SUTs) across all of the hazard categories, and published public reports of their findings. 

The AILuminate French language prompt datasets can help provide valuable insights into your AI systems: 

  • Demo Prompt Dataset: A Creative Commons dataset with 1,200 prompts that is a subset of the AILuminate Practice Test dataset. The French language Demo dataset is useful for in-house testing, provides examples of specific hazards, and seeds and inspiration for the creation of other prompts. It is available today on GitHub
  • Practice Test Dataset: Includes 12,000 prompts. This dataset closely resembles the official test and is intended to provide a realistic projection of how models would be evaluated by AILuminate when used with the MLCommons AILuminate evaluator. Access to the comprehensive Practice dataset is available upon request from MLCommons. 

AILuminate’s prompts are based on 12 hazard categories which are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently available in English and French, with future plans to expand to Simplified Chinese and Hindi. Each prompt dataset is human-made and reviewed by native language speakers for linguistic and cultural relevancy, to ensure accurate linguistic representation. MLCommons intends to regularly update the prompt datasets, including the Creative Commons Demo dataset as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“AI reliability is a global problem that requires an intentionally global approach to address,” said Peter Mattson, Founder and President of MLCommons. “The availability of the French Demo and Practice Test datasets reflects our commitment to evaluating AI risk concerns across languages, cultures, and value systems — and ensures our benchmarks and datasets are readily available to vendors and developers across continents.” 

Raising the bar for reliability testing by empowering stakeholders

The AILuminate Demo prompt datasets provide an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems across the twelve defined AILuminate hazard categories. While we discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards, as will policymakers looking for examples of prompts that can generate hazardous results.

Download the Demo prompt dataset today!

The Demo test dataset is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from GitHub. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk & Reliability working group and our efforts to ensure greater reliability of AI systems.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability. We invite others to join the AI Risk & Reliability working group.

FRENCH ANNOUNCEMENT

MLCommons publie le jeu de données de prompts de démonstration AILuminate en français sur GitHub

Une série de 1 200 prompts Creative Commons en français explore l’étendue des risques pour les systèmes d’IA générative

Suite au récent lancement de la version française du benchmark AILuminate v1.1, MLCommons annonce la mise à disposition publique de deux jeux de données en langue française pour ce benchmark. AILuminate évalue la fiabilité et l’intégrité des systèmes d’IA générative dans douze catégories de risques, et ses benchmarks sont désormais disponibles pour les systèmes en anglais et en français. La publication d’aujourd’hui inclut un jeu de 1 200 prompts de démonstration sous licence Creative Commons et un jeu de 12 000 prompts d’essai couvrant les douze types de risques définis par AILuminate.

Le jeu de données de démonstration est dès maintenant disponible sur GitHub, tandis que le jeu complet d’essai peut être obtenu en contactant MLCommons. La version anglaise des deux jeux est également disponible.

Le benchmark AILuminate

Le benchmark AILuminate, conçu et développé par le groupe de travail sur les risques et la fiabilité des IA chez MLCommons, évalue les réponses des modèles linguistiques (LLM) à 12 000 prompts par langue, répartis dans douze catégories de risques auxquels les utilisateurs peuvent être confrontés. Ce benchmark aide les organisations à évaluer leurs systèmes avant leur déploiement, à comparer les produits avant achat ou à vérifier leur conformité aux politiques internes et officielles. En plus d’une collection unique de prompts, AILuminate inclut un système d’évaluation avancé basé sur un ensemble optimisé de modèles d’évaluation des risques. MLCommons a utilisé ce benchmark pour analyser la performance des principaux systèmes linguistiques testés (SUTs) dans toutes les catégories de risques et a publié des rapports publics sur leurs résultats.

Les jeux de données en langue française AILuminate offrent des perspectives précieuses pour vos systèmes d’IA :

  • Jeu de prompts de démonstration : Un sous-ensemble Creative Commons contenant plus de 1 200 prompts issus du jeu de prompts d’essai. Ce jeu est utile pour les tests internes, fournit des exemples spécifiques aux risques identifiés et peut inspirer la création d’autres prompts. Il est disponible dès aujourd’hui sur GitHub.
  • Jeu de prompts d’essai : Comprend 12 000 prompts conçus pour simuler une évaluation réaliste par le système officiel AILuminate. Ce jeu complet peut être obtenu sur demande auprès de MLCommons.

Les prompts sont basés sur les douze catégories de risques définies dans le standard d’évaluation AILuminate, qui inclut notamment :

French translated Hazard Categories for AILuminate

Tous les prompts sont disponibles aujourd’hui en anglais et en français, et seront bientôt disponibles en chinois simplifié et en hindi. Les prompts sont rédigés par un être humain, puis contrôlés par une personne de langue maternelle afin d’en assurer la pertinence linguistique et culturelle. MLCommons prévoit de les mettre à jour régulièrement, notamment le jeu Creative Commons, à mesure que la compréhension communautaire des risques liés à l’IA évolue – par exemple avec le développement du comportement agentique dans les systèmes d’IA. Les données sont formatées pour une utilisation immédiate avec ModelBench, l’outil open-source publié par MLCommons.

« La fiabilité des systèmes IA est un problème mondial nécessitant une approche internationale délibérée », a déclaré Peter Mattson, fondateur et président de MLCommons. « La disponibilité des jeux de prompts français reflète notre engagement quant à l’évaluation des préoccupations liées aux risques de l’IA à travers différentes langues, cultures et systèmes de valeurs – garantissant que nos benchmarks soient accessibles aux entreprises commerciales et aux développeurs dans le monde entier. »

Élever les standards des tests IA en responsabilisant les parties prenantes

Les jeux de données AILuminate fournissent un point de départ important pour le développement des capacités internes en permettant d’évaluer la sécurité des systèmes d’IA générative dans les douze catégories définies par AILuminate. Bien que nous déconseillions l’utilisation de ces prompts pour entraîner des systèmes d’IA (les meilleures pratiques de l’industrie recommandent l’isolation des jeux de données d’évaluation afin d’éviter le surapprentissage), ces prompts sont exploitables dans de nombreux secteurs. Ils servent d’exemples représentatifs pour les utilisateurs qui souhaitent développer leurs propres jeux de données d’entraînement et d’évaluation ou évaluer la qualité de jeux de données existants. Les formateurs en sécurité de l’IA sauront profiter de cet ensemble complet d’exemples couvrant une taxonomie de risques étendue, et les décideurs politiques pourront y trouver des cas concrets où certains prompts peuvent générer des résultats dangereux.

Téléchargez le jeu de données aujourd’hui !

Le jeu de prompts de démonstration est disponible sous licence Creative Commons CC-BY-4.0 et peut être téléchargé directement sur GitHub. Le jeu d’essai complet peut être obtenu en contactant MLCommons. Après l’utilisation des tests d’essai hors ligne ou en ligne, nous encourageons les développeurs à soumettre leurs systèmes à AILuminate pour obtenir un benchmarking public avec validation et publication officielle des résultats.

Pour en savoir plus sur AILuminate et accéder à une FAQ détaillée, nous encourageons toutes les parties prenantes à rejoindre le groupe MLCommons sur les risques IA et la fiabilité.

À propos de MLCommons

MLCommons est le leader mondial du développement de benchmarks pour l’IA. C’est un consortium ouvert dont la mission est d’améliorer l’IA pour tous par la publication de benchmarks et de jeux de données publics. Formée de plus de 125 membres (entreprises de technologie, universitaires et chercheurs du monde entier), MLCommons se concentre sur le développement collaboratif d’outils pour l’industrie de l’IA par le biais de benchmarks, de jeux de données et de mesures publics traitant des risques et de la fiabilité de l’IA. Nous invitons toutes les parties intéressées à rejoindre le groupe sur les risques de l’IA chez MLCommons.

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
CKAN Announces Support for Croissant https://mlcommons.org/2025/04/ckan-croissant/ Wed, 16 Apr 2025 00:30:02 +0000 https://mlcommons.org/?p=2778 MLCommons Croissant metadata standard is now connected to CKAN’s open data catalog capabilities with ML workflows

The post CKAN Announces Support for Croissant appeared first on MLCommons.

]]>

CKAN, the world’s leading open source data management system, recently announced support for the Croissant metadata standard, a format designed by MLCommons to make datasets machine learning-ready. CKAN sites can now easily expose their datasets to ML pipelines with the help of the newly released croissant plugin. This integration, powered by the ckanext-dcat 2.3.0 extension, connects CKAN’s open data catalog capabilities with ML workflows, improving dataset discoverability and interoperability.

Learn more on the CKAN blog: Bridging CKAN and Machine Learning: Introducing Support for the Croissant Standard

The post CKAN Announces Support for Croissant appeared first on MLCommons.

]]>
MLCommons Releases New MLPerf Inference v5.0 Benchmark Results https://mlcommons.org/2025/04/mlperf-inference-v5-0-results/ Wed, 02 Apr 2025 15:00:00 +0000 https://mlcommons.org/?p=2657 Generative AI now the center of attention for performance engineering

The post MLCommons Releases New MLPerf Inference v5.0 Benchmark Results appeared first on MLCommons.

]]>
Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. The results highlight that the AI community is focusing much of its attention and efforts on generative AI scenarios, and that the combination of recent hardware and software advances optimized for generative AI have led to dramatic performance improvements over the past year.

The MLPerf Inference benchmark suite, which encompasses both datacenter and edge systems, is designed to measure how quickly systems can run AI and ML models across a variety of workloads. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems. This round of MLPerf Inference results also includes tests for four new benchmarks: Llama 3.1 405B, Llama 2 70B Interactive for low-latency applications, RGAT, and Automotive PointPainting for 3D object detection.

Llama 2 70B generative AI test takes center stage

The Inference v5.0 results show that generative AI scenarios have gained momentum. Over the last year submissions have increased 2.5x to the Llama 2 70B benchmark test, which implements a large generative AI inference workload based on a widely-referenced open-source model. With the v5.0 release, Llama 2 70B has now supplanted Resnet50 as the highest submission rate test.

The performance results for Llama 2 70B have also skyrocketed since a year ago: the median submitted score has doubled, and the best score is 3.3 times faster compared to Inference v4.0.

“It’s clear now that much of the ecosystem is focused squarely on deploying generative AI, and that the performance benchmarking feedback loop is working,” said David Kanter, head of MLPerf at MLCommons. “We’re seeing an unprecedented flood of new generations of accelerators. The hardware is paired with new software techniques, including aligned support across hardware and software for the FP4 data format. With these advances, the community is setting new records for generative AI inference performance.”

The benchmark results for this round include results for six newly available or soon-to-be-shipped processors:

  • AMD Instinct MI325X
  • Intel Xeon 6980P “Granite Rapids”
  • Google TPU Trillium (TPU v6e)
  • NVIDIA B200
  • NVIDIA Jetson AGX Thor 128
  • NVIDIA GB200

Benchmarking the state of the art in generative AI: two new tests introduced

In step with advances in the AI community, MLPerf Inference v5.0 introduces a new benchmark utilizing the Llama 3.1 405B model, marking a new bar for the scale of a generative AI inference model in a performance benchmark. Llama 3.1 405B incorporates 405 billion parameters in its model while supporting input and output lengths up to 128,000 tokens (compared to only 4,096 tokens for Llama 2 70B). The benchmark tests three separate tasks: general question-answering, math, and code generation.

“This is our most ambitious inference benchmark to-date,” said Miro Hodak, co-chair of the MLPerf Inference working group. “It reflects the industry trend toward larger models, which can increase accuracy and support a broader set of tasks. It’s a more difficult and time-consuming test, but organizations are trying to deploy real-world models of this order of magnitude. Trusted, relevant benchmark results are critical to help them make better decisions on the best way to provision them.”

The Inference v5.0 suite also adds a new twist to its existing benchmark for Llama 2 70B with an additional test that adds low-latency requirements: Llama 2 70B Interactive. Reflective of industry trends toward interactive chatbots as well as next-generation reasoning and agentic systems, the benchmark requires systems under test (SUTs) to meet more demanding system response metrics for time to first token (TTFT) and time per output token (TPOT).

“A critical measure of the performance of a query system or a chatbot is whether it feels responsive to a person interacting with it. How quickly does it start to reply to a prompt, and at what pace does it deliver its entire response?” said Mitchelle Rasquinha, MLPerf Inference working group co-chair. “By enforcing tighter requirements for responsiveness, this interactive version of the Llama 2 70B test offers new insights into the performance of LLMs in real-world scenarios.”

More information on the selection of the Llama 3.1 405B and the new Llama 2 70B Interactive benchmarks can be found in this supplemental blog.

New Graph Neural Network datacenter benchmark for modeling relationship graphs

Also new to Inference v5.0 is a datacenter benchmark that implements a graph neural network (GNN) model. GNNs are useful for modeling links and relationships between nodes in a network and are commonly used in recommendation systems, knowledge-graph answering, fraud-detection systems, and other kinds of graph-based applications.

The GNN datacenter benchmark implements the RGAT model, based on the Illinois Graph Benchmark Heterogeneous (IGBH) dataset containing 547,306,935 nodes and 5,812,005,639 edges.

More information on the construction of the RGAT benchmark can be found here.

New edge benchmark: Automotive PointPainting test for 3D object detection

The Inference v5.0 benchmark introduces a new Automotive PointPainting benchmark for edge computing devices, specifically automobiles. While the MLPerf Automotive working group continues to develop the Minimum Viable Product benchmark first announced last summer, this test provides a proxy for an important edge-computing scenario: 3D object detection in camera feeds for applications such as self-driving cars.

More information on the Automotive PointPainting benchmark can be found here.

As industry raises the bar for AI systems, MLPerf Inference benchmark follows suit

“We rarely introduce four new tests in a single update to the benchmark suite,” said Miro Hodak, “but we felt it was necessary to best serve the community. The rapid pace of advancement in machine learning and the breadth of new applications are both staggering, and stakeholders need relevant and up-to-date data to inform their decision-making.”

MLPerf Inference v5.0 includes 17,457 performance results from 23 submitting organizations: AMD, ASUSTeK, Broadcom, Cisco, CoreWeave, CTuning, Dell, FlexAI, Fujitsu, GATEOverflow, Giga Computing, Google, HPE, Intel, Krai, Lambda, Lenovo, MangoBoost, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Sustainable Metal Cloud.

“We would like to welcome the five first-time submitters to the Inference benchmark: CoreWeave, FlexAI, GATEOverflow, Lambda, and MangoBoost,” said David Kanter. “The continuing growth in the community of submitters is a testament to the importance of accurate and trustworthy performance metrics to the AI community. I would also like to highlight Fujitsu’s broad set of datacenter power benchmark submissions and GateOverflow’s edge power submissions in this round, which reminds us that energy efficiency in AI systems is an increasingly critical issue in need of accurate data to guide decision-making.”

“The machine learning ecosystem continues to give the community ever greater capabilities. We are increasing the scale of AI models being trained and deployed, achieving new levels of interactive responsiveness, and deploying AI compute more broadly than ever before,” said Kanter. “We are excited to see new generations of hardware and software deliver these capabilities, and MLCommons is proud to present exciting results for a wide range of systems and several novel processors with this release of the MLPerf Inference benchmark. Our work to keep the benchmark suite current, comprehensive, and relevant at a time of rapid change is a real accomplishment, and ensures that we will continue to deliver valuable performance data to stakeholders.”

View the results

To view the results for MLPerf Inference v5.0, please visit the Datacenter and Edge benchmark results pages.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons Releases New MLPerf Inference v5.0 Benchmark Results appeared first on MLCommons.

]]>
Unlocking Data Collaboration with AI-Ready Licenses https://mlcommons.org/2025/03/unlocking-data-collab/ Mon, 17 Mar 2025 19:37:31 +0000 https://mlcommons.org/?p=2609 New research shows promising signs that adopting modular, standard data license agreements can clarify key terms, reduce transaction costs, and enable more accessible data use across borders and sectors.

The post Unlocking Data Collaboration with AI-Ready Licenses appeared first on MLCommons.

]]>
The rapid advancement of artificial intelligence (AI) is fueling the need for complex datasets containing various high-quality text, images, audio, video, and specialized formats. Despite the growing interest in AI and its evident social impact, stakeholders face challenges in establishing reliable frameworks for responsibly sharing AI data. A significant hurdle is the absence of widely accepted, standardized contractual frameworks (i.e.: data licenses) for this purpose. 

Research funded through an unrestricted grant by the MLCommons Association and recently published by the Open Data Institute (ODI) and the Pratt School of Engineering at Duke University explored the motivations driving data sharing throughout the AI ecosystem. The goal was to inform the design of more fit-for-purpose AI data licenses. A blog published by the OECD.AI/GPAI also recently featured this work.  

Using a mixed-methods approach that combined interviews, an online survey, and a review of literature, this research captured a broad spectrum of insights from across eight jurisdictions and multiple sectors covering key roles such as data holders, data users, and intermediaries. Here, we give an overview of these research findings, demonstrating how standard data license agreements hold the potential to mitigate legal uncertainties, reduce transaction costs, and promote fair AI data access, particularly when combined with standard data governance and other tools.

The following key findings are organized according to four main themes: leveraging new standard data licenses to 1) promote greater data access and social good, and 2) address existing ambiguities in current contractual approaches to data sharing, alongside 3) the need for data licensing simplicity and the importance of balancing standardization with the need for some customization, and 4) developing a new standard data licensing framework to help address other data sharing challenges, including legal compliance and ethical considerations.  

data licensing pathway steps

1. New standard data licenses can promote greater responsible data access and AI for social good.

a) No standardized data licenses have been widely adopted, increasing data sharing challenges for under-represented and smaller organizations. The research indicates that the lack of widely adopted, standardized data licenses may create significant barriers to voluntary, legally compliant, and ethical data sharing. This gap might drive up costs and impose contractual burdens that particularly hinder smaller or under-represented organizations from fully benefiting from AI.

b) Data access disparities also impact the advancement of AI for social good. Limited cross-cultural and cross-linguistic data sharing can restrict access to high-quality datasets. The research shows that this shortfall might not only impede responsible AI development among research institutions, governments, and non-profits (especially in developing regions) but also diminish the diversity and richness of AI models.

c) Multi-stakeholder participation can help develop standard data licenses that advance equitable data access and AI for social good. According to the research, involving a diverse range of stakeholders (including representatives from different geographic regions, sectors, and groups that have traditionally faced challenges in the technology sector) appears to be essential. Such multi-stakeholder participation can lead to the development of standard licensing frameworks that more effectively address disparities and foster equitable data access for social good.

2. New standard data licenses can help address existing ambiguities and challenges with current contractual approaches to data sharing.

a) The lack of widely embraced standard data licenses leads to inconsistency and incompatibility and increases the challenges of using data in compliance with license terms. Through this research, it was confirmed that customized license agreements currently in use may lead to inconsistencies and incompatibilities, which in turn increase legal uncertainty and compliance costs. This fragmentation overall makes it more challenging for organizations to share data effectively across their ecosystems.

b) Confusion exists about the meaning of non-commercial limitations in data licenses, and clarification in new standard licenses might encourage more data-sharing transactions. This work highlights that the lack of definitions of “non-commercial” use in current licenses creates significant confusion. Clarifying this term within standard licenses could encourage more frequent and compliant data-sharing transactions.

c) New standard data licenses could clarify rights emerging from AI and provide different options for contractually allocating rights and liabilities. The research shows that commonly used existing licenses do not clearly address the allocation of rights for data usage (or for AI models and outputs developed using such data) resulting in uncertainties regarding rights and liabilities. Standard licenses that offer clearer contractual options, in line with applicable laws, could help resolve these issues.

d) New standard data licenses can help address attribution issues arising under existing agreements. Attribution requirements in widely used license frameworks can be impractical for AI applications, where data used in model training is not directly reproduced in outputs. The research therefore suggests that standardizing license terms and integrating technical solutions like automatic attribution may reduce this ambiguity and simplify data sharing.

3. New standard data licenses should embrace simplicity while balancing standardization with the need for some customization.

a) There is a pressing need for simplicity. Throughout, the research emphasized that drafting licenses in clear, easily interpretable language is critical for reducing transaction costs and building trust, especially for non-legal practitioners.

b) Standard data licensing frameworks should balance the need for standardization with some customization, given the many data types, use cases, models, tasks, and legal requirements. The research also indicates that while standardization can streamline data sharing, licenses may also need to allow for some customization to accommodate the diversity of data types, use cases, AI models, and jurisdiction-specific legal requirements across the AI data lifecycle.

c) Standard modular data license terms (and/or a menu of standard data license agreements) could balance standardization and some customization. Therefore, the research suggests that a modular framework or a menu of license options (developed through a collaborative, multi-stakeholder process) could effectively balance the need for uniformity with the necessity for customization to address varied stakeholder needs.

d) Standard definitions can advance data licensing. The research points to the potential benefits of developing and adopting standard definitions for key terms. Such definitions can further simplify license agreements and help maintain an effective balance between standardization and necessary customization.

4. A new standard data licensing framework may help address other data-sharing challenges, including legal compliance, data and AI governance, and ethical considerations.

a) Standard data license agreements could help address other challenges in implementing voluntary, legally compliant data sharing, including by helping advance data and AI governance. Ensuring technical tools for tracking data provenance and lineage and for easily processing the terms of standard licenses can enhance transparency and reduce compliance costs. This, the research shows, can support stronger data and AI governance practices. For instance, MLCommons’ Croissant initiative (a metadata format for ML datasets) simplifies discovery and integration by embedding legal and compliance measures into the AI pipeline. This can help to reduce uncertainty in data-sharing exchanges by representing legal and contractual frameworks as metadata. Lastly, complementary tools, such as automatic attribution services in Apache Atlas, can further ensure that license-defined rights are easily represented and therefore upheld.

b) Standard data licenses may help address cross-border and other compliance challenges. The diversity of data-related regulations across jurisdictions (e.g., data protection) continues to create substantial barriers to data sharing. To ease compliance, standard data license terms should be prepared with these legal matters in mind. 

c) Standard data licenses may help advance ethical guidelines for data usage. The research confirmed interest in having ethical restrictions on data and AI model usage in at least some scenarios. A new standard data licensing framework could reference relevant ethical codes of conduct.  Some concern was expressed about incorporating codes of conduct or standards, which could change over time, into standard data licenses, as this could increase uncertainty in the contractual terms and costs of compliance.  These factors should be considered in the global and multi-stakeholder process developed to prepare standard licenses.  

Looking forward

Overall, this work shows promising signs that adopting modular, standard data license agreements can clarify key terms, reduce transaction costs, and enable more accessible data use across borders and sectors. Additionally, it demonstrates that integrating external technical tools and frameworks like Apache Atlas and Croissant with these legal frameworks enhances data discoverability and usability. Together, these measures can help to build a transparent, interoperable ecosystem for responsible AI development. Here, we have the opportunity to shape an AI-driven future that is equitable, sustainable, and indeed serves everyone.

The post Unlocking Data Collaboration with AI-Ready Licenses appeared first on MLCommons.

]]>
MedPerf enhances transparency and integrity of real-world benchmarks using smart contracts https://mlcommons.org/2025/03/medperf-smart-contracts/ Mon, 10 Mar 2025 20:27:32 +0000 https://mlcommons.org/?p=2558 Protection of digital assets on MedPerf via policy enforcement using Private Data Objects (PDO) 

The post MedPerf enhances transparency and integrity of real-world benchmarks using smart contracts appeared first on MLCommons.

]]>
When the MLCommons Medical AI working group set out to develop MedPerf, an open framework for benchmarking medical AI in real-world private datasets, transparency and privacy were the main technical objectives. The current implementation of MedPerf ensures this transparency and integrity through a formal benchmark committee. The committee is a governance body which oversees the entire benchmark process, from committee formation to reference implementation development, dissemination and access of benchmark results, and execution of the benchmark. Critical to MedPerf is maintaining its federated evaluation approach to deliver a high level of data privacy; enforcing no transfer of medical data to perform the benchmarks. 

Figure 1. Current MedPerf benchmark workflow. Red circle shows current policy implementation.

The working group, in collaboration with technical member organization Intel and academic organization Notre Dame is improving MedPerf with a new policy-enhanced workflow enforced by smart contracts. A smart contract is delivered via a trusted-execution-environment such as an Intel Software Guard Extension to ensure stronger transparency and privacy. The new workflow and smart contract protects digital assets by combining a confidential smart contract framework with private data objects by binding data policy enforcement with policy evaluation and isolated data task executions. This new approach proven through a proof-of-concept (POC) delivers a higher level of data transparency, accountability, and traceability. 

The importance of policies to protect digital assets 

Digital asset usage policy is of the utmost importance to protect intellectual property (IP) such as data and AI models. Unfortunately most existing solutions require copies of digital assets with access controls when sharing and rely on centralized governance for policy enforcement. In reality, this may actually reduce transparency and increase the risk of intellectual property theft or leakage. 

This risk becomes more problematic in the healthcare space where data and AI models may contain personally identifiable information as well as very expensive IP. Transparency for medical data is critical to maintain traceability and accountability of various processes such as training, validation, and benchmarking. Any lack of transparency could be further magnified in decentralized settings such as data federations where assets are hosted and transferred in a decentralized manner. 

Trusted execution environment-based confidential smart contract frameworks with private data objects have proven to help with policy enforcement by binding policy evaluation and isolated task executions to protect digital assets. Frameworks such as these can facilitate a high level of transparency, accountability, and traceability but to-date were largely unexplored in the medical AI space.

MedPerf smart contracts act as enforcement vehicles for healthcare dataset use policies

Current use policies for MedPerf datasets are implicitly hardcoded in the benchmark workflow. This means a data owner must execute any requested benchmark on their local data. This approach offers users full control of their data assets, but lacks any flexibility for customization and scalability such as decentralization and increasing the number of requests.

To enhance policies and provide broader flexibility and scalability in MedPerf, the working group built a trusted-execution-environment POC using a confidential smart contract framework. This approach delivers policy enforcement via a private data object by binding policy evaluation and isolated task executions to protect digital assets. 

To initiate a smart contract, a MedPerf data owner creates a private data object contract that is linked to their original dataset through the dataset ID. The contract contains descriptive information on the dataset and acts as an interface to it. The data owner then securely deposits the dataset inside a digital-guardian-service trusted-execution-environment enclave. The enclave can be hosted on premise or on the cloud. When a benchmark of a model on the dataset is requested, the contract and the digital guardian perform bi-directional attestation to ensure each other’s integrity. This approach retains MedPerf’s federated evaluation ensuring the dataset never leaves and is always within the secure domain of the enclave while its use is bounded by policy enforcement. 

Figure 2. Policy-enhanced benchmark workflow. Red circle shows smart-contract policy implementation.

Enhanced benchmark workflow benefits

The new policy-enhanced workflow provides more power to owners of digital assets such as healthcare datasets and models in benchmarking workflows. The POC has validated the use of embedded smart contract policies in MedPerf, and full digital asset owner control of their assets. 

This new approach increases utility and potential rewards in benchmarks by supporting integrity enabling transparency in the execution cycle.

To learn more about how the policy is implemented, read the Technical Update.

The Medical AI working group calls on the broad community of academics, technologists, regulatory scientists, healthcare experts, and others to help with the roadmapping and contributions to impactful benchmarking technologies by joining the MLCommons Medical AI working group.

The post MedPerf enhances transparency and integrity of real-world benchmarks using smart contracts appeared first on MLCommons.

]]>
MLCommons Power Working Group Presents MLPerf Power benchmark at IEEE HPCA Symposium https://mlcommons.org/2025/03/ml-commons-power-hpca/ Tue, 04 Mar 2025 23:15:00 +0000 https://mlcommons.org/?p=2546 MLPerf Power Benchmark focuses on measuring energy efficiency along with power consumption

The post MLCommons Power Working Group Presents MLPerf Power benchmark at IEEE HPCA Symposium appeared first on MLCommons.

]]>
This week, members of the MLCommons® Power working group are presenting a paper at the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA) conference. The paper presents MLPerf® Power, a framework for benchmarking the energy efficiency of AI systems.

MLPerf Power is designed to measure energy efficiency across diverse applications and deployments of machine learning technologies, “from microwatts to megawatts.” That includes datacenters, edge deployments, mobile devices, and tiny “Internet of Things” implementations. It also includes a range of applications covering both training and inference.

“We cannot improve what we do not measure,” said Arun Tejusve (Tejus) Raghunath Rajan (Meta), Co-chair of the MLCommons Power working group. “Across the board, MLPerf has driven performance improvements for 6+ years. Our goal is to do the same for MLPerf Power. By providing a way to consistently measure and report power, we aim to move the needle of energy efficiency across the industry in a positive and sustainable manner.”

Fig. 1. The Power consumption range across MLPerf divisions, highlighting the need for scalable power measurement.
Fig. 1. The Power consumption range across MLPerf divisions, highlighting the need for scalable power measurement

Requirements for benchmarking energy efficiency

The MLPerf Power framework defines eleven requirements for benchmarking and measuring the energy efficiency of Machine Learning (ML) systems under three broad themes:

  1. Power measurements: ensuring that measurements are compatible with a wide range and scale of platforms and configurations, account for system-level interactions and shared resources, and where feasible, measure physical power with an appropriate analyzer.
  2. Diverse ML benchmarking: defining consistent rules and reporting guidelines, using industry-standard workloads, and collecting data on power consumption and performance at different throughputs and ML model accuracy/quality targets.
  3. Evaluation, Trends, and Impact: Communicating power consumption and energy efficiency trends and optimizations, providing case studies relevant to real-world scenarios, and being driven and audited by the industry.

The MLPerf Power framework addresses several myths and pitfalls related to measuring the power usage of AI systems. One is that measuring power consumption alone is sufficient. Instead, the benchmark focuses on measuring power efficiency: system performance per unit of power. 

Another myth is that isolating and measuring just the ML components is sufficient. While they do make up a large portion of the power consumption, fully understanding AI system power efficiency requires measuring all the system components, including compute, memory/storage, the interconnect, and the cooling system. This requires a careful, comprehensive approach since different components vary their level of activity at different phases of an AI workload.

MLPerf Power benchmark submissions provide early and impactful insights

ML Commons has received 1,841 MLPerf Power benchmark submissions to-date, spanning several years and four workload versions. The benchmark results are already providing useful insights that inform the design of future AI systems. 

One important insight relates to scaling up training systems and the complex relationship between system scale, training time, and energy consumption. Specifically, scaling up the number of accelerators reduces absolute training time, but the amount of energy consumed increases at a non-linear rate. This is due to two factors: the increased communication requirements between a larger set of accelerators; and the decreased utilization in each accelerator. This points to the need to design AI systems with an understanding of these tradeoffs to achieve the necessary balance between scale, speed and energy consumption. “In the AI era, where scaling laws are driving computational demands, energy efficiency is a fundamental imperative for sustainable and responsible technological advancement,” noted Sachin Idgunji (NVIDIA), founding Co-Chair of the MLCommons Power working group. “Benchmarks such as MLPerf Power give AI system builders the information they need to help them achieve their design goals.”

Fig.2: Energy consumption and time-to-train for Llama2-70b LoRA fine-tuning across different numbers of accelerators
Fig.2: Energy consumption and time-to-train for Llama2-70b LoRA fine-tuning across different numbers of accelerators.

Another insight derived from the MLPerf Power results involves the relationship and trade-off between AI model accuracy and energy efficiency. In earlier versions of the benchmark, results showed that organizations were sacrificing energy efficiency – by up to 50% – to increase inference accuracy from 99% to 99.9%. We believe that these benchmark results compelled organizations to deploy significant optimization techniques. In more recent benchmark versions, we now see that reduced precision techniques such as quantization have significantly narrowed the gap in energy efficiency between inference systems running at 99% and 99.9% accuracy. Moving forward, it’s expected that these techniques will continue to be a driving force in designing energy efficient AI systems.

The Path to Environmentally Sustainable AI Systems

Establishing an industry-wide standard for ML system power measurement is a critical step toward creating a set of metrics for what qualifies as “environmentally sustainable AI”.

“That’s our north star,” said Vijay Reddi, Professor at Harvard University and Vice President of MLCommons.  “Our mission is to enable building sustainable AI systems, but to do so we need to define metrics that we can use to quantify energy efficiency – and ultimately sustainability. MLCommons is an engineering organization; we build things. We’ve built the framework, the metrics, the benchmark, and the benchmarking process to get us the data we need to understand, quantify, and ultimately build sustainable AI systems, and we’re continuing to iterate and refine it to keep pace with technology.”

The HPCA paper makes several recommendations, including developing tools that can estimate the carbon footprint of an AI system. Doing so requires measuring the system’s energy efficiency, but it must also be contextualized in several other factors, including the nature of the workload as well as the carbon emissions of the system’s power source.

In the short term, the team’s most critical need is for more submissions to the MLPerf Power benchmark. “The roughly 1,800 submissions we have now is enough to deliver a first set of qualitative insights,” said Arya Tschand, PhD student at Harvard University who led the work, “but if we can increase that by an order of magnitude we can extrapolate the data further, identify and quantify additional trends, and make more specific and actionable design recommendations that are grounded in hard data. That benefits both the industry and the environment: higher energy efficiency saves AI companies money by reducing their substantial power bills, and it also reduces the demand for both existing and new power generation.”

Visit the MLPerf Power working group page to join the working group and learn more, including how to submit to the MLPerf Power Benchmark. 

About ML Commons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

The post MLCommons Power Working Group Presents MLPerf Power benchmark at IEEE HPCA Symposium appeared first on MLCommons.

]]>
Croissant Gains Momentum within the Data Community https://mlcommons.org/2025/02/croissant-qa-community/ Wed, 12 Feb 2025 14:44:00 +0000 https://mlcommons.org/?p=2435 Croissant working group co-chairs share insights on the success of Croissant, what’s next, and how to join the effort.

The post Croissant Gains Momentum within the Data Community appeared first on MLCommons.

]]>
Since its introduction in March 2024, the Croissant metadata format for ML-ready datasets has quickly gained momentum within the Data community. It was recently featured at the NeurIPS conference in December 2024. Croissant is supported by major ML dataset repositories, such as Kaggle and HuggingFace. Over 700k datasets are accessible and searchable via Google Dataset Search. Croissant datasets can be loaded automatically to train ML models using popular toolkits such as JAX and PyTorch. 

MLCommons’ Croissant working group co-chairs, Omar Benjelloun and Elena Simperl, sat down to share their thoughts on Croissant’s initial popularity, the approach the working group is taking to develop it, and next steps.

[MLC] Why do you think Croissant has become so popular?

[ES] We’re delighted to see it being discussed in many contexts, beyond AI and machine learning research. I think we’re solving a problem that is easy to recognize, a problem that many practitioners who work with machine learning have, but one whose solution requires a range of skills that traditionally haven’t been that important in that community. Those techniques have to do with how you structure and organize data – data used by machine learning systems or any other type of data – in a way that makes it easy for that data to be shared across applications, to be found and reused. That’s a topic that Omar and I and other people in the working group know a lot about. And with Croissant, we are applying these techniques, which have been around for around 10-15 years,  to a new use case: machine learning data work. And I suppose the magic happened when we brought together all the relevant expertise in one working group to deliver something that ticks lots of boxes and heals many pains.

[MLC] What was the approach you took to developing Croissant?

[OB] We took an approach that was very concrete. A lot of this metadata standards work happens in a way where some experts get in a working group and they create what they think is the perfect standard. And then they try to convince others to adopt it and use it in their products.

Instead, from the start we said, “Okay, let’s define the standards at the same time and in close collaboration with the tools and systems that are going to use it.” So we worked with the repositories of machine learning datasets, like Kaggle and Hugging Face and OpenML initially, to implement support for this format as we were defining it, as well as to participate in the definition of the format. That’s on the “supply side” of datasets. On the “demand side,” we worked with the teams that develop tools for loading datasets for machine learning to implement support for loading Croissant datasets. And they were also part of the definition of the format.

So getting these people to work together helped us to create something which was not, “Here’s a specification. We think it’s going to make things better. Now, please go and implement this so that we have some data on one side and some tools on the other side.” Instead we could say, “Here’s the specification and here are some datasets that are already available in this format because these repositories support it. And here are some tools that you can use to work with these datasets.” And of course, it’s just the beginning. There are more repositories and datasets that should be available in this format. And there are a lot more tools that should add support for Croissant so that ML developers have what they need to work with this data. But the initial release of Croissant was a strong start with a useful, concrete proposition. And I think this is part of what made it so successful.

[MLC] Do you consider Croissant a “standard” now?

[ES] Personally, I have mixed feelings about the word “standard.” Sometimes when I hear someone talking about standards, I think about those lengthy, tedious, and to some degree inaccessible processes that are run in standardization committees, where there is a lot of work that is put into coming to a proposal, with lots of different stakeholders, with many competing interests to agree upon. And there’s hundreds and hundreds and hundreds of pages explaining what someone needs to do to use the standard in the intended way. And then there’s some implementations. And then if you’re an academic researcher in a small lab or a startup who doesn’t have the resources to be part of this exercise, then you feel left out. 

Having said that, standards, once they are agreed, and if there is critical mass in adoption, can be very, very powerful, because they do reduce those frictions of doing business. So if I am a startup that is creating a new type of machine learning functionality, for example to manage workflows or to do some sort of fair machine learning or any specialist functionality that the community needs, then having a standard to be able to process data that has been created and processed using any other machine learning tool is hugely valuable. But I love what Omar said: we’re following a slightly different approach. We really want to release everything to the community. We want to build everything with that adoption pathway in mind so that we put the results in the hands of practitioners as quickly as possible and in the most accessible way possible.

The other thing that I would say is that even if a competing proposal emerges, the way in which the Croissant vocabulary – and technology it is built with – means that it would be quite easy to use both. There are many vocabularies that describe datasets. And something like Google Dataset Search can work with all of them because of the technologies that are used to do that. But of course, if you just look at the number of datasets that use this or any other machine-readable vocabulary that may share similar aims, at the moment we seem to have an advantage. And the interest to adopt Croissant by more and more platforms is still there, almost a year after the launch of the initial version.

[OB] We released the first version of the Croissant format, but I don’t consider it to be standard yet.  I prefer to say  “community-driven specification”. And we’re not afraid to say Croissant will evolve and continue to change. Once we have a critical mass of data platforms and tools that have adopted it and use it and make their users happy thanks to Croissant, then it will feel sort of stable and we can say, “we don’t see this changing much from here.” At that point declaring it a standard makes sense. And of course, all the companies, startups and university labs that build tools and participate in the community will have contributed to that standard through their work.

[MLC] How important is the support from industry?

[ES] Speaking as someone who is in academia, I think industry support is very important. I am personally very happy with the diversity of industry participants in the group. We have OpenML, which is an open source initiative that is driven among others by academics, but it is a project with lots of contributors. We have obviously Google, we have participants from Meta as well, so household names in the AI space. Then we have Hugging Face and Kaggle, which have huge communities behind them. And we have NASA as well, which brings in avery interesting use case around geospatial data sets and their use in machine learning. Going forward I’d love for the working group to attract more industries that work beyond tech, that perhaps have their own machine learning use cases and their own enterprise data that they want to prepare for machine learning usage. And in that sort of scenario if you think about a big company and the types of data management problems they have, having something like Croissant would be very valuable as well. 

We have a good initial proposal for what Croissant can be, how it should be. But equally, we also have a very long list of things that we need to work on this year to get to that stable version that you could perhaps call a standard. And once you reach that, then you can actually make a case to any enterprise to explore this internally.

[MLC] What are your goals for expanding participation in the working group and in finding new partners?

[OB] First of all, this is an open group, and everyone is welcome! We would love to see more companies involved that develop tools that are used by machine learning practitioners. This speaks to the promise of Croissant: that once all datasets are described and edited in the same format, then if you have a tool that understands that format, that tool can work with any of those datasets. So if tools that provide functionality around loading of data for training ML models or fine tuning or evaluation of models, or data labeling or data analysis for machine learning, if these tools look at Croissant and consider supporting it, which does not require a huge amount of effort because it’s not a very complex format, then that will really make Croissant a lot more useful to machine learning practitioners. A goal for us this year is to get more tool vendors to come and join the Croissant effort and add support for it. Any vendor, no matter what size, that does data work, AI data work, whether that’s data augmentation, data collection, maybe responsible data collection, or data labeling or data debiasing, any type of data related activity that would make it more likely for a dataset to be ready to be used for machine learning, would add a lot of value to what we do right now. And it will also bring in new perspectives and new pathways to adoption. 

[ES] And would using Croissant be valuable for them? Absolutely. I think it was a researcher from Google who said, “No one wants to do the data work in AI,” a few years ago, and it still rings true. The vendors of data-related products solve some data challenges, but at the same time they also struggle with the fact that many people still think, “I’m just going to find the data somehow, and I’m just going to try to get to training my model as quickly as possible.” Croissant and these vendors are saying the same thing: data is important, we need to pay attention to it. We have very similar aims, and we’re providing these developers with a data portability solution. We’re helping them to interoperate with a range of other tools in the ecosystem. And that’s an advantage, especially when you’re starting out. 

[MLC] Have there been any surprises for you in how the community has been using Croissant?

[ES] We see a lot of excitement around use cases that we hadn’t really thought about so much. For instance, anything related to responsible AI, including AI safety, that’s a huge area that is looking for a whole range of solutions, some of them technical, some of them socio-technical. We’ve seen a lot of interest to use Croissant for many scenarios where there is a need to operationalize responsible AI practices. And to some degree, the challenge that we’re facing is to decide which of those use cases are really part of the core roadmap, and which ones are best served by communities that have that sectorial expertise.

We’ve also seen a lot of interest from the open science community. They’ve been working on practices and solutions for scientists to share not just their data, but their workflows, in other words, the protocols, experiment designs and other ways to do science. Scientific datasets are now increasingly used with machine learning,o we’ve had lots of interest and there’s conversations on how to take advantage of those decades of experience that they bring in.

[MLC] How was Croissant received at the NeurIPS 2024 conference?

[OB] We presented a poster at NeurIPS that opened a lot of conversations. We also conducted what turned out to be perhaps the most interesting community engagement activity all year for us. We asked authors in the conference’s Datasets and Benchmarks track to submit their datasets with a Croissant metadata description. It was not a strict requirement; it was more of a recommended thing for them to do. I think this was great for Croissant as a budding metadata format to get that exposure and the real life testing. We learned a lot from that experience about the details in our spec that were not clear or things that were not explained as well as they should have. And we also got real-world feedback on the tools that we built, including an initial metadata editor with which users can create the Croissant metadata for their datasets. It works in some cases, but we also knew that it was not feature-complete. So, users ran into issues using it. For instance, researchers often create bespoke datasets for their own research problems. And they asked us, “I have my custom binary format that I created for this; how do I write Croissant for it? It’s not CSV or images or text.” And so it pushed us into facing some of Croissant’s limits. . This experience also taught us the importance of good creation tools for users to prepare their datasets, because nobody wants to write by hand some JSON file that describes their dataset. I think for us it’s a strong incentive, as a community, to invest in building good editing tools that can also connect to all the platforms like Hugging Face or Kaggle so that users can publish their datasets on those platforms. So this is something that we learned from this interaction – and at the next conference, the experience will be much smoother for the authors who contribute to this.  

[MLC] What does the roadmap for Croissant look like moving forward?

[ES] In general, we are working on extending the metadata format so that we can capture more information. There are a few features that we’ve heard over and over again would be very important. For instance, anything related to the provenance of the data. We’re also working on tools, as Omar mentioned, and that includes technical tools like editors, but also tutorials and accessible material. 

In terms of what I’d like to see personally, as I was saying earlier, we’d love to have more contributors who are working in companies or on some sort of data work. For instance, data labeling. And a topic that we haven’t really touched on is the quality of that metadata. Croissant can solve many problems, but that assumes that the metadata is going to be of good quality, detailed, complete. That will almost certainly require some manual effort: from people who publish the data and want to describe it, maybe even from people who use the data and have made experiences with it. So, we’d love to have more datasets featuring excellent Croissant metadata. That will require a campaign to involve many users of, say, the 100 most popular machine learning datasets to come together and join efforts and create those excellent Croissant records.

[OB] We also have a new effort to describe not just datasets, but also machine learning tasks that use those datasets, so that we go to the next level of interoperability and describe things like machine learning evaluation benchmarks, or even competitions and things like that, and how they relate to datasets. That’s also a very exciting effort that is happening in the working group.

[ES] It’s quite an ambitious agenda – we’ll be busy this year and we invite others to join the working group to continue to shape the future of Croissant!

We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to implement the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.

The post Croissant Gains Momentum within the Data Community appeared first on MLCommons.

]]>
MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark https://mlcommons.org/2025/02/ailumiate-v1-1-fr/ Tue, 11 Feb 2025 05:00:00 +0000 https://mlcommons.org/?p=2412 Announcement comes as AI experts gather at the Paris AI Action Summit, is the first of several AILuminate updates to be released in 2025

The post MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark appeared first on MLCommons.

]]>
Paris, France – February 11, 2025. MLCommons, in partnership with the AI Verify Foundation, today released v1.1 of AILuminate, incorporating new French language capabilities into its first-of-its-kind AI safety benchmark. The new update – which was announced at the Paris AI Action Summit – marks the next step towards a global standard for AI safety and comes as AI purchasers across the globe seek to evaluate and limit product risk in an emerging regulatory landscape. Like its v1.0 predecessor, the French LLM version 1.1 was developed collaboratively by AI researchers and industry experts, ensuring a trusted, rigorous analysis of chatbot risk that can be immediately incorporated into company decision-making.

“Companies around the world are increasingly incorporating AI in their products, but they have no common, trusted means of comparing model risk,” said Rebecca Weiss, Executive Director of MLCommons. “By expanding AILuminate’s language capabilities, we are ensuring that global AI developers and purchasers have access to the type of independent, rigorous benchmarking proven to reduce product risk and increase industry safety.”

Like the English v1.0, the v1.1 French model of AILuminate assesses LLM responses to over 24,000 French language test prompts across twelve categories of hazards behaviors – including violent crime, hate, and privacy. Unlike many of peer benchmarks, none of the LLMs evaluated are given advance access to specific evaluation prompts or the evaluator model. This ensures a methodological rigor uncommon in standard academic research and an empirical analysis that can be trusted by industry and academia alike.

“Building safe and reliable AI is a global problem – and we all have an interest in coordinating on our approach,” said Peter Mattson, Founder and President of MLCommons. “Today’s release marks our commitment to championing a solution to AI safety that’s global by design and is a first step toward evaluating safety concerns across diverse languages, cultures, and value systems.”

The AILuminate benchmark was developed by the MLCommons AI Risk and Reliability working group, a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety. Cognizant that AI safety requires a coordinated global approach, MLCommons also collaborated with international organizations such as the AI Verify Foundation to design the AILuminate benchmark.

“MLCommons’ work in pushing the industry toward a global safety standard is more important now than ever,” said Nicolas Miailhe, Founder and CEO of PRISM Eval. “PRISM is proud to support this work with our latest Behavior Elicitation Technology (BET), and we look forward to continuing to collaborate on this important trustbuilding effort – in France and beyond.”

Currently available in English and French, AILuminate will be made available in Chinese and Hindi later this year. For more information on MLCommons and the AILuminate Benchmark, please visit mlcommons.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academic, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued to use collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve the accuracy, safety, speed, and efficiency of AI technologies.

The post MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark appeared first on MLCommons.

]]>
Unsupervised People’s Speech: A Massive Multilingual Audio Dataset https://mlcommons.org/2025/01/new-unsupervised-peoples-speech/ Thu, 30 Jan 2025 21:00:00 +0000 https://mlcommons.org/?p=2200 The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People's Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages.

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

]]>
Introducing the Unsupervised People’s Speech dataset

The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People’s Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages. Given the impact the previously released Supervised People’s Speech dataset had on models such as Whisper, we expect this new version to drive innovations across different tasks, including self-supervised implementations as well as improvement on automatic speech recognition pipelines in numerous languages.

The Rationale Behind the Dataset

As the field of speech technology continues to advance, the need for large, diverse audio datasets is increasingly crucial. The Unsupervised People’s Speech dataset aims to address this need by providing a vast collection of multilingual audio data to support research and development in various areas of speech technology. Supporting broader Natural Language Processing (NLP) research for languages other than English helps bring communication technologies to more people globally–including those speaking low-resource languages.

Ensuring Useful Data

To ensure the dataset is as useful as possible the MLCommons Datasets working group ran different data pipelines to understand the contents across dimensions such as language distribution and speech detection.

  • Speech detection: the working group created a custom data loader that resampled and converted the audio to a single channel (available in the HuggingFace dataset card), we ran Silero’s Voice Activity Detection pipeline, a model that uses a multi-head attention mechanism with STFT as features. Results from this model delivered a total of 821,412+ hours of speech. 
  • Language identification: language identification is often the first task in a series of NLP tasks, so it’s crucial to get it right. The working group used Nvidia’s TensorRT-LLM implementation of Whisper Large v3 to run inference on the subset of the dataset for which our speech detection pipeline detected a speech utterance. The results from this pipeline detected a total of 89 languages. While we are certain there are more, since the third most common category the model inferred was a no speech tag, it means the model wasn’t able to determine the language.

Technical Hurdles

While the focus of the Unsupervised People’s dataset is on its potential applications, it’s worth noting some of the technical challenges the working group overcame:

  1. Data Upload and Storage: we developed a custom script utilizing Git LFS backend to efficiently upload the 48+TB dataset to S3 and HuggingFace. This overcame typical speed limitations.
  2. Self-Supervised Learning Potential: we’ve included a training pipeline to train a Wav2Vec model, which could unlock new possibilities in unsupervised speech representation learning.
  3. Deduplication: we are in the process of creating embedding representations of the entire dataset using Meta’s Encodec. This will allow us to identify and remove duplicate content. This process ensures the dataset’s uniqueness and quality.

Future Work

The Unsupervised People’s Speech dataset is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. As part of our commitment to updating and maintaining the dataset, we are also releasing the software as open-source to empower the community to build on our contributions.

As the working group completes the deduplication and self-supervised efforts, we anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis. 

Join Us

There are many ways to get involved in the MLCommons data effort. We are currently seeking speakers of over 130 languages to create a benchmark for a text language identification task. If this sounds interesting, head to Dynabench’s text classification task.  

If you want to be part of the Datasets working group or any other working group at MLCommons, please visit the Get Involved page. We meet weekly to share ideas and implement projects. If you want to read more about the MLCommons Unsupervised People’s Speech dataset, visit https://mlcommons.org/datasets/unsupervised-peoples-speech/

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

]]>