MLCommons https://mlcommons.org/ Better AI for Everyone Thu, 29 May 2025 15:31:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png MLCommons https://mlcommons.org/ 32 32 MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark https://mlcommons.org/2025/05/nasscom/ Thu, 29 May 2025 15:10:24 +0000 https://mlcommons.org/?p=2934 World leader in AI benchmarking announces new partnership with India’s NASSCOM; updated reliability grades for leading LLMs

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India’s premier technology trade association, to bring AILuminate’s globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate’s Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs). 

”We’re looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership,  along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”

 “The rapid development of AI is reshaping India’s technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the  growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”

The NASSCOM collaboration builds on MLCommons’ intentionally global approach to AI benchmarking. Modeled after MLCommons’ ongoing partnership with Singapore’s AI Verify Foundation, the NASSCOM partnership will help to meet South Asia’s urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region’s industry experts, policymakers, civil society members, and academic researchers. MLCommons’ partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.

AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking. 

“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We’re grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.” 

Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year. 

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons MLPerf Training Expands with Llama 3.1 405B https://mlcommons.org/2025/05/training-llama31405b/ Mon, 05 May 2025 15:43:00 +0000 https://mlcommons.org/?p=2877 MLCommons adds new pretraining benchmark for testing large-scale systems

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

]]>
About MLPerf Training

Pretraining Large Language Models (LLMs) is the initial phase of LLM training where the model learns to understand and generate human-like text by training the model on vast amounts of unstructured textual data. A pre-trained model can then be fine-tuned on domain-specific datasets, which gives the model knowledge of specific tasks. Reasoning approaches have become popular recently, helping models better explain their thinking process. Because pre-training uses large amounts of data, it is generally the most compute-intensive phase of LLM training. 

MLCommons® added a GPT3 Pretraining benchmark to MLPerf® Training in 2022 to help academics and industry experts optimize LLM pretraining. It has been a huge success, as we have seen over 3x speedups on the results of this benchmark for large systems (over 10,000 accelerators).

Over this same period, newer state-of-the-art models have demonstrated greater scale and many architectural advancements. For example, Google’s PaLM model, released in 2022, contains 540B parameters, three times as many as GPT3’s 175B parameter count. Meta’s Llama 2 and Llama 3 series, released in 2023, utilize new algorithms such as Root Mean Square Layer Normalization (RMSNorm, Zhang et al., https://arxiv.org/abs/1910.07467), Rotary Position Embedding (RoPE, Su et al., 2024), and Grouped Query Attention (GQA, Ainslie et al., 2023). 

A MLPerf Training working group task force was created to investigate a July 2024 proposal to  replace GPT3 with Llama 3.1 405B. Task force members evaluated the proposal and industry options to recommend the best new pretraining reference. They agreed that Llama 3.1 405B would best represent the current state-of-the-art pretraining and adopted it as a new benchmark. Details on the review and selection process are included in this blog.

Model selection

The most important factor the task force considered when constructing this new pretraining benchmark was the model architecture. As mentioned above, the current GPT3 architecture does not include some of the recent algorithm updates and therefore is not competitive with other state-of-the-art models such as Nemotron-4, GPT4.5 and Claude 3.7 Sonnet. However, Meta’s Llama 3.1 405B model, with an architecture and scale similar to these top-tier models, demonstrates on-par performance with these models across multiple quality benchmarks, thus positioning itself as a competitive model representative of the current state-of-the-art models. 

Choosing a model with high community engagement was also an important consideration, and Llama 3.1 405B is the perfect candidate on that basis. With more than 300 million total downloads of all Llama versions to date, the Llama family of models is very widely adopted. The MLPerf community, for instance, has already included Llama 2 in the LoRA benchmark. Additionally, Meta’s released technical paper details the training process, model architecture, dataset and hyperparameters, which allowed us to easily create a high-quality benchmark based on their work.

One other requirement for the new MLPerf Training benchmark is that the model architecture must be publicly accessible so that all submitters can download it and reproduce the results on their end. This typically requires a permissive license. Thanks to Meta’s support, the current Llama 3.1 405B model is now available to all MLCommons members for benchmarking. 

Dataset selection

To help submitters transition easily from GPT3 to Llama 3.1 405B, we use the same AllenAI C4-en dataset (Colossal Cleaned Common Crawl dataset) in both benchmarks. This dataset contains 365M English-language paragraphs scraped from the internet, cleaned and preprocessed to convenient JSON formats for training the model. 

One key difference is that Llama 3.1 405B uses a context length of 8,192, which is 4x larger than the 2,048-token context of GPT3. The context length represents the amount of text that a model can process at once, thus a larger context length leads to better understanding. 

Benchmark technical details

The reference code is implemented in the NVIDIA NeMo Framework, an open-sourced and scalable AI framework built for large language models. By offering easy-to-use modular components and pre-trained model recipes, NeMo enables developers to efficiently build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA H100 GPUs. 

To ensure that the new benchmark model setup is easily understandable without complex knobs and hyperparameters scattered around the training script, it starts with NeMo’s integrated Llama 3.1 405B training recipe, which closely followed the original paper’s hyperparameters and introduced only the minimum necessary changes to our script, such as adjusting the optimizer’s learning rates

When researching how to construct this benchmark, we first explored the GPT3-style setup. For the GPT3 pretraining benchmark, the task force pretrained from random weights for about 12.6T tokens to generate a stable customized checkpoint. Submitters were required to continue pretraining from this checkpoint until they reached a predetermined quality metric. This methodology was used to ensure that the GPT3 pretraining benchmark showcased a reproducible slice of the actual GPT3 pretraining process.

Applying the same methodology to the Llama 3.1 405B pretraining benchmark, the task force generated a customized checkpoint trained on 32B tokens. But experiment results showed that resuming from this checkpoint had more variation in the convergence curves than we could tolerate. This meant that resuming from a custom-trained checkpoint in the benchmark was not feasible. The interpretation that we were still in very early phases of training after the 32B tokens, meant that the checkpoint generated was still quite unstable.

So the taskforce decided to train from the fully trained and stable HuggingFace checkpoint by giving it some new information, which led to a much more stable convergence trend. The benchmark now begins from the checkpoint and stops when the evaluation log perplexity reaches our defined target.

One noticeable difference between the task force’s implementation and Meta’s implementation is that we replaced the original Tiktokenizer, which used a vocab size of 128k, with the 32k vocab size Mixtral 8x22b tokenizer. The motivation behind this change was that Meta’s checkpoint is already well-trained, so it will not learn anything new on the current C4 dataset. Replacing the tokenizer forces the model checkpoint to adapt to a new, different token distribution, therefore continuing to pretrain on the current dataset. This difference implies that, when we resume from the pre-trained checkpoint, only the first 32,000 rows of the 128,256-row word embedding layer weights will be loaded to the model.

As mentioned before, the Llama 3.1 405B pretraining benchmark is a replacement for the GPT3 pretraining benchmark, and the following components will be unchanged between the two:

  • Validation dataset: To reduce the cost of validation, we chose to create a customized subset of the full C4 validation dataset so that the evaluation time would not be significant compared to training time. Specifically, we found out that using only 5,760 sequences is sufficient to measure model convergence, and since only 91,205 validation samples are needed to yield the required 5,760 validation sequences, the first 91,205 unshuffled rows of the full C4 validation dataset were selected as the customized validation dataset for this benchmark.
  • Loss and target accuracy metric: As with GPT3, log perplexity is used, computed on the customized validation dataset to evaluate the model’s convergence behavior. This metric is not costly to compute and is widely adopted as a versatile indicator of evaluating LLMs.
  • Limit number of submission logs: We recognize that the Llama 3.1 405B pretraining benchmark is very expensive to run, so to ensure fairness while reducing submitters costs, each submitter is required to only run the benchmark 3 times, the same as the GPT3 pretraining benchmark. The median result is then reported.
  • Restriction on hyperparameter searching: Because running the benchmark is expensive, we believe it is reasonable to disallow hyperparameter searches and borrowing. Most hyperparameters have been given static values. For a very few others, such as the learning rate, formulas are provided to compute them based on the global batch size.

Conclusion

In this blog post, we have provided an overview of a new LLM pretraining benchmark based on Meta’s Llama 3.1 405B model with twice as many parameters and four times as many training tokens as GPT3. This benchmark is massively more computationally intensive and contains its own new challenges. It is the task force’s belief that this new benchmark will encourage more innovation in the area of LLM pretraining.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

]]>
MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs https://mlcommons.org/2025/04/mlperf-client-v0-6/ Mon, 28 Apr 2025 15:04:31 +0000 https://mlcommons.org/?p=2856 MLCommons' MLPerf Client v0.6: First Open Benchmark for NPU and GPU AI PCs

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

]]>
Today, MLCommons® is announcing the release of MLPerf® Client v0.6, an update to the MLPerf Client consumer AI performance benchmark. This release extends support to a broader range of hardware and platforms, including AI PCs with dedicated neural processing units (NPUs), while enhancing usability.

MLPerf Client v0.6 builds on the foundation established by version 0.5, which debuted with LLM-focused performance tests using the Llama 2 7B model from Meta. With this latest release, MLCommons continues its mission to provide a transparent, standardized, and vendor-neutral way to measure AI performance across a growing range of PC hardware.

MLPerf Client represents a collaboration among leaders in the consumer computing space, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, Inc., and top PC OEMs. These stakeholders have pooled resources and expertise to create a standardized performance benchmark for key consumer AI workloads.

Key Updates in v0.6:

  • Expanded Hardware Support:
    New support for NPUs from Intel, alongside continued GPU acceleration via AMD, Intel, and NVIDIA hardware. This milestone makes MLPerf Client the first open benchmark to span both GPU and NPU acceleration on consumer platforms.
  • Improved Device Selection:
    New device enumeration options help users better target and test systems with multiple capable accelerators, such as PCs equipped with multiple GPUs.
  • Updated Software Stack:
    Includes the latest versions of ONNX Runtime, ONNX Runtime GenAI, and Intel OpenVINO, offering performance and compatibility improvements across supported platforms.

“MLPerf Client v0.6 reflects the rapid evolution of the AI PC landscape with the inclusion of NPU evaluation” said Yanni Minkdakis, co-chair of the MLPerf Client working group. “With expanded support and more flexible testing, it’s now easier than ever for the industry and consumers alike to evaluate real-world AI performance on next-generation devices.”

MLPerf Client v0.6 is available now as a free download at mlcommons.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

MLPerf Client participation requires an MLCommons membership. For more information and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post MLCommons Releases MLPerf Client v0.6 With Expanded Hardware Support for AI PCs appeared first on MLCommons.

]]>
MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github https://mlcommons.org/2025/04/ailuminate-french-datasets/ Wed, 16 Apr 2025 17:00:00 +0000 https://mlcommons.org/?p=2833 Set of 1,200 Creative Commons French Demo prompts and 12,000 Practice Test prompts exercise the breadth of hazards for generative AI systems

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
Following the recent French release of the AILuminate v1.1 benchmark, MLCommons is announcing the public release of two French language datasets for the benchmark. AILuminate measures the reliability and integrity of generative AI language systems against twelve categories of hazards and is currently available in English and French. Today’s dataset release includes: a Creative Commons-licensed Demo prompt dataset of over 1,200 prompts; and a Practice Test dataset of 12,000 prompts, both of which span all twelve AILuminate hazard categories. 

The Demo prompt dataset is available on GitHub now and the full Practice Test dataset may be obtained by contacting MLCommons. English versions of both datasets are also available for use.

The AILuminate benchmark

The AILuminate benchmark, designed and developed by the MLCommons AI Risk & Reliability working group, assesses LLM responses to 12,000 test prompts per language, across twelve categories of hazards that users of generative AI language systems may encounter. The benchmark was designed to help organizations evaluate language systems before deploying them in production environments, to compare systems before procurement, or to verify their language systems comply with organizational and government policies. In addition to its unique corpus of prompts, AILuminate also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models. MLCommons has used the benchmark to assess the performance of many of the most prominent AI language systems-under-test (SUTs) across all of the hazard categories, and published public reports of their findings. 

The AILuminate French language prompt datasets can help provide valuable insights into your AI systems: 

  • Demo Prompt Dataset: A Creative Commons dataset with 1,200 prompts that is a subset of the AILuminate Practice Test dataset. The French language Demo dataset is useful for in-house testing, provides examples of specific hazards, and seeds and inspiration for the creation of other prompts. It is available today on GitHub
  • Practice Test Dataset: Includes 12,000 prompts. This dataset closely resembles the official test and is intended to provide a realistic projection of how models would be evaluated by AILuminate when used with the MLCommons AILuminate evaluator. Access to the comprehensive Practice dataset is available upon request from MLCommons. 

AILuminate’s prompts are based on 12 hazard categories which are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently available in English and French, with future plans to expand to Simplified Chinese and Hindi. Each prompt dataset is human-made and reviewed by native language speakers for linguistic and cultural relevancy, to ensure accurate linguistic representation. MLCommons intends to regularly update the prompt datasets, including the Creative Commons Demo dataset as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“AI reliability is a global problem that requires an intentionally global approach to address,” said Peter Mattson, Founder and President of MLCommons. “The availability of the French Demo and Practice Test datasets reflects our commitment to evaluating AI risk concerns across languages, cultures, and value systems — and ensures our benchmarks and datasets are readily available to vendors and developers across continents.” 

Raising the bar for reliability testing by empowering stakeholders

The AILuminate Demo prompt datasets provide an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems across the twelve defined AILuminate hazard categories. While we discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards, as will policymakers looking for examples of prompts that can generate hazardous results.

Download the Demo prompt dataset today!

The Demo test dataset is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from GitHub. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk & Reliability working group and our efforts to ensure greater reliability of AI systems.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability. We invite others to join the AI Risk & Reliability working group.

FRENCH ANNOUNCEMENT

MLCommons publie le jeu de données de prompts de démonstration AILuminate en français sur GitHub

Une série de 1 200 prompts Creative Commons en français explore l’étendue des risques pour les systèmes d’IA générative

Suite au récent lancement de la version française du benchmark AILuminate v1.1, MLCommons annonce la mise à disposition publique de deux jeux de données en langue française pour ce benchmark. AILuminate évalue la fiabilité et l’intégrité des systèmes d’IA générative dans douze catégories de risques, et ses benchmarks sont désormais disponibles pour les systèmes en anglais et en français. La publication d’aujourd’hui inclut un jeu de 1 200 prompts de démonstration sous licence Creative Commons et un jeu de 12 000 prompts d’essai couvrant les douze types de risques définis par AILuminate.

Le jeu de données de démonstration est dès maintenant disponible sur GitHub, tandis que le jeu complet d’essai peut être obtenu en contactant MLCommons. La version anglaise des deux jeux est également disponible.

Le benchmark AILuminate

Le benchmark AILuminate, conçu et développé par le groupe de travail sur les risques et la fiabilité des IA chez MLCommons, évalue les réponses des modèles linguistiques (LLM) à 12 000 prompts par langue, répartis dans douze catégories de risques auxquels les utilisateurs peuvent être confrontés. Ce benchmark aide les organisations à évaluer leurs systèmes avant leur déploiement, à comparer les produits avant achat ou à vérifier leur conformité aux politiques internes et officielles. En plus d’une collection unique de prompts, AILuminate inclut un système d’évaluation avancé basé sur un ensemble optimisé de modèles d’évaluation des risques. MLCommons a utilisé ce benchmark pour analyser la performance des principaux systèmes linguistiques testés (SUTs) dans toutes les catégories de risques et a publié des rapports publics sur leurs résultats.

Les jeux de données en langue française AILuminate offrent des perspectives précieuses pour vos systèmes d’IA :

  • Jeu de prompts de démonstration : Un sous-ensemble Creative Commons contenant plus de 1 200 prompts issus du jeu de prompts d’essai. Ce jeu est utile pour les tests internes, fournit des exemples spécifiques aux risques identifiés et peut inspirer la création d’autres prompts. Il est disponible dès aujourd’hui sur GitHub.
  • Jeu de prompts d’essai : Comprend 12 000 prompts conçus pour simuler une évaluation réaliste par le système officiel AILuminate. Ce jeu complet peut être obtenu sur demande auprès de MLCommons.

Les prompts sont basés sur les douze catégories de risques définies dans le standard d’évaluation AILuminate, qui inclut notamment :

French translated Hazard Categories for AILuminate

Tous les prompts sont disponibles aujourd’hui en anglais et en français, et seront bientôt disponibles en chinois simplifié et en hindi. Les prompts sont rédigés par un être humain, puis contrôlés par une personne de langue maternelle afin d’en assurer la pertinence linguistique et culturelle. MLCommons prévoit de les mettre à jour régulièrement, notamment le jeu Creative Commons, à mesure que la compréhension communautaire des risques liés à l’IA évolue – par exemple avec le développement du comportement agentique dans les systèmes d’IA. Les données sont formatées pour une utilisation immédiate avec ModelBench, l’outil open-source publié par MLCommons.

« La fiabilité des systèmes IA est un problème mondial nécessitant une approche internationale délibérée », a déclaré Peter Mattson, fondateur et président de MLCommons. « La disponibilité des jeux de prompts français reflète notre engagement quant à l’évaluation des préoccupations liées aux risques de l’IA à travers différentes langues, cultures et systèmes de valeurs – garantissant que nos benchmarks soient accessibles aux entreprises commerciales et aux développeurs dans le monde entier. »

Élever les standards des tests IA en responsabilisant les parties prenantes

Les jeux de données AILuminate fournissent un point de départ important pour le développement des capacités internes en permettant d’évaluer la sécurité des systèmes d’IA générative dans les douze catégories définies par AILuminate. Bien que nous déconseillions l’utilisation de ces prompts pour entraîner des systèmes d’IA (les meilleures pratiques de l’industrie recommandent l’isolation des jeux de données d’évaluation afin d’éviter le surapprentissage), ces prompts sont exploitables dans de nombreux secteurs. Ils servent d’exemples représentatifs pour les utilisateurs qui souhaitent développer leurs propres jeux de données d’entraînement et d’évaluation ou évaluer la qualité de jeux de données existants. Les formateurs en sécurité de l’IA sauront profiter de cet ensemble complet d’exemples couvrant une taxonomie de risques étendue, et les décideurs politiques pourront y trouver des cas concrets où certains prompts peuvent générer des résultats dangereux.

Téléchargez le jeu de données aujourd’hui !

Le jeu de prompts de démonstration est disponible sous licence Creative Commons CC-BY-4.0 et peut être téléchargé directement sur GitHub. Le jeu d’essai complet peut être obtenu en contactant MLCommons. Après l’utilisation des tests d’essai hors ligne ou en ligne, nous encourageons les développeurs à soumettre leurs systèmes à AILuminate pour obtenir un benchmarking public avec validation et publication officielle des résultats.

Pour en savoir plus sur AILuminate et accéder à une FAQ détaillée, nous encourageons toutes les parties prenantes à rejoindre le groupe MLCommons sur les risques IA et la fiabilité.

À propos de MLCommons

MLCommons est le leader mondial du développement de benchmarks pour l’IA. C’est un consortium ouvert dont la mission est d’améliorer l’IA pour tous par la publication de benchmarks et de jeux de données publics. Formée de plus de 125 membres (entreprises de technologie, universitaires et chercheurs du monde entier), MLCommons se concentre sur le développement collaboratif d’outils pour l’industrie de l’IA par le biais de benchmarks, de jeux de données et de mesures publics traitant des risques et de la fiabilité de l’IA. Nous invitons toutes les parties intéressées à rejoindre le groupe sur les risques de l’IA chez MLCommons.

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
CKAN Announces Support for Croissant https://mlcommons.org/2025/04/ckan-croissant/ Wed, 16 Apr 2025 00:30:02 +0000 https://mlcommons.org/?p=2778 MLCommons Croissant metadata standard is now connected to CKAN’s open data catalog capabilities with ML workflows

The post CKAN Announces Support for Croissant appeared first on MLCommons.

]]>

CKAN, the world’s leading open source data management system, recently announced support for the Croissant metadata standard, a format designed by MLCommons to make datasets machine learning-ready. CKAN sites can now easily expose their datasets to ML pipelines with the help of the newly released croissant plugin. This integration, powered by the ckanext-dcat 2.3.0 extension, connects CKAN’s open data catalog capabilities with ML workflows, improving dataset discoverability and interoperability.

Learn more on the CKAN blog: Bridging CKAN and Machine Learning: Introducing Support for the Croissant Standard

The post CKAN Announces Support for Croissant appeared first on MLCommons.

]]>
Announcing the AILuminate Benchmarking Policy https://mlcommons.org/2025/04/ail-benchmarking-policy/ Wed, 16 Apr 2025 00:26:43 +0000 https://mlcommons.org/?p=2810 The MLCommons policy to guide which systems we include in AILuminate benchmarks

The post Announcing the AILuminate Benchmarking Policy appeared first on MLCommons.

]]>
Just a few months ago we launched our first AILuminate benchmark, a first-of-its-kind reliability test for large language models (LLMs). Our goal with AILuminate is to build a standardized way of evaluating product reliability that will assist developers in improving the reliability of their systems, and give companies better clarity about the reliability of the systems they use. 

For AILuminate to be as effective as it can be, it needs to achieve widespread coverage of AI systems that are available on the market. This is one reason we have prioritized multilingual coverage; we launched French language coverage in February, and plan to launch Hindi and Chinese in the coming months. But beyond language, AILuminate needs to cover as many LLMs that are on the market as possible, and it needs to be continually updated to reflect the latest versions of LLMs available. 

While we work toward achieving this ubiquitous coverage from an engineering standpoint, we are announcing a Benchmarking Policy, which defines which “systems under test” (as we call the LLMs we test) will be included in the AILuminate benchmark as we go forward. With this policy, we have aimed to strike a balance between giving developers a practice test and notice that their systems will be included, against the need to maintain timely and continuously updated testing of as many of the systems on the market as we can. 
You can read the policy in full here

The post Announcing the AILuminate Benchmarking Policy appeared first on MLCommons.

]]>
MLCommons Releases New MLPerf Inference v5.0 Benchmark Results https://mlcommons.org/2025/04/mlperf-inference-v5-0-results/ Wed, 02 Apr 2025 15:00:00 +0000 https://mlcommons.org/?p=2657 Generative AI now the center of attention for performance engineering

The post MLCommons Releases New MLPerf Inference v5.0 Benchmark Results appeared first on MLCommons.

]]>
Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. The results highlight that the AI community is focusing much of its attention and efforts on generative AI scenarios, and that the combination of recent hardware and software advances optimized for generative AI have led to dramatic performance improvements over the past year.

The MLPerf Inference benchmark suite, which encompasses both datacenter and edge systems, is designed to measure how quickly systems can run AI and ML models across a variety of workloads. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems. This round of MLPerf Inference results also includes tests for four new benchmarks: Llama 3.1 405B, Llama 2 70B Interactive for low-latency applications, RGAT, and Automotive PointPainting for 3D object detection.

Llama 2 70B generative AI test takes center stage

The Inference v5.0 results show that generative AI scenarios have gained momentum. Over the last year submissions have increased 2.5x to the Llama 2 70B benchmark test, which implements a large generative AI inference workload based on a widely-referenced open-source model. With the v5.0 release, Llama 2 70B has now supplanted Resnet50 as the highest submission rate test.

The performance results for Llama 2 70B have also skyrocketed since a year ago: the median submitted score has doubled, and the best score is 3.3 times faster compared to Inference v4.0.

“It’s clear now that much of the ecosystem is focused squarely on deploying generative AI, and that the performance benchmarking feedback loop is working,” said David Kanter, head of MLPerf at MLCommons. “We’re seeing an unprecedented flood of new generations of accelerators. The hardware is paired with new software techniques, including aligned support across hardware and software for the FP4 data format. With these advances, the community is setting new records for generative AI inference performance.”

The benchmark results for this round include results for six newly available or soon-to-be-shipped processors:

  • AMD Instinct MI325X
  • Intel Xeon 6980P “Granite Rapids”
  • Google TPU Trillium (TPU v6e)
  • NVIDIA B200
  • NVIDIA Jetson AGX Thor 128
  • NVIDIA GB200

Benchmarking the state of the art in generative AI: two new tests introduced

In step with advances in the AI community, MLPerf Inference v5.0 introduces a new benchmark utilizing the Llama 3.1 405B model, marking a new bar for the scale of a generative AI inference model in a performance benchmark. Llama 3.1 405B incorporates 405 billion parameters in its model while supporting input and output lengths up to 128,000 tokens (compared to only 4,096 tokens for Llama 2 70B). The benchmark tests three separate tasks: general question-answering, math, and code generation.

“This is our most ambitious inference benchmark to-date,” said Miro Hodak, co-chair of the MLPerf Inference working group. “It reflects the industry trend toward larger models, which can increase accuracy and support a broader set of tasks. It’s a more difficult and time-consuming test, but organizations are trying to deploy real-world models of this order of magnitude. Trusted, relevant benchmark results are critical to help them make better decisions on the best way to provision them.”

The Inference v5.0 suite also adds a new twist to its existing benchmark for Llama 2 70B with an additional test that adds low-latency requirements: Llama 2 70B Interactive. Reflective of industry trends toward interactive chatbots as well as next-generation reasoning and agentic systems, the benchmark requires systems under test (SUTs) to meet more demanding system response metrics for time to first token (TTFT) and time per output token (TPOT).

“A critical measure of the performance of a query system or a chatbot is whether it feels responsive to a person interacting with it. How quickly does it start to reply to a prompt, and at what pace does it deliver its entire response?” said Mitchelle Rasquinha, MLPerf Inference working group co-chair. “By enforcing tighter requirements for responsiveness, this interactive version of the Llama 2 70B test offers new insights into the performance of LLMs in real-world scenarios.”

More information on the selection of the Llama 3.1 405B and the new Llama 2 70B Interactive benchmarks can be found in this supplemental blog.

New Graph Neural Network datacenter benchmark for modeling relationship graphs

Also new to Inference v5.0 is a datacenter benchmark that implements a graph neural network (GNN) model. GNNs are useful for modeling links and relationships between nodes in a network and are commonly used in recommendation systems, knowledge-graph answering, fraud-detection systems, and other kinds of graph-based applications.

The GNN datacenter benchmark implements the RGAT model, based on the Illinois Graph Benchmark Heterogeneous (IGBH) dataset containing 547,306,935 nodes and 5,812,005,639 edges.

More information on the construction of the RGAT benchmark can be found here.

New edge benchmark: Automotive PointPainting test for 3D object detection

The Inference v5.0 benchmark introduces a new Automotive PointPainting benchmark for edge computing devices, specifically automobiles. While the MLPerf Automotive working group continues to develop the Minimum Viable Product benchmark first announced last summer, this test provides a proxy for an important edge-computing scenario: 3D object detection in camera feeds for applications such as self-driving cars.

More information on the Automotive PointPainting benchmark can be found here.

As industry raises the bar for AI systems, MLPerf Inference benchmark follows suit

“We rarely introduce four new tests in a single update to the benchmark suite,” said Miro Hodak, “but we felt it was necessary to best serve the community. The rapid pace of advancement in machine learning and the breadth of new applications are both staggering, and stakeholders need relevant and up-to-date data to inform their decision-making.”

MLPerf Inference v5.0 includes 17,457 performance results from 23 submitting organizations: AMD, ASUSTeK, Broadcom, Cisco, CoreWeave, CTuning, Dell, FlexAI, Fujitsu, GATEOverflow, Giga Computing, Google, HPE, Intel, Krai, Lambda, Lenovo, MangoBoost, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Sustainable Metal Cloud.

“We would like to welcome the five first-time submitters to the Inference benchmark: CoreWeave, FlexAI, GATEOverflow, Lambda, and MangoBoost,” said David Kanter. “The continuing growth in the community of submitters is a testament to the importance of accurate and trustworthy performance metrics to the AI community. I would also like to highlight Fujitsu’s broad set of datacenter power benchmark submissions and GateOverflow’s edge power submissions in this round, which reminds us that energy efficiency in AI systems is an increasingly critical issue in need of accurate data to guide decision-making.”

“The machine learning ecosystem continues to give the community ever greater capabilities. We are increasing the scale of AI models being trained and deployed, achieving new levels of interactive responsiveness, and deploying AI compute more broadly than ever before,” said Kanter. “We are excited to see new generations of hardware and software deliver these capabilities, and MLCommons is proud to present exciting results for a wide range of systems and several novel processors with this release of the MLPerf Inference benchmark. Our work to keep the benchmark suite current, comprehensive, and relevant at a time of rapid change is a real accomplishment, and ensures that we will continue to deliver valuable performance data to stakeholders.”

View the results

To view the results for MLPerf Inference v5.0, please visit the Datacenter and Edge benchmark results pages.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons Releases New MLPerf Inference v5.0 Benchmark Results appeared first on MLCommons.

]]>
Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0 https://mlcommons.org/2025/04/rgat-inference-v5/ Wed, 02 Apr 2025 14:59:24 +0000 https://mlcommons.org/?p=2663 The RGAT benchmark addresses graph-structured data and applications, expanding the scope of MLPerf Inference

The post Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0 appeared first on MLCommons.

]]>
Introduction

Graph neural networks (GNNs) have been increasing in popularity throughout the last few years with the abundance of data and use-cases that have underlying graph structures. GNNs are used in a variety of applications including social network analysis, fraud detection, transportation applications, recommendation systems, weather prediction, drug discovery, and chemical structure analysis. In addition, GNNs are also being explored for natural language processing and image segmentation. Generally GNNs have been shown to outperform other machine learning methods for tasks like node classification and graph classification and unlock many new applications that work with data that is more complex than traditional text, images, video, and audio. Based on the popularity of graph neural networks and the new application areas, the GNN task force recommended adding a graph neural network benchmark to MLPerf® Inference v5.0.

Model selection

There are a variety of types of GNNs, including both Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). GCNs rely on convolution operators to capture information about relationships between nodes in a graph, while GATs use the attention mechanism. Traditional GATs focus on undirected single-relation graphs. For the MLPerf GNN benchmark, we chose a variant of graph attention networks called RGAT, or Relational Graph Attention Network, which extends GATs to allow for multi-relational graph structures – representing a wider range of applications and use cases.

What are RGATs?

Graph Attention Networks (GATs) are a specialized type of GNN that use attention mechanisms to dynamically weigh the importance of neighboring nodes during information propagation in graph-structured data. They’re particularly effective for node classification and social network analysis. Relational GATs (RGATs) extend this concept by incorporating relationship type discrimination between nodes in knowledge graphs (i.e., differentiating by edge type).

The figure below demonstrates a 2-layer RGAT with fan-out: When classifying a target node, we construct a subgraph by randomly sampling neighbors across two hierarchical levels (note that here, we fan-in instead of fan-out). In the first layer, neighbor embeddings serve directly as attention inputs, while subsequent layers use propagated outputs from GAT computations. Notably, the same graph node (e.g., any single GATnode in the figure) may appear multiple times in the computation subgraph. While this allows reuse of its learned embedding, each occurrence requires recomputation because input edges carry distinct values in different contexts. This architecture creates a memory-computation tradeoff dependent on graph relationships.

Subgraph Sampling Example

The figure below illustrates the architecture of a single RGAT layer, detailing its attention mechanism. For each node pair in the attention computation, both the local embedding (central node) and external embeddings (neighbors) pass through a shared MLP to generate separate Query and Key vectors. These projections are concatenated and transformed into an attention score. The scores undergo Softmax normalization to produce attention weights. These weights then compute a weighted sum of the Value vectors, which are derived from projected neighbor embeddings. This aggregated result becomes the hidden embedding passed to the next RGAT layer.

A key implementation detail is that while the same MLP parameters are reused across all computations, each attention head recalculates context-specific weights because input edges (Key/Value) differ for every node-edge interaction.

RGAT network architecture

Dataset and task selection

Some of the most popular applications of graph neural networks are social network analysis and recommenders. These applications can have massive datasets of user information which require a large amount of storage. One of the biggest challenges of graph neural networks is scaling to these large graphs, which can increase both the computational cost of processing all the relations in the graph, as well as the message passing and memory management of nodes across the graph, which are often stored on disk and even across multiple nodes.

To mimic this as much as possible, we chose to use the largest publicly available graph dataset at the time – the Illinois Graph Benchmark Heterogenous (IGB-H) dataset. The graph in this dataset consists of academic papers (“Paper” nodes), the field of study of those papers (“FoS” nodes), the authors of these papers (“Author” nodes), as well as the institutes the authors are affiliated with (“Institute” nodes). In addition to previously mentioned relations (topic, written by, and affiliated with), there is also an additional relation for “citation” which is a relation from “Paper” to “Paper”. The task for this dataset is classification of the ‘Paper’ nodes to a set of 2983 topics.

Graph structure of IGB-H
Graph structure of IGB-H

The IGB Heterogeneous “Full” dataset variant has 547 million nodes and 5.8 billion edges. With an embedding dimension of 1024, the dataset totals up to over 2 TB of data. In addition to this, for use in MLPerf Inference, the dataset is augmented with reverse edges, as well as self-loops for papers, over doubling the number of edges. It was decided that the subset used for inference as well as accuracy checks is the validation set from training, consisting of around 788 thousand “Paper” nodes. During classification, subgraphs are sampled for each input node with a fanout steps of 15 – 10 – 5 (i.e. 15 neighbors, 10 neighbors of each of those neighbors, and 5 neighbors of those neighbors). We acknowledge that some real-world applications of GNNs utilize “full fanout”, which uses every single neighbor of the node set at each step of subgraph generation. However, for the MLPerf Inference benchmark, we decided to use a fixed maximum fanout with the same parameters as when the model was trained in order to reduce variance of per-sample latencies, as it is possible for high neighbor count nodes to skew inference latency higher.

This iteration of GNN benchmarking only tests the ‘Offline’ MLPerf Inference scenario, as many of the production applications are only used in this setting. In the future, if there is sufficient interest, we may add the ‘Server’ scenario. 

Accuracy metrics

To align with the selected task and dataset, the task force decided to use the ratio of correctly predicted topic “Paper” nodes in the validation. Baseline accuracy is based on 0.5% of the 157 million labelled nodes resulting in 788,000 validation nodes as evaluated in float32 is 72.86% (Model weights are kept in float32, while embedding is downcast to fp16 to save on storage/memory requirements).

Naturally, MLCommons® provides a reasonable constraint for precisions lower than the baseline. This constraint is set at 99% of the reference. However, with neighborhood sampling embedded in the subgraph computation, some randomness had to be accounted for. The task force therefore made the decision to allow for an additional .5% margin.

Performance metrics

In our inaugural release, the benchmark is only tested in the ‘Offline’ scenario, where the performance metric is throughput measured in ‘samples per second’, and per-sample latency is not a performance consideration. We acknowledge that for some use-cases of GNNs, such as recommenders and transportation applications like map data processing, latency can be an important metric to consider and we may expand to consider different scenarios in future rounds of MLPerf Inference.

Conclusion

To address the growing usage of graph neural networks for different applications, a task force investigated and recommended the creation of a GNN benchmark within MLPerf Inference v5.0. 

To represent the unique challenges associated with end-to-end deployment of GNNs, the task force selected a representative benchmark model, RGAT, that works with the largest public graph dataset with the highest precision. This benchmark emphasizes the different stages of the deployment process such as representation of sufficient scope, distributed storage at scale, and efficient computation and offers an opportunity for vendors to showcase efficient systems and optimizations for tackling these challenges. In the future we hope to expand the scope of the benchmark to include the different datacenter deployment scenarios for real-time such as online serving, which is increasingly finding applications in graph as well as recommender systems.

Details of the MLPerf Inference RGAT benchmark and reference implementation can be found here.

 1 The DGL neighborhood sampler has a known bug with seeding when the sampling is parallelized across CPUs

The post Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0 appeared first on MLCommons.

]]>
A New Automotive Benchmark for MLPerf Inference v5.0 https://mlcommons.org/2025/04/auto-inference-v5/ Wed, 02 Apr 2025 14:58:55 +0000 https://mlcommons.org/?p=2703 MLCommons has developed a new automotive benchmark based on established industry methods such as PointPainting, DeepLabv3+, and the Waymo Open Dataset.

The post A New Automotive Benchmark for MLPerf Inference v5.0 appeared first on MLCommons.

]]>
Intro

MLPerf® Inference v5.0 introduced a new automotive benchmark. From the many potential workloads related to autonomous driving, the automotive benchmark task force chose 3D object detection as the benchmark task. Based on input from experts in the autonomous driving space, the task force selected PointPainting as the benchmark model. PointPainting is a sensor-fusion technique proposed by Vora et al. in 2019 that works in conjunction with image-based semantic segmentation to classify points generated by a lidar system. The experts indicated that 3D object detection is the most useful compute-intensive task to benchmark. We selected PointPainting because it is reasonably accurate, fast for inference, and has representative components of classical and newer 3D detection models.

Model selection

Compared to other machine learning applications, progress in autonomous driving is relatively more cautious. New models must go through extended review to ensure safety and reliability. We wanted to choose a model that contains useful components from both more proven older models as well as newer state-of-the-art models used in ongoing research. We also wanted to pick a model that is representative of multiple sensor modalities while still favoring the most common case of using video/images for vision, and a sensor suite synergistic between lower-resolution, affordable lidar and higher-resolution cameras.

Figure 1: Overview of PointPainting using DeepLabv3+ for segmentation and PointPillars for 3D Detection. On the left, five camera images are segmented by Deeplabv3+. In the middle, the segmentation scores are then painted on the lidar point cloud. Finally, the painted points are passed to PointPillars to produce the final 3D bounding boxes.

PointPainting is composed of two models: an image segmentation model that is used to “paint” lidar points by concatenating segmentation scores to lidar points and a lidar object detection model that takes the painted points as inputs and produces 3D detection predictions. A high level diagram is shown in Figure 1. We chose DeepLabv3+ with a ResNet-50 backbone as the segmentation model and PointPillars as the lidar detector. Both are popular for their respective tasks and are fast enough for real time applications. Additionally, these model choices account for about 90% of the PointPainting workload being performed in the segmentation task. Favoring images in the workload helps make the model more representative of the most common case of vision tasks while still incorporating lidar components in our model.

Dataset and task selection

Autonomous vehicles must perform many computational tasks. Our goal was to pick a task that is computationally intensive and representative of the self-driving tasks. To select one task to benchmark, we consulted with experts in the autonomous driving field. We had unanimous agreement that 3D object detection would be the most representative task. 3D tasks are inherently more computationally expensive than 2D tasks and often incorporate 2D models within them. 3D detection is also an important component of other tasks such as trajectory prediction.

We chose the Waymo Open Dataset (WOD) because it is a large high-quality dataset. Importantly, we verified that MLPerf benchmarking is considered a non-commercial use under the terms of the WOD license. The WOD provides five camera images and one lidar point cloud per frame. The images have resolutions of 1920×1280 and 1920×886 pixels. The data was captured in San Francisco, Mountain View, and Phoenix. The WOD provides annotations for 3D bounding boxes as well as video panoptic segmentation which is needed for the segmentation network.


There is no public implementation of PointPainting based on WOD. We trained a model ourselves to determine an accuracy target and generate a reference checkpoint for the benchmark. We built on the open source work of PointPillars, DeepLabv3+, PointPainting, and mmdetection 3d to create our implementation of PointPainting. As a baseline we trained PointPillars on WOD. Our goal was not to produce the most accurate version of PointPillars but to get a model with an accuracy that is representative of what can be achieved on the WOD dataset. Our PointPillars baseline achieved a mean average precision (mAP) of 49.91 across three classes (Pedestrians, Cyclists, and Vehicles).

Our trained PointPainting model is available to MLCommons® members by request here.

Performance metrics

MLPerf Inference’s single stream scenario for edge systems represents a task that accepts a single input and measures the latency for producing the output. This scenario is the closest and most relevant for an automotive workload where frames are received at a fixed rate from sensors. We chose a 99.9% accuracy threshold, as automotive workloads are safety-critical. The 99.9% latency percentile measures the 99.9% largest latency for a submission out of the required thousands of inference samples. These requirements balance the need for strong latency constraints with the desire to ensure the benchmark runtime is on the order of one day rather than weeks.

Conclusion

The automotive benchmark taskforce followed a rigorous process for choosing and building an automotive workload for MLPerf Inference. We have undertaken careful decisions regarding the task, dataset, and model to ensure the creation of a robust and representative benchmark. Additionally, we have determined accuracy targets and performance metrics that fit within MLPerf Inference while being useful for automotive tasks. MLCommons plans to continue developing automotive benchmarks across a suite of automotive-specific tasks. We welcome feedback from the community regarding this benchmark and future developments.

Additional technical contributors: Pablo Gonzalez, MLCommons; Anandhu Sooraj, MLCommons; Arjun Suresh, AMD (formerly at MLCommons).

The post A New Automotive Benchmark for MLPerf Inference v5.0 appeared first on MLCommons.

]]>
MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI https://mlcommons.org/2025/04/llm-inference-v5/ Wed, 02 Apr 2025 14:58:06 +0000 https://mlcommons.org/?p=2670 MLCommons debuts larger Llama 3.1 405B Instruct and faster Llama 2 Chat 70B interactive LLM benchmarks

The post MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI appeared first on MLCommons.

]]>
Introduction

The Large Language Model (LLM) task force was convened to ensure that the MLPerf® Inference v5.0 benchmarks address the evolving demands of LLMs and low-latency serving in real-world applications. We focus on three key aspects: first, benchmarking the performance of inferencing platforms on emerging open large LLMs; second, ensuring support for increased context and output sequence lengths required by Retrieval-Augmented Generation (RAG), agentic AI, and advanced reasoning tasks; and third, achieving reasonable response latency to enable efficient and practical deployment in diverse scenarios. These new benchmarks provide a robust framework for evaluating and advancing large-scale LLM inference performance. In MLPerf Inference v5.0, we added two new benchmarks to the suite: Llama 3.1 405B Instruct and Llama 2 Chat 70B.

Llama 3.1 405B Instruct

The selection of Meta’s Llama 3.1 405B Instruct model reflects its popularity and alignment with industry requirements. Several features of Llama 3.1 405B are particularly notable in this context:

  • Long Context Processing: Its benchmark-proven ability to process 128K-token contexts with minimal accuracy degradation addresses the growing need for models to summarize, analyze, and synthesize information from expansive documents while retaining critical details.
  • Structured Data Extraction: Llama 3.1 405B exhibits superior performance on needle-in-a-haystack (NIAH) tasks that extract structured data (e.g., key-value pairs) from noisy or unstructured corpora.
  • Multi-Source Synthesis: Llama 3.1 405B has demonstrated an exceptional capacity for reconciling ambiguous information across sources, notably in literature analysis where it matches GPT-4’s accuracy.
  • Cross-Contextual Reasoning: The model’s design emphasizes linking distant concepts, making it ideal for tasks requiring multi-document synthesis and cross-referencing.

Collectively, these attributes positioned Llama 3.1 405B Instruct as the best candidate, aligning with MLPerf’s mission to benchmark models that power real-world AI systems amid escalating computational and contextual complexity.

Dataset and task selection

Llama 3.1 405B Instruct has a context window of 128,000 tokens, compared to Llama 2 70B’s 4,096 tokens. This expanded context window enables Llama 3.1 405B to handle longer conversations, process larger documents or code samples, and perform tasks requiring extended context, such as long-form text summarization.

The dataset for Llama 3.1 405B Instruct was curated to evaluate its capabilities in long-context comprehension, retrieval-augmented reasoning, and task-specific optimization. The MLCommons® task force integrated samples from a diverse mix of open datasets:

This mix reflects real-world challenges such as synthesizing government reports, pinpointing critical data in cluttered contexts (needle-in-a-haystack), and executing precise document-based QA. With mean input/output context lengths of 9,400 and 680 tokens respectively, our dataset stresses the model’s ability to retain coherence and accuracy across extended sequences. By leveraging Llama 3.1 405B Instruct’s strengths, we achieved alignment between generated outputs and ground-truth references, validating its proficiency in parsing, contextual linking, and synthesizing insights from complex, lengthy inputs.

The Llama 3.1 405B Instruct benchmark employs task-specific accuracy metrics to evaluate performance across the dataset. For summarization tasks (e.g., GovReport-Summary), we adopted ROUGE-L, a widely recognized NLP metric that measures lexical overlap between generated summaries and ground-truth references. For precision-critical applications like needle-in-a-haystack retrieval and document-based QA, we utilized exact match (EM) scoring to ensure alignment with target answers, minimizing tolerance for errors in key-value extraction or factual responses. The accuracy threshold for closed division submissions is set to 99% of the FP16 reference.

Figure 1. Histogram of input and output (both ground truth and reference) token length distribution of the dataset for Llama 3.1 405B Instruct.

Performance metrics

The performance metrics chosen for the Llama 3.1 405B Instruct benchmark are the same as those used for the Llama 2 70B benchmark in MLPerf Inference v4.0: token throughput for the offline scenario and Time To First Token (TTFT) + Time Per Output Token (TPOT) for the server scenario. For more information, please refer to the “Measuring performance of the LLM” section of Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons. To balance the demands of long-context processing with real-world usability, we set a 99th percentile TTFT of 6 seconds and a 99th percentile TPOT of 175ms. These thresholds reflect the computational challenges of deploying large models with long context, while maintaining reasonable responsiveness.

Reference implementation

The reference implementation for Llama 3.1 405B Instruct leverages vLLM, a fast and versatile library for LLM inference and serving. This choice was driven by vLLM’s robust support for large language models and its wide-ranging compatibility with a diverse range of hardware, including NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi accelerators, TPUs etc, making it an ideal choice for deploying Llama 3.1-405B across different computing environments.

For readers interested in running the model themselves, we encourage them to follow the reference implementation, which contains code and instructions on how to run the entire benchmark end-to-end.

Llama 2 Chat 70B Interactive Benchmark

The Llama 2 70B benchmark was introduced in MLPerf Inference v4.0 and has quickly become one of the most popular benchmarks in the suite. Server latencies were set, according to the usage models and hardware capabilities of the time, at 2s TTFT and 200 ms TPOT. As adoption of generative AI has skyrocketed, state-of-the-art model serving has advanced and use cases such as reasoning models and agents are taking advantage of ever lower latencies. In an effort to set target latencies that are more representative of current requirements, we performed a new analysis of industry research, user surveys, and performance data from leading platforms like ChatGPT and Perplexity AI in late 2024. Our findings indicated that a 50th percentile token generation rate of 20–50 tokens per second (TPOT of 20-50ms) is critical for seamless user experience. To prioritize reliability under peak demand, MLPerf adopts a stricter 99th percentile threshold of 25 tokens/second (TPOT of 40ms), ensuring consistent responsiveness even during high-load scenarios. Additionally, we set a 99th percentile TTFT limit of 450ms to minimize initial latency, aligning with user expectations for near-instantaneous query response.

We continued to utilize the OpenOrca dataset and ROUGE accuracy metrics due to their relevance and proximity to chatbot tasks, which align well with the dataset’s diverse conversational scenarios and high-quality annotations.

Conclusion

The usage of LLMs is expanding rapidly across a diversity of application domains, with a demand for LLMs at different scales and response latencies. MLPerf is keeping pace with these trends by introducing a very large scale 405B benchmark and a low latency 70B interactive benchmark in the new MLPerf Inference v5.0 release. In this post, we shared the motivation and the complexities involved in the choice of model, task, dataset, accuracy, performance, and verification of such LLM benchmarks. We believe the design choices of this benchmark objectively address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers. With the addition of these benchmarks to MLPerf Inference, we now offer a comprehensive suite of language benchmarks at all scales (7B to 405B), diversity of architectures (like MoE) and deployment scenarios (datacenter, edge, or low latency).

Details of the MLPerf Inference Llama 3.1 405B Instruct benchmark and reference implementation can be found here.

For more information on the Llama 2 70B benchmark, please refer to the “Choosing a task and a dataset for the LLM” section of Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons.

The post MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI appeared first on MLCommons.

]]>
Unlocking Data Collaboration with AI-Ready Licenses https://mlcommons.org/2025/03/unlocking-data-collab/ Mon, 17 Mar 2025 19:37:31 +0000 https://mlcommons.org/?p=2609 New research shows promising signs that adopting modular, standard data license agreements can clarify key terms, reduce transaction costs, and enable more accessible data use across borders and sectors.

The post Unlocking Data Collaboration with AI-Ready Licenses appeared first on MLCommons.

]]>
The rapid advancement of artificial intelligence (AI) is fueling the need for complex datasets containing various high-quality text, images, audio, video, and specialized formats. Despite the growing interest in AI and its evident social impact, stakeholders face challenges in establishing reliable frameworks for responsibly sharing AI data. A significant hurdle is the absence of widely accepted, standardized contractual frameworks (i.e.: data licenses) for this purpose. 

Research funded through an unrestricted grant by the MLCommons Association and recently published by the Open Data Institute (ODI) and the Pratt School of Engineering at Duke University explored the motivations driving data sharing throughout the AI ecosystem. The goal was to inform the design of more fit-for-purpose AI data licenses. A blog published by the OECD.AI/GPAI also recently featured this work.  

Using a mixed-methods approach that combined interviews, an online survey, and a review of literature, this research captured a broad spectrum of insights from across eight jurisdictions and multiple sectors covering key roles such as data holders, data users, and intermediaries. Here, we give an overview of these research findings, demonstrating how standard data license agreements hold the potential to mitigate legal uncertainties, reduce transaction costs, and promote fair AI data access, particularly when combined with standard data governance and other tools.

The following key findings are organized according to four main themes: leveraging new standard data licenses to 1) promote greater data access and social good, and 2) address existing ambiguities in current contractual approaches to data sharing, alongside 3) the need for data licensing simplicity and the importance of balancing standardization with the need for some customization, and 4) developing a new standard data licensing framework to help address other data sharing challenges, including legal compliance and ethical considerations.  

data licensing pathway steps

1. New standard data licenses can promote greater responsible data access and AI for social good.

a) No standardized data licenses have been widely adopted, increasing data sharing challenges for under-represented and smaller organizations. The research indicates that the lack of widely adopted, standardized data licenses may create significant barriers to voluntary, legally compliant, and ethical data sharing. This gap might drive up costs and impose contractual burdens that particularly hinder smaller or under-represented organizations from fully benefiting from AI.

b) Data access disparities also impact the advancement of AI for social good. Limited cross-cultural and cross-linguistic data sharing can restrict access to high-quality datasets. The research shows that this shortfall might not only impede responsible AI development among research institutions, governments, and non-profits (especially in developing regions) but also diminish the diversity and richness of AI models.

c) Multi-stakeholder participation can help develop standard data licenses that advance equitable data access and AI for social good. According to the research, involving a diverse range of stakeholders (including representatives from different geographic regions, sectors, and groups that have traditionally faced challenges in the technology sector) appears to be essential. Such multi-stakeholder participation can lead to the development of standard licensing frameworks that more effectively address disparities and foster equitable data access for social good.

2. New standard data licenses can help address existing ambiguities and challenges with current contractual approaches to data sharing.

a) The lack of widely embraced standard data licenses leads to inconsistency and incompatibility and increases the challenges of using data in compliance with license terms. Through this research, it was confirmed that customized license agreements currently in use may lead to inconsistencies and incompatibilities, which in turn increase legal uncertainty and compliance costs. This fragmentation overall makes it more challenging for organizations to share data effectively across their ecosystems.

b) Confusion exists about the meaning of non-commercial limitations in data licenses, and clarification in new standard licenses might encourage more data-sharing transactions. This work highlights that the lack of definitions of “non-commercial” use in current licenses creates significant confusion. Clarifying this term within standard licenses could encourage more frequent and compliant data-sharing transactions.

c) New standard data licenses could clarify rights emerging from AI and provide different options for contractually allocating rights and liabilities. The research shows that commonly used existing licenses do not clearly address the allocation of rights for data usage (or for AI models and outputs developed using such data) resulting in uncertainties regarding rights and liabilities. Standard licenses that offer clearer contractual options, in line with applicable laws, could help resolve these issues.

d) New standard data licenses can help address attribution issues arising under existing agreements. Attribution requirements in widely used license frameworks can be impractical for AI applications, where data used in model training is not directly reproduced in outputs. The research therefore suggests that standardizing license terms and integrating technical solutions like automatic attribution may reduce this ambiguity and simplify data sharing.

3. New standard data licenses should embrace simplicity while balancing standardization with the need for some customization.

a) There is a pressing need for simplicity. Throughout, the research emphasized that drafting licenses in clear, easily interpretable language is critical for reducing transaction costs and building trust, especially for non-legal practitioners.

b) Standard data licensing frameworks should balance the need for standardization with some customization, given the many data types, use cases, models, tasks, and legal requirements. The research also indicates that while standardization can streamline data sharing, licenses may also need to allow for some customization to accommodate the diversity of data types, use cases, AI models, and jurisdiction-specific legal requirements across the AI data lifecycle.

c) Standard modular data license terms (and/or a menu of standard data license agreements) could balance standardization and some customization. Therefore, the research suggests that a modular framework or a menu of license options (developed through a collaborative, multi-stakeholder process) could effectively balance the need for uniformity with the necessity for customization to address varied stakeholder needs.

d) Standard definitions can advance data licensing. The research points to the potential benefits of developing and adopting standard definitions for key terms. Such definitions can further simplify license agreements and help maintain an effective balance between standardization and necessary customization.

4. A new standard data licensing framework may help address other data-sharing challenges, including legal compliance, data and AI governance, and ethical considerations.

a) Standard data license agreements could help address other challenges in implementing voluntary, legally compliant data sharing, including by helping advance data and AI governance. Ensuring technical tools for tracking data provenance and lineage and for easily processing the terms of standard licenses can enhance transparency and reduce compliance costs. This, the research shows, can support stronger data and AI governance practices. For instance, MLCommons’ Croissant initiative (a metadata format for ML datasets) simplifies discovery and integration by embedding legal and compliance measures into the AI pipeline. This can help to reduce uncertainty in data-sharing exchanges by representing legal and contractual frameworks as metadata. Lastly, complementary tools, such as automatic attribution services in Apache Atlas, can further ensure that license-defined rights are easily represented and therefore upheld.

b) Standard data licenses may help address cross-border and other compliance challenges. The diversity of data-related regulations across jurisdictions (e.g., data protection) continues to create substantial barriers to data sharing. To ease compliance, standard data license terms should be prepared with these legal matters in mind. 

c) Standard data licenses may help advance ethical guidelines for data usage. The research confirmed interest in having ethical restrictions on data and AI model usage in at least some scenarios. A new standard data licensing framework could reference relevant ethical codes of conduct.  Some concern was expressed about incorporating codes of conduct or standards, which could change over time, into standard data licenses, as this could increase uncertainty in the contractual terms and costs of compliance.  These factors should be considered in the global and multi-stakeholder process developed to prepare standard licenses.  

Looking forward

Overall, this work shows promising signs that adopting modular, standard data license agreements can clarify key terms, reduce transaction costs, and enable more accessible data use across borders and sectors. Additionally, it demonstrates that integrating external technical tools and frameworks like Apache Atlas and Croissant with these legal frameworks enhances data discoverability and usability. Together, these measures can help to build a transparent, interoperable ecosystem for responsible AI development. Here, we have the opportunity to shape an AI-driven future that is equitable, sustainable, and indeed serves everyone.

The post Unlocking Data Collaboration with AI-Ready Licenses appeared first on MLCommons.

]]>
MedPerf enhances transparency and integrity of real-world benchmarks using smart contracts https://mlcommons.org/2025/03/medperf-smart-contracts/ Mon, 10 Mar 2025 20:27:32 +0000 https://mlcommons.org/?p=2558 Protection of digital assets on MedPerf via policy enforcement using Private Data Objects (PDO) 

The post MedPerf enhances transparency and integrity of real-world benchmarks using smart contracts appeared first on MLCommons.

]]>
When the MLCommons Medical AI working group set out to develop MedPerf, an open framework for benchmarking medical AI in real-world private datasets, transparency and privacy were the main technical objectives. The current implementation of MedPerf ensures this transparency and integrity through a formal benchmark committee. The committee is a governance body which oversees the entire benchmark process, from committee formation to reference implementation development, dissemination and access of benchmark results, and execution of the benchmark. Critical to MedPerf is maintaining its federated evaluation approach to deliver a high level of data privacy; enforcing no transfer of medical data to perform the benchmarks. 

Figure 1. Current MedPerf benchmark workflow. Red circle shows current policy implementation.

The working group, in collaboration with technical member organization Intel and academic organization Notre Dame is improving MedPerf with a new policy-enhanced workflow enforced by smart contracts. A smart contract is delivered via a trusted-execution-environment such as an Intel Software Guard Extension to ensure stronger transparency and privacy. The new workflow and smart contract protects digital assets by combining a confidential smart contract framework with private data objects by binding data policy enforcement with policy evaluation and isolated data task executions. This new approach proven through a proof-of-concept (POC) delivers a higher level of data transparency, accountability, and traceability. 

The importance of policies to protect digital assets 

Digital asset usage policy is of the utmost importance to protect intellectual property (IP) such as data and AI models. Unfortunately most existing solutions require copies of digital assets with access controls when sharing and rely on centralized governance for policy enforcement. In reality, this may actually reduce transparency and increase the risk of intellectual property theft or leakage. 

This risk becomes more problematic in the healthcare space where data and AI models may contain personally identifiable information as well as very expensive IP. Transparency for medical data is critical to maintain traceability and accountability of various processes such as training, validation, and benchmarking. Any lack of transparency could be further magnified in decentralized settings such as data federations where assets are hosted and transferred in a decentralized manner. 

Trusted execution environment-based confidential smart contract frameworks with private data objects have proven to help with policy enforcement by binding policy evaluation and isolated task executions to protect digital assets. Frameworks such as these can facilitate a high level of transparency, accountability, and traceability but to-date were largely unexplored in the medical AI space.

MedPerf smart contracts act as enforcement vehicles for healthcare dataset use policies

Current use policies for MedPerf datasets are implicitly hardcoded in the benchmark workflow. This means a data owner must execute any requested benchmark on their local data. This approach offers users full control of their data assets, but lacks any flexibility for customization and scalability such as decentralization and increasing the number of requests.

To enhance policies and provide broader flexibility and scalability in MedPerf, the working group built a trusted-execution-environment POC using a confidential smart contract framework. This approach delivers policy enforcement via a private data object by binding policy evaluation and isolated task executions to protect digital assets. 

To initiate a smart contract, a MedPerf data owner creates a private data object contract that is linked to their original dataset through the dataset ID. The contract contains descriptive information on the dataset and acts as an interface to it. The data owner then securely deposits the dataset inside a digital-guardian-service trusted-execution-environment enclave. The enclave can be hosted on premise or on the cloud. When a benchmark of a model on the dataset is requested, the contract and the digital guardian perform bi-directional attestation to ensure each other’s integrity. This approach retains MedPerf’s federated evaluation ensuring the dataset never leaves and is always within the secure domain of the enclave while its use is bounded by policy enforcement. 

Figure 2. Policy-enhanced benchmark workflow. Red circle shows smart-contract policy implementation.

Enhanced benchmark workflow benefits

The new policy-enhanced workflow provides more power to owners of digital assets such as healthcare datasets and models in benchmarking workflows. The POC has validated the use of embedded smart contract policies in MedPerf, and full digital asset owner control of their assets. 

This new approach increases utility and potential rewards in benchmarks by supporting integrity enabling transparency in the execution cycle.

To learn more about how the policy is implemented, read the Technical Update.

The Medical AI working group calls on the broad community of academics, technologists, regulatory scientists, healthcare experts, and others to help with the roadmapping and contributions to impactful benchmarking technologies by joining the MLCommons Medical AI working group.

The post MedPerf enhances transparency and integrity of real-world benchmarks using smart contracts appeared first on MLCommons.

]]>