News Archives - MLCommons https://mlcommons.org/category/news/ Better AI for Everyone Tue, 25 Feb 2025 16:53:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png News Archives - MLCommons https://mlcommons.org/category/news/ 32 32 Croissant Gains Momentum within the Data Community https://mlcommons.org/2025/02/croissant-qa-community/ Wed, 12 Feb 2025 14:44:00 +0000 https://mlcommons.org/?p=2435 Croissant working group co-chairs share insights on the success of Croissant, what’s next, and how to join the effort.

The post Croissant Gains Momentum within the Data Community appeared first on MLCommons.

]]>
Since its introduction in March 2024, the Croissant metadata format for ML-ready datasets has quickly gained momentum within the Data community. It was recently featured at the NeurIPS conference in December 2024. Croissant is supported by major ML dataset repositories, such as Kaggle and HuggingFace. Over 700k datasets are accessible and searchable via Google Dataset Search. Croissant datasets can be loaded automatically to train ML models using popular toolkits such as JAX and PyTorch. 

MLCommons’ Croissant working group co-chairs, Omar Benjelloun and Elena Simperl, sat down to share their thoughts on Croissant’s initial popularity, the approach the working group is taking to develop it, and next steps.

[MLC] Why do you think Croissant has become so popular?

[ES] We’re delighted to see it being discussed in many contexts, beyond AI and machine learning research. I think we’re solving a problem that is easy to recognize, a problem that many practitioners who work with machine learning have, but one whose solution requires a range of skills that traditionally haven’t been that important in that community. Those techniques have to do with how you structure and organize data – data used by machine learning systems or any other type of data – in a way that makes it easy for that data to be shared across applications, to be found and reused. That’s a topic that Omar and I and other people in the working group know a lot about. And with Croissant, we are applying these techniques, which have been around for around 10-15 years,  to a new use case: machine learning data work. And I suppose the magic happened when we brought together all the relevant expertise in one working group to deliver something that ticks lots of boxes and heals many pains.

[MLC] What was the approach you took to developing Croissant?

[OB] We took an approach that was very concrete. A lot of this metadata standards work happens in a way where some experts get in a working group and they create what they think is the perfect standard. And then they try to convince others to adopt it and use it in their products.

Instead, from the start we said, “Okay, let’s define the standards at the same time and in close collaboration with the tools and systems that are going to use it.” So we worked with the repositories of machine learning datasets, like Kaggle and Hugging Face and OpenML initially, to implement support for this format as we were defining it, as well as to participate in the definition of the format. That’s on the “supply side” of datasets. On the “demand side,” we worked with the teams that develop tools for loading datasets for machine learning to implement support for loading Croissant datasets. And they were also part of the definition of the format.

So getting these people to work together helped us to create something which was not, “Here’s a specification. We think it’s going to make things better. Now, please go and implement this so that we have some data on one side and some tools on the other side.” Instead we could say, “Here’s the specification and here are some datasets that are already available in this format because these repositories support it. And here are some tools that you can use to work with these datasets.” And of course, it’s just the beginning. There are more repositories and datasets that should be available in this format. And there are a lot more tools that should add support for Croissant so that ML developers have what they need to work with this data. But the initial release of Croissant was a strong start with a useful, concrete proposition. And I think this is part of what made it so successful.

[MLC] Do you consider Croissant a “standard” now?

[ES] Personally, I have mixed feelings about the word “standard.” Sometimes when I hear someone talking about standards, I think about those lengthy, tedious, and to some degree inaccessible processes that are run in standardization committees, where there is a lot of work that is put into coming to a proposal, with lots of different stakeholders, with many competing interests to agree upon. And there’s hundreds and hundreds and hundreds of pages explaining what someone needs to do to use the standard in the intended way. And then there’s some implementations. And then if you’re an academic researcher in a small lab or a startup who doesn’t have the resources to be part of this exercise, then you feel left out. 

Having said that, standards, once they are agreed, and if there is critical mass in adoption, can be very, very powerful, because they do reduce those frictions of doing business. So if I am a startup that is creating a new type of machine learning functionality, for example to manage workflows or to do some sort of fair machine learning or any specialist functionality that the community needs, then having a standard to be able to process data that has been created and processed using any other machine learning tool is hugely valuable. But I love what Omar said: we’re following a slightly different approach. We really want to release everything to the community. We want to build everything with that adoption pathway in mind so that we put the results in the hands of practitioners as quickly as possible and in the most accessible way possible.

The other thing that I would say is that even if a competing proposal emerges, the way in which the Croissant vocabulary – and technology it is built with – means that it would be quite easy to use both. There are many vocabularies that describe datasets. And something like Google Dataset Search can work with all of them because of the technologies that are used to do that. But of course, if you just look at the number of datasets that use this or any other machine-readable vocabulary that may share similar aims, at the moment we seem to have an advantage. And the interest to adopt Croissant by more and more platforms is still there, almost a year after the launch of the initial version.

[OB] We released the first version of the Croissant format, but I don’t consider it to be standard yet.  I prefer to say  “community-driven specification”. And we’re not afraid to say Croissant will evolve and continue to change. Once we have a critical mass of data platforms and tools that have adopted it and use it and make their users happy thanks to Croissant, then it will feel sort of stable and we can say, “we don’t see this changing much from here.” At that point declaring it a standard makes sense. And of course, all the companies, startups and university labs that build tools and participate in the community will have contributed to that standard through their work.

[MLC] How important is the support from industry?

[ES] Speaking as someone who is in academia, I think industry support is very important. I am personally very happy with the diversity of industry participants in the group. We have OpenML, which is an open source initiative that is driven among others by academics, but it is a project with lots of contributors. We have obviously Google, we have participants from Meta as well, so household names in the AI space. Then we have Hugging Face and Kaggle, which have huge communities behind them. And we have NASA as well, which brings in avery interesting use case around geospatial data sets and their use in machine learning. Going forward I’d love for the working group to attract more industries that work beyond tech, that perhaps have their own machine learning use cases and their own enterprise data that they want to prepare for machine learning usage. And in that sort of scenario if you think about a big company and the types of data management problems they have, having something like Croissant would be very valuable as well. 

We have a good initial proposal for what Croissant can be, how it should be. But equally, we also have a very long list of things that we need to work on this year to get to that stable version that you could perhaps call a standard. And once you reach that, then you can actually make a case to any enterprise to explore this internally.

[MLC] What are your goals for expanding participation in the working group and in finding new partners?

[OB] First of all, this is an open group, and everyone is welcome! We would love to see more companies involved that develop tools that are used by machine learning practitioners. This speaks to the promise of Croissant: that once all datasets are described and edited in the same format, then if you have a tool that understands that format, that tool can work with any of those datasets. So if tools that provide functionality around loading of data for training ML models or fine tuning or evaluation of models, or data labeling or data analysis for machine learning, if these tools look at Croissant and consider supporting it, which does not require a huge amount of effort because it’s not a very complex format, then that will really make Croissant a lot more useful to machine learning practitioners. A goal for us this year is to get more tool vendors to come and join the Croissant effort and add support for it. Any vendor, no matter what size, that does data work, AI data work, whether that’s data augmentation, data collection, maybe responsible data collection, or data labeling or data debiasing, any type of data related activity that would make it more likely for a dataset to be ready to be used for machine learning, would add a lot of value to what we do right now. And it will also bring in new perspectives and new pathways to adoption. 

[ES] And would using Croissant be valuable for them? Absolutely. I think it was a researcher from Google who said, “No one wants to do the data work in AI,” a few years ago, and it still rings true. The vendors of data-related products solve some data challenges, but at the same time they also struggle with the fact that many people still think, “I’m just going to find the data somehow, and I’m just going to try to get to training my model as quickly as possible.” Croissant and these vendors are saying the same thing: data is important, we need to pay attention to it. We have very similar aims, and we’re providing these developers with a data portability solution. We’re helping them to interoperate with a range of other tools in the ecosystem. And that’s an advantage, especially when you’re starting out. 

[MLC] Have there been any surprises for you in how the community has been using Croissant?

[ES] We see a lot of excitement around use cases that we hadn’t really thought about so much. For instance, anything related to responsible AI, including AI safety, that’s a huge area that is looking for a whole range of solutions, some of them technical, some of them socio-technical. We’ve seen a lot of interest to use Croissant for many scenarios where there is a need to operationalize responsible AI practices. And to some degree, the challenge that we’re facing is to decide which of those use cases are really part of the core roadmap, and which ones are best served by communities that have that sectorial expertise.

We’ve also seen a lot of interest from the open science community. They’ve been working on practices and solutions for scientists to share not just their data, but their workflows, in other words, the protocols, experiment designs and other ways to do science. Scientific datasets are now increasingly used with machine learning,o we’ve had lots of interest and there’s conversations on how to take advantage of those decades of experience that they bring in.

[MLC] How was Croissant received at the NeurIPS 2024 conference?

[OB] We presented a poster at NeurIPS that opened a lot of conversations. We also conducted what turned out to be perhaps the most interesting community engagement activity all year for us. We asked authors in the conference’s Datasets and Benchmarks track to submit their datasets with a Croissant metadata description. It was not a strict requirement; it was more of a recommended thing for them to do. I think this was great for Croissant as a budding metadata format to get that exposure and the real life testing. We learned a lot from that experience about the details in our spec that were not clear or things that were not explained as well as they should have. And we also got real-world feedback on the tools that we built, including an initial metadata editor with which users can create the Croissant metadata for their datasets. It works in some cases, but we also knew that it was not feature-complete. So, users ran into issues using it. For instance, researchers often create bespoke datasets for their own research problems. And they asked us, “I have my custom binary format that I created for this; how do I write Croissant for it? It’s not CSV or images or text.” And so it pushed us into facing some of Croissant’s limits. . This experience also taught us the importance of good creation tools for users to prepare their datasets, because nobody wants to write by hand some JSON file that describes their dataset. I think for us it’s a strong incentive, as a community, to invest in building good editing tools that can also connect to all the platforms like Hugging Face or Kaggle so that users can publish their datasets on those platforms. So this is something that we learned from this interaction – and at the next conference, the experience will be much smoother for the authors who contribute to this.  

[MLC] What does the roadmap for Croissant look like moving forward?

[ES] In general, we are working on extending the metadata format so that we can capture more information. There are a few features that we’ve heard over and over again would be very important. For instance, anything related to the provenance of the data. We’re also working on tools, as Omar mentioned, and that includes technical tools like editors, but also tutorials and accessible material. 

In terms of what I’d like to see personally, as I was saying earlier, we’d love to have more contributors who are working in companies or on some sort of data work. For instance, data labeling. And a topic that we haven’t really touched on is the quality of that metadata. Croissant can solve many problems, but that assumes that the metadata is going to be of good quality, detailed, complete. That will almost certainly require some manual effort: from people who publish the data and want to describe it, maybe even from people who use the data and have made experiences with it. So, we’d love to have more datasets featuring excellent Croissant metadata. That will require a campaign to involve many users of, say, the 100 most popular machine learning datasets to come together and join efforts and create those excellent Croissant records.

[OB] We also have a new effort to describe not just datasets, but also machine learning tasks that use those datasets, so that we go to the next level of interoperability and describe things like machine learning evaluation benchmarks, or even competitions and things like that, and how they relate to datasets. That’s also a very exciting effort that is happening in the working group.

[ES] It’s quite an ambitious agenda – we’ll be busy this year and we invite others to join the working group to continue to shape the future of Croissant!

We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to implement the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.

The post Croissant Gains Momentum within the Data Community appeared first on MLCommons.

]]>
MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark https://mlcommons.org/2025/02/ailumiate-v1-1-fr/ Tue, 11 Feb 2025 05:00:00 +0000 https://mlcommons.org/?p=2412 Announcement comes as AI experts gather at the Paris AI Action Summit, is the first of several AILuminate updates to be released in 2025

The post MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark appeared first on MLCommons.

]]>
Paris, France – February 11, 2025. MLCommons, in partnership with the AI Verify Foundation, today released v1.1 of AILuminate, incorporating new French language capabilities into its first-of-its-kind AI safety benchmark. The new update – which was announced at the Paris AI Action Summit – marks the next step towards a global standard for AI safety and comes as AI purchasers across the globe seek to evaluate and limit product risk in an emerging regulatory landscape. Like its v1.0 predecessor, the French LLM version 1.1 was developed collaboratively by AI researchers and industry experts, ensuring a trusted, rigorous analysis of chatbot risk that can be immediately incorporated into company decision-making.

“Companies around the world are increasingly incorporating AI in their products, but they have no common, trusted means of comparing model risk,” said Rebecca Weiss, Executive Director of MLCommons. “By expanding AILuminate’s language capabilities, we are ensuring that global AI developers and purchasers have access to the type of independent, rigorous benchmarking proven to reduce product risk and increase industry safety.”

Like the English v1.0, the v1.1 French model of AILuminate assesses LLM responses to over 24,000 French language test prompts across twelve categories of hazards behaviors – including violent crime, hate, and privacy. Unlike many of peer benchmarks, none of the LLMs evaluated are given advance access to specific evaluation prompts or the evaluator model. This ensures a methodological rigor uncommon in standard academic research and an empirical analysis that can be trusted by industry and academia alike.

“Building safe and reliable AI is a global problem – and we all have an interest in coordinating on our approach,” said Peter Mattson, Founder and President of MLCommons. “Today’s release marks our commitment to championing a solution to AI safety that’s global by design and is a first step toward evaluating safety concerns across diverse languages, cultures, and value systems.”

The AILuminate benchmark was developed by the MLCommons AI Risk and Reliability working group, a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety. Cognizant that AI safety requires a coordinated global approach, MLCommons also collaborated with international organizations such as the AI Verify Foundation to design the AILuminate benchmark.

“MLCommons’ work in pushing the industry toward a global safety standard is more important now than ever,” said Nicolas Miailhe, Founder and CEO of PRISM Eval. “PRISM is proud to support this work with our latest Behavior Elicitation Technology (BET), and we look forward to continuing to collaborate on this important trustbuilding effort – in France and beyond.”

Currently available in English and French, AILuminate will be made available in Chinese and Hindi later this year. For more information on MLCommons and the AILuminate Benchmark, please visit mlcommons.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academic, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued to use collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve the accuracy, safety, speed, and efficiency of AI technologies.

The post MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark appeared first on MLCommons.

]]>
Unsupervised People’s Speech: A Massive Multilingual Audio Dataset https://mlcommons.org/2025/01/new-unsupervised-peoples-speech/ Thu, 30 Jan 2025 21:00:00 +0000 https://mlcommons.org/?p=2200 The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People's Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages.

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

]]>
Introducing the Unsupervised People’s Speech dataset

The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People’s Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages. Given the impact the previously released Supervised People’s Speech dataset had on models such as Whisper, we expect this new version to drive innovations across different tasks, including self-supervised implementations as well as improvement on automatic speech recognition pipelines in numerous languages.

The Rationale Behind the Dataset

As the field of speech technology continues to advance, the need for large, diverse audio datasets is increasingly crucial. The Unsupervised People’s Speech dataset aims to address this need by providing a vast collection of multilingual audio data to support research and development in various areas of speech technology. Supporting broader Natural Language Processing (NLP) research for languages other than English helps bring communication technologies to more people globally–including those speaking low-resource languages.

Ensuring Useful Data

To ensure the dataset is as useful as possible the MLCommons Datasets working group ran different data pipelines to understand the contents across dimensions such as language distribution and speech detection.

  • Speech detection: the working group created a custom data loader that resampled and converted the audio to a single channel (available in the HuggingFace dataset card), we ran Silero’s Voice Activity Detection pipeline, a model that uses a multi-head attention mechanism with STFT as features. Results from this model delivered a total of 821,412+ hours of speech. 
  • Language identification: language identification is often the first task in a series of NLP tasks, so it’s crucial to get it right. The working group used Nvidia’s TensorRT-LLM implementation of Whisper Large v3 to run inference on the subset of the dataset for which our speech detection pipeline detected a speech utterance. The results from this pipeline detected a total of 89 languages. While we are certain there are more, since the third most common category the model inferred was a no speech tag, it means the model wasn’t able to determine the language.

Technical Hurdles

While the focus of the Unsupervised People’s dataset is on its potential applications, it’s worth noting some of the technical challenges the working group overcame:

  1. Data Upload and Storage: we developed a custom script utilizing Git LFS backend to efficiently upload the 48+TB dataset to S3 and HuggingFace. This overcame typical speed limitations.
  2. Self-Supervised Learning Potential: we’ve included a training pipeline to train a Wav2Vec model, which could unlock new possibilities in unsupervised speech representation learning.
  3. Deduplication: we are in the process of creating embedding representations of the entire dataset using Meta’s Encodec. This will allow us to identify and remove duplicate content. This process ensures the dataset’s uniqueness and quality.

Future Work

The Unsupervised People’s Speech dataset is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. As part of our commitment to updating and maintaining the dataset, we are also releasing the software as open-source to empower the community to build on our contributions.

As the working group completes the deduplication and self-supervised efforts, we anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis. 

Join Us

There are many ways to get involved in the MLCommons data effort. We are currently seeking speakers of over 130 languages to create a benchmark for a text language identification task. If this sounds interesting, head to Dynabench’s text classification task.  

If you want to be part of the Datasets working group or any other working group at MLCommons, please visit the Get Involved page. We meet weekly to share ideas and implement projects. If you want to read more about the MLCommons Unsupervised People’s Speech dataset, visit https://mlcommons.org/datasets/unsupervised-peoples-speech/

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

]]>
MLCommons Medical Working Group Co-authors Book Chapter on Collaborative Evaluation for Medical Imaging https://mlcommons.org/2025/01/mlc-medical-collab-eval-medical-imaging/ Mon, 27 Jan 2025 20:32:20 +0000 https://mlcommons.org/?p=2011 Authors explore best practices for surmounting organizational, technical and logistical challenges

The post MLCommons Medical Working Group Co-authors Book Chapter on Collaborative Evaluation for Medical Imaging appeared first on MLCommons.

]]>
Several members of the MLCommons® Medical working group have co-authored a chapter for a new book on AI in medical imaging. The chapter, “Collaborative evaluation for performance assessment of medical imaging applications” appears in the book Trustworthy AI in Medical Imaging, part of a book series produced by the Medical Image Computing and Computer Assisted Intervention Society (MICCAI) and Elsevier. It provides an introduction to the concept of collaborative evaluation: how healthcare stakeholders form a community to jointly evaluate AI medical imaging systems.

Collaborative Evaluation: organizing to overcome barriers to evaluating medical AI applications

AI has been used in medical imaging for a while, but its adoption in clinical settings has been slow due to a lack of thorough and robust evaluation enabling trust by healthcare stakeholders. The rise of “generative AI”, or simply “GenAI”, in this field makes comprehensive evaluation even more crucial. This requires a large, diverse set of test data, which no single organization usually has. Therefore, organizations need to pool their data, but privacy and security regulations can make this difficult. A collaborative structure can help overcome these barriers.

Such a collaborative structure can organize essential evaluation activities like defining data specifications, preparing data, conducting evaluations, specifying evaluation metrics and publishing results. All these steps should happen with a clear governance structure, high level of integrity and transparency of evaluation enforced by open technology to enable stakeholders’ trust and collaboration. Teams can decide how to automate and whether to centralize or distribute these tasks. For example, a method called federated evaluation can be used to avoid data sharing, thus empowering data ownership, at the cost of more automation to handle the logistics.

The book chapter lays out a typical workflow for a collaborative evaluation process, and includes a case study of its implementation. It also delves into some of the key considerations in designing and managing collaborative evaluation, including:

  • Orchestration and scalability: how to design, provision and coordinate the evaluation process efficiently, across multiple sites;
  • Security and data/model protection: validating that the key resources are secure, free from tampering, and compliant with regulatory requirements for medical data;
  • Integrity and transparency: ensuring that the evaluation process is secure, confidential, and free from tampering for accountability and traceability purposes
  • Data quality: avoiding “garbage in, garbage out” issues by ensuring that the data used to evaluate AI imaging systems is of consistent, high quality.

The chapter closes by touching on some of the challenges and opportunities that lay ahead for collaborative evaluation. It discusses issues around sustainability: given the high cost for acquiring and preparing both data resources and AI technology systems, it is critical to ensure that stakeholders have sufficient financial incentives to continue participating. It also speaks to the need to “democratize” collaborative evaluation so that smaller industry stakeholders can join, contribute and benefit; and the need to ensure that collaborative evaluation aligns with clinical workflow processes.

Order the book today

More information about the book Trustworthy AI in Medical Imaging, published by Elsevier, can be found here. Our chapter “Collaborative evaluation for performance assessment of medical imaging applications” is available to download here.

The post MLCommons Medical Working Group Co-authors Book Chapter on Collaborative Evaluation for Medical Imaging appeared first on MLCommons.

]]>
MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github https://mlcommons.org/2025/01/ailuminate-demo-benchmark-prompt-dataset/ Wed, 15 Jan 2025 18:50:04 +0000 https://mlcommons.org/?p=2130 Set of 1,200 DEMO prompts exercises the breadth of hazards for generative AI systems

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

]]>
Today, MLCommons® publicly released the AILuminate DEMO prompt dataset, a collection of 1,200 prompts for generative AI systems to benchmark safety against twelve categories of hazards. The DEMO prompt set is a component of the full AILuminate Practice Test dataset.

The dataset is available at: https://github.com/mlcommons/ailuminate.

The AILuminate benchmark, designed and developed by the MLCommons AI Risk and Reliability working group assesses LLM responses to over 24,000 test prompts across twelve categories of hazards that users of generative AI systems may encounter. It also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models; and Public reports of 13 systems-under-test (SUTS)  using the AILuminate v1.0 benchmarking grading scale to assess the quantitative and qualitative performance of the SUTS in each hazard category. 

The full 24,000 test prompts are broken out into the following datasets:

  • A Practice Test dataset with a standard, well-documented set of prompts that can be used for in-house testing, examples of specific hazards, and seeds and inspiration for the creation of other prompts. The Creative Commons DEMO set being released today is a 10% subset of the Practice dataset. Access to the full Practice dataset is available upon request from MLCommons. 
  • An Official Test dataset. For an official benchmark of a System Under Test (SUT) (which can be a single LLM or an ensemble). It is run by MLCommons, using the MLCommons official AILuminate assessment independent evaluator to provide a detailed set of performance scores, as measured against a standard baseline. The Official Test dataset is private to avoid SUTs “gaming” the benchmark by training to the specific test prompts. The Practice Test and Official Test datasets are designed to be statistically equivalent in their coverage and testing of the taxonomy of hazards. To run an Official Test contact MLCommons.

The 12 hazard categories included in the test prompts are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently in English, with plans to expand to French, Chinese and Hindi. MLCommons intends to update the prompt datasets, including the Creative Commons DEMO dataset, on a regular basis as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“We have already started using AILuminate to help us robustly evaluate the safety of our models at Contextual. Its rigor and careful design has helped us gain a much deeper understanding of their strengths and weaknesses,” said Bertie Vidgen, Head of Human Data at Contextual AI.

Raising the Bar for Safety Testing by Empowering Stakeholders

The AILuminate Creative Commons-licensed DEMO prompt dataset provides an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems for the twelve defined AILuminate hazard categories. We discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards; as will policymakers looking for examples of prompts that can generate hazardous results.

Download the DEMO Prompt Dataset Today

The DEMO test set is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from Github. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk and Reliability working group and our efforts to ensure the safety of AI systems.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.We invite others to join the AI Risk and Reliability working group.

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

]]>
MLCommons Introduces MLPerf Client v0.5 https://mlcommons.org/2024/12/mlc-mlperf-client-v0-5/ Wed, 11 Dec 2024 15:55:00 +0000 https://mlcommons.org/?p=1927 A New Benchmark for Consumer AI Performance

The post MLCommons Introduces MLPerf Client v0.5 appeared first on MLCommons.

]]>
MLCommons®, the leading open engineering consortium dedicated to advancing machine learning (ML), is excited to announce the public release of the MLPerf® Client v0.5 benchmark. This benchmark sets a new standard for evaluating consumer AI performance, enabling users, press, and the industry to measure how effectively laptops, desktops, and workstations can run cutting-edge large language models (LLMs).

A Collaborative Effort by Industry Leaders

MLPerf Client represents a collaboration among technology leaders, including AMD, Intel, Microsoft, NVIDIA, Qualcomm Technologies, Inc., and top PC OEMs. These stakeholders have pooled resources and expertise to create a standardized benchmark, offering new insight into performance on key consumer AI workloads.

“MLPerf Client is a pivotal step forward in measuring consumer AI PC performance, bringing together industry heavyweights to set a new standard for evaluating generative AI applications on personal computers,” said David Kanter, Head of MLPerf at MLCommons.

Key Features of MLPerf Client v0.5

  • AI model: The benchmark’s tests are based on Meta’s Llama 2 7B large language model, optimized for reduced memory and computational requirements via 4-bit integer quantization.
  • Tests and metrics: Includes four AI tasks—content generation, creative writing, and text summarization of two different document lengths—evaluated using familiar metrics like time-to-first-token (TTFT) and tokens-per-second (TPS).
  • Hardware optimization: Supports hardware-accelerated execution on integrated and discrete GPUs via two distinct paths: ONNX Runtime GenAI and Intel OpenVINO.
  • Platform support: This initial release supports Windows 11 on x86-64 systems, with future updates planned for Windows on Arm and macOS.
  • Freely accessible: The benchmark is freely downloadable from MLCommons.org, empowering anyone to measure AI performance on supported systems.

Future Development

While version 0.5 marks the benchmark’s debut, MLCommons plans to expand its capabilities in future releases, including support for additional hardware acceleration paths and a broader set of test scenarios incorporating a range of AI models.

Availability

MLPerf Client v0.5 is available for download now from MLCommons.org. See the website for additional details on the benchmark’s hardware and software support requirements. 

For more information and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post MLCommons Introduces MLPerf Client v0.5 appeared first on MLCommons.

]]>
MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models   https://mlcommons.org/2024/12/mlcommons-ailuminate-v1-0-release/ Wed, 04 Dec 2024 13:56:00 +0000 https://mlcommons.org/?p=1796 Benchmark developed by leading AI researchers and industry experts; marks a significant milestone towards a global standard on AI safety

The post MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models   appeared first on MLCommons.

]]>
MLCommons today released AILuminate, a first-of-its-kind safety test for large language models (LLMs). The v1.0 benchmark – which provides a series of safety grades for the most widely-used LLMs – is the first AI safety benchmark designed collaboratively by AI researchers and industry experts. It builds on MLCommons’ track record of producing trusted AI performance benchmarks, and offers a scientific, independent analysis of LLM risk that can be immediately incorporated into company decision-making. 

“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said Peter Mattson, Founder and President of MLCommons. “Just like other complex technologies like cars or planes, AI models require industry-standard testing to guide responsible development. We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use.”  

The AILuminate benchmark assesses LLM responses to over 24,000 test prompts across twelve categories of hazards. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research and ensures an empirical analysis that can be trusted by industry and academia alike.

“With roots in well-respected research institutions, an open and transparent process, and buy-in across the industry, MLCommons is uniquely equipped to advance a global baseline on AI risk and reliability,” said Rebecca Weiss, Executive Director of MLCommons. “We are proud to release our v1.0 benchmark, which marks a major milestone in our work to build a harmonized approach to safer AI. By making AI more transparent, more reliable, and more trusted, we can ensure its positive use in society and move the industry forward.” 

This benchmark was developed by the MLCommons AI Risk and Reliability working group — a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Meta, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety. The working group plans to release ongoing updates as AI technologies continue to advance. 

“Industry-wide safety benchmarks are an important step in fostering trust across an often-fractured AI safety ecosystem,” said Camille François, Professor at Columbia University. “AILuminate will provide a shared foundation and encourage transparent collaboration and ongoing research into the nuanced challenges of AI safety.”

“The developers of AI technologies and organizations using AI have a shared interest in transparent and practical safety assessments. AI will only be adopted and used to address society’s greatest challenges if people trust that it is safe. The AILuminate benchmark represents important progress in developing research-based, effective evaluation techniques for AI safety testing,” said Natasha Crampton, Chief Responsible AI Officer, Microsoft

“As a member of the working group, I experienced firsthand the rigor and thoughtfulness that went into the development of AILuminate,” said Percy Liang, computer science professor at Stanford University. “The global, multi-stakeholder process is crucial for building trustworthy safety evals. I was pleased with the extensiveness of the evaluation, including multilingual support, and I look forward to seeing how AILuminate evolves with the technology.”

“Enterprise AI adoption depends on trust, transparency, and safety,” said Navrina Singh, working group member and founder and CEO of Credo AI. “The AILuminate benchmark, developed through rigorous collaboration between industry leaders and researchers, offers a trusted and fair framework for assessing model risk. This milestone sets a critical foundation for AI safety standards, enabling organizations to confidently and responsibly integrate AI into their operations.”

Cognizant that AI safety requires a coordinated global approach, MLCommons intentionally collaborated with international organizations such as the AI Verify Foundation to design the v1.0 AILuminate benchmark. The v1.0 benchmark is initially available for use in English, with versions in French, Chinese and Hindi coming in early 2025.

Learn more about the AILuminate Benchmark.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models   appeared first on MLCommons.

]]>
Comprehensive Open Federated Ecosystem (COFE) showcased at MICCAI 2024 https://mlcommons.org/2024/11/medical-miccai-cofe/ Thu, 21 Nov 2024 22:30:20 +0000 https://mlcommons.org/?p=1334 MLCommons Medical working group showcases new extensions to Comprehensive Open Federated Ecosystem (COFE) at MICCAI 2024

The post Comprehensive Open Federated Ecosystem (COFE) showcased at MICCAI 2024 appeared first on MLCommons.

]]>

During the 27th International Conference On Medical Image Computing And Computer Assisted Intervention (MICCA) the MLCommons® Medical working group presented a  “Federated Learning in Healthcare” tutorial. Previously presented at ISBI 2024, the tutorial taught participants about the Comprehensive Open Federated Ecosystem (COFE). COFE is an open ecosystem of tools and best practices distilled from years of industrial and academic research experience aiming to streamline the research and development of AI/ML models within the clinical setting. 

Figure 1. Updated diagram of latest COFE iteration. HuggingFace Hub was integrated as part of the COFE ecosystem.

During the tutorial the team shared the latest updates in the COFE ecosystem:

  1. Integration of Hugging Face Hub to enable a one-click solution for model dissemination for users, which allows easier research collaboration in the community. Hugging Face Hub also allows for a streamlined approach to a GaNDLF Model Zoo, where users can select appropriate licensing terms (including non-commercial use) for their models..
  2. Full integration with the Medical Open Network for Artificial Intelligence (MONAI) enables developers to reuse engineering and research efforts from different groups while allowing a more cohesive pathway for code reproducibility.
  3. GaNDLF-Synth is a general-purpose synthesis extension of GaNDLF. It provides a vehicle for researchers to train multiple synthesis approaches (autoencoders, GANs, and diffusion) for the same task, thereby allowing them to quantify the efficacy of each approach for their task. Additionally, providing a robust abstraction layer allows algorithmic developers to have a common framework to distribute their methods, allowing easier benchmarking against other approaches.
  4. Privacy-enabled training across various tasks using Opacus. By integrating with Opacus, GaNDLF enables the training of models using differential privacy (DP). DP is one of the most widely used methods to incorporate privacy in trained models, and with GaNDLF, users can now tune their models for various DP settings to achieve the best model utility while training private models.
  5. Single-command model distribution through Hugging Face.
  6. MedPerf’s new web user interface makes evaluating and benchmarking healthcare AI models easier than ever. Designed for seamless interaction, it allows dataset owners and benchmark stakeholders to engage with the platform without requiring technical expertise. This update ensures that anyone involved in the process can contribute to benchmarking effortlessly.

Siddhesh Thakur, a Data Engineer at Indiana University School of Medicine, showcased optimization workflows and their practical application in deploying AI models for whole slide image analysis in resource-constrained environments. Thakur shared key concepts of model optimization and demonstrated how GaNDLF facilitates this process by integrating with Intel’s OpenVINO toolkit. Through online examples, attendees learned how to leverage these tools to create efficient, deployable AI solutions for digital pathology, even in settings with limited computational resources.

Cyril Zakka, a physician and head of ML Research in Health at Hugging Face, demonstrated Hugging Face’s commitment to the democratization of state-of-the-art machine learning and highlighted various open-source tools and offerings for model development, fine-tuning, and deployment across different computing budgets and infrastructures. Dr Zakka outlined the newly developed Health Team’s mission and ongoing projects, highlighting the tight integration within the GaNDLF ecosystem to enable one-click deployment and checkpointing of scientific models, further emphasizing COFE’s goal of reproducibility and scientific accountability.

The community was invited to learn more through a self-guided hands-on session. To learn more about the Medical WG and its work in real-world clinical studies please email medperf@mlcommons.org.The hands-on session of this tutorial was made possible through GitHub Codespaces with generous technical support from their engineers and program managers working in close collaboration with the tutorial organizers. The tutorial is available to everyone here.

The team would also like to thank the organizing committee of MICCAI2024 and the conference organizers.

Participants attending the “Federated Learning in Healthcare” at MICCAI 2024 in Marrakesh, Morocco.

Join Us

COFE addresses key challenges that the research community faces in the federated healthcare AI space through open community tools. As an organization, we strive to collect feedback from the broader healthcare research community to help build tools that advance AI/ML for everyone. If you want to participate in the MLCommons Medical working group, please join us here.

The post Comprehensive Open Federated Ecosystem (COFE) showcased at MICCAI 2024 appeared first on MLCommons.

]]>
AI Safety in Practice: Two Key Takeaways from MLCommons and PAI Workshop https://mlcommons.org/2024/11/mlc-and-pai-workshop/ Thu, 21 Nov 2024 18:14:56 +0000 https://mlcommons.org/?p=1329 A joint exploration of collaborative approaches for responsible evaluation and governance of AI across the value chain

The post AI Safety in Practice: Two Key Takeaways from MLCommons and PAI Workshop appeared first on MLCommons.

]]>
In October 2024, MLCommons and Partnership on AI (PAI) virtually convened practitioners and policy analysts to assess the current state of general-purpose AI evaluations and the challenges to their adoption. The workshop explored collaborative approaches to responsibly evaluating and governing AI across the value chain. Against a backdrop of unprecedented regulatory activity — including over 300 federal bills, 600 state-level bills, a U.S. AI executive order, and the EU’s new draft General-Purpose AI Code of Practice — the workshop provided an opportunity to examine how each actor in the AI ecosystem contributes to safe deployment and accountability. 

From foundation model providers to downstream actors, discussions emphasized that each actor has unique roles, obligations, and needs for guidance, highlighting the limitations of a “one-size-fits-all” approach. PAI’s recent work on distributed risk mitigation strategies for open foundation models, along with MLCommons’ development of safety benchmarks for AI, highlights the need for evaluation and governance practices tailored to each actor’s unique role. Together, these efforts blend normative guidance with technical evaluation, supporting a comprehensive approach to AI safety.

First Workshop Takeaway: Layered and Role-specific evaluation approaches 

The first takeaway is the importance of layered and role-specific evaluation approaches across the AI ecosystem. Workshop participants recognized that different actors—such as foundation model providers, model adapters who fine-tune these models, model hubs and model integrators—play unique roles that require tailored evaluation strategies. This includes comprehensive evaluation immediately after a model is initially trained, using both benchmarking and adversarial testing. For model adapters, post-training evaluation and adjustments during fine-tuning are essential to uncover new risks and ensure safety.

Participants emphasized proactive, layered approaches where early benchmarking and regular red-teaming work together to support safety, reliability, and compliance as AI technology advances. Benchmarking early establishes a foundation for continuous improvement, helping identify performance gaps and ensuring models meet necessary safety standards. Red-teaming, or adversarial testing, is critical to pair with benchmarks, as it helps uncover vulnerabilities not apparent through standard testing and stress-tests the model’s resilience to misuse. Frequent red-teaming enables developers to stay ahead of emerging risks, especially in the fast-evolving field of generative AI, and to address potential misuse cases proactively before widespread deployment. For model integrators, continuous testing and re-evaluation procedures were also recommended, particularly to manage the dynamic changes that occur as generative AI models are updated.

Second Workshop Takeaway: Adaptability in Evaluation Methods 

The second takeaway is the necessity of adaptable evaluation methods to ensure that safety assessments remain realistic and resistant to manipulation as AI models evolve. A critical part of adaptability is using high-quality test data sets and avoiding overfitting risks—where models become too tailored to specific test scenarios, leading them to perform well on those tests but potentially failing in new or real-world situations. Overfitting can result in evaluations that give a false sense of security, as the model may appear safe but lacks robustness.

To address this, participants discussed the importance of keeping elements of the evaluation test-set “held back” privately, to ensure that model developers cannot over-train to a public test-set. By using private test-sets as a component of the evaluation process, evaluations can better simulate real-world uses and identify vulnerabilities that traditional static benchmarks and leaderboards might miss.

Workshop participants agreed that responsibility for implementing AI evaluations will need to be shared across the AI value chain. The recent U.S. AI Safety Institute guidance highlights the importance of involving all actors in managing misuse risks throughout the AI lifecycle. While the EU’s draft General-Purpose AI Code of Practice currently focuses on model providers, the broader regulatory landscape indicates a growing recognition of the need for shared responsibility across all stakeholders involved in AI development and deployment. Together, these perspectives underscore the importance of multi-stakeholder collaboration in ensuring that general purpose  AI systems are governed safely and responsibly.

Learn more about the MLCommons AI Risk and Reliability working group building a benchmark to evaluate the safety of AI.

Learn more about Partnership on AI’s Open Foundation Model Value Chain

The post AI Safety in Practice: Two Key Takeaways from MLCommons and PAI Workshop appeared first on MLCommons.

]]>
New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications https://mlcommons.org/2024/11/new-mlperf-training-v4-1-benchmarks-highlight-industrys-focus-on-new-systems-and-generative-ai-applications/ Wed, 13 Nov 2024 19:51:49 +0000 https://mlcommons.org/?p=1200 The latest Gen AI benchmarks added to the suite show significant performance improvements.

The post New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications appeared first on MLCommons.

]]>
Today, MLCommons® announced new results for the MLPerf® Training v4.1 benchmark suite, including several preview category submissions using the next generation of accelerator hardware. The v4.1 round also saw increased participation in the benchmarks that represent generative AI model training, highlighting the strong alignment between the benchmark suite and the current direction of the AI industry.

New generations of hardware accelerators

MLPerf Training v4.1 includes preview category submissions using new hardware accelerators that will be generally available in the next round:

  • Google “Trillium” TPUv6 accelerator (preview)
  • NVIDIA “Blackwell” B200 accelerator (preview)

 “As AI-targeted hardware rapidly advances, the value of the MLPerf Training benchmark becomes more important as an open, transparent forum for apples-to-apples comparisons,” said Hiwot Kassa, MLPerf Training working group co-chair. “Everyone benefits: vendors know where they stand versus their competitors, and their customers have better information as they procure AI training systems.”

MLPerf Training v4.1

The MLPerf Training benchmark suite comprises full system tests that stress models, software, and hardware for a range of machine learning (ML) applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

The latest Training v4.1 results show a substantial shift in submissions for the three benchmarks that represent “generative AI” training workloads: GPT3, Stable Diffusion, and Llama 2 70B LoRA fine-tuning benchmark, with a 46% increase in submissions in total across those three. 

The two newest benchmarks in the MLPerf Training suite, Llama 2 70B LoRA and Graph Neural Network (GNN), both had notably higher submission rates: a 16% increase for Llama 2, and a 55% increase for GNN. They both also saw significant performance improvements in the v4.1 round compared to v4.0 when they were first introduced, with a 1.26X speedup in the best training time for Llama 2 and a 1.23X speedup for GNN.

“The MLCommons Training working group regularly updates the benchmark to keep pace with industry trends, and the trend in submissions we are seeing validates that our open process is giving stakeholders the information they are looking for. In this round we see continued steady, incremental progress in improving AI training performance,” observed Shriya Rishab, MLPerf Training working group co-chair.

Continued Robust Industry Participation

The MLPerf Training v4.1 round includes 155 results from 17 submitting organizations: ASUSTeK, Azure, Cisco, Clemson University Research Computing and Data, Dell, FlexAI, Fujitsu, GigaComputing, Google, Krai, Lambda, Lenovo, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Tiny Corp.

“We would especially like to welcome first-time MLPerf Training submitters FlexAI and Lambda,” said David Kanter, Head of MLPerf at MLCommons. “We are also very excited to see Dell’s first MLPerf Training results that includes power measurement; as AI adoption skyrockets, it is critical to measure both the performance and the energy efficiency of AI training.”

Participation in the benchmarking process by a broad set of industry stakeholders, as well as by academic groups, strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

To view the full results for MLPerf Training v4.1 and find additional information about the benchmarks, please visit the Training benchmark page.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications appeared first on MLCommons.

]]>
New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance https://mlcommons.org/2024/09/mlperf-storage-v1-0-benchmark-results/ Wed, 25 Sep 2024 14:55:00 +0000 http://local.mlcommons/2024/09/mlperf-storage-v1-0-benchmark-results/ Storage system providers invent new, creative solutions to keep pace with faster accelerators, but the challenge continues to escalate

The post New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance appeared first on MLCommons.

]]>
Storage system providers showcase innovative solutions to keep pace with faster accelerators.

Today, MLCommons® announced results for its industry-standard MLPerf® Storage v1.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. The results show that as accelerator technology has advanced and datasets continue to increase in size, ML system providers must ensure that their storage solutions keep up with the compute needs. This is a time of rapid change in ML systems, where progress in one technology area drives new demands in other areas. High-performance AI training now requires storage systems that are both large-scale and high-speed, lest access to stored data becomes the bottleneck in the entire system. With the v1.0 release of MLPerf Storage benchmark results, it is clear that storage system providers are innovating to meet that challenge.

Version 1.0 storage benchmark breaks new ground

The MLPerf Storage benchmark is the first and only open, transparent benchmark to measure storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators’ “think time” the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

Three models are included in the benchmark to ensure diverse patterns of AI training are tested: 3D-UNet, Resnet50, and CosmoFlow. These workloads offer a variety of sample sizes, ranging from hundreds of megabytes to hundreds of kilobytes, as well as wide-ranging simulated “think times” from a few milliseconds to a few hundred milliseconds. 

The benchmark emulates NVIDIA A100 and H100 models as representatives of the currently available accelerator technologies. The H100 accelerator reduces the per-batch computation time for the 3D-UNet workload by 76% compared to the earlier V100 accelerator in the v0.5 round, turning what was typically a bandwidth-sensitive workload into much more of a latency-sensitive workload. 

In addition, MLPerf Storage v1.0 includes support for distributed training. Distributed training is an important scenario for the benchmark because it represents a common real-world practice for faster training of models with large datasets, and it presents specific challenges for a storage system not only in delivering higher throughput but also in serving multiple training nodes simultaneously.

V1.0 benchmark results show performance improvement in storage technology for ML systems

The broad scope of workloads submitted to the benchmark reflect the wide range and diversity of different storage systems and architectures. This is testament to how important ML workloads are to all types of storage solutions, and demonstrates the active innovation happening in this space.

“The MLPerf Storage v1.0 results demonstrate a renewal in storage technology design,” said Oana Balmau, MLPerf Storage working group co-chair. “At the moment, there doesn’t appear to be a consensus ‘best of breed’ technical architecture for storage in ML systems: the submissions we received for the v1.0 benchmark took a wide range of unique and creative approaches to providing high-speed, high-scale storage.”

The results in the distributed training scenario show the delicate balance needed between the number of hosts, the number of simulated accelerators per host, and the storage system in order to serve all accelerators at the required utilization. Adding more nodes and accelerators to serve ever-larger training datasets increases the throughput demands. Distributed training adds another twist, because historically different technologies – with different throughputs and latencies – have been used for moving data within a node and between nodes. The maximum number of accelerators a single node can support may not be limited by the node’s own hardware but instead by the ability to move enough data quickly to that node in a distributed environment (up to 2.7 GiB/s per emulated accelerator). Storage system architects now have few design tradeoffs available to them: the systems must be high-throughput and low-latency, to keep a large-scale AI training system running at peak load.

“As we anticipated, the new, faster accelerator hardware significantly raised the bar for storage, making it clear that storage access performance has become a gating factor for overall training speed,” said Curtis Anderson, MLPerf Storage working group co-chair. “To prevent expensive accelerators from sitting idle, system architects are moving to the fastest storage they can procure – and storage providers are innovating in response.” 

MLPerf Storage v1.0

The MLPerf Storage benchmark was created through a collaborative engineering process across more than a dozen leading storage solution providers and academic research groups. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v1.0 benchmark results, from a broad set of technology providers, demonstrate the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v1.0 includes over 100 performance results from 13 submitting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Micron, Nutanix, Simplyblock, Volumez, WEKA, and YanRong Tech.

“We’re excited to see so many storage providers, both large and small, participate in the first-of-its-kind v1.0 Storage benchmark,” said David Kanter, Head of MLPerf at MLCommons. “It shows both that the industry is recognizing the need to keep innovating in storage technologies to keep pace with the rest of the AI technology stack, and also that the ability to measure the performance of those technologies is critical to the successful deployment of ML training systems. As a trusted provider of open, fair, and transparent benchmarks, MLCommons ensures that technology providers know the performance target they need to meet, and consumers can procure and tune ML systems to maximize their utilization – and ultimately their return on investment.”

We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark. Future work includes improving and increasing accelerator emulations and AI training scenarios.

View the Results

To view the results for MLPerf Storage v1.0, please visit the Storage benchmark results.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance appeared first on MLCommons.

]]>
New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems https://mlcommons.org/2024/08/mlperf-inference-v4-1-results/ Wed, 28 Aug 2024 14:56:00 +0000 http://local.mlcommons/2024/08/mlperf-inference-v4-1-results/ New mixture of experts benchmark tracks emerging architectures for AI models

The post New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems appeared first on MLCommons.

]]>
Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v4.1 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. This release includes first-time results for a new benchmark based on a mixture of experts (MoE) model architecture. It also presents new findings on power consumption related to inference execution.

MLPerf Inference v4.1

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models across a variety of deployment scenarios. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems.

The benchmark results for this round demonstrate broad industry participation, and includes the debut of six newly available or soon-to-be-shipped processors:

  • AMD MI300x accelerator (available)
  • AMD EPYC “Turin” CPU (preview)
  • Google “Trillium” TPUv6e accelerator (preview)
  • Intel “Granite Rapids” Xeon CPUs (preview)
  • NVIDIA “Blackwell” B200 accelerator (preview)
  • UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators

MLPerf Inference v4.1 includes 964 performance results from 22 submitting organizations: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud, and Untether AI.

“There is now more choice than ever in AI system technologies, and it’s heartening to see providers embracing the need for open, transparent performance benchmarks to help stakeholders evaluate their technologies,” said Mitchelle Rasquinha, MLCommons Inference working group co-chair.

New mixture of experts benchmark

Keeping pace with today’s ever-changing AI landscape, MLPerf Inference v4.1 introduces a new benchmark to the suite: mixture of experts. MoE is an architectural design for AI models that departs from the traditional approach of employing a single, massive model; it instead uses a collection of smaller “expert” models. Inference queries are directed to a subset of the expert models to generate results. Research and industry leaders have found that this approach can yield equivalent accuracy to a single monolithic model but often at a significant performance advantage because only a fraction of the parameters are invoked with each query.

The MoE benchmark is unique and one of the most complex implemented by MLCommons to date. It uses the open-source Mixtral 8x7B model as a reference implementation and performs inferences using datasets covering three independent tasks: general Q&A, solving math problems, and code generation.

“When determining to add a new benchmark, the MLPerf Inference working group observed that many key players in the AI ecosystem are strongly embracing MoE as part of their strategy,” said Miro Hodak, MLCommons Inference working group co-chair. “Building an industry-standard benchmark for measuring system performance on MoE models is essential to address this trend in AI adoption. We’re proud to be the first AI benchmark suite to include MoE tests to fill this critical information gap.”

Benchmarking Power Consumption

The MLPerf Inference v4.1 benchmark includes 31 power consumption test results across three submitted systems covering both datacenter and edge scenarios. These results demonstrate the continued importance of understanding the power requirements for AI systems running inference tasks. as power costs are a substantial portion of the overall expense of operating AI systems.

The Increasing Pace of AI Innovation

Today, we are witnessing an incredible groundswell of technological advances across the AI ecosystem, driven by a wide range of providers including AI pioneers; large, well-established technology companies; and small startups. 

MLCommons would especially like to welcome first-time MLPerf Inference submitters AMD and Sustainable Metal Cloud, as well as Untether AI, which delivered both performance and power efficiency results. 

“It’s encouraging to see the breadth of technical diversity in the systems submitted to the MLPerf Inference benchmark as vendors adopt new techniques for optimizing system performance such as vLLM and sparsity-aware inference,” said David Kanter, Head of MLPerf at MLCommons. “Farther down the technology stack, we were struck by the substantial increase in unique accelerator technologies submitted to the benchmark this time. We are excited to see that systems are now evolving at a much faster pace – at every layer – to meet the needs of AI. We are delighted to be a trusted provider of open, fair, and transparent benchmarks that help stakeholders get the data they need to make sense of the fast pace of AI innovation and drive the industry forward.”

View the Results

To view the results for MLPerf Inference v4.1, please visit the Datacenter and Edge benchmark results pages.

To learn more about selection of the new MoE benchmark read the blog.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems appeared first on MLCommons.

]]>