Datasets Archives - MLCommons

Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety

MLCommons — Tue, 03 Jun 2025 22:36:33 +0000

The Association for Advanced Artificial Intelligence (AAAI) 2025 workshop on Datasets and Evaluators for AI Safety on March 3rd, 2025, provided a forum to discuss the challenges and strategies in AI safety. Drawing on contributions from leading experts across academia, industry, and policy, the discussions revealed that advancing AI safety is as much about creating ethical, adaptive, and collaborative governance mechanisms as it is about embracing rigorous technical benchmarks.

Establishing robust standards for low-risk AI

The workshop highlighted the need for standardisation and reliable benchmarks to reduce AI risks. In his talk on “Building an Ecosystem for Reliable, Low-Risk AI,” Dr Peter Mattson, President of MLCommons, outlined the important role of standardised safety benchmarks. The introduction of frameworks like the AILuminate Benchmark, which evaluates AI systems across 12 distinct safety hazards (e.g., physical, non-physical, and contextual risks), has sparked a conversation on making safety evaluations robust and accessible. Such benchmarks provide a quantitative measure of risks (from privacy and intellectual property challenges to non-physical harms) and are a foundation for industry-wide standards.

The workshop also showcased metadata tools such as the Croissant vocabulary, which standardises ML datasets to improve reproducibility and transparency. These initiatives address a core barrier in verifying and replicating AI results by ensuring that datasets are FAIR (findable, accessible, interoperable, and reusable). Ultimately, the workshop showed that safer and more reliable AI deployments can be achieved through greater technical rigour in evaluation.

Balancing technical rigor with ethical considerations

Another significant insight from the workshop was the call to integrate ethical governance into technical evaluations. Participants highlighted that technical benchmarks must be complemented by sociotechnical frameworks that manage AI’s current fairness, inclusivity, and broader societal impacts. For instance, a presentation by Prof Virginia Dignum explored the tension between long-term speculative risks and immediate ethical challenges. Prof Dignum argued that an excessive focus on future harms might detract from addressing present-day issues such as bias and fairness.

Panel discussions underscored that effective AI safety requires both robust technical metrics and frameworks for social responsibility to tackle complex real-world environments. Without the latter, unforeseen biases, or worse, are more likely to be perpetuated when AI systems “in the wild” encounter safety challenges not anticipated during model training.

Adapting evaluation methodologies in a dynamic landscape

The rapidly evolving nature of AI technologies calls for a dynamic approach to safety evaluations. The workshop continuously illustrated that safety frameworks must be adaptive; what is deemed safe today may not hold in a future where AI systems evolve unexpectedly. Panelists agreed that traditional, static evaluation methods are often ill-equipped to handle the nuances of modern AI applications.

Emergent harms, manifesting as indirect or second-order effects, challenge conventional safety paradigms. Practitioners from multiple disciplinary backgrounds emphasised how all AI safety metrics require continuous re-evaluation and adjustment as models scale in complex environments. This interdisciplinary call to action reminds us that AI safety evaluations must be as flexible and dynamic as the technologies they are designed to safeguard.

Creating collaborative and interdisciplinary approaches

Underlying the discussions at AAAI 2025 was the recognition that addressing AI safety is inherently an interdisciplinary challenge. From data scientists and software engineers to policymakers and ethicists, the workshop brought together diverse experts, enumerating unique priorities in AI safety. For instance, during the panel discussions, some experts advocated for technical measures such as implementing self-customised safety filters and automated source-referencing. Meanwhile, others stressed governance priorities like a broad citizen-first approach and frameworks to safeguard vulnerable groups.
Integrating these viewpoints is critical for developing safety frameworks that are both technically sound and socially responsive. Collaborative discussions of this kind help to bridge the gap between theoretical research and practical implementation. Unsurprisingly, there was broad consensus that some form of collective action will drive the evolution of safer, more reliable AI systems. Initiatives such as the development of the Croissant vocabulary and the ongoing work on standardised benchmarks (e.g., AILuminate) effectively capture this spirit of cooperation, given both rely on cross-sector engagement to succeed.

Conclusion

Ultimately, several critical themes for the AAAI 2025 Datasets and Evaluators for AI Safety workshop were the need for adaptive evaluation frameworks to handle real-world complexity and ongoing interdisciplinary discussions to ensure that our notion of “risk” is as robust as possible. Many discussion points were held throughout the day, but these themes were particularly important for ongoing work such as MLCommons’ AI Risk and Reliability initiative. By the end of the event, we felt a desire for more interdisciplinary workshops like
this. The common theme of adaptability tells us that it will be critical for practitioners to stay abreast of the AI safety responses to fast-moving developments we continue to see across the ecosystem.

The post Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety appeared first on MLCommons.

CKAN Announces Support for Croissant

MLCommons — Wed, 16 Apr 2025 00:30:02 +0000

CKAN, the world’s leading open source data management system, recently announced support for the Croissant metadata standard, a format designed by MLCommons to make datasets machine learning-ready. CKAN sites can now easily expose their datasets to ML pipelines with the help of the newly released croissant plugin. This integration, powered by the ckanext-dcat 2.3.0 extension, connects CKAN’s open data catalog capabilities with ML workflows, improving dataset discoverability and interoperability.

Learn more on the CKAN blog: Bridging CKAN and Machine Learning: Introducing Support for the Croissant Standard.

The post CKAN Announces Support for Croissant appeared first on MLCommons.

Unsupervised People’s Speech: A Massive Multilingual Audio Dataset

lori@mlcommons.org — Thu, 30 Jan 2025 21:00:00 +0000

Introducing the Unsupervised People’s Speech dataset

The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People’s Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages. Given the impact the previously released Supervised People’s Speech dataset had on models such as Whisper, we expect this new version to drive innovations across different tasks, including self-supervised implementations as well as improvement on automatic speech recognition pipelines in numerous languages.

The Rationale Behind the Dataset

As the field of speech technology continues to advance, the need for large, diverse audio datasets is increasingly crucial. The Unsupervised People’s Speech dataset aims to address this need by providing a vast collection of multilingual audio data to support research and development in various areas of speech technology. Supporting broader Natural Language Processing (NLP) research for languages other than English helps bring communication technologies to more people globally–including those speaking low-resource languages.

Ensuring Useful Data

To ensure the dataset is as useful as possible the MLCommons Datasets working group ran different data pipelines to understand the contents across dimensions such as language distribution and speech detection.

Speech detection: the working group created a custom data loader that resampled and converted the audio to a single channel (available in the HuggingFace dataset card), we ran Silero’s Voice Activity Detection pipeline, a model that uses a multi-head attention mechanism with STFT as features. Results from this model delivered a total of 821,412+ hours of speech.
Language identification: language identification is often the first task in a series of NLP tasks, so it’s crucial to get it right. The working group used Nvidia’s TensorRT-LLM implementation of Whisper Large v3 to run inference on the subset of the dataset for which our speech detection pipeline detected a speech utterance. The results from this pipeline detected a total of 89 languages. While we are certain there are more, since the third most common category the model inferred was a no speech tag, it means the model wasn’t able to determine the language.

Technical Hurdles

While the focus of the Unsupervised People’s dataset is on its potential applications, it’s worth noting some of the technical challenges the working group overcame:

Data Upload and Storage: we developed a custom script utilizing Git LFS backend to efficiently upload the 48+TB dataset to S3 and HuggingFace. This overcame typical speed limitations.
Self-Supervised Learning Potential: we’ve included a training pipeline to train a Wav2Vec model, which could unlock new possibilities in unsupervised speech representation learning.
Deduplication: we are in the process of creating embedding representations of the entire dataset using Meta’s Encodec. This will allow us to identify and remove duplicate content. This process ensures the dataset’s uniqueness and quality.

Future Work

The Unsupervised People’s Speech dataset is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. As part of our commitment to updating and maintaining the dataset, we are also releasing the software as open-source to empower the community to build on our contributions.

As the working group completes the deduplication and self-supervised efforts, we anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis.

Join Us

There are many ways to get involved in the MLCommons data effort. We are currently seeking speakers of over 130 languages to create a benchmark for a text language identification task. If this sounds interesting, head to Dynabench’s text classification task.

If you want to be part of the Datasets working group or any other working group at MLCommons, please visit the Get Involved page. We meet weekly to share ideas and implement projects. If you want to read more about the MLCommons Unsupervised People’s Speech dataset, visit https://mlcommons.org/datasets/unsupervised-peoples-speech/

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github

MLCommons — Wed, 15 Jan 2025 18:50:04 +0000

Today, MLCommons^®publicly released the AILuminate DEMO prompt dataset, a collection of 1,200 prompts for generative AI systems to benchmark safety against twelve categories of hazards. The DEMO prompt set is a component of the full AILuminate Practice Test dataset.

The dataset is available at: https://github.com/mlcommons/ailuminate.

The AILuminate benchmark, designed and developed by the MLCommons AI Risk and Reliability working group assesses LLM responses to over 24,000 test prompts across twelve categories of hazards that users of generative AI systems may encounter. It also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models; and Public reports of 13 systems-under-test (SUTS) using the AILuminate v1.0 benchmarking grading scale to assess the quantitative and qualitative performance of the SUTS in each hazard category.

The full 24,000 test prompts are broken out into the following datasets:

A Practice Test dataset with a standard, well-documented set of prompts that can be used for in-house testing, examples of specific hazards, and seeds and inspiration for the creation of other prompts. The Creative Commons DEMO set being released today is a 10% subset of the Practice dataset. Access to the full Practice dataset is available upon request from MLCommons.
An Official Test dataset. For an official benchmark of a System Under Test (SUT) (which can be a single LLM or an ensemble). It is run by MLCommons, using the MLCommons official AILuminate assessment independent evaluator to provide a detailed set of performance scores, as measured against a standard baseline. The Official Test dataset is private to avoid SUTs “gaming” the benchmark by training to the specific test prompts. The Practice Test and Official Test datasets are designed to be statistically equivalent in their coverage and testing of the taxonomy of hazards. To run an Official Test contact MLCommons.

The 12 hazard categories included in the test prompts are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently in English, with plans to expand to French, Chinese and Hindi. MLCommons intends to update the prompt datasets, including the Creative Commons DEMO dataset, on a regular basis as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“We have already started using AILuminate to help us robustly evaluate the safety of our models at Contextual. Its rigor and careful design has helped us gain a much deeper understanding of their strengths and weaknesses,” said Bertie Vidgen, Head of Human Data at Contextual AI.

Raising the Bar for Safety Testing by Empowering Stakeholders

The AILuminate Creative Commons-licensed DEMO prompt dataset provides an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems for the twelve defined AILuminate hazard categories. We discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards; as will policymakers looking for examples of prompts that can generate hazardous results.

Download the DEMO Prompt Dataset Today

The DEMO test set is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from Github. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk and Reliability working group and our efforts to ensure the safety of AI systems.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.We invite others to join the AI Risk and Reliability working group.

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

Bringing Open Source Principles to AI Data through Croissant

MLCommons — Mon, 16 Dec 2024 18:50:28 +0000

The Croissant metadata format to help standardize machine learning (ML) datasets has gained significant popularity within the open-source AI community by offering a powerful tool for standardizing and enhancing the accessibility of ML datasets. Developed by the MLCommons Croissant working group, it serves as a community-driven metadata standard that simplifies how data sets are used, making them more understandable and shareable. This surge in adoption underscores a growing awareness of the critical role transparency and collaboration play in advancing AI development.

By adhering to the Open Source AI Definition, Croissant guarantees that datasets are FAIR (findable, accessible, interoperable, and reusable) and advocates for responsible data management practices. As an increasing number of organizations and individuals embrace Croissant, cultivating innovation and collaboration across platforms becomes essential.

Thomas Carey Wilson‘s recent Medium post “Bringing Open Source Principles to AI Data through Croissant“, shares why embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms. We are republishing the Medium post here for the community. To get involved in the Croissant working group join it here.

Republished from the ODI research team Medium account (“Canvas“), November 27, 2024, authored by Thomas Carey Wilson.

Bringing Open Source Principles to AI Data through Croissant

Introduction

Artificial Intelligence (AI) is progressively influencing various sectors, including industries and governmental operations. Despite some risks, its potential can be unlocked by greater, more responsible, openness – enabling freedom, transparency, and collaboration. The Open Source Initiative’s (OSI) Open Source AI Definition seeks to bring these core principles to AI systems. To our delight, central to this mission is the availability and standardisation of AI data, in addition to models and weights. This is where MLCommons’ Croissant comes into play. Croissant is a community-driven metadata standard that simplifies how datasets are used in machine learning (ML), making them more accessible, understandable, and shareable.

Understanding the Open Source AI Definition

The Open Source AI Definition extends the foundational freedoms of open source software to AI systems. It outlines four essential freedoms:

Use: the right to use the AI system for any purpose without seeking permission.
Study: examining the system’s workings by accessing its components.
Modify: the freedom to alter the system to change its outputs or behaviour.
Share: the right to distribute the system to others, with or without modifications, for any purpose.

These freedoms apply not only to complete AI systems but also to their components, such as models, data, and parameters. The OSI explains that a critical enabler of these freedoms is ensuring access to the “preferred form to make modifications,”- the essential resources required for others to understand and alter certain aspects of the AI system.

This includes providing detailed information about the data used to train the system (“data information”), the complete source code for training and running the system (“code”), and the model parameters like weights or configuration settings (“parameters”). Among these, data information is particularly crucial because it empowers users to study and modify AI systems by understanding the data foundation upon which they are built- a core area where Croissant makes a significant contribution.

We value the focus on data transparency but suggest, as highlighted in our data for AI taxonomy, specifying what measures enabling transparency could look like for different data types in AI systems. This can support more equitable access, informed debate, and the implementation of responsible practices across the AI ecosystem.

Croissant’s alignment with the Open Source AI Definition

Croissant, as an open-source tool, ensures that ML datasets are FAIR (findable, accessible, interoperable, and reusable), aligning with open science and open data practices. It embodies the principles of the Open Source AI Definition by providing a standardized, machine-readable format for ML dataset metadata. Here’s how Croissant supports each of the core freedoms and data information requirements:

Use

To be useful, core Croissant metadata should be openly available for all AI datasets, even if the datasets themselves cannot be made public. For an open-source AI model based on open data, Croissant can enable free use of AI datasets by providing detailed metadata that makes datasets easier to find and utilize. Once the required dataset(s) are located, it allows practitioners to quickly understand data collection methods, structures, and processing steps by standardizing how datasets are described. This ease of access and clarity empowers users to employ datasets for any purpose without needing permission or facing unnecessary barriers.

Moreover, Croissant’s design ensures interoperability by leveraging its attributes and ensuring tools understand what those attributes mean. The use of widely recognized vocabularies like schema.org allows integration with existing data and metadata management tools, such as data crawlers, search engines, data catalogues, and metadata assurance systems. This aligns with the Open Source AI Definition’s emphasis on freedom of use by enabling seamless integration across various ML frameworks like PyTorch, TensorFlow, and JAX.

Study

Studying an AI system requires access to detailed information about its components. Croissant facilitates this by offering rich metadata that supports a range of responsible AI use cases, including the structured discovery of datasets, representation of human contributions, and analysis of data diversity and bias. Specifically, it provides:

Dataset-level information: Includes descriptions, licenses, creators, and versioning.
Resources: Details about files and file sets within the dataset.
RecordSets: Structures that describe how data is organized, including fields and data types.

For example, Croissant enables researchers to analyze the demographic characteristics of annotators and contributors to assess dataset diversity or potential biases. For this purpose and others, the inclusion of data types, hierarchical record sets, and fields supports machine \interoperability, allowing tools to interpret and integrate datasets effectively.

Modify

Croissant’s open and extensible format encourages modification and adaptation. Users can:

Extend metadata: Croissant is designed to be modular, allowing for extensions like the Geo-Croissant extension, which captures information specific to geospatial data.
Adapt datasets: the standardization of metadata means that datasets can be modified more easily, whether by adding new data, altering existing records, or adjusting data structures.
Integrate with tools: Croissant metadata can be edited through a visual editor or a Python library, making it accessible for developers to modify datasets programmatically.
Enable interoperability across repositories: Croissant acts as a glue between platforms like Kaggle and Hugging Face, providing adapters that enable seamless dataset interchange while retaining provenance and versioning information for compatibility across frameworks.

For example, if a practitioner wants to add new fields to a dataset or alter the way data is processed, Croissant’s standardized format makes these modifications straightforward. The ability to represent complex data types and transformations within the metadata ensures that changes are accurately captured and can be shared with others.

Sharing is at the heart of open source, and Croissant facilitates this by making datasets and their metadata easily distributable. The use of JSON-LD and alignment with schema.org means that metadata can be embedded in web pages, enabling search engines to index and discover datasets. Croissant also supports:

Portability: Croissant datasets can be seamlessly loaded into different ML frameworks, promoting sharing across platforms.
Collaboration: Croissant fosters collaboration among practitioners, researchers, and organizations by providing a common language for dataset metadata.

An illustrative example is how Croissant enables dataset repositories to infer metadata from existing documentation like data cards, making it easier to publish and share datasets with metadata attached.

Data information compliance

As for the practical requirements which enable these freedoms, the Open Source AI Definition places significant emphasis on data information, requiring:

A complete description of all data used for training: Croissant provides a detailed schema for dataset-level information, including descriptions, licenses, creators, and data versions. It also specifies how to represent resources like files and file sets, ensuring that all data used in training is thoroughly documented.
Listing of publicly available training data: through the distribution property and detailed resource descriptions, Croissant makes it clear where data is stored and how it can be accessed. The use of URLs and content descriptions aligns with the requirement to list publicly available data and where to obtain it.
Listing of training data obtainable from third parties: Croissant’s metadata can include references to external data sources, including third-party datasets. Specifying licenses and access URLs ensures that users know how to obtain additional data, even if it requires a fee.

Moreover, Croissant’s support for versioning and checksums addresses the need for transparency in data changes over time. Croissant ensures that users can verify the integrity of data and understand its provenance by documenting dataset versions and providing file checksums.

Conclusion

Croissant epitomizes the core principles of the Open Source AI Definition by supporting AI datasets to be fully open, accessible, and adaptable. Through its standardized format, Croissant empowers everyone to use, study, modify, and share AI data freely- driving innovation, fostering collaboration, and building trust in AI systems. While details on how the Open Source AI Definition will work in practice are still emerging, embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms.

The post Bringing Open Source Principles to AI Data through Croissant appeared first on MLCommons.

New Croissant Metadata Format helps Standardize ML Datasets

MLCommons — Wed, 06 Mar 2024 18:03:38 +0000

Today, MLCommons^Ⓡ is announcing the release of Croissant, a metadata format to help standardize machine learning datasets. The aim of Croissant is to make datasets easily discoverable and usable across tools and platforms. Today’s release includes the format documentation, an open source library, and visual editor, with industry support from HuggingFace, Google Dataset Search, Kaggle, and OpenML amongst others.

Data is at the core of every AI and ML model. However, there is currently no standardized method of organizing and arranging the data and files that make up each dataset. As a result, finding, understanding, and using ML datasets can be tedious and time-consuming.

One of the goals of Croissant is to make data more easily accessible and discoverable. The Croissant vocabulary is an extension to schema.org, a machine-readable standard to describe structured data, used by over 40M datasets on the Web, which allows the datasets to be discoverable through dataset search engines such as Google Dataset Search.

Croissant is easy to adopt because it doesn’t require changing the data itself or how it is represented. Instead, it adds a layer of metadata that represents the contents of the dataset in a standardized way, describing key attributes and properties.

Croissant enables datasets to be loaded into different ML platforms without the need for reformatting. Popular ML frameworks like TensorFlow, JAX and PyTorch can already load Croissant datasets via the TensorFlow Datasets library. Additionally, by providing operationalized documentation, Croissant users can easily understand the best practices for contributing to and utilizing the data.

Users looking to publish a dataset in the Croissant format benefit from the Croissant editor which allows them to easily inspect, create, or modify Croissant descriptions for their dataset.

“Data is a critical element of any model’s performance, and some experts suggest it will run out, making the need to harness it even more important,” said Elena Simperl, professor of Computer Science at King’s College London and a Croissant working group co-chair. “Croissant allows more people to do more with data. As co-chair of the working group, it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe making an enormous contribution to the AI data ecosystem.”

In addition to the 1.0 specification release of the Croissant open source library, we are announcing support from major ML repositories, including Kaggle, HuggingFace, and OpenML. You can also find public datasets using the Croissant format through Google Dataset Search.

“The development of Croissant was grounded in the needs of ML practitioners, and the technical requirements of ML tools, platforms, and datasets. Our goal with Croissant is to unlock real value for users by enabling the tools they use to work seamlessly together, while keeping the format as simple and intuitive as possible.” said Omar Benjelloun, software engineer at Google and Croissant working group co-chair.

We encourage dataset creators to provide Croissant descriptions, and dataset hosts to provide Croissant dataset files for download, as well as embed Croissant metadata into their corresponding dataset pages. Tool creators should include Croissant dataset support in data analysis and labeling tools to make datasets easier to find and work with.

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King’s College London – Open Data Institute, Meta, NASA, NASA IMPACT – UAH, North Carolina State University, Open University of Catalonia – Luxembourg Institute of Science and Technology, Sage Bionetworks, and TU Eindhoven.

We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to start implementing the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.

The post New Croissant Metadata Format helps Standardize ML Datasets appeared first on MLCommons.

Announcing the DataPerf Challenges 2023 Winners

MLCommons — Wed, 13 Dec 2023 22:03:25 +0000

The MLCommons® DataPerf working group is excited to announce the winners of the inaugural DataPerf 2023 challenges, which centered around pushing the boundaries of data-centric AI and addressing the crucial ‘data bottleneck’ faced by the machine learning (ML) community.

In the challenges, participants were asked to approach ML from a novel perspective: by evolving training and test data alongside models to overcome data bottlenecks. The results were inspiring. Without further ado, here are the winners of the DataPerf 2023 challenges:

Quality of Training Data

The first two challenges targeted the quality of training data in the Vision and Speech domains. The winners successfully designed selection strategies to determine the best training sets from large candidate pools of weakly labeled images and automatically extracted spoken word clips.

Quality of Training Data: Vision Domain

In the Vision domain, Paolo Climaco, PhD candidate at the Institute of Numerical Simulation, University of Bonn, triumphed by presenting a robust and efficient method for training data selection. He proposed an approach based on the technique of farthest point sampling where he selected negative examples for each label by sampling the feature space via iteratively searching to maximize the l2 distance. The best core set is then selected under nested cross-validation. Paolo’s best performing model scored a mean-F1 of 81.00 across the ‘cupcake’, ‘hawk’ and ‘sushi’ classes.

Second place was awarded to Danilo Brajovic, Research Associate at Fraunhofer IPA, for his pseudo label generation technique. In this technique, he trained multiple neural networks and classical machine learning models on a subset of data to classify remaining points to the positive and negative image pools. The best-performing model is then selected for core-set proposal under multiple sampling experiments, achieving a best mean-F1 score of 78.98.

Third place was awarded to Steve Mussmann, Computer Science Postdoc at University of Washington, who used modified uncertainty sampling to score a mean-F1 of 78.06 on the leaderboard. He trained a binary classifier on noisy labels from OpenImages and used this classifier to assign positive and negative image pools, with the core-set randomly sampled from both pools.

We would also like to highlight Margaret Warren from Institute for Human and Machine Cognition for her submission using innovative techniques of human-centered axiomatic data selection.

Quality of Training Data: Speech Domain

The Speech domain was won by the Cleanlab team, led by Jonas Mueller, Chief Scientist, who showcased a data-driven approach using out-of-sample predicted probabilities for training set selection on spoken words in multiple languages. The Cleanlab solution achieved a macro F1 of 46.76 across English, Portuguese, and Indonesian, versus a crossfold validation baseline score of 42.23, showcasing a data-driven approach using out-of-sample predicted probabilities for training set selection on spoken words.

Training Data Cleaning

The third challenge centered on data cleaning within the Vision domain. Participants were tasked with devising data cleaning strategies to select samples from noisy training sets that would benefit from re-labeling. Team Akridata, led by Sudhir Kumar Suman, took first place with an ingenious, adaptive cleaning algorithm. Impressively, by addressing only 11.58% of the samples, Akridata achieved an accuracy that was 95% of what a model trained on a purely clean dataset could achieve. The team’s approach was characterized by a thorough analysis of the dataset to identify any significant imbalance, followed by a comprehensive and meticulous identification of misclassified samples. They delved deep into understanding the inherent limitations of the model, which helped in making more efficient corrections. They also embraced a boost-like iterative methodology, which progressively honed the model’s performance with each cycle. Notably, they adopted a weighted approach to combine Shapley value into the scoring system to better represent the importance of each data sample. Their effort echoed efforts from other participants (namely Shaopeng Wei from ETH Zurich & SWUFE China), but differed from the more generic calculation of importance valuation.

Training Dataset Acquisition

Data marketplaces are increasingly crucial in a world driven by data. The dataset acquisition challenge focused on studying a critical problem in making data marketplaces work. This challenge required participants to devise a data acquisition strategy to select the best training set from multiple data-sellers based on limited exchanged information. There was a close tie between the submissions of Hanrui Lyu and Yifan Sun, PhD students from Columbia University, advised by Prof. Yongchan Kwon, and Feiyang Kang, a PhD student from Virginia Tech, advised by Prof. Ruoxi Jia. Both approaches achieved high accuracy. Team Columbia’s approach, which led to an accuracy of 76.45% (7.92% better than the baseline), was based on a brute force search to find one seller to allocate all the budget. This approach leveraged multiple submissions into DynaBench, which was not forbidden by our rules, although it is not a practical approach. Feiyang’s approach using a customized distribution matching approach with dimension reduction achieved an accuracy of 76.17% . Because of the close match between the submissions, both were declared co-winners.

DataPerf and the Future

We are immensely proud of the work done by all the participants in these challenges. These challenges represent a pivotal moment in data-centric AI research, highlighting the crucial role data plays in ML’s evolution and how we can continue to improve it.

Congratulations to all the winners and thank you to all participants. Your efforts and solutions have provided a significant leap forward in our understanding and tackling of data bottlenecks.

If you’re interested in getting involved in more challenges, the MLCommons DataPerf working group is currently accepting submissions in the Adversarial Nibbler Challenge, a challenge where users interact with an image generation model to identify “safe” prompts that lead to “unsafe” image generations. Adversarial Nibbler has a dedicated challenge track for the ART of Safety Workshop (AACL), and challenge participants are encouraged to write up their findings as red-teaming papers for the workshop. In addition, Adversarial Nibbler is intended as a continuous challenge with multiple rounds, so follow and stay tuned on the Nibbler Twitter account for future rounds of the challenge.

For those inspired by the work done here, we will shortly announce the DataPerf 2024 challenges; details coming soon. Please visit the DataPerf website to learn more about the challenges, join the upcoming challenges, or contribute to the design of new data-centric tasks. Join the MLCommons DataPerf working group to stay abreast of the exciting developments in the data-centric ML community. We presented DataPerf results at NeurIPS 2023 in New Orleans in December 2023.

Appendix: DataPerf Resources

Challenge Creators
- Speech Selection
  - Mark Mazumder and Colby Banbury, from Harvard, developed and organized the speech selection challenge.
- Vision Selection
  - William Gaviria Rojas and Cody Coleman from Coactive AI developed and organized the vision training set selection challenge with help from Sudnya Diamos and Mehul Smriti Raje.
- Training Data Cleaning
  - Xiaozhe Yao and Ce Zhang from ETH Zurich developed and organized the training data cleaning challenge.
- Training Dataset Acquisition
  - Lingjiao Chen, Stanford University, and Bilge Acun, Newsha Ardalani, and Carole-Jean Wu all from Meta AI organized the training dataset acquisition challenge.
Challenge Submissions
- Vision Selection
  - GitHub repository of the winning submission. It includes code to run the algorithm, slides presenting the developed method and a challenge report explaining the proposed approach.
- Speech Selection
  - Github link to the Cleanlabs solution
  - Leaderboard
- Adversarial Nibbler
  - Releases planned by the end of the year for (i) dataset, (ii) paper for dataset description / challenge results
- Vision Cleaning
  - Akridata wins the competition, but not with open source solution.
  - Leaderboard: https://dynabench.org/tasks/vision-debugging

Acknowledgements

The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive AI, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. In addition, this would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Mod Op, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

The post Announcing the DataPerf Challenges 2023 Winners appeared first on MLCommons.

DataPerf: the Leaderboard for Data

MLCommons — Thu, 30 Mar 2023 08:46:00 +0000

As Machine Learning (ML) has grown rapidly and become more powerful, new challenges have arisen. The latest are limitations in the availability and utility of training and testing datasets for ML models. We propose DataPerf, the first platform and community to develop competitions and leaderboards for data and data-centric AI algorithms. Competitions and leaderboards have traditionally been focused on improving ML models and have inspired decades of innovation, including accomplishments such as AlexNet, the first model to surpass human performance on image recognition on the ImageNet dataset, which catalyzed the latest revolution in machine learning. By applying these proven approaches in a data-centric fashion, we believe that we can break through these dataset limitations and encourage better machine learning in the decades to come.

Data is the new bottleneck for ML

Could the keys to diagnosing and treating cancer earlier, engineering safe self-driving cars, automating fact-checking, detecting fraud, and reducing litigation costs have something in common? Machine learning (ML) promises new and exciting solutions to these problems, provided that appropriate training and testing data is available.

However, training and testing datasets are often too small, too specialized, too limited or of poor quality, and too static, leading to what we call the data bottleneck. The lack of appropriate training and testing data hinders ML development and deployment in various domains, including healthcare, transportation, finance, cybersecurity, and more.

In this blogpost, we identify the current data bottlenecks we are facing, and explain how DataPerf provides incentive for research and engineering communities to overcome these bottlenecks.

Figure 1: Changing the ML paradigm: On the left, standard ML assumes static training and test datasets. On the right, DataPerf evolves training and test data alongside models, overcoming data bottlenecks.

The first bottleneck is around data quality. Datasets for ML have a wide range of sources. For example, most ML research relies on public training and test datasets which are built from easily available data sources such as web scrapes, forums, Wikipedia, as well as crowdsourcing. However, these sources often suffer from quality, distribution, and bias, among other issues. For example, as Dollar Street has illustrated, most visual data comes from the wealthy portions of the developed world, leading to bias.

These data quality bottlenecks in turn lead to data quantity bottlenecks. The growth in data means that a large portion of the data is low-quality, which has incrementally less benefit and consequently leads to non-linear and rapid growth in model size^[1]. This ultimately drives up the size and computational cost of models, leading to recent innovations requiring staggering and costly infrastructure^[2]. As we exhaust our public data sources ML models may even saturate in terms of accuracy, stalling progress^[3]. Last, there is a growing gap between performance on these public research data sets and the real-world. Therefore, it is crucial for the AI community to actively benchmark and conduct research to enhance the quality of training and test data. This is the core principle of the data-centric AI community.

DataPerf, the leaderboard for data

DataPerf is inspired by the success of ML Leaderboards. Leaderboards such as SQuAD, SuperGLUE, Kaggle and others have played a significant role in advancing research in ML models. These leaderboards create a shared metric for measuring progress and act as a compass for shaping ML research.

DataPerf is the first platform and community to develop leaderboards for data and data-centric AI algorithms. We aim to have a similar impact on data-centric AI research as ML leaderboards had on ML model research. DataPerf leverages Dynabench, a platform that supports benchmarking of data, data-centric algorithms, and models.

DataPerf v0.5 features five challenges that focus on five common data-centric tasks across four different application domains:

Training data selection (Vision) – Design a data selection strategy that selects the best training set from a large candidate pool of weakly labeled training images.
Training data selection (Speech) – Design a data selection strategy that selects the best training set from a large candidate pool of automatically extracted clips of spoken words.
Training data Cleaning (Vision)– Design a data cleaning strategy that chooses samples to relabel from a noisy training set. The current version of this challenge targets image classification.
Training dataset valuation (NLP) – Design a data acquisition strategy that selects the best training set from multiple data-sellers based on limited information exchanged between buyers and sellers.
Adversarial Nibbler (Multimodal text-to-image) – Design safe-looking prompts that lead to unsafe image generations.

These challenges aim to benchmark and improve the performance of data-centric algorithms and models. Each challenge on the DataPerf website includes design documents that outline the problem, model, quality target, rules and submission guidelines. The Dynabench platform enables us to offer a live leaderboard, an online evaluation framework, and track submissions over time.

How to get involved?

As a community of machine learning researchers, data scientists, and engineers, we are committed to improving machine learning by enhancing data quality. DataPerf is calling on innovators in academia and industry to measure and validate data-centric algorithms and techniques, and to enhance datasets through these benchmarks. The first round of Dataperf 2023 challenges are open for participation on March 30th, 2023. The submission will close on May 26th, 2023. The winners of the challenge will be invited to present their work at ICML 2023, and publish their work in the forthcoming Data for ML Research (DMLR) journal.

To help, please share this blog post with your colleagues. Visit the DataPerf website to participate in the DataPerf challenges, or to design a new data-centric challenge. Join the DataPerf working group to stay up-to-date on the ongoing changes in the data-centric ML community.

We would like to thank the Dynabench engineering team for providing the infrastructure and support for DataPerf. The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive.ai, ETH Zurich, Google, Harvard University, Meta, MLCommons, and Stanford University. In addition, this effort would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Digital Prism Advisors, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

References

Hestness, Joel, et al. “Deep learning scaling is predictable, empirically.” arXiv preprint arXiv:1712.00409 (2017).
Sevilla, Jaime, et al. “Compute trends across three eras of machine learning.” 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022.
Villalobos, Pablo, et al. “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning.” arXiv preprint arXiv:2211.04325 (2022).

The post DataPerf: the Leaderboard for Data appeared first on MLCommons.

Dollar Street – Bringing Diversity to Computer Vision

MLCommons — Tue, 07 Feb 2023 08:35:00 +0000

We are extremely excited to announce the release of the Dollar Street Dataset, an open access diverse computer vision (CV) dataset. It is designed to reduce geographic and socioeconomic bias and is a collaboration between MLCommons®, Coactive AI, Gapminder – a Swedish non-profit, and Harvard University. The Dollar Street Dataset includes 38K high-quality images of household objects from across the world labeled with object tags and demographic data. The images were collected from dozens of different countries and all metadata was manually verified. The dataset is available under CC-BY 4.0 for academic and commercial usage. We improved classification accuracy for items from lower income households by 50%, highlighting the impact of data to reduce bias and ultimately empowering the community to build better machine learning for everyone.

Computer vision is an incredibly powerful discipline within AI for enriching society and the human experience and in the last decade, machine learning has become the linchpin of CV. Advances in machine learning make life subtly better – for example, helping us take vivid photos, capturing beautiful and unique memories. Other applications are profoundly shaping society and saving lives by enabling cars to detect pedestrians or helping detect, diagnose, and treat cancers better.

However, most existing commercial and open access datasets for training ML models disproportionately skew towards the developed world and more specifically high-income populations. This systematic bias in datasets means that many ML models perform unevenly across the broader population. In particular, many CV tools perform poorly for women and people of color. Just as one example, studies have found that existing image classification tools perform >30X worse for darker-skinned women compared to lighter-skinned men. This variation in performance can have tremendous real-world consequences, given the rapid adoption of CV throughout modern society.

The Dollar Street Dataset is an open access, diverse CV dataset designed to combat bias and enable machine learning to work better for everyone. It comprises 38K high-quality images of 289 everyday objects from households across the world, labeled with object tags and demographic data such as region, country, and household monthly income. The images were collected at scale across 63 different countries, including many communities without internet access, and all metadata was manually verified.

We also demonstrated that the Dollar Street Dataset can profoundly improve machine learning models by dramatically reducing bias and improving robustness – especially for lower socioeconomic communities. We used five standard CV models trained on existing datasets to classify images of household objects such as stoves, toilets, and toothbrushes. These models achieved an acceptable 57% accuracy for high-income households, but dropped to 18% for lower-income households – illustrating the bias of models trained on existing datasets. By finetuning with the Dollar Street Dataset, we achieved 70% or better accuracy across all household income levels, an amazing increase of 50% or more.

Contributors to the Dollar Street Datasets include researchers from Coactive AI, Harvard University, and MLCommons. It can be downloaded at mlcommons.org/dollar-street and for more information, please read our paper accepted to the 2022 Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

The post Dollar Street – Bringing Diversity to Computer Vision appeared first on MLCommons.

The People’s Speech Dataset just got better – dynamic data in action

MLCommons — Thu, 26 Jan 2023 08:36:00 +0000

Our big, freely available speech dataset just got its first upgrade. Find out what’s new, and why dynamic datasets are the future of machine learning.

What is The People’s Speech Dataset?

The People’s Speech Dataset was designed to improve the accuracy of Automatic Speech Recognition (ASR) Machine Learning (ML) models for recognizing English speakers. It includes over 30,000 hours of transcribed English-language speech from a diverse set of speakers with a wide range of accents.

We published the People’s Speech in November 2021 under a CC-BY-SA and CC-BY 4.0 license, meaning it’s free for both academic and commercial use.

Why have we already updated it to v1.1?

We believe that dynamic datasets are critical to the future advancement of machine learning. MLCommons® is committed to actively maintaining and updating its datasets, so that developers can always train their models on the most relevant and robust data.

With the People’s Speech Dataset, this is particularly important because of the speed at which modern languages evolve. Fuelled by globalization and the proliferation of the internet, the real-world manifestations of language are highly dynamic. It’s therefore vital that the datasets built to represent everyday speech are dynamic too.

When datasets are not proactively maintained, they become less relevant with every day that passes because they are increasingly reflecting historical rather than present-day trends. For developers, using static datasets also brings a major risk of reinforcing implicit algorithmic biases.

By continuously improving the quality of our datasets, we make development easier and are improving the real-world performance of ML models that train on them and making machine learning better for everyone.

What’s new for v1.1?

In this first update, our Datasets Working Group has improved the speed, quality, and usability of the dataset, particularly for those using smaller subsets of data.

A major focus for us was making the dataset more user-friendly. We’ve created a tutorial that shows users how to train an ASR model using NVIDIA NeMo, with Google Cloud Platform included.

As another example, we’ve done some behind-the-scenes work to improve download speeds, fix bugs, and moved the dataset hosting to Hugging Face. We also raised the bar for our “clean” data samples. This has increased the proportion of “dirty” data and improved the overall quality of the remaining “clean” data.

Summary

We hope these upgrades to our People’s Speech dataset inspire ML practitioners to pay more attention to the quality of their training data, and the importance of dynamic datasets.

Discover more and start using the dataset here

About MLCommons

MLCommons is an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit MLCommons or contact participation@mlcommons.org.

To get involved with The People’s Speech, please join the Datasets Working Group, and follow @MLCommons on Twitter. Help us keep growing this community effort, and don’t hesitate to get in touch if you would like to be involved.

Press Contact:
David Kanter
press@mlcommons.org

The post The People’s Speech Dataset just got better – dynamic data in action appeared first on MLCommons.

Harnessing Human-AI Collaboration

MLCommons — Mon, 11 Jul 2022 08:28:00 +0000

AI models need data to learn about the world. In the past decade, scaling up model parameters and data together has driven tremendous progress in AI. This is the motivation behind building large scale datasets at MLCommons® such as the People’s Speech and the Multilingual Spoken Words Corpus. However, datasets are not just about quantity. Research shows that data diversity and quality are also critically important for good performance. We need “bigger” data, but we also need “better” data.

Dynamic Adversarial Data Collection (DADC) is an approach to collecting data that relies on collaboration between humans and AI models, e.g. using the Dynabench platform, to identify and improve accuracy. It is Adversarial because humans collaborate by finding examples of input data that ‘trick’ the model into misclassification errors. It is Dynamic because the model-fooling process is iterative, building a progressively stronger model as more ‘tricky’ data is discovered and incorporated into training.

DADC has already demonstrated its effectiveness on a number of tasks such as question answering, sentiment analysis, natural language inference and hate speech. While it leads to data that is more creative and diverse, the DADC-collected data often represents unusual corner cases and should be combined with real-world examples to be truly effective.

In the next decade, we envision a shift in focus from quantity alone towards a more balanced approach that incorporates quality, creativity, diversity and collaboration. DADC operates at the next frontier of model capabilities, providing a data- and human-centric tool to find and begin to bridge the gaps between machine and human intelligence.

In a large scale collaboration to address these challenges, researchers from University College London, the University of Oxford, The Alan Turing Institute, Stanford University, the University of North Carolina at Chapel Hill, Carnegie Mellon University, the University of Sheffield, Simon Fraser University, Meta / FAIR, Hugging Face and MLCommons are rethinking model benchmarking through Dynabench: a platform to facilitate DADC research. Dynabench brings models and humans together in an easy, collaborative and community-driven way to push the boundaries of current data collection methods.

To encourage even more exciting collaboration, we will be holding the First Workshop on Dynamic Adversarial Data Collection co-located with NAACL ‘22 in Seattle on the 14th of July. We have an exciting day planned with amazing keynote speakers, a diverse and engaging panel, and presentations from our Shared Task participants. We encourage you to get involved! To stay up to date with all DADC-related developments please follow @DADCWorkshop, @DynabenchAI & @MLCommons on Twitter. Help us keep growing this community effort, and don’t hesitate to get in touch if you would like to be involved.

The post Harnessing Human-AI Collaboration appeared first on MLCommons.

MLCommons Unveils Open Datasets and Tools to Drive Democratization of Machine Learning

MLCommons — Tue, 14 Dec 2021 08:43:00 +0000

The MLCommons® Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People’s Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). These trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI.

The People’s Speech Dataset
The People’s Speech Dataset is among the world’s largest diverse English speech recognition datasets licensed for academic and commercial usage. The 30,000-hour supervised conversational dataset is an order of magnitude larger than what was available just a few years ago. The dataset, released under a Creative Commons license, democratizes access to speech technology such as voice assistants and transcription, and unlocks innovation in the machine learning community. Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech.

Multilingual Spoken Words Corpus
Also available today is the Multilingual Spoken Words Corpus (MSWC), a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples. Previous datasets relied on manual efforts to collect and validate thousands of utterances for each keyword and were commonly restricted to a single language. A diverse multilingual dataset that spans languages spoken by over 5 billion people, MSWC advances the research and development of applications such as voice interfaces for a broad global audience. Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words.

DataPerf
The new DataPerf benchmark suite supports data-centric AI innovation by measuring the quality of datasets for common ML tasks and the impact of enhancing datasets. Training and test datasets are a key part of creating an ML solution — the solution can only be as good as the data — but much less effort is spent on understanding and improving datasets than on mastering and improving models. DataPerf fosters and measures progress in this vital area. The MLCommons Association will support a series of challenges with leaderboards in 2022 to encourage participation in DataPerf. Contributors to the suite include researchers from Alibaba, Coactive.AI, ETH Zurich, Google, Harvard University, Landing.AI, Meta, Stanford University, and TU Eindhoven, drawing on the teams responsible for Cats4ML, the Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf benchmarks. The MLCommons Association invites other participants to join the DataPerf effort at dataperf.ai.

Historically, most AI research has focused on improving model architectures and making them available to the community; in contrast, attention to engineering and maintaining datasets has lagged and is often manual and ad-hoc. The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation.

“The machine learning model architecture for many applications is basically a solved problem. In many cases, focusing on engineering the data is more important for unlocking successful AI applications. Data is food for AI, and our systems need not just massive amounts of calories, but also high quality nutrition. We need not just big data, but good data,” said Andrew Ng, founder and CEO of Landing AI, founding lead of Google Brain, co-founder and chairman of Coursera, and adjunct professor at Stanford University. “Thanks to the shared efforts by the community, including the work initiated by the MLCommons Association and its members, the movement demonstrates the potential for Data-Centric AI, and how we can collectively implement a greater AI adoption.”

“Speech technology can empower billions of people across the planet, but there’s a real need for large, open, and diverse datasets to catalyze innovation,” said David Kanter, the MLCommons Association co-founder and executive director. “The People’s Speech is a large scale dataset in English, while MSWC offers a tremendous breadth of languages. I’m excited for these datasets to improve everyday experiences like voice-enabled consumer devices and speech recognition.”

About the MLCommons Association
The foundation for the MLCommons Association began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding member partners – global technology providers, academics and researchers, the MLCommons Association is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.
For additional information on the MLCommons Association and details on becoming a member of the organization, please visit https://mlcommons.org/ or contact membership@mlcommons.org.

Press Contact:
press@mlcommons.org

The post MLCommons Unveils Open Datasets and Tools to Drive Democratization of Machine Learning appeared first on MLCommons.