Datasets Archives - MLCommons https://mlcommons.org/category/datasets/ Better AI for Everyone Tue, 25 Feb 2025 16:57:54 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png Datasets Archives - MLCommons https://mlcommons.org/category/datasets/ 32 32 Unsupervised People’s Speech: A Massive Multilingual Audio Dataset https://mlcommons.org/2025/01/new-unsupervised-peoples-speech/ Thu, 30 Jan 2025 21:00:00 +0000 https://mlcommons.org/?p=2200 The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People's Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages.

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

]]>
Introducing the Unsupervised People’s Speech dataset

The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People’s Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages. Given the impact the previously released Supervised People’s Speech dataset had on models such as Whisper, we expect this new version to drive innovations across different tasks, including self-supervised implementations as well as improvement on automatic speech recognition pipelines in numerous languages.

The Rationale Behind the Dataset

As the field of speech technology continues to advance, the need for large, diverse audio datasets is increasingly crucial. The Unsupervised People’s Speech dataset aims to address this need by providing a vast collection of multilingual audio data to support research and development in various areas of speech technology. Supporting broader Natural Language Processing (NLP) research for languages other than English helps bring communication technologies to more people globally–including those speaking low-resource languages.

Ensuring Useful Data

To ensure the dataset is as useful as possible the MLCommons Datasets working group ran different data pipelines to understand the contents across dimensions such as language distribution and speech detection.

  • Speech detection: the working group created a custom data loader that resampled and converted the audio to a single channel (available in the HuggingFace dataset card), we ran Silero’s Voice Activity Detection pipeline, a model that uses a multi-head attention mechanism with STFT as features. Results from this model delivered a total of 821,412+ hours of speech. 
  • Language identification: language identification is often the first task in a series of NLP tasks, so it’s crucial to get it right. The working group used Nvidia’s TensorRT-LLM implementation of Whisper Large v3 to run inference on the subset of the dataset for which our speech detection pipeline detected a speech utterance. The results from this pipeline detected a total of 89 languages. While we are certain there are more, since the third most common category the model inferred was a no speech tag, it means the model wasn’t able to determine the language.

Technical Hurdles

While the focus of the Unsupervised People’s dataset is on its potential applications, it’s worth noting some of the technical challenges the working group overcame:

  1. Data Upload and Storage: we developed a custom script utilizing Git LFS backend to efficiently upload the 48+TB dataset to S3 and HuggingFace. This overcame typical speed limitations.
  2. Self-Supervised Learning Potential: we’ve included a training pipeline to train a Wav2Vec model, which could unlock new possibilities in unsupervised speech representation learning.
  3. Deduplication: we are in the process of creating embedding representations of the entire dataset using Meta’s Encodec. This will allow us to identify and remove duplicate content. This process ensures the dataset’s uniqueness and quality.

Future Work

The Unsupervised People’s Speech dataset is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. As part of our commitment to updating and maintaining the dataset, we are also releasing the software as open-source to empower the community to build on our contributions.

As the working group completes the deduplication and self-supervised efforts, we anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis. 

Join Us

There are many ways to get involved in the MLCommons data effort. We are currently seeking speakers of over 130 languages to create a benchmark for a text language identification task. If this sounds interesting, head to Dynabench’s text classification task.  

If you want to be part of the Datasets working group or any other working group at MLCommons, please visit the Get Involved page. We meet weekly to share ideas and implement projects. If you want to read more about the MLCommons Unsupervised People’s Speech dataset, visit https://mlcommons.org/datasets/unsupervised-peoples-speech/

The post Unsupervised People’s Speech: A Massive Multilingual Audio Dataset appeared first on MLCommons.

]]>
MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github https://mlcommons.org/2025/01/ailuminate-demo-benchmark-prompt-dataset/ Wed, 15 Jan 2025 18:50:04 +0000 https://mlcommons.org/?p=2130 Set of 1,200 DEMO prompts exercises the breadth of hazards for generative AI systems

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

]]>
Today, MLCommons® publicly released the AILuminate DEMO prompt dataset, a collection of 1,200 prompts for generative AI systems to benchmark safety against twelve categories of hazards. The DEMO prompt set is a component of the full AILuminate Practice Test dataset.

The dataset is available at: https://github.com/mlcommons/ailuminate.

The AILuminate benchmark, designed and developed by the MLCommons AI Risk and Reliability working group assesses LLM responses to over 24,000 test prompts across twelve categories of hazards that users of generative AI systems may encounter. It also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models; and Public reports of 13 systems-under-test (SUTS)  using the AILuminate v1.0 benchmarking grading scale to assess the quantitative and qualitative performance of the SUTS in each hazard category. 

The full 24,000 test prompts are broken out into the following datasets:

  • A Practice Test dataset with a standard, well-documented set of prompts that can be used for in-house testing, examples of specific hazards, and seeds and inspiration for the creation of other prompts. The Creative Commons DEMO set being released today is a 10% subset of the Practice dataset. Access to the full Practice dataset is available upon request from MLCommons. 
  • An Official Test dataset. For an official benchmark of a System Under Test (SUT) (which can be a single LLM or an ensemble). It is run by MLCommons, using the MLCommons official AILuminate assessment independent evaluator to provide a detailed set of performance scores, as measured against a standard baseline. The Official Test dataset is private to avoid SUTs “gaming” the benchmark by training to the specific test prompts. The Practice Test and Official Test datasets are designed to be statistically equivalent in their coverage and testing of the taxonomy of hazards. To run an Official Test contact MLCommons.

The 12 hazard categories included in the test prompts are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently in English, with plans to expand to French, Chinese and Hindi. MLCommons intends to update the prompt datasets, including the Creative Commons DEMO dataset, on a regular basis as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“We have already started using AILuminate to help us robustly evaluate the safety of our models at Contextual. Its rigor and careful design has helped us gain a much deeper understanding of their strengths and weaknesses,” said Bertie Vidgen, Head of Human Data at Contextual AI.

Raising the Bar for Safety Testing by Empowering Stakeholders

The AILuminate Creative Commons-licensed DEMO prompt dataset provides an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems for the twelve defined AILuminate hazard categories. We discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards; as will policymakers looking for examples of prompts that can generate hazardous results.

Download the DEMO Prompt Dataset Today

The DEMO test set is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from Github. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk and Reliability working group and our efforts to ensure the safety of AI systems.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.We invite others to join the AI Risk and Reliability working group.

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

]]>
Bringing Open Source Principles to AI Data through Croissant https://mlcommons.org/2024/12/croissant-medium-post/ Mon, 16 Dec 2024 18:50:28 +0000 https://mlcommons.org/?p=1966 Republished from the ODI research team Medium ("Canvas") post, authored by Thomas Carey Wilson, November 27, 2024.

The post Bringing Open Source Principles to AI Data through Croissant appeared first on MLCommons.

]]>
The Croissant metadata format to help standardize machine learning (ML) datasets has gained significant popularity within the open-source AI community by offering a powerful tool for standardizing and enhancing the accessibility of ML datasets. Developed by the MLCommons Croissant working group, it serves as a community-driven metadata standard that simplifies how data sets are used, making them more understandable and shareable. This surge in adoption underscores a growing awareness of the critical role transparency and collaboration play in advancing AI development.

By adhering to the Open Source AI Definition, Croissant guarantees that datasets are FAIR (findable, accessible, interoperable, and reusable) and advocates for responsible data management practices. As an increasing number of organizations and individuals embrace Croissant, cultivating innovation and collaboration across platforms becomes essential.

Thomas Carey Wilson‘s recent Medium post “Bringing Open Source Principles to AI Data through Croissant“, shares why embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms. We are republishing the Medium post here for the community. To get involved in the Croissant working group join it here.

Republished from the ODI research team Medium account (“Canvas“), November 27, 2024, authored by Thomas Carey Wilson.

Bringing Open Source Principles to AI Data through Croissant

Introduction

Artificial Intelligence (AI) is progressively influencing various sectors, including industries and governmental operations. Despite some risks, its potential can be unlocked by greater, more responsible, openness – enabling freedom, transparency, and collaboration. The Open Source Initiative’s (OSI) Open Source AI Definition seeks to bring these core principles to AI systems. To our delight, central to this mission is the availability and standardisation of AI data, in addition to models and weights. This is where MLCommons’ Croissant comes into play. Croissant is a community-driven metadata standard that simplifies how datasets are used in machine learning (ML), making them more accessible, understandable, and shareable. 

Understanding the Open Source AI Definition

The Open Source AI Definition extends the foundational freedoms of open source software to AI systems. It outlines four essential freedoms:

  • Use: the right to use the AI system for any purpose without seeking permission.
  • Study: examining the system’s workings by accessing its components.
  • Modify: the freedom to alter the system to change its outputs or behaviour.
  • Share: the right to distribute the system to others, with or without modifications, for any purpose.

These freedoms apply not only to complete AI systems but also to their components, such as models, data, and parameters. The OSI explains that a critical enabler of these freedoms is ensuring access to the “preferred form to make modifications,”- the essential resources required for others to understand and alter certain aspects of the AI system.

This includes providing detailed information about the data used to train the system (“data information”), the complete source code for training and running the system (“code”), and the model parameters like weights or configuration settings (“parameters”). Among these, data information is particularly crucial because it empowers users to study and modify AI systems by understanding the data foundation upon which they are built- a core area where Croissant makes a significant contribution.

We value the focus on data transparency but suggest, as highlighted in our data for AI taxonomy, specifying what measures enabling transparency could look like for different data types in AI systems. This can support more equitable access, informed debate, and the implementation of responsible practices across the AI ecosystem.

Croissant’s alignment with the Open Source AI Definition

Croissant, as an open-source tool, ensures that ML datasets are FAIR (findable, accessible, interoperable, and reusable), aligning with open science and open data practices. It embodies the principles of the Open Source AI Definition by providing a standardized, machine-readable format for ML dataset metadata. Here’s how Croissant supports each of the core freedoms and data information requirements:

Use

To be useful, core Croissant metadata should be openly available for all AI datasets, even if the datasets themselves cannot be made public. For an open-source AI model based on open data, Croissant can enable free use of AI datasets by providing detailed metadata that makes datasets easier to find and utilize. Once the required dataset(s) are located, it allows practitioners to quickly understand data collection methods, structures, and processing steps by standardizing how datasets are described. This ease of access and clarity empowers users to employ datasets for any purpose without needing permission or facing unnecessary barriers.

Moreover, Croissant’s design ensures interoperability by leveraging its attributes and ensuring tools understand what those attributes mean. The use of widely recognized vocabularies like schema.org allows integration with existing data and metadata management tools, such as data crawlers, search engines, data catalogues, and metadata assurance systems. This aligns with the Open Source AI Definition’s emphasis on freedom of use by enabling seamless integration across various ML frameworks like PyTorch, TensorFlow, and JAX.

Study

Studying an AI system requires access to detailed information about its components. Croissant facilitates this by offering rich metadata that supports a range of responsible AI use cases, including the structured discovery of datasets, representation of human contributions, and analysis of data diversity and bias. Specifically, it provides:

  • Dataset-level information: Includes descriptions, licenses, creators, and versioning.
  • Resources: Details about files and file sets within the dataset.
  • RecordSets: Structures that describe how data is organized, including fields and data types.

For example, Croissant enables researchers to analyze the demographic characteristics of annotators and contributors to assess dataset diversity or potential biases. For this purpose and others, the inclusion of data types, hierarchical record sets, and fields supports machine \interoperability, allowing tools to interpret and integrate datasets effectively.

Modify

Croissant’s open and extensible format encourages modification and adaptation. Users can:

  • Extend metadata: Croissant is designed to be modular, allowing for extensions like the Geo-Croissant extension, which captures information specific to geospatial data.
  • Adapt datasets: the standardization of metadata means that datasets can be modified more easily, whether by adding new data, altering existing records, or adjusting data structures.
  • Integrate with tools: Croissant metadata can be edited through a visual editor or a Python library, making it accessible for developers to modify datasets programmatically.
  • Enable interoperability across repositories: Croissant acts as a glue between platforms like Kaggle and Hugging Face, providing adapters that enable seamless dataset interchange while retaining provenance and versioning information for compatibility across frameworks.

For example, if a practitioner wants to add new fields to a dataset or alter the way data is processed, Croissant’s standardized format makes these modifications straightforward. The ability to represent complex data types and transformations within the metadata ensures that changes are accurately captured and can be shared with others.

Share

Sharing is at the heart of open source, and Croissant facilitates this by making datasets and their metadata easily distributable. The use of JSON-LD and alignment with schema.org means that metadata can be embedded in web pages, enabling search engines to index and discover datasets. Croissant also supports:

  • Portability: Croissant datasets can be seamlessly loaded into different ML frameworks, promoting sharing across platforms.
  • Collaboration: Croissant fosters collaboration among practitioners, researchers, and organizations by providing a common language for dataset metadata.

An illustrative example is how Croissant enables dataset repositories to infer metadata from existing documentation like data cards, making it easier to publish and share datasets with metadata attached.

Data information compliance

As for the practical requirements which enable these freedoms, the Open Source AI Definition places significant emphasis on data information, requiring:

  1. A complete description of all data used for training: Croissant provides a detailed schema for dataset-level information, including descriptions, licenses, creators, and data versions. It also specifies how to represent resources like files and file sets, ensuring that all data used in training is thoroughly documented.
  2. Listing of publicly available training data: through the distribution property and detailed resource descriptions, Croissant makes it clear where data is stored and how it can be accessed. The use of URLs and content descriptions aligns with the requirement to list publicly available data and where to obtain it.
  3. Listing of training data obtainable from third parties: Croissant’s metadata can include references to external data sources, including third-party datasets. Specifying licenses and access URLs ensures that users know how to obtain additional data, even if it requires a fee.

Moreover, Croissant’s support for versioning and checksums addresses the need for transparency in data changes over time. Croissant ensures that users can verify the integrity of data and understand its provenance by documenting dataset versions and providing file checksums.

Conclusion

Croissant epitomizes the core principles of the Open Source AI Definition by supporting AI datasets to be fully open, accessible, and adaptable. Through its standardized format, Croissant empowers everyone to use, study, modify, and share AI data freely- driving innovation, fostering collaboration, and building trust in AI systems. While details on how the Open Source AI Definition will work in practice are still emerging, embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms.

The post Bringing Open Source Principles to AI Data through Croissant appeared first on MLCommons.

]]>
Announcing the DataPerf Challenges 2023 Winners https://mlcommons.org/2023/12/dataperf-challenges-2023-winners/ Wed, 13 Dec 2023 22:03:25 +0000 http://local.mlcommons/2023/12/dataperf-challenges-2023-winners/ Pushing the boundaries of data-centric AI

The post Announcing the DataPerf Challenges 2023 Winners appeared first on MLCommons.

]]>

The MLCommons® DataPerf working group is excited to announce the winners of the inaugural DataPerf 2023 challenges, which centered around pushing the boundaries of data-centric AI and addressing the crucial ‘data bottleneck’ faced by the machine learning (ML) community.

In the challenges, participants were asked to approach ML from a novel perspective: by evolving training and test data alongside models to overcome data bottlenecks. The results were inspiring. Without further ado, here are the winners of the DataPerf 2023 challenges:

Quality of Training Data

The first two challenges targeted the quality of training data in the Vision and Speech domains. The winners successfully designed selection strategies to determine the best training sets from large candidate pools of weakly labeled images and automatically extracted spoken word clips.

Quality of Training Data: Vision Domain 

In the Vision domain, Paolo Climaco, PhD candidate at the  Institute of Numerical Simulation, University of Bonn,  triumphed by presenting a robust and efficient method for training data selection. He proposed an approach based on the technique of farthest point sampling where he selected negative examples for each label by sampling the feature space via iteratively searching to maximize the l2 distance. The best core set is then selected under nested cross-validation. Paolo’s best performing model scored a mean-F1 of 81.00 across the ‘cupcake’, ‘hawk’ and ‘sushi’ classes.

Second place was awarded to Danilo Brajovic, Research Associate at Fraunhofer IPA, for his pseudo label generation technique. In this technique, he trained multiple neural networks and classical machine learning models on a subset of data to classify remaining points to the positive and negative image pools. The best-performing model is then selected for core-set proposal under multiple sampling experiments, achieving a best mean-F1 score of 78.98.

Third place was awarded to Steve Mussmann, Computer Science Postdoc at University of Washington, who used modified uncertainty sampling to score a mean-F1 of 78.06 on the leaderboard. He trained a binary classifier on noisy labels from OpenImages and used this classifier to assign positive and negative image pools, with the core-set randomly sampled from both pools. 

We would also like to highlight Margaret Warren from Institute for Human and Machine Cognition for her submission using innovative techniques of human-centered axiomatic data selection.

Quality of Training Data: Speech Domain

The Speech domain was won by the Cleanlab team, led by Jonas Mueller, Chief Scientist, who showcased a data-driven approach using out-of-sample predicted probabilities for training set selection on spoken words in multiple languages. The Cleanlab solution achieved a macro F1 of 46.76 across English, Portuguese, and Indonesian, versus a crossfold validation baseline score of 42.23, showcasing a data-driven approach using out-of-sample predicted probabilities for training set selection on spoken words. 

Training Data Cleaning

The third challenge centered on data cleaning within the Vision domain. Participants were tasked with devising data cleaning strategies to select samples from noisy training sets that would benefit from re-labeling. Team Akridata, led by Sudhir Kumar Suman, took first place with an ingenious, adaptive cleaning algorithm. Impressively, by addressing only 11.58% of the samples, Akridata achieved an accuracy that was 95% of what a model trained on a purely clean dataset could achieve. The team’s approach was characterized by a thorough analysis of the dataset to identify any significant imbalance, followed by a comprehensive and meticulous identification of misclassified samples. They delved deep into understanding the inherent limitations of the model, which helped in making more efficient corrections. They also embraced a boost-like iterative methodology, which progressively honed the model’s performance with each cycle. Notably, they adopted a weighted approach to combine Shapley value into the scoring system to better represent the importance of each data sample. Their effort echoed efforts from other participants (namely Shaopeng Wei from ETH Zurich & SWUFE China), but differed from the more generic calculation of importance valuation.

Training Dataset Acquisition

Data marketplaces are increasingly crucial in a world driven by data. The dataset acquisition challenge focused on studying a critical problem in making data marketplaces work. This challenge required participants to devise a data acquisition strategy to select the best training set from multiple data-sellers based on limited exchanged information. There was a close tie between the submissions of Hanrui Lyu and Yifan Sun, PhD students from Columbia University, advised by Prof. Yongchan Kwon, and Feiyang Kang, a PhD student from Virginia Tech, advised by Prof. Ruoxi Jia. Both approaches achieved high accuracy. Team Columbia’s approach, which led to an accuracy of 76.45% (7.92% better than the baseline), was based on a brute force search to find one seller to allocate all the budget. This approach leveraged multiple submissions into DynaBench, which was not forbidden by our rules, although it is not a practical approach. Feiyang’s approach using a customized distribution matching approach with dimension reduction achieved an accuracy of 76.17% . Because of the close match between the submissions, both were declared co-winners.

DataPerf and the Future

We are immensely proud of the work done by all the participants in these challenges. These challenges represent a pivotal moment in data-centric AI research, highlighting the crucial role data plays in ML’s evolution and how we can continue to improve it.

Congratulations to all the winners and thank you to all participants. Your efforts and solutions have provided a significant leap forward in our understanding and tackling of data bottlenecks.

If you’re interested in getting involved in more challenges, the MLCommons DataPerf working group is currently accepting submissions in the Adversarial Nibbler Challenge, a challenge where users interact with an image generation model to identify “safe” prompts that lead to “unsafe” image generations. Adversarial Nibbler has a dedicated challenge track for the ART of Safety Workshop (AACL), and challenge participants are encouraged to write up their findings as red-teaming papers for the workshop. In addition, Adversarial Nibbler is intended as a continuous challenge with multiple rounds, so follow and stay tuned on the Nibbler Twitter account for future rounds of the challenge. 

For those inspired by the work done here, we will shortly announce the DataPerf 2024 challenges; details coming soon. Please visit the DataPerf website to learn more about the challenges, join the upcoming challenges, or contribute to the design of new data-centric tasks. Join the MLCommons DataPerf working group to stay abreast of the exciting developments in the data-centric ML community. We presented DataPerf results at NeurIPS 2023 in New Orleans in December 2023. 

Appendix: DataPerf Resources

  • Challenge Creators
  • Challenge Submissions
    • Vision Selection
      • GitHub repository of the winning submission. It includes code to run the algorithm, slides presenting the developed method and a challenge report explaining the proposed approach.
    • Speech Selection
    • Adversarial Nibbler
      • Releases planned by the end of the year for (i) dataset, (ii) paper for dataset description / challenge results
    • Vision Cleaning
      • Akridata wins the competition, but not with open source solution.
      • Leaderboard: https://dynabench.org/tasks/vision-debugging

Acknowledgements

The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive AI, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. In addition, this would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Mod Op, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

The post Announcing the DataPerf Challenges 2023 Winners appeared first on MLCommons.

]]>
DataPerf: the Leaderboard for Data https://mlcommons.org/2023/03/dataperf-the-leaderboard-for-data/ Thu, 30 Mar 2023 08:46:00 +0000 http://local.mlcommons/2023/03/dataperf-the-leaderboard-for-data/ Introducing a data-centric platform and community for competitions and building better ML.

The post DataPerf: the Leaderboard for Data appeared first on MLCommons.

]]>
As Machine Learning (ML) has grown rapidly and become more powerful, new challenges have arisen. The latest are limitations in the availability and utility of training and testing datasets for ML models. We propose DataPerf, the first platform and community to develop competitions and leaderboards for data and data-centric AI algorithms. Competitions and leaderboards have traditionally been focused on improving ML models and have inspired decades of innovation, including accomplishments such as AlexNet, the first model to surpass human performance on image recognition on the ImageNet dataset, which catalyzed the latest revolution in machine learning. By applying these proven approaches in a data-centric fashion, we believe that we can break through these dataset limitations and encourage better machine learning in the decades to come.

Data is the new bottleneck for ML

Could the keys to diagnosing and treating cancer earlier, engineering safe self-driving cars, automating fact-checking, detecting fraud, and reducing litigation costs have something in common? Machine learning (ML) promises new and exciting solutions to these problems, provided that appropriate training and testing data is available.

However, training and testing datasets are often too small, too specialized, too limited or of poor quality, and too static, leading to what we call the data bottleneck. The lack of appropriate training and testing data hinders ML development and deployment in various domains, including healthcare, transportation, finance, cybersecurity, and more.

In this blogpost, we identify the current data bottlenecks we are facing, and explain how DataPerf provides incentive for research and engineering communities to overcome these bottlenecks.

Figure 1: Changing the ML paradigm: On the left, standard ML assumes static training and test datasets. On the right, DataPerf evolves training and test data alongside models, overcoming data bottlenecks.

The first bottleneck is around data quality. Datasets for ML have a wide range of sources. For example, most ML research relies on public training and test datasets which are built from easily available data sources such as web scrapes, forums, Wikipedia, as well as crowdsourcing. However, these sources often suffer from quality, distribution, and bias, among other issues. For example, as Dollar Street has illustrated, most visual data comes from the wealthy portions of the developed world, leading to bias.

These data quality bottlenecks in turn lead to data quantity bottlenecks. The growth in data means that a large portion of the data is low-quality, which has incrementally less benefit and consequently leads to non-linear and rapid growth in model size[1]. This ultimately drives up the size and computational cost of models, leading to recent innovations requiring staggering and costly infrastructure[2]. As we exhaust our public data sources ML models may even saturate in terms of accuracy, stalling progress[3]. Last, there is a growing gap between performance on these public research data sets and the real-world. Therefore, it is crucial for the AI community to actively benchmark and conduct research to enhance the quality of training and test data. This is the core principle of the data-centric AI community.

DataPerf, the leaderboard for data

DataPerf is inspired by the success of ML Leaderboards. Leaderboards such as SQuAD, SuperGLUE, Kaggle and others have played a significant role in advancing research in ML models. These leaderboards create a shared metric for measuring progress and act as a compass for shaping ML research.

DataPerf is the first platform and community to develop leaderboards for data and data-centric AI algorithms. We aim to have a similar impact on data-centric AI research as ML leaderboards had on ML model research. DataPerf leverages Dynabench, a platform that supports benchmarking of data, data-centric algorithms, and models.

DataPerf v0.5 features five challenges that focus on five common data-centric tasks across four different application domains:

  • Training data selection (Vision) – Design a data selection strategy that selects the best training set from a large candidate pool of weakly labeled training images.
  • Training data selection (Speech) – Design a data selection strategy that selects the best training set from a large candidate pool of automatically extracted clips of spoken words.
  • Training data Cleaning (Vision)– Design a data cleaning strategy that chooses samples to relabel from a noisy training set. The current version of this challenge targets image classification.
  • Training dataset valuation (NLP) – Design a data acquisition strategy that selects the best training set from multiple data-sellers based on limited information exchanged between buyers and sellers.
  • Adversarial Nibbler (Multimodal text-to-image) – Design safe-looking prompts that lead to unsafe image generations.

These challenges aim to benchmark and improve the performance of data-centric algorithms and models. Each challenge on the DataPerf website includes design documents that outline the problem, model, quality target, rules and submission guidelines. The Dynabench platform enables us to offer a live leaderboard, an online evaluation framework, and track submissions over time.

How to get involved?

As a community of machine learning researchers, data scientists, and engineers, we are committed to improving machine learning by enhancing data quality. DataPerf is calling on innovators in academia and industry to measure and validate data-centric algorithms and techniques, and to enhance datasets through these benchmarks. The first round of Dataperf 2023 challenges are open for participation on March 30th, 2023. The submission will close on May 26th, 2023. The winners of the challenge will be invited to present their work at ICML 2023, and publish their work in the forthcoming Data for ML Research (DMLR) journal.

To help, please share this blog post with your colleagues. Visit the DataPerf website to participate in the DataPerf challenges, or to design a new data-centric challenge. Join the DataPerf working group to stay up-to-date on the ongoing changes in the data-centric ML community.

We would like to thank the Dynabench engineering team for providing the infrastructure and support for DataPerf. The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive.ai, ETH Zurich, Google, Harvard University, Meta, MLCommons, and Stanford University. In addition, this effort would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Digital Prism Advisors, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

References


  1. Hestness, Joel, et al. “Deep learning scaling is predictable, empirically.” arXiv preprint arXiv:1712.00409 (2017). 
  2. Sevilla, Jaime, et al. “Compute trends across three eras of machine learning.” 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022. 
  3. Villalobos, Pablo, et al. “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning.” arXiv preprint arXiv:2211.04325 (2022). 

The post DataPerf: the Leaderboard for Data appeared first on MLCommons.

]]>
Dollar Street – Bringing Diversity to Computer Vision https://mlcommons.org/2023/02/dollar-street-bringing-diversity-to-computer-vision/ Tue, 07 Feb 2023 08:35:00 +0000 http://local.mlcommons/2023/02/dollar-street-bringing-diversity-to-computer-vision/ Harnessing the power of diverse data to reduce bias and build better machine learning for everyone

The post Dollar Street – Bringing Diversity to Computer Vision appeared first on MLCommons.

]]>
We are extremely excited to announce the release of the Dollar Street Dataset, an open access diverse computer vision (CV) dataset. It is designed to reduce geographic and socioeconomic bias and is a collaboration between MLCommons®, Coactive AIGapminder – a Swedish non-profit, and Harvard University. The Dollar Street Dataset includes 38K high-quality images of household objects from across the world labeled with object tags and demographic data. The images were collected from dozens of different countries and all metadata was manually verified. The dataset is available under CC-BY 4.0 for academic and commercial usage. We improved classification accuracy for items from lower income households by 50%, highlighting the impact of data to reduce bias and ultimately empowering the community to build better machine learning for everyone.

Computer vision is an incredibly powerful discipline within AI for enriching society and the human experience and in the last decade, machine learning has become the linchpin of CV. Advances in machine learning make life subtly better – for example, helping us take vivid photos, capturing beautiful and unique memories. Other applications are profoundly shaping society and saving lives by enabling cars to detect pedestrians or helping detect, diagnose, and treat cancers better.

However, most existing commercial and open access datasets for training ML models disproportionately skew towards the developed world and more specifically high-income populations. This systematic bias in datasets means that many ML models perform unevenly across the broader population. In particular, many CV tools perform poorly for women and people of color. Just as one example, studies have found that existing image classification tools perform >30X worse for darker-skinned women compared to lighter-skinned men. This variation in performance can have tremendous real-world consequences, given the rapid adoption of CV throughout modern society.

The Dollar Street Dataset is an open access, diverse CV dataset designed to combat bias and enable machine learning to work better for everyone. It comprises 38K high-quality images of 289 everyday objects from households across the world, labeled with object tags and demographic data such as region, country, and household monthly income. The images were collected at scale across 63 different countries, including many communities without internet access, and all metadata was manually verified.

We also demonstrated that the Dollar Street Dataset can profoundly improve machine learning models by dramatically reducing bias and improving robustness – especially for lower socioeconomic communities. We used five standard CV models trained on existing datasets to classify images of household objects such as stoves, toilets, and toothbrushes. These models achieved an acceptable 57% accuracy for high-income households, but dropped to 18% for lower-income households – illustrating the bias of models trained on existing datasets. By finetuning with the Dollar Street Dataset, we achieved 70% or better accuracy across all household income levels, an amazing increase of 50% or more.

Contributors to the Dollar Street Datasets include researchers from Coactive AI, Harvard University, and MLCommons. It can be downloaded at mlcommons.org/dollar-street and for more information, please read our paper accepted to the 2022 Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

The post Dollar Street – Bringing Diversity to Computer Vision appeared first on MLCommons.

]]>
The People’s Speech Dataset just got better – dynamic data in action https://mlcommons.org/2023/01/the-peoples-speech-dataset-just-got-better-dynamic-data-in-action/ Thu, 26 Jan 2023 08:36:00 +0000 http://local.mlcommons/2023/01/the-peoples-speech-dataset-just-got-better-dynamic-data-in-action/ Updating datasets improves speech for everyone

The post The People’s Speech Dataset just got better – dynamic data in action appeared first on MLCommons.

]]>
Our big, freely available speech dataset just got its first upgrade. Find out what’s new, and why dynamic datasets are the future of machine learning.

What is The People’s Speech Dataset?

The People’s Speech Dataset was designed to improve the accuracy of Automatic Speech Recognition (ASR) Machine Learning (ML) models for recognizing English speakers. It includes over 30,000 hours of transcribed English-language speech from a diverse set of speakers with a wide range of accents.

We published the People’s Speech in November 2021 under a CC-BY-SA and CC-BY 4.0 license, meaning it’s free for both academic and commercial use.

Why have we already updated it to v1.1?

We believe that dynamic datasets are critical to the future advancement of machine learning. MLCommons® is committed to actively maintaining and updating its datasets, so that developers can always train their models on the most relevant and robust data.

With the People’s Speech Dataset, this is particularly important because of the speed at which modern languages evolve. Fuelled by globalization and the proliferation of the internet, the real-world manifestations of language are highly dynamic. It’s therefore vital that the datasets built to represent everyday speech are dynamic too.

When datasets are not proactively maintained, they become less relevant with every day that passes because they are increasingly reflecting historical rather than present-day trends. For developers, using static datasets also brings a major risk of reinforcing implicit algorithmic biases.

By continuously improving the quality of our datasets, we make development easier and are improving the real-world performance of ML models that train on them and making machine learning better for everyone.

What’s new for v1.1?

In this first update, our Datasets Working Group has improved the speed, quality, and usability of the dataset, particularly for those using smaller subsets of data.

A major focus for us was making the dataset more user-friendly. We’ve created a tutorial that shows users how to train an ASR model using NVIDIA NeMo, with Google Cloud Platform included.

As another example, we’ve done some behind-the-scenes work to improve download speeds, fix bugs, and moved the dataset hosting to Hugging Face. We also raised the bar for our “clean” data samples. This has increased the proportion of “dirty” data and improved the overall quality of the remaining “clean” data.

Summary

We hope these upgrades to our People’s Speech dataset inspire ML practitioners to pay more attention to the quality of their training data, and the importance of dynamic datasets.

Discover more and start using the dataset here


About MLCommons

MLCommons is an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit MLCommons or contact participation@mlcommons.org.

To get involved with The People’s Speech, please join the Datasets Working Group, and follow @MLCommons on Twitter. Help us keep growing this community effort, and don’t hesitate to get in touch if you would like to be involved.

Press Contact:
David Kanter
press@mlcommons.org

The post The People’s Speech Dataset just got better – dynamic data in action appeared first on MLCommons.

]]>
Harnessing Human-AI Collaboration https://mlcommons.org/2022/07/harnessing-human-ai-collaboration/ Mon, 11 Jul 2022 08:28:00 +0000 http://local.mlcommons/2022/07/harnessing-human-ai-collaboration/ Dynamic Adversarial Data Collection augments large scale datasets by adding diverse and high-quality data

The post Harnessing Human-AI Collaboration appeared first on MLCommons.

]]>
AI models need data to learn about the world. In the past decade, scaling up model parameters and data together has driven tremendous progress in AI. This is the motivation behind building large scale datasets at MLCommons® such as the People’s Speech and the Multilingual Spoken Words Corpus. However, datasets are not just about quantity. Research shows that data diversity and quality are also critically important for good performance. We need “bigger” data, but we also need “better” data.

Dynamic Adversarial Data Collection (DADC) is an approach to collecting data that relies on collaboration between humans and AI models, e.g. using the Dynabench platform, to identify and improve accuracy. It is Adversarial because humans collaborate by finding examples of input data that ‘trick’ the model into misclassification errors. It is Dynamic because the model-fooling process is iterative, building a progressively stronger model as more ‘tricky’ data is discovered and incorporated into training.

DADC has already demonstrated its effectiveness on a number of tasks such as question answeringsentiment analysisnatural language inference and hate speech. While it leads to data that is more creative and diverse, the DADC-collected data often represents unusual corner cases and should be combined with real-world examples to be truly effective.

In the next decade, we envision a shift in focus from quantity alone towards a more balanced approach that incorporates quality, creativity, diversity and collaboration. DADC operates at the next frontier of model capabilities, providing a data- and human-centric tool to find and begin to bridge the gaps between machine and human intelligence.

In a large scale collaboration to address these challenges, researchers from University College London, the University of Oxford, The Alan Turing InstituteStanford Universitythe University of North Carolina at Chapel HillCarnegie Mellon Universitythe University of SheffieldSimon Fraser UniversityMeta / FAIRHugging Face and MLCommons are rethinking model benchmarking through Dynabench: a platform to facilitate DADC research. Dynabench brings models and humans together in an easy, collaborative and community-driven way to push the boundaries of current data collection methods.

To encourage even more exciting collaboration, we will be holding the First Workshop on Dynamic Adversarial Data Collection co-located with NAACL ‘22 in Seattle on the 14th of July. We have an exciting day planned with amazing keynote speakers, a diverse and engaging panel, and presentations from our Shared Task participants. We encourage you to get involved! To stay up to date with all DADC-related developments please follow @DADCWorkshop@DynabenchAI & @MLCommons on Twitter. Help us keep growing this community effort, and don’t hesitate to get in touch if you would like to be involved.

The post Harnessing Human-AI Collaboration appeared first on MLCommons.

]]>
MLCommons Unveils Open Datasets and Tools to Drive Democratization of Machine Learning https://mlcommons.org/2021/12/mlcommons-unveils-open-datasets-and-tools-to-drive-democratization-of-machine-learning/ Tue, 14 Dec 2021 08:43:00 +0000 http://local.mlcommons/2021/12/mlcommons-unveils-open-datasets-and-tools-to-drive-democratization-of-machine-learning/ Engineering consortium advances data-centric AI with pioneering datasets and tools

The post MLCommons Unveils Open Datasets and Tools to Drive Democratization of Machine Learning appeared first on MLCommons.

]]>
The MLCommons® Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People’s Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). These trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI.

The People’s Speech Dataset
The People’s Speech Dataset is among the world’s largest diverse English speech recognition datasets licensed for academic and commercial usage. The 30,000-hour supervised conversational dataset is an order of magnitude larger than what was available just a few years ago. The dataset, released under a Creative Commons license, democratizes access to speech technology such as voice assistants and transcription, and unlocks innovation in the machine learning community. Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech.

Multilingual Spoken Words Corpus
Also available today is the Multilingual Spoken Words Corpus (MSWC), a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples. Previous datasets relied on manual efforts to collect and validate thousands of utterances for each keyword and were commonly restricted to a single language. A diverse multilingual dataset that spans languages spoken by over 5 billion people, MSWC advances the research and development of applications such as voice interfaces for a broad global audience. Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words.

DataPerf
The new DataPerf benchmark suite supports data-centric AI innovation by measuring the quality of datasets for common ML tasks and the impact of enhancing datasets. Training and test datasets are a key part of creating an ML solution — the solution can only be as good as the data — but much less effort is spent on understanding and improving datasets than on mastering and improving models. DataPerf fosters and measures progress in this vital area. The MLCommons Association will support a series of challenges with leaderboards in 2022 to encourage participation in DataPerf. Contributors to the suite include researchers from Alibaba, Coactive.AI, ETH Zurich, Google, Harvard University, Landing.AI, Meta, Stanford University, and TU Eindhoven, drawing on the teams responsible for Cats4ML, the Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf™ benchmarks. The MLCommons Association invites other participants to join the DataPerf effort at dataperf.ai.

Historically, most AI research has focused on improving model architectures and making them available to the community; in contrast, attention to engineering and maintaining datasets has lagged and is often manual and ad-hoc. The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation.

“The machine learning model architecture for many applications is basically a solved problem. In many cases, focusing on engineering the data is more important for unlocking successful AI applications. Data is food for AI, and our systems need not just massive amounts of calories, but also high quality nutrition. We need not just big data, but good data,” said Andrew Ng, founder and CEO of Landing AI, founding lead of Google Brain, co-founder and chairman of Coursera, and adjunct professor at Stanford University. “Thanks to the shared efforts by the community, including the work initiated by the MLCommons Association and its members, the movement demonstrates the potential for Data-Centric AI, and how we can collectively implement a greater AI adoption.”

“Speech technology can empower billions of people across the planet, but there’s a real need for large, open, and diverse datasets to catalyze innovation,” said David Kanter, the MLCommons Association co-founder and executive director. “The People’s Speech is a large scale dataset in English, while MSWC offers a tremendous breadth of languages. I’m excited for these datasets to improve everyday experiences like voice-enabled consumer devices and speech recognition.”

About the MLCommons Association
The foundation for the MLCommons Association began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding member partners – global technology providers, academics and researchers, the MLCommons Association is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.
For additional information on the MLCommons Association and details on becoming a member of the organization, please visit https://mlcommons.org/ or contact membership@mlcommons.org.

Press Contact:
press@mlcommons.org

The post MLCommons Unveils Open Datasets and Tools to Drive Democratization of Machine Learning appeared first on MLCommons.

]]>
Introducing the People’s Speech dataset – 30,000+ hours of diverse speech data to drive ML innovation https://mlcommons.org/2021/12/introducing-the-peoples-speech-dataset-30000-hours-of-diverse-speech-data-to-drive-ml-innovation/ Tue, 14 Dec 2021 08:35:00 +0000 http://local.mlcommons/2021/12/introducing-the-peoples-speech-dataset-30000-hours-of-diverse-speech-data-to-drive-ml-innovation/ Bigger, better, and for everyone

The post Introducing the People’s Speech dataset – 30,000+ hours of diverse speech data to drive ML innovation appeared first on MLCommons.

]]>
We are thrilled to announce the public release of the People’s Speech, a 30,000-hour and growing supervised conversational English speech recognition dataset. It is the largest diverse English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. Crucially, the People’s Speech will drive innovation by democratizing machine learning – it is large enough to train fully end-to-end models and improve applications such as automated speech recognition, transcription, diarization, and noise reduction. In future releases we will expand the reach and impact by adding more complex acoustic conditions and additional languages.

Language and voice are natural and universal for nearly everyone on the planet and foundational to human interactions. Voice interfaces have so often captured our imagination through movies and science fiction. But this vision of the future is increasingly becoming real – voice assistants are common in everything from smartphones to cars, and are expected to reach the majority of the world’s population by the middle of the decade. More subtly, today’s phone calls and audio conferences are routinely enhanced with transcription and audio enhancement.

But the machine learning community is only just scratching the surface of the possibilities unlocked by speech technologies. One of the biggest challenges to progress is the availability of diverse large-scale datasets with permissive licensing; experts estimate that training an end-to-end speech recognition model requires 10,000-15,000 hours of labeled speech. Most existing public speech datasets are limited by a combination of restrictive licensing that prohibits commercial use, relatively small size, or simpler types of speech such as read audiobooks.

The People’s Speech is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. Rather than having humans transcribe everything which is expensive and time-consuming, we collected only audio that already had a transcript. We then used an innovative software pipeline to automatically match the audio with the transcript. As part of our commitment to updating and maintaing the dataset, we are also releasing our software as open-source to empower the community to build on our contributions.

Why? By increasing the availability of the dataset to reach more people, machine learning research and commercial applications will advance significantly. In short, the People’s Speech provides a solid jumping-off point for other companies and individuals to innovate and experiment.

Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech. For more information, please read our paper accepted to the 2021 Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

The post Introducing the People’s Speech dataset – 30,000+ hours of diverse speech data to drive ML innovation appeared first on MLCommons.

]]>
Multilingual Spoken Words Corpus – 50 Languages and Over 23 Million Audio Keyword Examples https://mlcommons.org/2021/12/multilingual-spoken-words-corpus-50-languages-and-over-23-million-audio-keyword-examples/ Tue, 14 Dec 2021 08:33:00 +0000 http://local.mlcommons/2021/12/multilingual-spoken-words-corpus-50-languages-and-over-23-million-audio-keyword-examples/ Giving the gift of voice to 5 billion people in 50 languages

The post Multilingual Spoken Words Corpus – 50 Languages and Over 23 Million Audio Keyword Examples appeared first on MLCommons.

]]>
Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.

Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant). Keyword spotting systems use very low-power hardware to continuously listen for a key phrase in order to trigger an action such as turning on lights or waking up a more sophisticated interface. For some people, this kind of interaction is a modern convenience, but for others, like the visually-impaired, it can be a life-changing capability.

Robust voice interaction requires training machine learning models on large datasets. Traditionally, these keyword datasets require enormous effort to collect and validate many thousands of utterances for each keyword of interest from a diverse set of speakers and environmental contexts. Unfortunately, most existing public keyword datasets are monolingual and contain only a handful of keywords. Many commonly spoken languages lack any public datasets, which makes it extremely difficult to provide basic voice capabilities to speakers of these languages.

To address these challenges, MLCommons® has developed and will maintain and update MSWC, a large speech recognition dataset of spoken words in 50 languages. In total, the dataset contains over 340,000 words and 23 million one-second audio samples, adding up to over 6,000 hours of speech. We constructed this dataset by applying open source tools to extract individual words from crowdsourced sentences donated to the Common Voice project, which can then be used to train keyword spotting models for voice assistants across a diverse array of languages.

Of the languages in our dataset, 12 are “high-resource”, with over 100 hours of data, 12 are “medium resource” with 10 to 100 hours of data, and 26 are “low-resource” with under 10 hours of data. To our knowledge, the MSWC dataset is the only open-source spoken word dataset for 46 of these languages. Each keyword has predefined training, validation, and testing splits and we are also releasing the open-source tools used to build the dataset and categorize the keywords.

Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words and for more information, please read our paper accepted to the 2021 Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

The post Multilingual Spoken Words Corpus – 50 Languages and Over 23 Million Audio Keyword Examples appeared first on MLCommons.

]]>