Data-centric ML Archives - MLCommons

Unveiling the PRISM Alignment Project

MLCommons — Mon, 13 May 2024 19:02:56 +0000

Introducing PRISM

Human feedback learning for aligning large language models (LLMs) is a field that is rapidly growing in importance and influence for AI design and development. Typically, human feedback is collected as binary preference votes or numeric scores over two or more model outputs to a user prompt. To date, there’s been limited data-centric efforts to source a higher quality and quantity of human feedback from a wider spectrum of people.

Recently, researchers at the University of Oxford in collaboration with MLCommons and others, released the PRISM alignment project – a new resource for the AI community to understand how humans from around the world perceive and interact with LLMs.

PRISM aims to diversify the voices currently contributing to the human feedback used to train LLMs. To achieve this aim, researchers collected preferences from 1,500 humans across 38 countries, as they interacted with 21 different LLMs, in real-time conversations. These conversations were hosted on Dynabench – an MLCommons® platform designed for dynamic human-and-model-in-the-loop data collection. In total, the researchers observed 8,011 conversations, amassing over 68,000 human-AI interactions. Each interaction was assigned a numeric preference intensity score, and associated with additional fine-grained feedback such as factuality, fluency, and creativity ratings as well as open-ended text feedback .

Hannah Rose Kirk, PRISM project lead, said “Human feedback learning can no longer operate under simplifying assumptions of a “generic human”. Alignment is complex precisely because it is influenced by messy human factors, interpersonal subjectivities and cross-cultural disagreements. With PRISM, we wanted to lean into these data-centric human factors and ultimately encourage more inclusive and adaptive technologies.”

Max Bartolo, co-chair of the MLCommons Data-Centric ML working group, says “Data is central to training and evaluating machine learning systems. Human preference is intricate and multifaceted, but current methods simplify this complexity. PRISM is a pioneering large-scale initiative that offers profound insights and poses critical questions, shedding light on the complex nature of human preferences and the perspectives they represent.”

Features of the PRISM Alignment Dataset

PRISM has many rich features that enable future data-centric work to be incorporated into human preferences for LLMs. For example, every rating of an LLM output on the Dynabench platform is linked back to a detailed participant survey via a pseudo-anonymised identifier. This data could be used to tailor the behaviors of LLMs to personalized preferences.

Attaching ratings to sociodemographics is valuable when considering how to aggregate preferences across groups of individuals, for example, across members of a religion, nation or continent. To understand these aggregation decisions, PRISM includes two census representative samples for the UK and the US. For the additional 30+ country-specific samples, the researchers sourced significant demographic diversity, for example by age, religious affiliation and ethnicity – especially compared to previous publicly-available human feedback datasets.

As Dr Scott A Hale, Associate Professor at the University of Oxford, and co-principal investigator on the PRISM project, explains “Having more people at the table, having more perspectives represented, ultimately leads to better technology, and it leads to technology that will raise everyone up.”

PRISM has another interesting feature compared to previous human feedback datasets: it is designed to map the stated preferences to contextual preferences. Stated preferences of participants are collected in the survey, where people are asked how important it is to them that LLMs possess general behaviors like factuality, honesty or creativity. Contextual preferences are collected for every conversation with an LLM-in-the-loop, where participants are asked how well models performed in the specific interaction, and how important these performance factors were in determining why one model was scored higher than another. Comparing these two types of preferences allows researchers to understand how different conversations cause departures from global priorities or principles of AI behaviors.

How MLCommons Solved Engineering Challenges to Facilitate Live Interactions with Multiple LLMs

The success of PRISM relied heavily on building an interface for participants to dynamically interact with LLMs, in naturalistic, low latency settings, and a clean UX design. PRISM is the first MLCommons project to build capabilities for LLM data collection and research. The updates to the MLCommons Dynabench Platform included plugins of LLMs in the backend, new rating scales, and a sleek chat interface. These are now being used for future research projects, such as HELP-MED, another University of Oxford led project where users interact with LLMs to assist in the discovery of medical knowledge.

Juan Ciro, MLCommons Engineer and PRISM project author says, “dealing with so many users concurrently meant rebuilding some of the Dynabench platform’s base elements and, although very challenging, seeing the platform evolve was truly gratifying”. Rafael Mosquera, MLCommons Engineer and author, says, “from an engineering standpoint, PRISM allowed us to understand the complexities behind serving LLMs. From a research perspective, we also got a sense of the difficulties that arise when evaluating the performance of these models on standard benchmarks. It was very surprising to see PRISM participants on average rating models like GPT-4 less favorably than smaller models, like Zephyr-7B. We hope the dataset allows other researchers to rethink what alignment means, and perhaps think about different demographics when training and evaluating LLMs.”

Beyond the Dataset: Should all domains be treated equally when crowdsourcing human preference data?

High-quality open-access human feedback is a scarce resource in the AI community. So, it may not be wise to treat all domains as equally in need. The PRISM project authors say that diverse perspectives on LLM behaviors are not needed equally everywhere: the prompt “what is the capital of France?” can be treated differently to “is abortion morally right or wrong?” PRISM targets its efforts in areas of high interpersonal and cross-cultural disagreement, where the majority of conversations center on values-guided topics (e.g., religion, politics, family) or controversial issues (e.g., gun rights, abortion, global conflicts or euthanasia). However, there are challenges in crowdsourcing perspectives to controversial issues, especially in choosing which people to sample and how to weigh their opinions, how to balance competing or conflicting interests, and ensure safe responses to harmful prompts.

As co-principal investigator on the project, and MLCommons AI Safety Working Group lead, Bertie Vidgen says “PRISM shows that you shouldn’t ask ‘Is your model aligned’ but instead ‘Who has it been aligned to?’ and ‘How has it been aligned?’ This exciting work has huge implications for how we think about the values, safety, and design of models.“

Use the Dataset

In line with MLCommons mission to develop rich datasets for the community of researchers and developers, PRISM offers many avenues of future research and is fully open-access.

The dataset can be downloaded at: https://doi.org/10.57967/hf/2113.

A preprint describing the dataset and findings is also available at: [2404.16019] The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

PRISM was supported by a variety of funders, including MetaAI’s Dynabench Grants for optimizing human-and-model-in-the-loop feedback and Microsoft’s Advancing Foundation Models Research (AFMR) grant program. It also received support in the form of credit or research access from OpenAI, Hugging Face, Anthropic, Google, Aleph Alpha and Cohere.

A full acknowledgement and disclosure of the funding statement can be found in the paper.

The post Unveiling the PRISM Alignment Project appeared first on MLCommons.

Announcing the Formation of the MLCommons Data-centric Machine Learning Research Working Group

MLCommons — Wed, 03 Jan 2024 22:22:20 +0000

We are pleased to announce the Data-centric Machine Learning Research (DMLR) working group. The goal of the new working group is to improve datasets that matter–in the ways that matter to the Machine Learning community.

Benchmarks for machine learning solutions based on static datasets have well-known issues: they saturate quickly, are susceptible to overfitting, contain exploitable annotator artifacts, and have unclear or imperfect evaluation metrics. To help overcome these and other data-centric challenges, the DMLR working group brings together two key projects, DataPerf and Dynabench, which deliver novel approaches towards data-centric methodology benchmarks, data quality benchmarks, and benchmark data curation with dynamic adversarial data collection.

“I’m excited to see the DMLR working group accelerate machine learning through defining, developing, and operating benchmarks for datasets and data-centric algorithms,” said David Kanter, MLCommons Executive Director. “This flexible ML benchmarking platform will unlock new approaches and methodologies in AI.”

The new paradigm of data-centric benchmarking is powered by Dynabench, a research platform for dynamic data collection and benchmarking. Dynabench challenges existing ML benchmarking dogma by embracing dynamic dataset generation with a human in the loop–where people are tasked with finding examples of data that fool a state-of-the-art AI model. This dynamic approach allows for rapid iteration of models by yielding data that can be used to further train even better state-of-the-art models.

DataPerf complements Dynabench by delivering an increased research focus on data quality and excellence. DataPerf evaluates the quality of training and test data, as well as the algorithms for constructing or optimizing datasets. By enabling construction and optimization of test sets, DataPerf plays a critical role in evaluating future AI systems for bias and advancing equity. This recent NeurIPS 2023 paper on DataPerf shares more on the first round of five DataPerf challenges across multiple modalities.

“We are excited to further DataPerf’s impact in close collaboration with Dynabench. Together, we will continue to address pressing research questions as part of one unified effort to advance data-centric approaches to machine learning,” said Lilith Bat-Leah, who co-chairs the DMLR working group along with Max Bartolo and Praveen Paritosh.

Core to understanding progress is measurement tracked on leaderboards. DataPerf and other Dynabench challenges will continue to be hosted by the new DMLR working group, and the platform will continue to evolve to address challenges core to the working group’s mission. Researchers from Coactive.AI, Cohere, Common Crawl, ETH Zurich, Factored, Google, Harvard, Meta, Mod Op, Oxford, Stanford and many other institutions have contributed to prior challenges. New collaborations are underway with Common Crawl, DataComp, NASA, and the University of Texas at Austin.

We invite others to get involved and help shape the future of the DMLR working group.

The post Announcing the Formation of the MLCommons Data-centric Machine Learning Research Working Group appeared first on MLCommons.

DataPerf: the Leaderboard for Data

MLCommons — Thu, 30 Mar 2023 08:46:00 +0000

As Machine Learning (ML) has grown rapidly and become more powerful, new challenges have arisen. The latest are limitations in the availability and utility of training and testing datasets for ML models. We propose DataPerf, the first platform and community to develop competitions and leaderboards for data and data-centric AI algorithms. Competitions and leaderboards have traditionally been focused on improving ML models and have inspired decades of innovation, including accomplishments such as AlexNet, the first model to surpass human performance on image recognition on the ImageNet dataset, which catalyzed the latest revolution in machine learning. By applying these proven approaches in a data-centric fashion, we believe that we can break through these dataset limitations and encourage better machine learning in the decades to come.

Data is the new bottleneck for ML

Could the keys to diagnosing and treating cancer earlier, engineering safe self-driving cars, automating fact-checking, detecting fraud, and reducing litigation costs have something in common? Machine learning (ML) promises new and exciting solutions to these problems, provided that appropriate training and testing data is available.

However, training and testing datasets are often too small, too specialized, too limited or of poor quality, and too static, leading to what we call the data bottleneck. The lack of appropriate training and testing data hinders ML development and deployment in various domains, including healthcare, transportation, finance, cybersecurity, and more.

In this blogpost, we identify the current data bottlenecks we are facing, and explain how DataPerf provides incentive for research and engineering communities to overcome these bottlenecks.

Figure 1: Changing the ML paradigm: On the left, standard ML assumes static training and test datasets. On the right, DataPerf evolves training and test data alongside models, overcoming data bottlenecks.

The first bottleneck is around data quality. Datasets for ML have a wide range of sources. For example, most ML research relies on public training and test datasets which are built from easily available data sources such as web scrapes, forums, Wikipedia, as well as crowdsourcing. However, these sources often suffer from quality, distribution, and bias, among other issues. For example, as Dollar Street has illustrated, most visual data comes from the wealthy portions of the developed world, leading to bias.

These data quality bottlenecks in turn lead to data quantity bottlenecks. The growth in data means that a large portion of the data is low-quality, which has incrementally less benefit and consequently leads to non-linear and rapid growth in model size^[1]. This ultimately drives up the size and computational cost of models, leading to recent innovations requiring staggering and costly infrastructure^[2]. As we exhaust our public data sources ML models may even saturate in terms of accuracy, stalling progress^[3]. Last, there is a growing gap between performance on these public research data sets and the real-world. Therefore, it is crucial for the AI community to actively benchmark and conduct research to enhance the quality of training and test data. This is the core principle of the data-centric AI community.

DataPerf, the leaderboard for data

DataPerf is inspired by the success of ML Leaderboards. Leaderboards such as SQuAD, SuperGLUE, Kaggle and others have played a significant role in advancing research in ML models. These leaderboards create a shared metric for measuring progress and act as a compass for shaping ML research.

DataPerf is the first platform and community to develop leaderboards for data and data-centric AI algorithms. We aim to have a similar impact on data-centric AI research as ML leaderboards had on ML model research. DataPerf leverages Dynabench, a platform that supports benchmarking of data, data-centric algorithms, and models.

DataPerf v0.5 features five challenges that focus on five common data-centric tasks across four different application domains:

Training data selection (Vision) – Design a data selection strategy that selects the best training set from a large candidate pool of weakly labeled training images.
Training data selection (Speech) – Design a data selection strategy that selects the best training set from a large candidate pool of automatically extracted clips of spoken words.
Training data Cleaning (Vision)– Design a data cleaning strategy that chooses samples to relabel from a noisy training set. The current version of this challenge targets image classification.
Training dataset valuation (NLP) – Design a data acquisition strategy that selects the best training set from multiple data-sellers based on limited information exchanged between buyers and sellers.
Adversarial Nibbler (Multimodal text-to-image) – Design safe-looking prompts that lead to unsafe image generations.

These challenges aim to benchmark and improve the performance of data-centric algorithms and models. Each challenge on the DataPerf website includes design documents that outline the problem, model, quality target, rules and submission guidelines. The Dynabench platform enables us to offer a live leaderboard, an online evaluation framework, and track submissions over time.

How to get involved?

As a community of machine learning researchers, data scientists, and engineers, we are committed to improving machine learning by enhancing data quality. DataPerf is calling on innovators in academia and industry to measure and validate data-centric algorithms and techniques, and to enhance datasets through these benchmarks. The first round of Dataperf 2023 challenges are open for participation on March 30th, 2023. The submission will close on May 26th, 2023. The winners of the challenge will be invited to present their work at ICML 2023, and publish their work in the forthcoming Data for ML Research (DMLR) journal.

To help, please share this blog post with your colleagues. Visit the DataPerf website to participate in the DataPerf challenges, or to design a new data-centric challenge. Join the DataPerf working group to stay up-to-date on the ongoing changes in the data-centric ML community.

We would like to thank the Dynabench engineering team for providing the infrastructure and support for DataPerf. The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive.ai, ETH Zurich, Google, Harvard University, Meta, MLCommons, and Stanford University. In addition, this effort would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Digital Prism Advisors, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.

References

Hestness, Joel, et al. “Deep learning scaling is predictable, empirically.” arXiv preprint arXiv:1712.00409 (2017).
Sevilla, Jaime, et al. “Compute trends across three eras of machine learning.” 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022.
Villalobos, Pablo, et al. “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning.” arXiv preprint arXiv:2211.04325 (2022).

The post DataPerf: the Leaderboard for Data appeared first on MLCommons.