AI Risk & Reliability Archives - MLCommons https://mlcommons.org/category/ai-risk-reliability/ Better AI for Everyone Tue, 01 Jul 2025 23:51:45 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png AI Risk & Reliability Archives - MLCommons https://mlcommons.org/category/ai-risk-reliability/ 32 32 MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders https://mlcommons.org/2025/06/ares-announce/ Fri, 27 Jun 2025 17:11:26 +0000 https://mlcommons.org/?p=2977 MLCommons and partners unite to create actionable reliability standards for next-generation AI agents.

The post MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders appeared first on MLCommons.

]]>
As agentic AI rapidly transforms the enterprise adoption landscape, MLCommons is convening experts from across academia, civil society, and industry to better understand the risks associated around practical, near-term deployments of AI agents. 

The challenge we posed to our community was: Imagine you are the decision maker on a product launch in 2026 for an agent that could take limited actions, such as placing an order or issuing a refund – what evaluations would you need to make an informed decision?

Last month, MLCommons co-hosted the Agentic Reliability Evaluation Summit (ARES) with the Oxford Martin School’s AI Governance Initiative, Schmidt Sciences, Laboratoire National de Metrologie et D’Essais (LNE), and the AI Verify Foundation. 34 organizations attended the convening — including AI labs, large software companies, AI safety institutes, startups, and civil society organizations — to discuss new and emerging evaluation methods for agentic AI. 

Today, MLCommons is announcing a new collaboration with contributors from Advai, AI Verify Foundation, Anthropic, Arize AI, Cohere, Google, Intel, LNE, Meta, Microsoft, NASSCOM, OpenAI, Patronus AI, Polytechnique Montreal, Qualcomm, QuantumBlack – AI by McKinsey, Salesforce, Schmidt Sciences, ServiceNow, University of Cambridge, University of Oxford, University of Illinois Urbana-Champaign, and University of California, Santa Barbara to co-develop an open agent reliability evaluation standard to operationalize trust in agentic deployments. This collaboration brings together a diverse ecosystem spanning AI builders, major enterprise deployers, specialized testing and safety providers, and important global policy organizations to ensure that this standard is both technically sound and practically implementable in the near-term. By uniting these different perspectives, this collaboration will create frameworks and benchmarks that are grounded in technical reality while addressing the practical needs of the global AI ecosystem.

We have outlined a set of principles to evaluate the reliability of near-term agentic use-cases, focusing on scenarios with limited actions within structured contexts. The evaluation framework will be organized into four key categories: Correctness, Safety, Security, and Control. The outcome from this group will comprise: 1) Design principles for agentic development 2) Benchmarks that are relevant to business for correctness, safety & control, and security. 

We welcome collaboration with a diverse range of enterprises, technology leaders, and business builders who share our vision. If you are interested in joining, please email ai-risk-and-reliability-chairs@mlcommons.org

The post MLCommons Builds New Agentic Reliability Evaluation Standard in Collaboration with Industry Leaders appeared first on MLCommons.

]]>
Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety https://mlcommons.org/2025/06/aaai2025/ Tue, 03 Jun 2025 22:36:33 +0000 https://mlcommons.org/?p=2984 Advancing AI safety is as much about creating ethical, adaptive, and collaborative governance mechanisms as it is about embracing rigorous technical benchmarks.

The post Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety appeared first on MLCommons.

]]>
The Association for Advanced Artificial Intelligence (AAAI) 2025 workshop on Datasets and Evaluators for AI Safety on March 3rd, 2025, provided a forum to discuss the challenges and strategies in AI safety. Drawing on contributions from leading experts across academia, industry, and policy, the discussions revealed that advancing AI safety is as much about creating ethical, adaptive, and collaborative governance mechanisms as it is about embracing rigorous technical benchmarks.

Establishing robust standards for low-risk AI

The workshop highlighted the need for standardisation and reliable benchmarks to reduce AI risks. In his talk on “Building an Ecosystem for Reliable, Low-Risk AI,” Dr Peter Mattson, President of MLCommons, outlined the important role of standardised safety benchmarks. The introduction of frameworks like the AILuminate Benchmark, which evaluates AI systems across 12 distinct safety hazards (e.g., physical, non-physical, and contextual risks), has sparked a conversation on making safety evaluations robust and accessible. Such benchmarks provide a quantitative measure of risks (from privacy and intellectual property challenges to non-physical harms) and are a foundation for industry-wide standards.

The workshop also showcased metadata tools such as the Croissant vocabulary, which standardises ML datasets to improve reproducibility and transparency. These initiatives address a core barrier in verifying and replicating AI results by ensuring that datasets are FAIR (findable, accessible, interoperable, and reusable). Ultimately, the workshop showed that safer and more reliable AI deployments can be achieved through greater technical rigour in evaluation.

Balancing technical rigor with ethical considerations

Another significant insight from the workshop was the call to integrate ethical governance into technical evaluations. Participants highlighted that technical benchmarks must be complemented by sociotechnical frameworks that manage AI’s current fairness, inclusivity, and broader societal impacts. For instance, a presentation by Prof Virginia Dignum explored the tension between long-term speculative risks and immediate ethical challenges. Prof Dignum argued that an excessive focus on future harms might detract from addressing present-day issues such as bias and fairness.

Panel discussions underscored that effective AI safety requires both robust technical metrics and frameworks for social responsibility to tackle complex real-world environments. Without the latter, unforeseen biases, or worse, are more likely to be perpetuated when AI systems “in the wild” encounter safety challenges not anticipated during model training.

Adapting evaluation methodologies in a dynamic landscape

The rapidly evolving nature of AI technologies calls for a dynamic approach to safety evaluations. The workshop continuously illustrated that safety frameworks must be adaptive; what is deemed safe today may not hold in a future where AI systems evolve unexpectedly. Panelists agreed that traditional, static evaluation methods are often ill-equipped to handle the nuances of modern AI applications.

Emergent harms, manifesting as indirect or second-order effects, challenge conventional safety paradigms. Practitioners from multiple disciplinary backgrounds emphasised how all AI safety metrics require continuous re-evaluation and adjustment as models scale in complex environments. This interdisciplinary call to action reminds us that AI safety evaluations must be as flexible and dynamic as the technologies they are designed to safeguard.

Creating collaborative and interdisciplinary approaches

Underlying the discussions at AAAI 2025 was the recognition that addressing AI safety is inherently an interdisciplinary challenge. From data scientists and software engineers to policymakers and ethicists, the workshop brought together diverse experts, enumerating unique priorities in AI safety. For instance, during the panel discussions, some experts advocated for technical measures such as implementing self-customised safety filters and automated source-referencing. Meanwhile, others stressed governance priorities like a broad citizen-first approach and frameworks to safeguard vulnerable groups.
Integrating these viewpoints is critical for developing safety frameworks that are both technically sound and socially responsive. Collaborative discussions of this kind help to bridge the gap between theoretical research and practical implementation. Unsurprisingly, there was broad consensus that some form of collective action will drive the evolution of safer, more reliable AI systems. Initiatives such as the development of the Croissant vocabulary and the ongoing work on standardised benchmarks (e.g., AILuminate) effectively capture this spirit of cooperation, given both rely on cross-sector engagement to succeed.

Conclusion

Ultimately, several critical themes for the AAAI 2025 Datasets and Evaluators for AI Safety workshop were the need for adaptive evaluation frameworks to handle real-world complexity and ongoing interdisciplinary discussions to ensure that our notion of “risk” is as robust as possible. Many discussion points were held throughout the day, but these themes were particularly important for ongoing work such as MLCommons’ AI Risk and Reliability initiative. By the end of the event, we felt a desire for more interdisciplinary workshops like
this. The common theme of adaptability tells us that it will be critical for practitioners to stay abreast of the AI safety responses to fast-moving developments we continue to see across the ecosystem.

The post Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety appeared first on MLCommons.

]]>
MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark https://mlcommons.org/2025/05/nasscom/ Thu, 29 May 2025 15:10:24 +0000 https://mlcommons.org/?p=2934 World leader in AI benchmarking announces new partnership with India’s NASSCOM; updated reliability grades for leading LLMs

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India’s premier technology trade association, to bring AILuminate’s globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate’s Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs). 

”We’re looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership,  along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”

 “The rapid development of AI is reshaping India’s technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the  growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”

The NASSCOM collaboration builds on MLCommons’ intentionally global approach to AI benchmarking. Modeled after MLCommons’ ongoing partnership with Singapore’s AI Verify Foundation, the NASSCOM partnership will help to meet South Asia’s urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region’s industry experts, policymakers, civil society members, and academic researchers. MLCommons’ partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.

AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking. 

“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We’re grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.” 

Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year. 

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark appeared first on MLCommons.

]]>
MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github https://mlcommons.org/2025/04/ailuminate-french-datasets/ Wed, 16 Apr 2025 17:00:00 +0000 https://mlcommons.org/?p=2833 Set of 1,200 Creative Commons French Demo prompts and 12,000 Practice Test prompts exercise the breadth of hazards for generative AI systems

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
Following the recent French release of the AILuminate v1.1 benchmark, MLCommons is announcing the public release of two French language datasets for the benchmark. AILuminate measures the reliability and integrity of generative AI language systems against twelve categories of hazards and is currently available in English and French. Today’s dataset release includes: a Creative Commons-licensed Demo prompt dataset of over 1,200 prompts; and a Practice Test dataset of 12,000 prompts, both of which span all twelve AILuminate hazard categories. 

The Demo prompt dataset is available on GitHub now and the full Practice Test dataset may be obtained by contacting MLCommons. English versions of both datasets are also available for use.

The AILuminate benchmark

The AILuminate benchmark, designed and developed by the MLCommons AI Risk & Reliability working group, assesses LLM responses to 12,000 test prompts per language, across twelve categories of hazards that users of generative AI language systems may encounter. The benchmark was designed to help organizations evaluate language systems before deploying them in production environments, to compare systems before procurement, or to verify their language systems comply with organizational and government policies. In addition to its unique corpus of prompts, AILuminate also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models. MLCommons has used the benchmark to assess the performance of many of the most prominent AI language systems-under-test (SUTs) across all of the hazard categories, and published public reports of their findings. 

The AILuminate French language prompt datasets can help provide valuable insights into your AI systems: 

  • Demo Prompt Dataset: A Creative Commons dataset with 1,200 prompts that is a subset of the AILuminate Practice Test dataset. The French language Demo dataset is useful for in-house testing, provides examples of specific hazards, and seeds and inspiration for the creation of other prompts. It is available today on GitHub
  • Practice Test Dataset: Includes 12,000 prompts. This dataset closely resembles the official test and is intended to provide a realistic projection of how models would be evaluated by AILuminate when used with the MLCommons AILuminate evaluator. Access to the comprehensive Practice dataset is available upon request from MLCommons. 

AILuminate’s prompts are based on 12 hazard categories which are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently available in English and French, with future plans to expand to Simplified Chinese and Hindi. Each prompt dataset is human-made and reviewed by native language speakers for linguistic and cultural relevancy, to ensure accurate linguistic representation. MLCommons intends to regularly update the prompt datasets, including the Creative Commons Demo dataset as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“AI reliability is a global problem that requires an intentionally global approach to address,” said Peter Mattson, Founder and President of MLCommons. “The availability of the French Demo and Practice Test datasets reflects our commitment to evaluating AI risk concerns across languages, cultures, and value systems — and ensures our benchmarks and datasets are readily available to vendors and developers across continents.” 

Raising the bar for reliability testing by empowering stakeholders

The AILuminate Demo prompt datasets provide an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems across the twelve defined AILuminate hazard categories. While we discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards, as will policymakers looking for examples of prompts that can generate hazardous results.

Download the Demo prompt dataset today!

The Demo test dataset is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from GitHub. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk & Reliability working group and our efforts to ensure greater reliability of AI systems.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability. We invite others to join the AI Risk & Reliability working group.

FRENCH ANNOUNCEMENT

MLCommons publie le jeu de données de prompts de démonstration AILuminate en français sur GitHub

Une série de 1 200 prompts Creative Commons en français explore l’étendue des risques pour les systèmes d’IA générative

Suite au récent lancement de la version française du benchmark AILuminate v1.1, MLCommons annonce la mise à disposition publique de deux jeux de données en langue française pour ce benchmark. AILuminate évalue la fiabilité et l’intégrité des systèmes d’IA générative dans douze catégories de risques, et ses benchmarks sont désormais disponibles pour les systèmes en anglais et en français. La publication d’aujourd’hui inclut un jeu de 1 200 prompts de démonstration sous licence Creative Commons et un jeu de 12 000 prompts d’essai couvrant les douze types de risques définis par AILuminate.

Le jeu de données de démonstration est dès maintenant disponible sur GitHub, tandis que le jeu complet d’essai peut être obtenu en contactant MLCommons. La version anglaise des deux jeux est également disponible.

Le benchmark AILuminate

Le benchmark AILuminate, conçu et développé par le groupe de travail sur les risques et la fiabilité des IA chez MLCommons, évalue les réponses des modèles linguistiques (LLM) à 12 000 prompts par langue, répartis dans douze catégories de risques auxquels les utilisateurs peuvent être confrontés. Ce benchmark aide les organisations à évaluer leurs systèmes avant leur déploiement, à comparer les produits avant achat ou à vérifier leur conformité aux politiques internes et officielles. En plus d’une collection unique de prompts, AILuminate inclut un système d’évaluation avancé basé sur un ensemble optimisé de modèles d’évaluation des risques. MLCommons a utilisé ce benchmark pour analyser la performance des principaux systèmes linguistiques testés (SUTs) dans toutes les catégories de risques et a publié des rapports publics sur leurs résultats.

Les jeux de données en langue française AILuminate offrent des perspectives précieuses pour vos systèmes d’IA :

  • Jeu de prompts de démonstration : Un sous-ensemble Creative Commons contenant plus de 1 200 prompts issus du jeu de prompts d’essai. Ce jeu est utile pour les tests internes, fournit des exemples spécifiques aux risques identifiés et peut inspirer la création d’autres prompts. Il est disponible dès aujourd’hui sur GitHub.
  • Jeu de prompts d’essai : Comprend 12 000 prompts conçus pour simuler une évaluation réaliste par le système officiel AILuminate. Ce jeu complet peut être obtenu sur demande auprès de MLCommons.

Les prompts sont basés sur les douze catégories de risques définies dans le standard d’évaluation AILuminate, qui inclut notamment :

French translated Hazard Categories for AILuminate

Tous les prompts sont disponibles aujourd’hui en anglais et en français, et seront bientôt disponibles en chinois simplifié et en hindi. Les prompts sont rédigés par un être humain, puis contrôlés par une personne de langue maternelle afin d’en assurer la pertinence linguistique et culturelle. MLCommons prévoit de les mettre à jour régulièrement, notamment le jeu Creative Commons, à mesure que la compréhension communautaire des risques liés à l’IA évolue – par exemple avec le développement du comportement agentique dans les systèmes d’IA. Les données sont formatées pour une utilisation immédiate avec ModelBench, l’outil open-source publié par MLCommons.

« La fiabilité des systèmes IA est un problème mondial nécessitant une approche internationale délibérée », a déclaré Peter Mattson, fondateur et président de MLCommons. « La disponibilité des jeux de prompts français reflète notre engagement quant à l’évaluation des préoccupations liées aux risques de l’IA à travers différentes langues, cultures et systèmes de valeurs – garantissant que nos benchmarks soient accessibles aux entreprises commerciales et aux développeurs dans le monde entier. »

Élever les standards des tests IA en responsabilisant les parties prenantes

Les jeux de données AILuminate fournissent un point de départ important pour le développement des capacités internes en permettant d’évaluer la sécurité des systèmes d’IA générative dans les douze catégories définies par AILuminate. Bien que nous déconseillions l’utilisation de ces prompts pour entraîner des systèmes d’IA (les meilleures pratiques de l’industrie recommandent l’isolation des jeux de données d’évaluation afin d’éviter le surapprentissage), ces prompts sont exploitables dans de nombreux secteurs. Ils servent d’exemples représentatifs pour les utilisateurs qui souhaitent développer leurs propres jeux de données d’entraînement et d’évaluation ou évaluer la qualité de jeux de données existants. Les formateurs en sécurité de l’IA sauront profiter de cet ensemble complet d’exemples couvrant une taxonomie de risques étendue, et les décideurs politiques pourront y trouver des cas concrets où certains prompts peuvent générer des résultats dangereux.

Téléchargez le jeu de données aujourd’hui !

Le jeu de prompts de démonstration est disponible sous licence Creative Commons CC-BY-4.0 et peut être téléchargé directement sur GitHub. Le jeu d’essai complet peut être obtenu en contactant MLCommons. Après l’utilisation des tests d’essai hors ligne ou en ligne, nous encourageons les développeurs à soumettre leurs systèmes à AILuminate pour obtenir un benchmarking public avec validation et publication officielle des résultats.

Pour en savoir plus sur AILuminate et accéder à une FAQ détaillée, nous encourageons toutes les parties prenantes à rejoindre le groupe MLCommons sur les risques IA et la fiabilité.

À propos de MLCommons

MLCommons est le leader mondial du développement de benchmarks pour l’IA. C’est un consortium ouvert dont la mission est d’améliorer l’IA pour tous par la publication de benchmarks et de jeux de données publics. Formée de plus de 125 membres (entreprises de technologie, universitaires et chercheurs du monde entier), MLCommons se concentre sur le développement collaboratif d’outils pour l’industrie de l’IA par le biais de benchmarks, de jeux de données et de mesures publics traitant des risques et de la fiabilité de l’IA. Nous invitons toutes les parties intéressées à rejoindre le groupe sur les risques de l’IA chez MLCommons.

The post MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to Github appeared first on MLCommons.

]]>
Announcing the AILuminate Benchmarking Policy https://mlcommons.org/2025/04/ail-benchmarking-policy/ Wed, 16 Apr 2025 00:26:43 +0000 https://mlcommons.org/?p=2810 The MLCommons policy to guide which systems we include in AILuminate benchmarks

The post Announcing the AILuminate Benchmarking Policy appeared first on MLCommons.

]]>
Just a few months ago we launched our first AILuminate benchmark, a first-of-its-kind reliability test for large language models (LLMs). Our goal with AILuminate is to build a standardized way of evaluating product reliability that will assist developers in improving the reliability of their systems, and give companies better clarity about the reliability of the systems they use. 

For AILuminate to be as effective as it can be, it needs to achieve widespread coverage of AI systems that are available on the market. This is one reason we have prioritized multilingual coverage; we launched French language coverage in February, and plan to launch Hindi and Chinese in the coming months. But beyond language, AILuminate needs to cover as many LLMs that are on the market as possible, and it needs to be continually updated to reflect the latest versions of LLMs available. 

While we work toward achieving this ubiquitous coverage from an engineering standpoint, we are announcing a Benchmarking Policy, which defines which “systems under test” (as we call the LLMs we test) will be included in the AILuminate benchmark as we go forward. With this policy, we have aimed to strike a balance between giving developers a practice test and notice that their systems will be included, against the need to maintain timely and continuously updated testing of as many of the systems on the market as we can. 
You can read the policy in full here

The post Announcing the AILuminate Benchmarking Policy appeared first on MLCommons.

]]>
Unlocking Data Collaboration with AI-Ready Licenses https://mlcommons.org/2025/03/unlocking-data-collab/ Mon, 17 Mar 2025 19:37:31 +0000 https://mlcommons.org/?p=2609 New research shows promising signs that adopting modular, standard data license agreements can clarify key terms, reduce transaction costs, and enable more accessible data use across borders and sectors.

The post Unlocking Data Collaboration with AI-Ready Licenses appeared first on MLCommons.

]]>
The rapid advancement of artificial intelligence (AI) is fueling the need for complex datasets containing various high-quality text, images, audio, video, and specialized formats. Despite the growing interest in AI and its evident social impact, stakeholders face challenges in establishing reliable frameworks for responsibly sharing AI data. A significant hurdle is the absence of widely accepted, standardized contractual frameworks (i.e.: data licenses) for this purpose. 

Research funded through an unrestricted grant by the MLCommons Association and recently published by the Open Data Institute (ODI) and the Pratt School of Engineering at Duke University explored the motivations driving data sharing throughout the AI ecosystem. The goal was to inform the design of more fit-for-purpose AI data licenses. A blog published by the OECD.AI/GPAI also recently featured this work.  

Using a mixed-methods approach that combined interviews, an online survey, and a review of literature, this research captured a broad spectrum of insights from across eight jurisdictions and multiple sectors covering key roles such as data holders, data users, and intermediaries. Here, we give an overview of these research findings, demonstrating how standard data license agreements hold the potential to mitigate legal uncertainties, reduce transaction costs, and promote fair AI data access, particularly when combined with standard data governance and other tools.

The following key findings are organized according to four main themes: leveraging new standard data licenses to 1) promote greater data access and social good, and 2) address existing ambiguities in current contractual approaches to data sharing, alongside 3) the need for data licensing simplicity and the importance of balancing standardization with the need for some customization, and 4) developing a new standard data licensing framework to help address other data sharing challenges, including legal compliance and ethical considerations.  

data licensing pathway steps

1. New standard data licenses can promote greater responsible data access and AI for social good.

a) No standardized data licenses have been widely adopted, increasing data sharing challenges for under-represented and smaller organizations. The research indicates that the lack of widely adopted, standardized data licenses may create significant barriers to voluntary, legally compliant, and ethical data sharing. This gap might drive up costs and impose contractual burdens that particularly hinder smaller or under-represented organizations from fully benefiting from AI.

b) Data access disparities also impact the advancement of AI for social good. Limited cross-cultural and cross-linguistic data sharing can restrict access to high-quality datasets. The research shows that this shortfall might not only impede responsible AI development among research institutions, governments, and non-profits (especially in developing regions) but also diminish the diversity and richness of AI models.

c) Multi-stakeholder participation can help develop standard data licenses that advance equitable data access and AI for social good. According to the research, involving a diverse range of stakeholders (including representatives from different geographic regions, sectors, and groups that have traditionally faced challenges in the technology sector) appears to be essential. Such multi-stakeholder participation can lead to the development of standard licensing frameworks that more effectively address disparities and foster equitable data access for social good.

2. New standard data licenses can help address existing ambiguities and challenges with current contractual approaches to data sharing.

a) The lack of widely embraced standard data licenses leads to inconsistency and incompatibility and increases the challenges of using data in compliance with license terms. Through this research, it was confirmed that customized license agreements currently in use may lead to inconsistencies and incompatibilities, which in turn increase legal uncertainty and compliance costs. This fragmentation overall makes it more challenging for organizations to share data effectively across their ecosystems.

b) Confusion exists about the meaning of non-commercial limitations in data licenses, and clarification in new standard licenses might encourage more data-sharing transactions. This work highlights that the lack of definitions of “non-commercial” use in current licenses creates significant confusion. Clarifying this term within standard licenses could encourage more frequent and compliant data-sharing transactions.

c) New standard data licenses could clarify rights emerging from AI and provide different options for contractually allocating rights and liabilities. The research shows that commonly used existing licenses do not clearly address the allocation of rights for data usage (or for AI models and outputs developed using such data) resulting in uncertainties regarding rights and liabilities. Standard licenses that offer clearer contractual options, in line with applicable laws, could help resolve these issues.

d) New standard data licenses can help address attribution issues arising under existing agreements. Attribution requirements in widely used license frameworks can be impractical for AI applications, where data used in model training is not directly reproduced in outputs. The research therefore suggests that standardizing license terms and integrating technical solutions like automatic attribution may reduce this ambiguity and simplify data sharing.

3. New standard data licenses should embrace simplicity while balancing standardization with the need for some customization.

a) There is a pressing need for simplicity. Throughout, the research emphasized that drafting licenses in clear, easily interpretable language is critical for reducing transaction costs and building trust, especially for non-legal practitioners.

b) Standard data licensing frameworks should balance the need for standardization with some customization, given the many data types, use cases, models, tasks, and legal requirements. The research also indicates that while standardization can streamline data sharing, licenses may also need to allow for some customization to accommodate the diversity of data types, use cases, AI models, and jurisdiction-specific legal requirements across the AI data lifecycle.

c) Standard modular data license terms (and/or a menu of standard data license agreements) could balance standardization and some customization. Therefore, the research suggests that a modular framework or a menu of license options (developed through a collaborative, multi-stakeholder process) could effectively balance the need for uniformity with the necessity for customization to address varied stakeholder needs.

d) Standard definitions can advance data licensing. The research points to the potential benefits of developing and adopting standard definitions for key terms. Such definitions can further simplify license agreements and help maintain an effective balance between standardization and necessary customization.

4. A new standard data licensing framework may help address other data-sharing challenges, including legal compliance, data and AI governance, and ethical considerations.

a) Standard data license agreements could help address other challenges in implementing voluntary, legally compliant data sharing, including by helping advance data and AI governance. Ensuring technical tools for tracking data provenance and lineage and for easily processing the terms of standard licenses can enhance transparency and reduce compliance costs. This, the research shows, can support stronger data and AI governance practices. For instance, MLCommons’ Croissant initiative (a metadata format for ML datasets) simplifies discovery and integration by embedding legal and compliance measures into the AI pipeline. This can help to reduce uncertainty in data-sharing exchanges by representing legal and contractual frameworks as metadata. Lastly, complementary tools, such as automatic attribution services in Apache Atlas, can further ensure that license-defined rights are easily represented and therefore upheld.

b) Standard data licenses may help address cross-border and other compliance challenges. The diversity of data-related regulations across jurisdictions (e.g., data protection) continues to create substantial barriers to data sharing. To ease compliance, standard data license terms should be prepared with these legal matters in mind. 

c) Standard data licenses may help advance ethical guidelines for data usage. The research confirmed interest in having ethical restrictions on data and AI model usage in at least some scenarios. A new standard data licensing framework could reference relevant ethical codes of conduct.  Some concern was expressed about incorporating codes of conduct or standards, which could change over time, into standard data licenses, as this could increase uncertainty in the contractual terms and costs of compliance.  These factors should be considered in the global and multi-stakeholder process developed to prepare standard licenses.  

Looking forward

Overall, this work shows promising signs that adopting modular, standard data license agreements can clarify key terms, reduce transaction costs, and enable more accessible data use across borders and sectors. Additionally, it demonstrates that integrating external technical tools and frameworks like Apache Atlas and Croissant with these legal frameworks enhances data discoverability and usability. Together, these measures can help to build a transparent, interoperable ecosystem for responsible AI development. Here, we have the opportunity to shape an AI-driven future that is equitable, sustainable, and indeed serves everyone.

The post Unlocking Data Collaboration with AI-Ready Licenses appeared first on MLCommons.

]]>
MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark https://mlcommons.org/2025/02/ailumiate-v1-1-fr/ Tue, 11 Feb 2025 05:00:00 +0000 https://mlcommons.org/?p=2412 Announcement comes as AI experts gather at the Paris AI Action Summit, is the first of several AILuminate updates to be released in 2025

The post MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark appeared first on MLCommons.

]]>
Paris, France – February 11, 2025. MLCommons, in partnership with the AI Verify Foundation, today released v1.1 of AILuminate, incorporating new French language capabilities into its first-of-its-kind AI safety benchmark. The new update – which was announced at the Paris AI Action Summit – marks the next step towards a global standard for AI safety and comes as AI purchasers across the globe seek to evaluate and limit product risk in an emerging regulatory landscape. Like its v1.0 predecessor, the French LLM version 1.1 was developed collaboratively by AI researchers and industry experts, ensuring a trusted, rigorous analysis of chatbot risk that can be immediately incorporated into company decision-making.

“Companies around the world are increasingly incorporating AI in their products, but they have no common, trusted means of comparing model risk,” said Rebecca Weiss, Executive Director of MLCommons. “By expanding AILuminate’s language capabilities, we are ensuring that global AI developers and purchasers have access to the type of independent, rigorous benchmarking proven to reduce product risk and increase industry safety.”

Like the English v1.0, the v1.1 French model of AILuminate assesses LLM responses to over 24,000 French language test prompts across twelve categories of hazards behaviors – including violent crime, hate, and privacy. Unlike many of peer benchmarks, none of the LLMs evaluated are given advance access to specific evaluation prompts or the evaluator model. This ensures a methodological rigor uncommon in standard academic research and an empirical analysis that can be trusted by industry and academia alike.

“Building safe and reliable AI is a global problem – and we all have an interest in coordinating on our approach,” said Peter Mattson, Founder and President of MLCommons. “Today’s release marks our commitment to championing a solution to AI safety that’s global by design and is a first step toward evaluating safety concerns across diverse languages, cultures, and value systems.”

The AILuminate benchmark was developed by the MLCommons AI Risk and Reliability working group, a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety. Cognizant that AI safety requires a coordinated global approach, MLCommons also collaborated with international organizations such as the AI Verify Foundation to design the AILuminate benchmark.

“MLCommons’ work in pushing the industry toward a global safety standard is more important now than ever,” said Nicolas Miailhe, Founder and CEO of PRISM Eval. “PRISM is proud to support this work with our latest Behavior Elicitation Technology (BET), and we look forward to continuing to collaborate on this important trustbuilding effort – in France and beyond.”

Currently available in English and French, AILuminate will be made available in Chinese and Hindi later this year. For more information on MLCommons and the AILuminate Benchmark, please visit mlcommons.org.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academic, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued to use collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve the accuracy, safety, speed, and efficiency of AI technologies.

The post MLCommons Releases AILuminate LLM v1.1, Adding French Language Capabilities to Industry-Leading AI Safety Benchmark appeared first on MLCommons.

]]>
MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github https://mlcommons.org/2025/01/ailuminate-demo-benchmark-prompt-dataset/ Wed, 15 Jan 2025 18:50:04 +0000 https://mlcommons.org/?p=2130 Set of 1,200 DEMO prompts exercises the breadth of hazards for generative AI systems

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

]]>
Today, MLCommons® publicly released the AILuminate DEMO prompt dataset, a collection of 1,200 prompts for generative AI systems to benchmark safety against twelve categories of hazards. The DEMO prompt set is a component of the full AILuminate Practice Test dataset.

The dataset is available at: https://github.com/mlcommons/ailuminate.

The AILuminate benchmark, designed and developed by the MLCommons AI Risk and Reliability working group assesses LLM responses to over 24,000 test prompts across twelve categories of hazards that users of generative AI systems may encounter. It also includes a best-in-class evaluation system using a tuned ensemble of safety evaluation models; and Public reports of 13 systems-under-test (SUTS)  using the AILuminate v1.0 benchmarking grading scale to assess the quantitative and qualitative performance of the SUTS in each hazard category. 

The full 24,000 test prompts are broken out into the following datasets:

  • A Practice Test dataset with a standard, well-documented set of prompts that can be used for in-house testing, examples of specific hazards, and seeds and inspiration for the creation of other prompts. The Creative Commons DEMO set being released today is a 10% subset of the Practice dataset. Access to the full Practice dataset is available upon request from MLCommons. 
  • An Official Test dataset. For an official benchmark of a System Under Test (SUT) (which can be a single LLM or an ensemble). It is run by MLCommons, using the MLCommons official AILuminate assessment independent evaluator to provide a detailed set of performance scores, as measured against a standard baseline. The Official Test dataset is private to avoid SUTs “gaming” the benchmark by training to the specific test prompts. The Practice Test and Official Test datasets are designed to be statistically equivalent in their coverage and testing of the taxonomy of hazards. To run an Official Test contact MLCommons.

The 12 hazard categories included in the test prompts are extensively documented in the AILuminate Assessment Standard. They include:

All prompts are currently in English, with plans to expand to French, Chinese and Hindi. MLCommons intends to update the prompt datasets, including the Creative Commons DEMO dataset, on a regular basis as the community’s understanding of AI hazards evolves – for example, as agentic behavior is developed in AI systems. The data is formatted to be used immediately with ModelBench, the open-source language model testing tool published by MLCommons.

“We have already started using AILuminate to help us robustly evaluate the safety of our models at Contextual. Its rigor and careful design has helped us gain a much deeper understanding of their strengths and weaknesses,” said Bertie Vidgen, Head of Human Data at Contextual AI.

Raising the Bar for Safety Testing by Empowering Stakeholders

The AILuminate Creative Commons-licensed DEMO prompt dataset provides an important starting point for those who are developing their in-house abilities to test the safety of generative AI systems for the twelve defined AILuminate hazard categories. We discourage training AI systems directly using this data – following the industry best practice of maintaining separate evaluation suites to avoid overfitting AI models – we do expect the dataset to have direct relevance to the work of a wide variety of stakeholders. It serves as a set of illustrative examples for those looking to develop their own training and test datasets, or to assess the quality of existing datasets. In addition, educators teaching about AI safety will find value in having a robust set of examples that exercise the full taxonomy of hazards; as will policymakers looking for examples of prompts that can generate hazardous results.

Download the DEMO Prompt Dataset Today

The DEMO test set is made available using the Creative Commons CC-BY-4.0 license and can be downloaded directly from Github. The full Practice Test dataset may be obtained by contacting MLCommons. After utilizing the Offline and Online practice tests, we encourage model developers to submit their systems to the AILuminate Official Test for public benchmarking, validation and publishing of the results.

More information on AILuminate can be found here, including a FAQ. We encourage stakeholders to join the MLCommons AI Risk and Reliability working group and our efforts to ensure the safety of AI systems.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.We invite others to join the AI Risk and Reliability working group.

The post MLCommons Releases AILuminate Creative Commons DEMO Benchmark Prompt Dataset to Github appeared first on MLCommons.

]]>
MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models   https://mlcommons.org/2024/12/mlcommons-ailuminate-v1-0-release/ Wed, 04 Dec 2024 13:56:00 +0000 https://mlcommons.org/?p=1796 Benchmark developed by leading AI researchers and industry experts; marks a significant milestone towards a global standard on AI safety

The post MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models   appeared first on MLCommons.

]]>
MLCommons today released AILuminate, a first-of-its-kind safety test for large language models (LLMs). The v1.0 benchmark – which provides a series of safety grades for the most widely-used LLMs – is the first AI safety benchmark designed collaboratively by AI researchers and industry experts. It builds on MLCommons’ track record of producing trusted AI performance benchmarks, and offers a scientific, independent analysis of LLM risk that can be immediately incorporated into company decision-making. 

“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said Peter Mattson, Founder and President of MLCommons. “Just like other complex technologies like cars or planes, AI models require industry-standard testing to guide responsible development. We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use.”  

The AILuminate benchmark assesses LLM responses to over 24,000 test prompts across twelve categories of hazards. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research and ensures an empirical analysis that can be trusted by industry and academia alike.

“With roots in well-respected research institutions, an open and transparent process, and buy-in across the industry, MLCommons is uniquely equipped to advance a global baseline on AI risk and reliability,” said Rebecca Weiss, Executive Director of MLCommons. “We are proud to release our v1.0 benchmark, which marks a major milestone in our work to build a harmonized approach to safer AI. By making AI more transparent, more reliable, and more trusted, we can ensure its positive use in society and move the industry forward.” 

This benchmark was developed by the MLCommons AI Risk and Reliability working group — a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Meta, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety. The working group plans to release ongoing updates as AI technologies continue to advance. 

“Industry-wide safety benchmarks are an important step in fostering trust across an often-fractured AI safety ecosystem,” said Camille François, Professor at Columbia University. “AILuminate will provide a shared foundation and encourage transparent collaboration and ongoing research into the nuanced challenges of AI safety.”

“The developers of AI technologies and organizations using AI have a shared interest in transparent and practical safety assessments. AI will only be adopted and used to address society’s greatest challenges if people trust that it is safe. The AILuminate benchmark represents important progress in developing research-based, effective evaluation techniques for AI safety testing,” said Natasha Crampton, Chief Responsible AI Officer, Microsoft

“As a member of the working group, I experienced firsthand the rigor and thoughtfulness that went into the development of AILuminate,” said Percy Liang, computer science professor at Stanford University. “The global, multi-stakeholder process is crucial for building trustworthy safety evals. I was pleased with the extensiveness of the evaluation, including multilingual support, and I look forward to seeing how AILuminate evolves with the technology.”

“Enterprise AI adoption depends on trust, transparency, and safety,” said Navrina Singh, working group member and founder and CEO of Credo AI. “The AILuminate benchmark, developed through rigorous collaboration between industry leaders and researchers, offers a trusted and fair framework for assessing model risk. This milestone sets a critical foundation for AI safety standards, enabling organizations to confidently and responsibly integrate AI into their operations.”

Cognizant that AI safety requires a coordinated global approach, MLCommons intentionally collaborated with international organizations such as the AI Verify Foundation to design the v1.0 AILuminate benchmark. The v1.0 benchmark is initially available for use in English, with versions in French, Chinese and Hindi coming in early 2025.

Learn more about the AILuminate Benchmark.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

The post MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models   appeared first on MLCommons.

]]>
AI Safety in Practice: Two Key Takeaways from MLCommons and PAI Workshop https://mlcommons.org/2024/11/mlc-and-pai-workshop/ Thu, 21 Nov 2024 18:14:56 +0000 https://mlcommons.org/?p=1329 A joint exploration of collaborative approaches for responsible evaluation and governance of AI across the value chain

The post AI Safety in Practice: Two Key Takeaways from MLCommons and PAI Workshop appeared first on MLCommons.

]]>
In October 2024, MLCommons and Partnership on AI (PAI) virtually convened practitioners and policy analysts to assess the current state of general-purpose AI evaluations and the challenges to their adoption. The workshop explored collaborative approaches to responsibly evaluating and governing AI across the value chain. Against a backdrop of unprecedented regulatory activity — including over 300 federal bills, 600 state-level bills, a U.S. AI executive order, and the EU’s new draft General-Purpose AI Code of Practice — the workshop provided an opportunity to examine how each actor in the AI ecosystem contributes to safe deployment and accountability. 

From foundation model providers to downstream actors, discussions emphasized that each actor has unique roles, obligations, and needs for guidance, highlighting the limitations of a “one-size-fits-all” approach. PAI’s recent work on distributed risk mitigation strategies for open foundation models, along with MLCommons’ development of safety benchmarks for AI, highlights the need for evaluation and governance practices tailored to each actor’s unique role. Together, these efforts blend normative guidance with technical evaluation, supporting a comprehensive approach to AI safety.

First Workshop Takeaway: Layered and Role-specific evaluation approaches 

The first takeaway is the importance of layered and role-specific evaluation approaches across the AI ecosystem. Workshop participants recognized that different actors—such as foundation model providers, model adapters who fine-tune these models, model hubs and model integrators—play unique roles that require tailored evaluation strategies. This includes comprehensive evaluation immediately after a model is initially trained, using both benchmarking and adversarial testing. For model adapters, post-training evaluation and adjustments during fine-tuning are essential to uncover new risks and ensure safety.

Participants emphasized proactive, layered approaches where early benchmarking and regular red-teaming work together to support safety, reliability, and compliance as AI technology advances. Benchmarking early establishes a foundation for continuous improvement, helping identify performance gaps and ensuring models meet necessary safety standards. Red-teaming, or adversarial testing, is critical to pair with benchmarks, as it helps uncover vulnerabilities not apparent through standard testing and stress-tests the model’s resilience to misuse. Frequent red-teaming enables developers to stay ahead of emerging risks, especially in the fast-evolving field of generative AI, and to address potential misuse cases proactively before widespread deployment. For model integrators, continuous testing and re-evaluation procedures were also recommended, particularly to manage the dynamic changes that occur as generative AI models are updated.

Second Workshop Takeaway: Adaptability in Evaluation Methods 

The second takeaway is the necessity of adaptable evaluation methods to ensure that safety assessments remain realistic and resistant to manipulation as AI models evolve. A critical part of adaptability is using high-quality test data sets and avoiding overfitting risks—where models become too tailored to specific test scenarios, leading them to perform well on those tests but potentially failing in new or real-world situations. Overfitting can result in evaluations that give a false sense of security, as the model may appear safe but lacks robustness.

To address this, participants discussed the importance of keeping elements of the evaluation test-set “held back” privately, to ensure that model developers cannot over-train to a public test-set. By using private test-sets as a component of the evaluation process, evaluations can better simulate real-world uses and identify vulnerabilities that traditional static benchmarks and leaderboards might miss.

Workshop participants agreed that responsibility for implementing AI evaluations will need to be shared across the AI value chain. The recent U.S. AI Safety Institute guidance highlights the importance of involving all actors in managing misuse risks throughout the AI lifecycle. While the EU’s draft General-Purpose AI Code of Practice currently focuses on model providers, the broader regulatory landscape indicates a growing recognition of the need for shared responsibility across all stakeholders involved in AI development and deployment. Together, these perspectives underscore the importance of multi-stakeholder collaboration in ensuring that general purpose  AI systems are governed safely and responsibly.

Learn more about the MLCommons AI Risk and Reliability working group building a benchmark to evaluate the safety of AI.

Learn more about Partnership on AI’s Open Foundation Model Value Chain

The post AI Safety in Practice: Two Key Takeaways from MLCommons and PAI Workshop appeared first on MLCommons.

]]>
MLCommons AI Safety Working Group’s Rapid Progress to a v1.0 Release https://mlcommons.org/2024/07/mlc-ais-v1-0-progress/ Wed, 31 Jul 2024 16:38:28 +0000 http://local.mlcommons/2024/07/mlc-ais-v1-0-progress/ Building a comprehensive approach to measuring the safety of LLMs and beyond

The post MLCommons AI Safety Working Group’s Rapid Progress to a v1.0 Release appeared first on MLCommons.

]]>
We are witnessing the rapid evolution of AI technologies, alongside early efforts to incorporate them into products and services. In response, a wide range of stakeholders, including advocacy groups and policymakers, continue to raise concerns about how we can tell whether AI technologies are safe to deploy and use. To address these concerns, and provide a higher level of transparency, the MLCommons® AI Safety (AIS) working group, convened last fall, and has been working diligently to construct a neutral set of benchmarks for measuring the safety of AI systems, starting with large language models (LLMs). The working group reached a significant milestone in April with the release of a v0.5 AI Safety benchmark proof of concept (POC). Since then they have continued to make rapid progress towards delivering an industry-standard benchmark v1.0 release later this fall. As part of MLCommons commitment to transparency, here is an update on our progress across several important areas.

Feedback from the v0.5 POC

The v0.5 POC was intended as a first “stake in the ground” to share with the community for feedback to inform the v1.0 release. The POC focused on a single general-purpose AI chat model domain. It ran 14 different systems-under-tests to validate the basic structure of a safety test and the components it must contain, against a set of tests across 7 hazards. The team is now hard at work incorporating feedback and extending and adapting that structure to a full taxonomy of hazards for inclusion in the v1.0 release this fall.

The POC also did its job in provoking thoughtful conversations around key elements of the benchmark, such as how transparent test prompts and evaluators should be within an adversarial ecosystem where technology providers and policymakers are attempting to keep users safe from threats – and from bad actors who could leverage openness to their advantage. In a sense, complete transparency for safety benchmarks is akin to telling students what questions will be on a test before they take it. To ensure a level of safety within the tests, the v1.0 release will include some level of “hidden testing” in which a portion of the prompts remains undisclosed.

Moving forward

To ensure a comprehensive and agile approach for the benchmark, the working group has established eight workstreams that are working in parallel to build the v1.0 release and look beyond to future iterations.

  • Commercial beta users: providing user-focused product management to the benchmarks and platform, with an initial focus on 2nd-party testers including purchasers, system integrators, and deployers;
  • Evaluator models: creating a pipeline to tune models to automate a portion of the evaluation, and generating additional data in a continual learning loop;
  • Grading, scoring, and reporting: developing useful and usable systems for quantifying and reporting out statistically valid benchmark results for AI safety;
  • Hazards: identifying, describing, and prioritizing hazards and biases that could be incorporated into the benchmark;
  • Multimodal: investigating scenarios where text prompts generate images and/or image prompts generate text, aligning with specific harms, generating relevant prompts, and tracking the state of the art with partner organizations;
  • Prompts: creating test specifications, acquiring and curating a collection of prompts to AI systems that might generate unsafe results, and organizing and reporting on human verification of prompt quality;
  • Scope: defining the scope of each release, including hazard taxonomy definitions, use cases, personas, and languages for localization;
  • Test integrity and score reliability: ensuring that benchmark results are consistent and that the benchmark cannot be “gamed” by a bad actor, and assessing release options.

“I’m proud of the team that has assembled to work on the v1.0 release and beyond,” said Rebecca Weiss, Executive Director of MLCommons and AI Safety lead. “We have some of the best and brightest from industry, research, and academia collaborating to move each workstream forward. A broad community of diverse stakeholders, both individual contributors and their employers, came together to help solve this difficult problem, and the emerging results are exciting and visionary.” 

Ensuring Diverse Representation

The working group has completed the initial version of the benchmark’s taxonomy of hazards that will initially be supported in four languages: English, French, Simplified Chinese, and Hindi. In order to provide an AI Safety benchmark that tests the full range of the taxonomy and is broadly applicable, the Prompts workstream is sourcing written prompts through two mechanisms. First, it is contracting directly with a set of primary suppliers for prompt generation, to ensure complete coverage of the taxonomy of hazards. Second, it is publicly sourcing additional prompts through an Expression of Interest (EOI) to add a diverse set of deep expertise across different languages, geographies, disciplines, industries, and hazards. This work is ongoing to ensure continued responsiveness to emerging and changing hazards. 

Global Partnerships

The team is building a network of global partnerships to ensure access to relevant expertise, data, and technology. In May MLCommons signed an MOI with the AI Verify Foundation, based in Singapore, to collaborate on developing a set of common safety testing benchmarks for generative AI models. The working group is also looking to incorporate evaluator models from multiple sources. This will enable the AI Safety benchmark to address a wide range of global hazards incorporating regional and cultural nuances as well as domain-specific threats.

Over 50 academic, industry, and civil society organizations have provided essential contributions to the MLCommons AI Safety working group, with initial funding from Google, Intel, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. to support the version 1.0 release of the AI Safety benchmark. 

Long-term goals

As the version 1.0 release approaches completion later this fall, the team is already planning for future releases. “This is a long-term process, and an organic, rapidly evolving ecosystem,” said Peter Mattson, MLCommons President and co-chair of the AI Safety working group. “Bad actors will adapt and hazards will continue to shift and change, so we in turn must keep investing to ensure that we have strong, effective, and responsive benchmark testing for the safety of AI systems. AI safety benchmarking is progressing quickly thanks to the contributions of the several dozen individuals and organizations who support our effort, but the AI industry is also moving fast and our challenge is to build a safety benchmark process that can keep pace with it. At the moment, we have organized the AI Safety effort to operate at the pace of a startup, given the rate of development of AI-deploying organizations. But as more AI companies mature into multi-billion dollar industries, our benchmarking efforts will also evolve towards a long-term sustainable model with equivalent maturity.”

“The AI Safety benchmark helps us to support policymakers around the world with rigorous technical measurement methodology, many of whom are struggling to keep pace with a fast-paced, emerging industry,” said Weiss. “Benchmarks provide a common frame of reference, and they can ground discussions in facts and evidence. We want to make sure that the AI Safety benchmark not only accelerates technology and product improvements by providing careful measurements, but also helps to inspire thoughtful policies supported by the technical metrics needed to measure the behavior of complex AI systems.”

MLCommons welcomes additional participation to help shape the v1.0 AI Safety benchmark suite and beyond. To contribute, please join the MLCommons AI Safety working group.

The post MLCommons AI Safety Working Group’s Rapid Progress to a v1.0 Release appeared first on MLCommons.

]]>
MLCommons AI Safety prompt generation expression of interest. Submissions are now open! https://mlcommons.org/2024/06/ai-safety-prompt-generation-eoi/ Thu, 20 Jun 2024 22:44:44 +0000 http://local.mlcommons/2024/06/ai-safety-prompt-generation-eoi/ MLCommons is looking for prompt generation suppliers for its v1.0 AI Safety Benchmark Suite. This will be for a paid opportunity.

The post MLCommons AI Safety prompt generation expression of interest. Submissions are now open! appeared first on MLCommons.

]]>
The MLCommons AI Safety working group has issued a public expression of interest (EOI) for prompt generation suppliers for its v1.0 AI Safety Benchmark Suite. A full description of the EOI can be found here. Qualified and interested organizations should express interest using the submission form . Note that the submission form will not require a formal proposal, rather MLCommons will use the form to determine who to contact for exploratory calls. The submission deadline is July 19, 2024; early submissions are encouraged.

Deliverable options

Qualified organizations may express interest for one or both of the following deliverable options:

Deliverable Option 1 – Pilot Project

For organizations that wish to propose coverage of languages, hazards or personas not listed in Deliverable Option 2. This option will come with a smaller budget than Deliverable Option 2.

Deliverable Option 2 – Full Coverage Project

For organizations that wish to fully cover MLCommons’ required hazards, languages and personas. This option will come with a bigger budget than Deliverable Option 1.

See the full EOI description for more information about the deliverable options.

Who should express interest?

Through the EOI, MLCommons hopes to engage with diverse organizations that have deep expertise in different geographies, languages, disciplines, industries and hazards. MLCommons is committed to nurturing a mature and global AI safety ecosystem, and we are excited to connect and learn from potential suppliers from around the world. We hope to hear from you!

Join us

We welcome additional participation to help shape the v1.0 AI Safety benchmark suite and beyond. To contribute, please join the MLCommons AI Safety working group.

The post MLCommons AI Safety prompt generation expression of interest. Submissions are now open! appeared first on MLCommons.

]]>