Croissant Archives - MLCommons https://mlcommons.org/category/croissant/ Better AI for Everyone Tue, 25 Feb 2025 16:52:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png Croissant Archives - MLCommons https://mlcommons.org/category/croissant/ 32 32 Croissant Gains Momentum within the Data Community https://mlcommons.org/2025/02/croissant-qa-community/ Wed, 12 Feb 2025 14:44:00 +0000 https://mlcommons.org/?p=2435 Croissant working group co-chairs share insights on the success of Croissant, what’s next, and how to join the effort.

The post Croissant Gains Momentum within the Data Community appeared first on MLCommons.

]]>
Since its introduction in March 2024, the Croissant metadata format for ML-ready datasets has quickly gained momentum within the Data community. It was recently featured at the NeurIPS conference in December 2024. Croissant is supported by major ML dataset repositories, such as Kaggle and HuggingFace. Over 700k datasets are accessible and searchable via Google Dataset Search. Croissant datasets can be loaded automatically to train ML models using popular toolkits such as JAX and PyTorch. 

MLCommons’ Croissant working group co-chairs, Omar Benjelloun and Elena Simperl, sat down to share their thoughts on Croissant’s initial popularity, the approach the working group is taking to develop it, and next steps.

[MLC] Why do you think Croissant has become so popular?

[ES] We’re delighted to see it being discussed in many contexts, beyond AI and machine learning research. I think we’re solving a problem that is easy to recognize, a problem that many practitioners who work with machine learning have, but one whose solution requires a range of skills that traditionally haven’t been that important in that community. Those techniques have to do with how you structure and organize data – data used by machine learning systems or any other type of data – in a way that makes it easy for that data to be shared across applications, to be found and reused. That’s a topic that Omar and I and other people in the working group know a lot about. And with Croissant, we are applying these techniques, which have been around for around 10-15 years,  to a new use case: machine learning data work. And I suppose the magic happened when we brought together all the relevant expertise in one working group to deliver something that ticks lots of boxes and heals many pains.

[MLC] What was the approach you took to developing Croissant?

[OB] We took an approach that was very concrete. A lot of this metadata standards work happens in a way where some experts get in a working group and they create what they think is the perfect standard. And then they try to convince others to adopt it and use it in their products.

Instead, from the start we said, “Okay, let’s define the standards at the same time and in close collaboration with the tools and systems that are going to use it.” So we worked with the repositories of machine learning datasets, like Kaggle and Hugging Face and OpenML initially, to implement support for this format as we were defining it, as well as to participate in the definition of the format. That’s on the “supply side” of datasets. On the “demand side,” we worked with the teams that develop tools for loading datasets for machine learning to implement support for loading Croissant datasets. And they were also part of the definition of the format.

So getting these people to work together helped us to create something which was not, “Here’s a specification. We think it’s going to make things better. Now, please go and implement this so that we have some data on one side and some tools on the other side.” Instead we could say, “Here’s the specification and here are some datasets that are already available in this format because these repositories support it. And here are some tools that you can use to work with these datasets.” And of course, it’s just the beginning. There are more repositories and datasets that should be available in this format. And there are a lot more tools that should add support for Croissant so that ML developers have what they need to work with this data. But the initial release of Croissant was a strong start with a useful, concrete proposition. And I think this is part of what made it so successful.

[MLC] Do you consider Croissant a “standard” now?

[ES] Personally, I have mixed feelings about the word “standard.” Sometimes when I hear someone talking about standards, I think about those lengthy, tedious, and to some degree inaccessible processes that are run in standardization committees, where there is a lot of work that is put into coming to a proposal, with lots of different stakeholders, with many competing interests to agree upon. And there’s hundreds and hundreds and hundreds of pages explaining what someone needs to do to use the standard in the intended way. And then there’s some implementations. And then if you’re an academic researcher in a small lab or a startup who doesn’t have the resources to be part of this exercise, then you feel left out. 

Having said that, standards, once they are agreed, and if there is critical mass in adoption, can be very, very powerful, because they do reduce those frictions of doing business. So if I am a startup that is creating a new type of machine learning functionality, for example to manage workflows or to do some sort of fair machine learning or any specialist functionality that the community needs, then having a standard to be able to process data that has been created and processed using any other machine learning tool is hugely valuable. But I love what Omar said: we’re following a slightly different approach. We really want to release everything to the community. We want to build everything with that adoption pathway in mind so that we put the results in the hands of practitioners as quickly as possible and in the most accessible way possible.

The other thing that I would say is that even if a competing proposal emerges, the way in which the Croissant vocabulary – and technology it is built with – means that it would be quite easy to use both. There are many vocabularies that describe datasets. And something like Google Dataset Search can work with all of them because of the technologies that are used to do that. But of course, if you just look at the number of datasets that use this or any other machine-readable vocabulary that may share similar aims, at the moment we seem to have an advantage. And the interest to adopt Croissant by more and more platforms is still there, almost a year after the launch of the initial version.

[OB] We released the first version of the Croissant format, but I don’t consider it to be standard yet.  I prefer to say  “community-driven specification”. And we’re not afraid to say Croissant will evolve and continue to change. Once we have a critical mass of data platforms and tools that have adopted it and use it and make their users happy thanks to Croissant, then it will feel sort of stable and we can say, “we don’t see this changing much from here.” At that point declaring it a standard makes sense. And of course, all the companies, startups and university labs that build tools and participate in the community will have contributed to that standard through their work.

[MLC] How important is the support from industry?

[ES] Speaking as someone who is in academia, I think industry support is very important. I am personally very happy with the diversity of industry participants in the group. We have OpenML, which is an open source initiative that is driven among others by academics, but it is a project with lots of contributors. We have obviously Google, we have participants from Meta as well, so household names in the AI space. Then we have Hugging Face and Kaggle, which have huge communities behind them. And we have NASA as well, which brings in avery interesting use case around geospatial data sets and their use in machine learning. Going forward I’d love for the working group to attract more industries that work beyond tech, that perhaps have their own machine learning use cases and their own enterprise data that they want to prepare for machine learning usage. And in that sort of scenario if you think about a big company and the types of data management problems they have, having something like Croissant would be very valuable as well. 

We have a good initial proposal for what Croissant can be, how it should be. But equally, we also have a very long list of things that we need to work on this year to get to that stable version that you could perhaps call a standard. And once you reach that, then you can actually make a case to any enterprise to explore this internally.

[MLC] What are your goals for expanding participation in the working group and in finding new partners?

[OB] First of all, this is an open group, and everyone is welcome! We would love to see more companies involved that develop tools that are used by machine learning practitioners. This speaks to the promise of Croissant: that once all datasets are described and edited in the same format, then if you have a tool that understands that format, that tool can work with any of those datasets. So if tools that provide functionality around loading of data for training ML models or fine tuning or evaluation of models, or data labeling or data analysis for machine learning, if these tools look at Croissant and consider supporting it, which does not require a huge amount of effort because it’s not a very complex format, then that will really make Croissant a lot more useful to machine learning practitioners. A goal for us this year is to get more tool vendors to come and join the Croissant effort and add support for it. Any vendor, no matter what size, that does data work, AI data work, whether that’s data augmentation, data collection, maybe responsible data collection, or data labeling or data debiasing, any type of data related activity that would make it more likely for a dataset to be ready to be used for machine learning, would add a lot of value to what we do right now. And it will also bring in new perspectives and new pathways to adoption. 

[ES] And would using Croissant be valuable for them? Absolutely. I think it was a researcher from Google who said, “No one wants to do the data work in AI,” a few years ago, and it still rings true. The vendors of data-related products solve some data challenges, but at the same time they also struggle with the fact that many people still think, “I’m just going to find the data somehow, and I’m just going to try to get to training my model as quickly as possible.” Croissant and these vendors are saying the same thing: data is important, we need to pay attention to it. We have very similar aims, and we’re providing these developers with a data portability solution. We’re helping them to interoperate with a range of other tools in the ecosystem. And that’s an advantage, especially when you’re starting out. 

[MLC] Have there been any surprises for you in how the community has been using Croissant?

[ES] We see a lot of excitement around use cases that we hadn’t really thought about so much. For instance, anything related to responsible AI, including AI safety, that’s a huge area that is looking for a whole range of solutions, some of them technical, some of them socio-technical. We’ve seen a lot of interest to use Croissant for many scenarios where there is a need to operationalize responsible AI practices. And to some degree, the challenge that we’re facing is to decide which of those use cases are really part of the core roadmap, and which ones are best served by communities that have that sectorial expertise.

We’ve also seen a lot of interest from the open science community. They’ve been working on practices and solutions for scientists to share not just their data, but their workflows, in other words, the protocols, experiment designs and other ways to do science. Scientific datasets are now increasingly used with machine learning,o we’ve had lots of interest and there’s conversations on how to take advantage of those decades of experience that they bring in.

[MLC] How was Croissant received at the NeurIPS 2024 conference?

[OB] We presented a poster at NeurIPS that opened a lot of conversations. We also conducted what turned out to be perhaps the most interesting community engagement activity all year for us. We asked authors in the conference’s Datasets and Benchmarks track to submit their datasets with a Croissant metadata description. It was not a strict requirement; it was more of a recommended thing for them to do. I think this was great for Croissant as a budding metadata format to get that exposure and the real life testing. We learned a lot from that experience about the details in our spec that were not clear or things that were not explained as well as they should have. And we also got real-world feedback on the tools that we built, including an initial metadata editor with which users can create the Croissant metadata for their datasets. It works in some cases, but we also knew that it was not feature-complete. So, users ran into issues using it. For instance, researchers often create bespoke datasets for their own research problems. And they asked us, “I have my custom binary format that I created for this; how do I write Croissant for it? It’s not CSV or images or text.” And so it pushed us into facing some of Croissant’s limits. . This experience also taught us the importance of good creation tools for users to prepare their datasets, because nobody wants to write by hand some JSON file that describes their dataset. I think for us it’s a strong incentive, as a community, to invest in building good editing tools that can also connect to all the platforms like Hugging Face or Kaggle so that users can publish their datasets on those platforms. So this is something that we learned from this interaction – and at the next conference, the experience will be much smoother for the authors who contribute to this.  

[MLC] What does the roadmap for Croissant look like moving forward?

[ES] In general, we are working on extending the metadata format so that we can capture more information. There are a few features that we’ve heard over and over again would be very important. For instance, anything related to the provenance of the data. We’re also working on tools, as Omar mentioned, and that includes technical tools like editors, but also tutorials and accessible material. 

In terms of what I’d like to see personally, as I was saying earlier, we’d love to have more contributors who are working in companies or on some sort of data work. For instance, data labeling. And a topic that we haven’t really touched on is the quality of that metadata. Croissant can solve many problems, but that assumes that the metadata is going to be of good quality, detailed, complete. That will almost certainly require some manual effort: from people who publish the data and want to describe it, maybe even from people who use the data and have made experiences with it. So, we’d love to have more datasets featuring excellent Croissant metadata. That will require a campaign to involve many users of, say, the 100 most popular machine learning datasets to come together and join efforts and create those excellent Croissant records.

[OB] We also have a new effort to describe not just datasets, but also machine learning tasks that use those datasets, so that we go to the next level of interoperability and describe things like machine learning evaluation benchmarks, or even competitions and things like that, and how they relate to datasets. That’s also a very exciting effort that is happening in the working group.

[ES] It’s quite an ambitious agenda – we’ll be busy this year and we invite others to join the working group to continue to shape the future of Croissant!

We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to implement the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.

The post Croissant Gains Momentum within the Data Community appeared first on MLCommons.

]]>
Bringing Open Source Principles to AI Data through Croissant https://mlcommons.org/2024/12/croissant-medium-post/ Mon, 16 Dec 2024 18:50:28 +0000 https://mlcommons.org/?p=1966 Republished from the ODI research team Medium ("Canvas") post, authored by Thomas Carey Wilson, November 27, 2024.

The post Bringing Open Source Principles to AI Data through Croissant appeared first on MLCommons.

]]>
The Croissant metadata format to help standardize machine learning (ML) datasets has gained significant popularity within the open-source AI community by offering a powerful tool for standardizing and enhancing the accessibility of ML datasets. Developed by the MLCommons Croissant working group, it serves as a community-driven metadata standard that simplifies how data sets are used, making them more understandable and shareable. This surge in adoption underscores a growing awareness of the critical role transparency and collaboration play in advancing AI development.

By adhering to the Open Source AI Definition, Croissant guarantees that datasets are FAIR (findable, accessible, interoperable, and reusable) and advocates for responsible data management practices. As an increasing number of organizations and individuals embrace Croissant, cultivating innovation and collaboration across platforms becomes essential.

Thomas Carey Wilson‘s recent Medium post “Bringing Open Source Principles to AI Data through Croissant“, shares why embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms. We are republishing the Medium post here for the community. To get involved in the Croissant working group join it here.

Republished from the ODI research team Medium account (“Canvas“), November 27, 2024, authored by Thomas Carey Wilson.

Bringing Open Source Principles to AI Data through Croissant

Introduction

Artificial Intelligence (AI) is progressively influencing various sectors, including industries and governmental operations. Despite some risks, its potential can be unlocked by greater, more responsible, openness – enabling freedom, transparency, and collaboration. The Open Source Initiative’s (OSI) Open Source AI Definition seeks to bring these core principles to AI systems. To our delight, central to this mission is the availability and standardisation of AI data, in addition to models and weights. This is where MLCommons’ Croissant comes into play. Croissant is a community-driven metadata standard that simplifies how datasets are used in machine learning (ML), making them more accessible, understandable, and shareable. 

Understanding the Open Source AI Definition

The Open Source AI Definition extends the foundational freedoms of open source software to AI systems. It outlines four essential freedoms:

  • Use: the right to use the AI system for any purpose without seeking permission.
  • Study: examining the system’s workings by accessing its components.
  • Modify: the freedom to alter the system to change its outputs or behaviour.
  • Share: the right to distribute the system to others, with or without modifications, for any purpose.

These freedoms apply not only to complete AI systems but also to their components, such as models, data, and parameters. The OSI explains that a critical enabler of these freedoms is ensuring access to the “preferred form to make modifications,”- the essential resources required for others to understand and alter certain aspects of the AI system.

This includes providing detailed information about the data used to train the system (“data information”), the complete source code for training and running the system (“code”), and the model parameters like weights or configuration settings (“parameters”). Among these, data information is particularly crucial because it empowers users to study and modify AI systems by understanding the data foundation upon which they are built- a core area where Croissant makes a significant contribution.

We value the focus on data transparency but suggest, as highlighted in our data for AI taxonomy, specifying what measures enabling transparency could look like for different data types in AI systems. This can support more equitable access, informed debate, and the implementation of responsible practices across the AI ecosystem.

Croissant’s alignment with the Open Source AI Definition

Croissant, as an open-source tool, ensures that ML datasets are FAIR (findable, accessible, interoperable, and reusable), aligning with open science and open data practices. It embodies the principles of the Open Source AI Definition by providing a standardized, machine-readable format for ML dataset metadata. Here’s how Croissant supports each of the core freedoms and data information requirements:

Use

To be useful, core Croissant metadata should be openly available for all AI datasets, even if the datasets themselves cannot be made public. For an open-source AI model based on open data, Croissant can enable free use of AI datasets by providing detailed metadata that makes datasets easier to find and utilize. Once the required dataset(s) are located, it allows practitioners to quickly understand data collection methods, structures, and processing steps by standardizing how datasets are described. This ease of access and clarity empowers users to employ datasets for any purpose without needing permission or facing unnecessary barriers.

Moreover, Croissant’s design ensures interoperability by leveraging its attributes and ensuring tools understand what those attributes mean. The use of widely recognized vocabularies like schema.org allows integration with existing data and metadata management tools, such as data crawlers, search engines, data catalogues, and metadata assurance systems. This aligns with the Open Source AI Definition’s emphasis on freedom of use by enabling seamless integration across various ML frameworks like PyTorch, TensorFlow, and JAX.

Study

Studying an AI system requires access to detailed information about its components. Croissant facilitates this by offering rich metadata that supports a range of responsible AI use cases, including the structured discovery of datasets, representation of human contributions, and analysis of data diversity and bias. Specifically, it provides:

  • Dataset-level information: Includes descriptions, licenses, creators, and versioning.
  • Resources: Details about files and file sets within the dataset.
  • RecordSets: Structures that describe how data is organized, including fields and data types.

For example, Croissant enables researchers to analyze the demographic characteristics of annotators and contributors to assess dataset diversity or potential biases. For this purpose and others, the inclusion of data types, hierarchical record sets, and fields supports machine \interoperability, allowing tools to interpret and integrate datasets effectively.

Modify

Croissant’s open and extensible format encourages modification and adaptation. Users can:

  • Extend metadata: Croissant is designed to be modular, allowing for extensions like the Geo-Croissant extension, which captures information specific to geospatial data.
  • Adapt datasets: the standardization of metadata means that datasets can be modified more easily, whether by adding new data, altering existing records, or adjusting data structures.
  • Integrate with tools: Croissant metadata can be edited through a visual editor or a Python library, making it accessible for developers to modify datasets programmatically.
  • Enable interoperability across repositories: Croissant acts as a glue between platforms like Kaggle and Hugging Face, providing adapters that enable seamless dataset interchange while retaining provenance and versioning information for compatibility across frameworks.

For example, if a practitioner wants to add new fields to a dataset or alter the way data is processed, Croissant’s standardized format makes these modifications straightforward. The ability to represent complex data types and transformations within the metadata ensures that changes are accurately captured and can be shared with others.

Share

Sharing is at the heart of open source, and Croissant facilitates this by making datasets and their metadata easily distributable. The use of JSON-LD and alignment with schema.org means that metadata can be embedded in web pages, enabling search engines to index and discover datasets. Croissant also supports:

  • Portability: Croissant datasets can be seamlessly loaded into different ML frameworks, promoting sharing across platforms.
  • Collaboration: Croissant fosters collaboration among practitioners, researchers, and organizations by providing a common language for dataset metadata.

An illustrative example is how Croissant enables dataset repositories to infer metadata from existing documentation like data cards, making it easier to publish and share datasets with metadata attached.

Data information compliance

As for the practical requirements which enable these freedoms, the Open Source AI Definition places significant emphasis on data information, requiring:

  1. A complete description of all data used for training: Croissant provides a detailed schema for dataset-level information, including descriptions, licenses, creators, and data versions. It also specifies how to represent resources like files and file sets, ensuring that all data used in training is thoroughly documented.
  2. Listing of publicly available training data: through the distribution property and detailed resource descriptions, Croissant makes it clear where data is stored and how it can be accessed. The use of URLs and content descriptions aligns with the requirement to list publicly available data and where to obtain it.
  3. Listing of training data obtainable from third parties: Croissant’s metadata can include references to external data sources, including third-party datasets. Specifying licenses and access URLs ensures that users know how to obtain additional data, even if it requires a fee.

Moreover, Croissant’s support for versioning and checksums addresses the need for transparency in data changes over time. Croissant ensures that users can verify the integrity of data and understand its provenance by documenting dataset versions and providing file checksums.

Conclusion

Croissant epitomizes the core principles of the Open Source AI Definition by supporting AI datasets to be fully open, accessible, and adaptable. Through its standardized format, Croissant empowers everyone to use, study, modify, and share AI data freely- driving innovation, fostering collaboration, and building trust in AI systems. While details on how the Open Source AI Definition will work in practice are still emerging, embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms.

The post Bringing Open Source Principles to AI Data through Croissant appeared first on MLCommons.

]]>
New Croissant Metadata Format helps Standardize ML Datasets https://mlcommons.org/2024/03/croissant_metadata_announce/ Wed, 06 Mar 2024 18:03:38 +0000 http://local.mlcommons/2024/03/croissant_metadata_announce/ Support from Hugging Face, Google Dataset Search, Kaggle, Open ML, and TFDS, makes datasets easily discoverable and usable.

The post New Croissant Metadata Format helps Standardize ML Datasets appeared first on MLCommons.

]]>
Today, MLCommons is announcing the release of Croissant, a metadata format to help standardize machine learning datasets. The aim of Croissant is to make datasets easily discoverable and usable across tools and platforms. Today’s release includes the format documentation, an open source library, and visual editor, with industry support from HuggingFace, Google Dataset Search, Kaggle, and OpenML amongst others.

Data is at the core of every AI and ML model. However, there is currently no standardized method of organizing and arranging the data and files that make up each dataset. As a result, finding, understanding, and using ML datasets can be tedious and time-consuming.

One of the goals of Croissant is to make data more easily accessible and discoverable. The Croissant vocabulary is an extension to schema.org, a machine-readable standard to describe structured data, used by over 40M datasets on the Web, which allows the datasets to be discoverable through dataset search engines such as Google Dataset Search.

Croissant is easy to adopt because it doesn’t require changing the data itself or how it is represented. Instead, it adds a layer of metadata that represents the contents of the dataset in a standardized way, describing key attributes and properties.

Croissant enables datasets to be loaded into different ML platforms without the need for reformatting. Popular ML frameworks like TensorFlow, JAX and PyTorch can already load Croissant datasets via the TensorFlow Datasets library. Additionally, by providing operationalized documentation, Croissant users can easily understand the best practices for contributing to and utilizing the data.

Users looking to publish a dataset in the Croissant format benefit from the Croissant editor which allows them to easily inspect, create, or modify Croissant descriptions for their dataset.

“Data is a critical element of any model’s performance, and some experts suggest it will run out, making the need to harness it even more important,” said Elena Simperl, professor of Computer Science at King’s College London and a Croissant working group co-chair. “Croissant allows more people to do more with data. As co-chair of the working group, it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe making an enormous contribution to the AI data ecosystem.”

In addition to the 1.0 specification release of the Croissant open source library, we are announcing support from major ML repositories, including Kaggle, HuggingFace, and OpenML. You can also find public datasets using the Croissant format through Google Dataset Search.

“The development of Croissant was grounded in the needs of ML practitioners, and the technical requirements of ML tools, platforms, and datasets. Our goal with Croissant is to unlock real value for users by enabling the tools they use to work seamlessly together, while keeping the format as simple and intuitive as possible.” said Omar Benjelloun, software engineer at Google and Croissant working group co-chair.

We encourage dataset creators to provide Croissant descriptions, and dataset hosts to provide Croissant dataset files for download, as well as embed Croissant metadata into their corresponding dataset pages. Tool creators should include Croissant dataset support in data analysis and labeling tools to make datasets easier to find and work with. 

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King’s College London – Open Data Institute, Meta, NASA, NASA IMPACT – UAH, North Carolina State University, Open University of Catalonia –  Luxembourg Institute of Science and Technology, Sage Bionetworks, and TU Eindhoven.

We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to start implementing the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.

The post New Croissant Metadata Format helps Standardize ML Datasets appeared first on MLCommons.

]]>