MLPerf Training Archives - MLCommons

New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI

MLCommons — Wed, 04 Jun 2025 14:56:30 +0000

Today, MLCommons^®announced new results for the MLPerf^® Training v5.0 benchmark suite, highlighting the rapid growth and evolution of the field of AI. This round of benchmark results includes a record number of total submissions, as well as increased submissions for most benchmarks in the suite compared to the v4.1 benchmark.

MLPerf Training v5.0 introduces new Llama 3.1 405B benchmark

The MLPerf Training benchmark suite comprises full system tests that stress models, software, and hardware for a range of machine learning (ML) applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

Version 5.0 introduces a new large language model pretraining benchmark based on the Llama 3.1 405B generative AI system, which is the largest model to be introduced in the training benchmark suite. It replaces the gpt3-based benchmark included in previous versions of the MLPerf Training benchmark suite. An MLPerf Training task force selected the new benchmark because it is a competitive model representative of the current state-of-the-art LLMs, including recent algorithmic updates and training on more tokens. More information on the new benchmark can be found here. Despite just being introduced, the Llama 3.1 405B benchmark is already receiving more submissions than the gpt3-based predecessor saw in previous rounds – demonstrating the popularity and importance of large-scale training.

Rapid performance improvements for key training scenarios

The MLPerf Training working group regularly adds emerging training workloads to the benchmark suite to ensure that it reflects industry trends. The Training 5.0 benchmark results show notable performance improvements for newer benchmarks, indicating that the industry is prioritizing emerging training workloads over older ones. The Stable Diffusion benchmark saw a 2.28x speed increase for 8-processor systems compared to the 4.1 version six months ago, and the Llama 2.0 70B LoRA benchmark increased its speed 2.10x versus version 4.1; both outpacing historical expectations for computing performance improvements over time as per Moore’s Law. Older benchmarks in the suite saw more modest performance improvements.

On multi-node, 64-processor systems, the RetinaNet benchmark saw a 1.43x speedup compared to the prior v3.1 benchmark round (the most recent to include comparable scale systems), while the Stable Diffusion benchmark had a dramatic 3.68x increase.

“This is the sign of a robust technology innovation cycle and co-design: AI takes advantage of new systems, but the systems are also evolving to support high-priority scenarios,” said Shriya Rishab, MLPerf Training working group co-chair.

Increasing diversity of processors, increasing scale of systems, broadening ecosystem

Submissions to MLPerf Training 5.0 utilized 12 unique processors, all in the available (production) category. Five of the processors have become publicly available since the last version of the benchmark suite.

AMD Instinct MI300X 192GB HBM3
AMD Instinct MI325X 256GB HBM3e
NVIDIA Blackwell GPU (GB200)
NVIDIA Blackwell GPU (B200-SXM-180GB)
TPU-trillium

Submissions also included three new processor families:

5th Generation AMD Epyc Processor (“Turin”)
Intel Xeon 6 Processor (“Granite Rapids”)
Neoverse V2 as part of NVIDIA GB200

In addition, the number of multi-node systems submitted increased more than 1.8x when compared to version 4.1.

“The picture is clear: AI workloads are scaling up, systems are scaling up to run them, and hardware innovation continues to boost performance for key scenarios,” said Hiwot Kassa, MLPerf Training working group co-chair. “In-house large scale systems were built by few companies, but the increased proliferation – and competition – in AI-optimized systems is enabling the broader community to scale up their own infrastructure. Most notably, we see an increasing cadre of cloud service providers offering access to large-scale systems, democratizing access to training large models.

“The industry is not standing still, and neither can we. MLCommons is committed to continuing to evolve our benchmark suite so that we can capture and report on the innovation that is happening in the field of AI.”

Record industry participation

The MLPerf Training v5.0 round includes 201 performance results from 20 submitting organizations: AMD, ASUSTeK, Cisco Systems Inc., CoreWeave, Dell Technologies, GigaComputing, Google Cloud, Hewlett Packard Enterprise, IBM, Krai, Lambda, Lenovo, MangoBoost, Nebius, NVIDIA, Oracle, Quanta Cloud Technology, SCITIX, Supermicro, and TinyCorp.

“We would especially like to welcome first-time MLPerf Training submitters AMD, IBM, MangoBoost, Nebius, and SCITIX,” said David Kanter, Head of MLPerf at MLCommons. ”I would also like to highlight Lenovo’s first set of power benchmark submissions in this round – energy efficiency in AI training systems is an increasingly critical issue in need of accurate measurement.”

MLPerf Training v5.0 set a new high-water mark for the >200 submissions. The vast majority of the individual benchmark tests that carried over from the previous round saw an increase in submissions.

Robust participation by a broad set of industry stakeholders strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

Please visit the Training benchmark page to view the full results for MLPerf Training v5.0 and find additional information about the benchmarks.

About ML Commons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

Press Inquiries: contact press@mlcommons.org

The post New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI appeared first on MLCommons.

MLCommons MLPerf Training Expands with Llama 3.1 405B

lori@mlcommons.org — Mon, 05 May 2025 15:43:00 +0000

About MLPerf Training

Pretraining Large Language Models (LLMs) is the initial phase of LLM training where the model learns to understand and generate human-like text by training the model on vast amounts of unstructured textual data. A pre-trained model can then be fine-tuned on domain-specific datasets, which gives the model knowledge of specific tasks. Reasoning approaches have become popular recently, helping models better explain their thinking process. Because pre-training uses large amounts of data, it is generally the most compute-intensive phase of LLM training.

MLCommons^® added a GPT3 Pretraining benchmark to MLPerf^® Training in 2022 to help academics and industry experts optimize LLM pretraining. It has been a huge success, as we have seen over 3x speedups on the results of this benchmark for large systems (over 10,000 accelerators).

Over this same period, newer state-of-the-art models have demonstrated greater scale and many architectural advancements. For example, Google’s PaLM model, released in 2022, contains 540B parameters, three times as many as GPT3’s 175B parameter count. Meta’s Llama 2 and Llama 3 series, released in 2023, utilize new algorithms such as Root Mean Square Layer Normalization (RMSNorm, Zhang et al., https://arxiv.org/abs/1910.07467), Rotary Position Embedding (RoPE, Su et al., 2024), and Grouped Query Attention (GQA, Ainslie et al., 2023).

A MLPerf Training working group task force was created to investigate a July 2024 proposal to replace GPT3 with Llama 3.1 405B. Task force members evaluated the proposal and industry options to recommend the best new pretraining reference. They agreed that Llama 3.1 405B would best represent the current state-of-the-art pretraining and adopted it as a new benchmark. Details on the review and selection process are included in this blog.

Model selection

The most important factor the task force considered when constructing this new pretraining benchmark was the model architecture. As mentioned above, the current GPT3 architecture does not include some of the recent algorithm updates and therefore is not competitive with other state-of-the-art models such as Nemotron-4, GPT4.5 and Claude 3.7 Sonnet. However, Meta’s Llama 3.1 405B model, with an architecture and scale similar to these top-tier models, demonstrates on-par performance with these models across multiple quality benchmarks, thus positioning itself as a competitive model representative of the current state-of-the-art models.

Choosing a model with high community engagement was also an important consideration, and Llama 3.1 405B is the perfect candidate on that basis. With more than 300 million total downloads of all Llama versions to date, the Llama family of models is very widely adopted. The MLPerf community, for instance, has already included Llama 2 in the LoRA benchmark. Additionally, Meta’s released technical paper details the training process, model architecture, dataset and hyperparameters, which allowed us to easily create a high-quality benchmark based on their work.

One other requirement for the new MLPerf Training benchmark is that the model architecture must be publicly accessible so that all submitters can download it and reproduce the results on their end. This typically requires a permissive license. Thanks to Meta’s support, the current Llama 3.1 405B model is now available to all MLCommons members for benchmarking.

Dataset selection

To help submitters transition easily from GPT3 to Llama 3.1 405B, we use the same AllenAI C4-en dataset (Colossal Cleaned Common Crawl dataset) in both benchmarks. This dataset contains 365M English-language paragraphs scraped from the internet, cleaned and preprocessed to convenient JSON formats for training the model.

One key difference is that Llama 3.1 405B uses a context length of 8,192, which is 4x larger than the 2,048-token context of GPT3. The context length represents the amount of text that a model can process at once, thus a larger context length leads to better understanding.

Benchmark technical details

The reference code is implemented in the NVIDIA NeMo Framework, an open-sourced and scalable AI framework built for large language models. By offering easy-to-use modular components and pre-trained model recipes, NeMo enables developers to efficiently build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA H100 GPUs.

To ensure that the new benchmark model setup is easily understandable without complex knobs and hyperparameters scattered around the training script, it starts with NeMo’s integrated Llama 3.1 405B training recipe, which closely followed the original paper’s hyperparameters and introduced only the minimum necessary changes to our script, such as adjusting the optimizer’s learning rates.

When researching how to construct this benchmark, we first explored the GPT3-style setup. For the GPT3 pretraining benchmark, the task force pretrained from random weights for about 12.6T tokens to generate a stable customized checkpoint. Submitters were required to continue pretraining from this checkpoint until they reached a predetermined quality metric. This methodology was used to ensure that the GPT3 pretraining benchmark showcased a reproducible slice of the actual GPT3 pretraining process.

Applying the same methodology to the Llama 3.1 405B pretraining benchmark, the task force generated a customized checkpoint trained on 32B tokens. But experiment results showed that resuming from this checkpoint had more variation in the convergence curves than we could tolerate. This meant that resuming from a custom-trained checkpoint in the benchmark was not feasible. The interpretation that we were still in very early phases of training after the 32B tokens, meant that the checkpoint generated was still quite unstable.

So the taskforce decided to train from the fully trained and stable HuggingFace checkpoint by giving it some new information, which led to a much more stable convergence trend. The benchmark now begins from the checkpoint and stops when the evaluation log perplexity reaches our defined target.

One noticeable difference between the task force’s implementation and Meta’s implementation is that we replaced the original Tiktokenizer, which used a vocab size of 128k, with the 32k vocab size Mixtral 8x22b tokenizer. The motivation behind this change was that Meta’s checkpoint is already well-trained, so it will not learn anything new on the current C4 dataset. Replacing the tokenizer forces the model checkpoint to adapt to a new, different token distribution, therefore continuing to pretrain on the current dataset. This difference implies that, when we resume from the pre-trained checkpoint, only the first 32,000 rows of the 128,256-row word embedding layer weights will be loaded to the model.

As mentioned before, the Llama 3.1 405B pretraining benchmark is a replacement for the GPT3 pretraining benchmark, and the following components will be unchanged between the two:

Validation dataset: To reduce the cost of validation, we chose to create a customized subset of the full C4 validation dataset so that the evaluation time would not be significant compared to training time. Specifically, we found out that using only 5,760 sequences is sufficient to measure model convergence, and since only 91,205 validation samples are needed to yield the required 5,760 validation sequences, the first 91,205 unshuffled rows of the full C4 validation dataset were selected as the customized validation dataset for this benchmark.

Loss and target accuracy metric: As with GPT3, log perplexity is used, computed on the customized validation dataset to evaluate the model’s convergence behavior. This metric is not costly to compute and is widely adopted as a versatile indicator of evaluating LLMs.
Limit number of submission logs: We recognize that the Llama 3.1 405B pretraining benchmark is very expensive to run, so to ensure fairness while reducing submitters costs, each submitter is required to only run the benchmark 3 times, the same as the GPT3 pretraining benchmark. The median result is then reported.
Restriction on hyperparameter searching: Because running the benchmark is expensive, we believe it is reasonable to disallow hyperparameter searches and borrowing. Most hyperparameters have been given static values. For a very few others, such as the learning rate, formulas are provided to compute them based on the global batch size.

Conclusion

In this blog post, we have provided an overview of a new LLM pretraining benchmark based on Meta’s Llama 3.1 405B model with twice as many parameters and four times as many training tokens as GPT3. This benchmark is massively more computationally intensive and contains its own new challenges. It is the task force’s belief that this new benchmark will encourage more innovation in the area of LLM pretraining.

About MLCommons

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons MLPerf Training Expands with Llama 3.1 405B appeared first on MLCommons.

New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications

MLCommons — Wed, 13 Nov 2024 19:51:49 +0000

Today, MLCommons^®announced new results for the MLPerf^® Training v4.1 benchmark suite, including several preview category submissions using the next generation of accelerator hardware. The v4.1 round also saw increased participation in the benchmarks that represent generative AI model training, highlighting the strong alignment between the benchmark suite and the current direction of the AI industry.

New generations of hardware accelerators

MLPerf Training v4.1 includes preview category submissions using new hardware accelerators that will be generally available in the next round:

Google “Trillium” TPUv6 accelerator (preview)
NVIDIA “Blackwell” B200 accelerator (preview)

“As AI-targeted hardware rapidly advances, the value of the MLPerf Training benchmark becomes more important as an open, transparent forum for apples-to-apples comparisons,” said Hiwot Kassa, MLPerf Training working group co-chair. “Everyone benefits: vendors know where they stand versus their competitors, and their customers have better information as they procure AI training systems.”

MLPerf Training v4.1

The latest Training v4.1 results show a substantial shift in submissions for the three benchmarks that represent “generative AI” training workloads: GPT3, Stable Diffusion, and Llama 2 70B LoRA fine-tuning benchmark, with a 46% increase in submissions in total across those three.

The two newest benchmarks in the MLPerf Training suite, Llama 2 70B LoRA and Graph Neural Network (GNN), both had notably higher submission rates: a 16% increase for Llama 2, and a 55% increase for GNN. They both also saw significant performance improvements in the v4.1 round compared to v4.0 when they were first introduced, with a 1.26X speedup in the best training time for Llama 2 and a 1.23X speedup for GNN.

“The MLCommons Training working group regularly updates the benchmark to keep pace with industry trends, and the trend in submissions we are seeing validates that our open process is giving stakeholders the information they are looking for. In this round we see continued steady, incremental progress in improving AI training performance,” observed Shriya Rishab, MLPerf Training working group co-chair.

Continued Robust Industry Participation

The MLPerf Training v4.1 round includes 155 results from 17 submitting organizations: ASUSTeK, Azure, Cisco, Clemson University Research Computing and Data, Dell, FlexAI, Fujitsu, GigaComputing, Google, Krai, Lambda, Lenovo, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Tiny Corp.

“We would especially like to welcome first-time MLPerf Training submitters FlexAI and Lambda,” said David Kanter, Head of MLPerf at MLCommons. “We are also very excited to see Dell’s first MLPerf Training results that includes power measurement; as AI adoption skyrockets, it is critical to measure both the performance and the energy efficiency of AI training.”

Participation in the benchmarking process by a broad set of industry stakeholders, as well as by academic groups, strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

To view the full results for MLPerf Training v4.1 and find additional information about the benchmarks, please visit the Training benchmark page.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications appeared first on MLCommons.

New MLPerf Training Benchmark Results Highlight Hardware and Software Innovations in AI Systems

MLCommons — Wed, 12 Jun 2024 14:55:00 +0000

Today, MLCommons^®announced new results for the MLPerf^® Training v4.0 benchmark suite, including first-time results for two benchmarks: LoRA fine-tuning of LLama 2 70B and GNN.

MLPerf Training v4.0

The MLPerf Training benchmark suite comprises full system tests that stress machine learning (ML) models, software, and hardware for a broad range of applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

MLPerf Training v4.0 includes over 205 performance results from 17 submitting organizations: ASUSTeK, Dell, Fujitsu, Giga Computing, Google, HPE, Intel (Habana Labs), Juniper Networks, Lenovo, NVIDIA, NVIDIA + CoreWeave, Oracle, Quanta Cloud Technology, Red Hat + Supermicro, Supermicro, Sustainable Metal Cloud (SMC), and tiny corp.

MLCommons would like to especially welcome first-time MLPerf Training submitters Juniper Networks, Oracle, SMC, and tiny corp.

Congratulations to first-time participant SMC for submitting the first-ever set of power results for MLPerf Training. These results highlight the impact of SMC’s immersion cooling solutions for data center systems. Our industry-standard power measurement works with MLPerf Training and is the first and only method to accurately measure full system power draw and energy consumption for both cloud and on-premise systems in a trusted and consistent fashion. These metrics are critical for the entire community to understand and improve the overall efficiency for training ML models – which will ultimately reduce the energy use and improve the environmental impact of AI in the coming years.

The Training v4.0 results demonstrate broad industry participation and showcase substantial performance gains in ML systems and software. Compared to the last round of results six months ago, this round brings a 1.8X speed-up in training time for Stable Diffusion. Meanwhile, the best results in the RetinaNet and GPT3 tests are 1.2X and 1.13X faster, respectively, thanks to performance scaling at increased system sizes.

“I’m thrilled by the performance gains we are seeing especially for generative AI,” said David Kanter, executive director of MLCommons. “Together with our first power measurement results for MLPerf Training we are increasing capabilities and reducing the environmental footprint – making AI better for everyone.”

New LLM fine-tuning benchmark

The MLPerf Training v4.0 suite introduces a new benchmark to target fine-tuning a large language model (LLM). An LLM that has been pre-trained on a general corpus of text can be fine-tuned to improve its accuracy on specific tasks, and the computational costs of doing so can differ from pre-training.

A variety of approaches to fine-tuning an LLM at lower computational costs have been introduced over the past few years. The MLCommons Training working group evaluated several of these algorithms and ultimately selected LoRA as the basis for its new benchmark. First introduced in 2021, LoRA freezes original pre-trained parameters in a network layer and injects trainable rank decomposition matrices. Since LoRA fine-tuning trains only a small portion of the network parameters, this approach dramatically reduces the computational and memory demands compared to pre-training or supervised fine-tuning.

“Fine-tuning LLMs is a notable workload because AI practitioners across many organizations make use of this technology. LoRA was the optimal choice for a state-of-the-art fine-tuning technique; it significantly reduces trainable parameters while maintaining performance comparable to fully fine-tuned models,” said Hiwot Kassa, MLPerf Training working group co-chair.

The new LoRA benchmark uses the Llama 2 70B general LLM as its base. This model is fine-tuned with the Scrolls dataset of government documents with a goal of generating more accurate document summaries. Accuracy is measured using the ROUGE algorithm for evaluating the quality of document summaries. The model uses a context length of 8,192 tokens, keeping pace with the industry’s rapid evolution toward longer context lengths.

The LLM fine-tuning benchmark is already achieving widespread adoption, with over 30 submissions in its initial round.

Learn more about the selection of the LoRA fine-tuning algorithm for the MLPerf Training benchmark in this blog.

New GNN benchmark for classification in graphs

MLPerf Training v4.0 also introduces a graph neural network (GNN) benchmark for measuring the performance of ML systems on problems that are represented by large graph-structured data, such as those used to implement literary databases, drug discovery applications, fraud detection systems, social networks, and recommender systems.

“Training on large graph-structured datasets poses unique system challenges, demanding optimizations for sparse operations and inter-node communication. We hope the addition of a GNN based benchmark in MLPerf Training broadens the challenges offered by the suite and spurs software and hardware innovations for this critical class of workload,” said Ritika Borkar MLPerf Training working group co-chair.

The MLPerf Training GNN benchmark is used for a node classification task where the goal is to predict a label for each node in a graph. The benchmark uses an R-GAT model and is trained on the 2.2 terabyte IGBH full dataset, the largest available open-source graph dataset with 547 million nodes and 5.8 billion edges. The IGBH database is a graph showing the relationships between academic authors, papers, and institutes. Each node in the graph can be classified into one of 2,983 classes.

The MLPerf Training team recently submitted MLPerf R-Gat to the Illinois Graph Benchmark (IGB) leaderboard which helps the industry keep track of the state of the art for GNN models, encouraging reproducibility. We are pleased to announce that their submission is currently #1 with a 72% test accuracy.

Learn more about the selection of the GNN benchmark in this blog.

View the results

To view the full results for MLPerf Training v4.0 and find additional information about the benchmarks, please visit the Training benchmark page.

About MLCommons

For additional information on MLCommons and details on becoming a member or affiliate, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Training Benchmark Results Highlight Hardware and Software Innovations in AI Systems appeared first on MLCommons.

Introducing the MLPerf Training Benchmark for Graph Neural Networks

MLCommons — Wed, 12 Jun 2024 14:45:00 +0000

About MLPerf Training

The field of AI has been rapidly advancing over the past decade, and the MLPerf^® Training benchmark suite is constantly evolving to capture the zeitgeist. New benchmarks are added based on MLCommons^® member consensus and customer interests.

For MLPerf Training v4.0, we are pleased to introduce a graph neural network (GNN) training benchmark based on the Relational Graph Attention Network (R-GAT) model. The benchmark is a result of almost a year of close collaboration between teams from Alibaba, Intel, and NVIDIA.

About graph neural networks

Graph neural networks (GNNs) have emerged as an important class of models to employ with graph-structured data such as social graphs. GNNs are used in a wide range of areas such as recommendation systems, fraud detection, knowledge graph answering, and drug discovery, to name a few. From a computational perspective, sparse operations and message passing between nodes of the graph make GNNs stand apart from other benchmarks in the training suite and present new challenges for system optimization and scalability.

Dataset and model

Commercial applications of GNNs typically use very large graphs that do not fit into the memory of a single node. To reflect the multi-node nature of the workload, we sought to use the largest available public dataset. Our initial proposal was to use the MAG-240M dataset from Open Graph Benchmarks, consisting of ~240 million nodes and ~1.7 billion edges. With the release of the Illinois Graph Benchmark (IGB) dataset in March 2023, we settled upon using the heterogeneous version of the dataset (IGBH-Full) for the benchmark. For more details, please refer to the IGB publication and repo.

The IGBH dataset is a real-world citation graph consisting of ~547 million nodes and ~5.8 billion edges, with more than 40% labeled nodes. The dataset schema is illustrated in Figure 1. The benchmark’s task is to classify the paper nodes of the graph among 2,983 available topics. We also augment the dataset by adding reverse edges to improve accuracy.

Figure 1: IGB-Heterogeneous dataset schema (source).

While there are a variety of architectures for GNNs, we chose to use the R-GAT model due to its popularity. This model leverages the attention mechanism proposed in the transformer model to focus on data subsets that are of greatest importance. After a thorough exploration of model architectures, we arrived at a configuration that achieves 72% classification accuracy on the validation set.

Technical details

We use a R-GAT model with three layers with [5,10,15] fanout, hidden dimension of 512, and four attention heads. The model is trained from scratch on a predetermined training subset of the graph (60% of the nodes) using the Adam optimizer. The eval dataset is fixed to be a 1/40th fraction of the full validation dataset–about 0.8 million nodes–with accuracy evaluated 20 times per epoch. To obtain a representative result, submitters must run the benchmark 10 times with different random seeds. Results are then scored using Olympic averaging.

The reference code is implemented using Alibaba’s GraphLearn-for-Pytorch (GLT) framework. GLT is a library for large-scale GNN training in Pytorch which supports distributed CPU/GPU-based sampling and training and is fully compatible with PyG. The reference code is designed to run on both NVIDIA GPUs and Intel Xeon platforms.

As we developed the benchmark, we had a healthy debate on the rules governing the benchmark so that the typical industry use case is reflected. For multi-node runs, the graph must be partitioned so that features of a given graph node are read exclusively from one training node to ensure feature-fetching over the network. We also spent considerable effort understanding the sampling differences between PyG and DGL, the two popular GNN frameworks. While the two samplers are not mathematically equivalent, we allow submitters to use either version out of the box because the convergence differences between them were minimal in our experiments. We have also contributed a PyG-style sampler in DGL for community usage.

We recently submitted MLPerf R-GAT to the Illinois Graph Benchmark (IGB) leaderboard which helps the industry keep track of the state of the art for GNN models, encouraging reproducibility, and we are pleased to announce that our submission is currently #1 with a 72% test accuracy.

Conclusion

We have described the motivation and the process of creating the new MLPerf GNN training benchmark. This benchmark will allow for fair evaluation of various systems, frameworks, and optimization techniques today and is expected to influence the design of future AI systems. We are delighted by the community interest in this benchmark, as shown by the number of submissions to MLPerf Training v4.0, and hope to see this interest grow further.

Further details of the benchmark and the reference implementation can be found here.

The post Introducing the MLPerf Training Benchmark for Graph Neural Networks appeared first on MLCommons.

LoRA selected as the fine-tuning technique added to MLPerf Training v4.0

MLCommons — Wed, 12 Jun 2024 14:45:00 +0000

Introduction to MLPerf Training

Generative AI has captured the public’s attention and imagination; its uses can be widespread and revolutionary. Large language models (LLMs) can perform language-related tasks and power the advanced conversational abilities of services such as ChatGPT, Gemini, Perplexity AI, and many more.

The MLPerf Training benchmark suite seeks to provide a standard means for measuring performance in machine learning (ML) model training, so naturally, the MLPerf Training working group has been watching these developments closely and considering how to expand the benchmark suite to cover emerging training scenarios.

Training LLMs can be broadly classified into two phases: pre-training and fine-tuning.

Pre-training is the crucial initial phase where the neural network is trained on a large and diverse dataset for generic language understanding such as grammar, idioms, facts, and the subtleties of different contexts. The dataset for pre-training typically consists of trillions of words from books, articles, websites, and social media. This phase is akin to developing a general understanding of language as the model learns to predict the next token based on past context. The MLPerf GPT-3 benchmark captures the performance of the pre-training phase of the LLM development life-cycle.

Fine-tuning adapts a pre-trained model to specific tasks or domains by further training the model on a smaller, task-specific dataset to enhance its performance. This process boosts training efficiency by reducing computational intensity and enhances performance on specific tasks without starting training over from scratch.

Fine-tuning has been widely adopted by the industry because of its cost effectiveness and reduced infrastructure requirements which make it highly accessible. With this consideration in mind, the MLPerf Training working group formed a special task force in 2023 to explore and prioritize the inclusion of a fine-tuning benchmark in MLPerf Training.

Model selection

The task force evaluated numerous model candidates for inclusion, ranging from smaller models like Llama 2 7B and Mixtral 7B to mid-range models such as Falcon 40B, MPT 30B, and Llama 2 70B. After thorough evaluation and deliberation, the task force selected Llama 2 70B as the model that most aligned with its objectives.

Several factors contributed to this decision:

Model diversity: One of the goals of MLPerf Training is to curate a small set of models with diverse characteristics so that the overall suite represents the much larger universe of trained models. With the right mix of models, the benchmark results can be generally and broadly useful. The existing MLPerf Training suite includes two language models: BERT (340 million parameters) and GPT-3 (175 billion parameters). The addition of Llama 2 70B (70 billion parameters) adds another distinct model size along with some architectural diversity thanks to features such as group query attention (GQA), SwiGLU activation function, and rotary embeddings.

Community engagement: The task force considered Hugging Face contributions and leaderboards as solid indicators of community engagement and industry adoption. The Llama 2 70B model emerged as the frontrunner in terms of community engagement at the time of benchmark development. Since then, Meta has released Llama 3 70B, which was trained on a larger dataset with an improved tokenizer. However, the majority of the network layers are common across the two models, so per-iteration performance on Llama 2 will likely track closely to performance on Llama 3.

Licensing flexibility: For an open benchmark consortium, model licenses must be flexible enough to offer unlimited access to participating submitters at minimum and, ideally, to the press and public in general. This requirement typically reduces the candidate pool to models (and datasets) with permissive licenses such as Apache 2.0, MIT, or similar. Thanks to Meta’s support, MLCommons is making the Llama 2 family of models available to MLCommons members for the purpose of conducting benchmarking with the MLPerf benchmark suites.

Efficient benchmarking integration: Llama 2 70B offers easy deployment, simplifying benchmarking and integration for the task force. Also, model scales are growing exponentially, so large language models may not fit on a single accelerator anymore. To train on multiple accelerators, AI practitioners will have to employ methods like model parallelism to split the model efficiently across devices. Choosing a moderate-sized model like Llama 2 70B creates opportunities for submitters to showcase such software innovations.

Selecting a fine-tuning technique

Next, the task force investigated three different techniques for optimizing large language models for downstream tasks: supervised fine-tuning (SFT) which operates on training an entire model on curated datasets; reinforcement learning (RLHF) which utilizes human feedback to guide model training; and parameter-efficient fine tuning (PEFT) which updates only a subset of the model parameters.

PEFT stood out among these three approaches as the best fine-tuning technique for training because of its computational efficiency and architectural simplicity. Low-rank adaptation (LoRA) is a PEFT technique which freezes pre-trained model weights while injecting trainable low-rank decomposition matrices. We chose LoRA as the preferred fine-tuning technique for MLPerf Training due to its reduced parameter counts and memory footprint, efficient training process, lower barrier to entry, and popularity (which was likely driven by the prior factors.) Also, LoRA fine-tuning has different performance characteristics than existing MLPerf benchmarks.

Dataset selection

For this benchmark, the task force chose the SCROLLS (Standardized CompaRison Over Long Language Sequences) government report dataset. The dataset is made up of question-summary pairs based on reports written by government research agencies including Congressional Research Service and U.S. Government Accountability Office.

Context length is the amount of historical data the model uses to understand the context before making a forecast. Enabling LLMs to process and understand larger context lengths is crucial for capturing nuanced relationships and generating coherent and relevant text. The task force experimented with various context lengths and opted for the longest context that fit within a single, eight-accelerator system: 8192 tokens. That’s the equivalent of roughly 6,144 words and is significantly longer context than the existing language models in MLPerf Training, which have sequence lengths of 512 and 2048 for BERT and GPT3, respectively.

Benchmark technical details

The Llama 2 70B LoRA benchmark is based on the Hugging Face ecosystem and tools in order to represent a common developer experience. Hugging Face is one of the most widely used environments for fine-tuning due to the extensive and easy-to-use Transformers library, which provides researchers and developers with pre-trained models. Developers can adapt these models with user-friendly and well-documented APIs, enabling efficient and effective fine-tuning.

To ensure the representative nature of this benchmark, the task force followed the most common convention among Hugging Face users for fine-tuning LLMs by leveraging AutoModelForCausalLM. AutoModelForCausalLM is a versatile tool provided by Hugging Face designed for tasks involving causal language modeling. This model is primarily used for generating text, where the goal is to predict the next word in a sequence, given the preceding words.

The MLPerf Training v4.0 benchmark also includes:

Loss metric: The benchmark uses cross entropy of the next token prediction for the loss function (much like MLPerf GPT3).

Target accuracy metric: The standard metric for text summarization is the ROUGE score. Due to the computationally intensive and time-consuming nature of computing ROUGE scores, the typical approach involves analyzing the convergence graph of the evaluation loss. Once convergence is detected, ROUGE scores are computed to evaluate performance. Our testing of this method unveiled a strong correlation (R score of 0.9) between evaluation loss and ROUGE scores. Note that, unlike the training loss, the evaluation loss is calculated only over the target tokens.

Selected region of convergence: To ensure fairness, each submitter is required to run the benchmark 10 times. Results are then scored using Olympic averaging.

Conclusion

In this blog post, we shared the insights, motivation, and process behind the creation of the new MLPerf Llama 2 70B LoRA fine-tuning benchmark. This benchmark is less computationally intensive than GPT-3, so it should serve as an excellent starting point for working with language models and lower the barrier to entry for MLPerf participants. We are thrilled by the community’s interest in the benchmark, as evidenced by 30 submissions in MLPerf Training v4.0 for this benchmark, and we hope to see this enthusiasm continue to grow.

The post LoRA selected as the fine-tuning technique added to MLPerf Training v4.0 appeared first on MLCommons.

New MLPerf Training and HPC Benchmark Results Showcase 49X Performance Gains in 5 Years

MLCommons — Wed, 08 Nov 2023 16:55:00 +0000

Today, MLCommons® announced new results from two industry-standard MLPerf benchmark suites:

The MLPerf Training v3.1 suite, which measures the performance of training machine learning models.
The MLPerf HPC (High Performance Computing) v.3.0 benchmark suite, which is targeted at supercomputers and measures the performance of training machine learning models for scientific applications and data.

MLPerf Training v3.1
The MLPerf Training benchmark suite comprises full system tests that stress machine learning models, software, and hardware for a broad range of applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy-efficiency for the entire industry.

MLPerf Training v3.1 includes over 200 performance results from 19 submitting organizations: Ailiverse, ASUSTek, Azure, Azure+NVIDIA, Clemson University Research Computing and Data, CTuning, Dell, Fujitsu, GigaComputing, Google, Intel+Habana Labs, Krai, Lenovo, NVIDIA, NVIDIA+CoreWeave, Quanta Cloud Technology, Supermicro, Supermicro+Red Hat, and xFusion. MLCommons would like to especially congratulate first-time MLPerf Training submitters Ailiverse, Clemson University Research Computing and Data, CTuning Foundation, and Red Hat.

The results demonstrate broad industry participation and highlight performance gains of up to 2.8X compared to just 5 months ago and 49X over the first results, reflecting the tremendous rate of innovation in systems for machine learning.

Significant to this round, is the largest system ever submitted to MLPerf Training. Comprising over 10K accelerators, it demonstrates the extraordinary progress by the machine learning community in scaling system size to advance the training of neural networks.

MLPerf Training v3.1 introduces the new Stable Diffusion generative AI benchmark model to the suite. Based on Stability AI’s Stable Diffusion v2 latent diffusion model, Stable Diffusion takes text prompts as inputs and generates photorealistic images as output. It is the core technology behind an emerging and exciting class of tools and applications such as Midjourney and Lensa.

“Adding Stable Diffusion to the benchmark suite is timely, given how image generation has exploded in popularity,” said Eric Han, MLPerf Training co-chair. “This is a critical new area – extending Generative AI to the visual domain.”

MLCommons added the GPT-3 benchmark to MLPerf Training v3.0 last June. In just five months, the LLM benchmark has shown over 2.8X in performance gains. Eleven submissions in this round include this large language model (LLM) using the GPT-3 reference model, reflecting the tremendous popularity of generative AI.

“GPT-3 is among the fastest growing benchmarks we’ve launched,” said David Kanter, Executive Director, MLCommons. “It’s one of our goals to ensure that our benchmarks are representative of real-world workloads and it’s exciting to see 2.8X better performance in mere months.”

MLPerf HPC v3.0 Benchmarks
The MLPerf HPC benchmark is similar to MLPerf Training, but is specifically intended for high-performance computing systems that are commonly employed in leading-edge scientific research. It emphasizes training machine learning models for scientific applications and data, such as quantum molecular dynamics, and also incorporates an optional throughput metric for large systems that commonly support multiple users.

MLCommons added a new protein-folding benchmark in the HPC v3.0 benchmark suite: the OpenFold generative AI model, which predicts the 3D structure of a protein given a 1D amino acid sequence. Developed by Columbia University, OpenFold is an open-source reproduction of the AlphaFold 2 foundation model and has been the cornerstone of a large number of research projects since its creation.

MLPerf HPC v3.0 includes over 30 results – a 50% increase in participation over last year, and includes submissions by 8 organizations with some of the world’s largest supercomputers: Clemson University Research Computing and Data, Dell, Fujitsu+RIKEN, HPE+Lawrence Berkeley National Laboratory, NVIDIA, and Texas Advanced Computing Center. MLCommons congratulates first-time MLPerf HPC submitters Clemson University Research Computing and Data and HPE+Lawrence Berkeley National Laboratory.

The new OpenFold benchmark includes submissions from 5 organizations: Clemson University Research Computing and Data, HPE+Lawrence Berkeley National Laboratory, NVIDIA, and Texas Advanced Computing Center,

HPC v3.0 Performance Gains
The MLPerf HPC benchmark suite demonstrates considerable progress in AI for science that will help unlock new discoveries. For example, the DeepCAM weather modeling benchmark is 14X faster than when it debuted, illustrating how rapid innovations in machine learning systems can empower scientists with better tools to address critical research areas and advance our understanding of the world.

“The addition of OpenFold follows the spirit of the MLPerf HPC benchmark suite: Accelerating workloads with potential for global-scale contribution. We are excited for the new addition as well as the increased participation in the latest submission round.” said Andreas Prodromou, MLCommons HPC co-chair.

View the Results
To view the results for MLPerf Training v3.1 and MLPerf HPC v3.0 and find additional information about the benchmarks, please visit the Training and HPC benchmark pages.

About MLCommons
MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a member or affiliate, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Training and HPC Benchmark Results Showcase 49X Performance Gains in 5 Years appeared first on MLCommons.

MLPerf Results Show Rapid AI Performance Gains

MLCommons — Tue, 27 Jun 2023 08:24:00 +0000

Today we announced new results from two industry-standard MLPerf benchmark suites: Training v3.0, which measures the performance of training machine learning models, and Tiny v1.1, which measures how quickly a trained neural network can process new data for extremely low-power devices in the smallest form factors.

Faster training paves the way for more capable intelligent systems

Training models faster empowers researchers to unlock new capabilities, such as the latest advances in generative AI. The latest MLPerf Training round demonstrates broad industry participation and highlights performance gains of up to 1.54x compared to just six months ago and 33-49x over the first round, reflecting the tremendous rate of innovation in systems for machine learning.

The MLPerf Training benchmark suite comprises full system tests that stress machine learning models, software, and hardware for a broad range of applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy-efficiency for the entire industry.

In this round, MLPerf Training added two new benchmarks to the suite. The first is a large language model (LLM) using the GPT-3 reference model that reflects the rapid adoption of generative AI. The second is an updated recommender, modified to be more representative of industry practices, using the DLRM-DCNv2 reference model. These new tests help advance AI by ensuring that industry-standard benchmarks are representative of the latest trends in adoption and can help guide customers, vendors, and researchers alike.

“I’m excited to see the debut of GPT-3 and DLRM-DCNv2, which were built based on extensive feedback from the community and leading customers and demonstrate our commitment to keep the MLPerf benchmarks representative of modern machine learning,” said David Kanter, executive director of MLCommons®.

The MLPerf Training v3.0 round includes over 250 performance results, an increase of 62% over the last round, from 16 different submitters: ASUSTek, Azure, Dell, Fujitsu, GIGABYTE, H3C, IEI, Intel & Habana Labs, Krai, Lenovo, NVIDIA, NVIDIA + CoreWeave, Quanta Cloud Technology, Supermicro, and xFusion. In particular, MLCommons would like to congratulate first time MLPerf Training submitters CoreWeave, IEI, and Quanta Cloud Technology.

“It is truly remarkable to witness system engineers continuously pushing the boundaries of performance on workloads that hold utmost value for users via MLPerf,” said Ritika Borkar, co-chair of the MLPerf Training Working Group. “We are particularly thrilled to incorporate an LLM benchmark in this round, as it will inspire system innovation for a workload that has the potential of revolutionizing countless applications.”

MLPerf Tiny Results Reflect the Rapid Pace of Embedded Devices Innovation

Tiny compute devices are a pervasive part of everyone’s everyday life, from tire sensors in your vehicles to your appliances and even your fitness tracker. Tiny devices bring intelligence to life at very little cost.

ML inference on the edge is increasingly attractive to increase energy efficiency, privacy, responsiveness, and autonomy of edge devices. Tiny ML breaks the traditional paradigm of energy and compute hungry ML by eliminating networking overhead, allowing for greater overall efficiency and security relative to a cloud-centric approach. The MLPerf Tiny benchmark suite captures a variety of inference use cases that involve “tiny” neural networks, typically 100 kB and below, that process sensor data, such as audio and vision, to provide endpoint intelligence for low-power devices in the smallest form factors. MLPerf Tiny tests these capabilities in a fair and reproducible manner, in addition to offering optional power measurement.

In this round, the Tiny ML v1.1 benchmarks include 10 submissions from academic, industry organizations, and national labs, producing 159 peer-reviewed results. Submitters include: Bosch, cTuning, fpgaConvNet, Kai Jiang, Krai, Nuvoton, Plumerai, Skymizer, STMicroelectronics, and Syntiant. This round includes 41 power measurements, as well. MLCommons congratulates Bosch, cTuning, fpgaConvNet, Kai Jiang, Krai, Nuvoton, and Skymizer on their first submissions to MLPerf Tiny.

“I’m particularly excited to see so many companies embrace the Tiny ML benchmark suite,” said David Kanter, Executive Director of MLCommons. “We had 7 new submitters this round which demonstrates the value and importance of a standard benchmark to enable device makers and researchers to choose the best solution for their use case.”

“With so many new companies adopting the benchmark suite it’s really extended the range of hardware solutions and innovative software frameworks covered. The v1.1 release includes submissions ranging from tiny and inexpensive microcontrollers to larger FPGAs, showing a large variety of design choices,” said Dr. Csaba Kiraly, co-chair of the MLPerf Tiny Working Group. “And the combined effect of software and hardware performance improvements are 1000-fold in some areas compared to our initial reference benchmark results, which shows the pace that innovation is happening in the field.”

View the Results

To view the results for MLPerf Training v3.0 and MLPerf Tiny v1.1, and to find additional information about the benchmarks please visit:
Training v3.0 and Tiny v1.1.

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ members – global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate, please visit MLCommmons or contact participation@mlcommons.org.

The post MLPerf Results Show Rapid AI Performance Gains appeared first on MLCommons.

Perspective: Unlocking ML requires an ecosystem approach

MLCommons — Fri, 10 Mar 2023 08:04:00 +0000

Over the past 10 years, machine learning (ML) has grown exponentially, from a small research field to a broad community spanning universities, technology platforms, industries, nonprofits, and governments. The global funding (PE/VC) in AI companies increased by 390% from $16 billion in 2015 to $79 billion in 2022.^[1] Recent advances in ML have made it possible for machines to not only consume but also create content to a level that approaches what humans can do. This capability is sometimes called Generative AI^[2]. This is primarily enabled by foundation models^[3] – large, pre-trained models trained on copious amounts of diverse data through self-supervised learning that can be adapted to a wide array of downstream tasks. The disruptive potential of foundation models lies in their ability to achieve state-of-the-art performance on tasks with little or no task-specific training required (though such “fine-tuning” further improves capabilities).

While considerable innovation is taking place across the ML research-to-production life cycle, it often occurs organically and in silos, resulting in uneven impact and a low transfer of innovation across industry verticals. A handful of technology companies are experiencing benefits from deploying cutting-edge ML at scale, while others are still learning to operationalize it: to quote one study, nearly 80 percent of AI solutions are abandoned because companies struggle to scale them.^[4] Even in companies with advanced ML capabilities, there is considerable friction and uncertainty within the ML development cycle. Reasons for this include:

ML has a fundamentally different programming paradigm than traditional software. In traditional software engineering, a programmer writes explicit instructions for the computer to execute, while in machine learning—especially for deep neural networks—the programmer provides a data set and specifies a model architecture and other training code from which the program learns the desired behavior. While this is an incredibly powerful programming tool, it creates an inherent unpredictability that must be accounted for throughout the ML development life cycle.
The ML programming paradigm requires a very different development cycle. The best practice for traditional software development uses a collaborative, flexible, agile methodology and the scrum framework, whereby teams make rapid, incremental advancements toward well-defined goals. ML development, by contrast, is relatively opaque and unpredictable; in effect, data is the new code, and the quality and quantity of data determine the ML’s functionality—though this connection is still poorly understood. Teams continuously revisit data sets and retrain models as they move ML innovations from research into production – a process we term the “ML development flywheel” (Exhibit 1).
ML lacks a mature ecosystem – a set of shared tools, resources, standards, and best practices – to support this development cycle. Traditional computer programming has had decades to develop an ecosystem that has become so prevalent that nearly 80 percent of all sites on the internet today rely on a common software stack^[5]. ML development, however, is still considered an art as much as a science, often requiring a custom solution for each new problem and highly experienced engineers who can try various techniques gained through experience to arrive at the desired model behavior and convergence.

Exhibit 1: ML Development Flywheel

The result is that the ML field today resembles the internet of the late 1990s. It has high potential but uneven impact, knowledge is spread across myriad organizations and individuals, and programming requires a collection of custom, often vertically integrated tech stacks.

Motivated by the challenge of increasing ML’s impact in society, the MLCommons Association collaborated with several partners on a high-level analysis of how to develop a healthier ML ecosystem (Exhibit 2). We wanted to answer two key questions—where are the opportunities for unlocking ML’s greatest impact on society, and what are the best ways to address them—with the goal of providing a perspective on how various ML ecosystem players, including enterprises and technology hyperscalers,^[6] can contribute to driving greater ML adoption.

Exhibit 2: Types of ML Ecosystem players

We interviewed 32 experts across these ML ecosystem players^[7]—including engineers and product managers from leading tech companies, business leaders, the heads of industry and civil-society groups, and academics—to explore the challenges they face in developing ML and attempt to prioritize solutions for addressing them.

Opportunities to improve ML

Exhibit 3: Opportunities to improve ML fall into three broad categories

Opportunities to unlock the full impact of ML at scale fall into three categories: increasing the capabilities of ML technology, broadening access to and adoption of ML, and ensuring that the results of ML benefit society (Exhibit 3). All ecosystem players have a role to play for ML to reach its full potential, and contributions can be in the form of tools, data, funding, education, training, benchmarks, best practices, partnerships and coalitions, and governance and policy.

Deep dives on critical ecosystem areas to improve

In this section, we discuss three key opportunities that emerged in our discussions with business leaders and ML practitioners: data discoverability, access, and quality; operational scaling and hardening; and ethical applied ML.

Data discoverability, access, and quality: ML developer productivity is data-limited, but public data is primitive and data tools are poor

ML developers face two broad categories of data challenges:

Finding, accessing, and labeling data. Developers need to locate relevant data, gain access to it—which can be especially difficult if it’s held by multiple organizations—format it for consistency, and sometimes efficiently and correctly label it.
Making data useful. Collected data can be incomplete, incorrect, biased, or misleading in numerous ways, problems which are hard to detect prior to deployment.

These challenges are exacerbated by the relatively low investment companies are making in public ML data (compared with the substantial investment in model development), the nonexistent to primitive nature of metrics and standards for ML data sets, and the lack of a widely used tool stack for working with data. Opportunities for improving the ML data ecosystem include:

Improving public data sets. There is opportunity for a network of partners to work together to create open data sets that support multiple contributors and continuous quality improvements. Datasets such as the Pile^[8] and LAION^[9] are early, promising examples of the potential of this approach, though other issues such as licensing still need to be addressed. One estimate states that the boost to the economy from adoption of open-data ecosystems in finance could range from about 1 to 1.5 percent of GDP in 2030 in the European Union, the United Kingdom, and the United States, to as much as 4 to 5 percent in India.^[10] There is a particularly strong need for open datasets that define community consensus on usable public data, serve as tests and benchmarks for different approaches, or address societal issues without commercial interest.
Increasing data sharing. Public data sets should be populated by standardizing best practices for anonymizing and releasing data or finding innovative ways to leverage private data without releasing it. For example, MedPerf is an open framework for benchmarking machine learning in the medical domain that securely evaluates medical AI models on distributed healthcare data sets while keeping the data private.^[11]
Agreeing on common data metrics and standards. First, metrics, such as DataPerf, should be created that quantify data quality (DataPerf is a benchmark suite for ML datasets and data-centric algorithms^[12]). Second, data formats and metadata content should be standardized; Google model cards, IBM fact sheets, datasheets for datasets, and the Data Nutrition Project all provide examples of how to do this.^[13] Third, best practices for ethical data sourcing and consistent licensing should be developed.
Improving interoperable data-centric tools. These metrics and standards should be leveraged to improve data quality and reduce costs through a data-development stack of interoperable tools akin to those widely used for code.

Operational scaling and hardening: deploying models from research into production too often relies on costly and fragile bespoke solutions

Delivering stellar results using a prototype model trained on an experimental data set is a significant accomplishment, but putting the model into production to deliver business value brings new challenges that generally fall into two categories:

Building the ML data platform. Experimental data is often hand-curated or has different characteristics from the actual, production data at a company. As a result, teams need to invest in building persistent pipelines to access and transform production data and design features to approximate those used for the experimental model. Often the new model does not match the evaluation performance of the experimental one and additional work is needed to bring it up to par. An abrupt handoff between the research and production teams can increase the challenge of this work if the team responsible for operationalizing the model into a product is unaware of the context and datasets used in model development. Finally, live data needs to be collected as the model is put through dark launches, live experiments, and full launch, all while monitoring the model output to ensure the query and training distribution match, and it isn’t producing unintended outcomes. Building this end-to-end data story, with an ability to track artifact provenance, requires significant up-front investment before any business impact is realized.
Navigating infrastructure fragmentation. Much of the open-source infrastructure used to write ML models and compile them down to hardware-native code was created by researchers for their own specific, fast-changing needs. This has resulted in an ecosystem of toolchains that is highly siloed or patched together using custom bridges, which creates a host of interoperability problems—for example, trying to port a server-side model written in Pytorch to a mobile back-end using TFLite can be a challenging process—and means teams often have to throw out their existing model code and pipelines and start from scratch as their business needs evolve, significantly slowing down development velocity.

ML tooling innovation is addressing some of these challenges, in the process attracting high levels of VC funding and some of the best talent in the industry. An ecosystem perspective suggests new solutions are needed to accelerate ML adoption:

Data-platform standards and best practices. Best practices for building robust data platforms could enable a smooth transition from prototyping to live monitoring. While it is common for different companies to use different tools for each stage of the pipeline, data sharing and security standards would ensure these tools work seamlessly together.
Standard interchange formats. Decoupling higher and lower parts of the core ML stack could drive interoperability and reduce fragmentation. Developers should be able to design the toolchain (for example, PyTorch framework with XLA compiler) that best meets their needs and works across hardware back-ends. Defining a standard set of operators at various levels of the stack would make ML development much more scalable and make it easier for hardware and software vendors to focus on their strengths rather than on needing to support multiple independent code paths.
Include people and organizations. Promising ML applications often fail to gain traction for nontechnical reasons such as a lack of stakeholder involvement during development and skepticism by front-line users during early iterations. Organizations as a whole, not just ML engineering teams, need to be aligned around the ML development flywheel.

Ethical applied ML: developers lack universally accepted standards, technologies and benchmarks to build responsible ML applications

ML has driven phenomenal advances in many areas, from autonomous vehicles and drug discovery to financial fraud-detection and retail-product recommendations. However, it has also had negative effects, including perpetuating systemic biases^[15], making many companies cautious about adopting it. Building a trust-based AI architecture will expand the scope of industries and use cases where ML can be applied and will require work across several dimensions – bias and fairness, explainability and transparency, human oversight and accountability, privacy and data ethics, performance and safety, security and sustainability^[16]. These can be addressed through several approaches:

Privacy-enhancing technologies. Many tech hyperscalers and open-source communities provide privacy-enhancing technology solutions, such as federated learning and homomorphic encryption, but these solutions lack standardization, which can lead to reinventing the wheel in silos rather than alignment among ecosystem players. A common set of community backed standards can help ensure these technologies are more widely adopted, benefiting everyone.
Best practices. There is a need for universally accepted best practices so that organizations with hundreds of models can carry out the full gamut of fairness checks and evaluate data risks. Instituting best practices while the technology is still young can minimize problems caused by proprietary systems not “talking” to one another later, help standardize models, and contribute to curbing bias. Early development of such practices can help avoid what occurred with electronic health records; the promise was that they would help patients and providers easily access relevant information, but the reality is that records remain fragmented across platforms, often causing frustration for stakeholders at every level.
Policy and governance. Global standards for AI regulation are still being developed and governments, technology companies, and civil society groups can help shape legislation to focus on key principles like transparency, human centricity, contestability, data protection, and fairness. For regulations already in place, there remains an opportunity to provide tools and certifications to help companies implement these guidelines.
Tools and benchmarks. There is a need for comprehensive benchmarks tailored to the needs of specific industries for identifying and addressing bias. An example of this is the Monk Skin Tone Scale^[17] developed and released by Google that can be leveraged to evaluate computer vision datasets and models for skin tone representation. There is also a need for better documentation and artifact tracking at each stage of the ML lifecycle.^[18]
Education and training. To accelerate the adoption of tools, benchmarks, and best practices, training resources should be customized to the needs and roles of each of the players involved in each step of ML development flywheel. One example of such a resource is provided by EDM Council, a global association focused on industry-validated trainings and certifications and other practitioner resources.

Concluding thoughts

The ML field is still nascent—barely 10 years old—and significant work is needed to transform it into an established discipline that achieves its full potential of social and economic impact. Mainstream ML adopters and tech hyperscalers will be key to this endeavor.

Key takeaways for Enterprise adopters of ML

In recent years, the momentum of R&D in AI has been exponential; more than 30 times as many patents were filed in the field in 2021 than in 2015.^[19] Given this momentum, as well as other factors such as increased investment, AI is not only performing better but also becoming more affordable, leading to increased adoption across industries. According to McKinsey & Company’s Global Survey on the state of AI^[20] 2022, 50 percent of respondents reported their organizations had adopted AI in at least one function in 2021, compared with 20 percent in 2017. While AI adoption has grown 2.5 times since 2017, it has more recently plateaued. While the consensus is that AI will be a mainstream technology at their companies, a 2020 survey found that only 76 percent of organizations were close to breaking even on their AI investments when considering the costs and not just the benefits.^[21] To address these challenges and realize the full potential of AI for themselves and other ecosystem players, enterprise adopters could adopt the following strategies:

Reassess the path to ML production. The path to performance observability and model conformance lies as much in the data that is fueling the infrastructure as it does in the design of the infrastructure itself. Thus, enterprise ML adopters should be as metrics-driven on evaluating data quality as they are on enabling high-performance hardware. This would enable the next level of scale and impact for enterprises. Enterprise adopters could work to prevent bias and drift by building in time for iterative, unpredictable developments and bringing in all relevant organizational stakeholders, including legal and ethics leads, early in the ML development flywheel. This would help organizations identify regulatory requirements up-front, mitigate bias, and communicate clearly about privacy and governance laws.
Support emerging standards and best practices, and tailor them to current and future challenges. Many current ML development tools are either domain-specific or don’t scale well when moving from research prototyping to deploying ML at scale. Adopting open-source frameworks and best practices and eschewing one-off solutions can ensure an organization does not get stuck on a tech island as it matures but continues to learn and adapt as tooling solutions, policies, and innovations in the ecosystem evolve and scale.
Invest in people. Today’s AI heroes are often just those who write model code. However, enterprises need to ensure that everyone involved in the ML Development Flywheel, such as those who enable the data flow, provide tooling solutions, assure unbiased outcomes, and monitor live operations are also rewarded and upskilled.

Key takeaways for tech hyperscalers

As some of the largest investors in ML research, infrastructure, and applications, tech hyperscalers such as Google and Meta are critical to advancing the ML ecosystem. In addition to the significant contributions they already make through university collaborations and open-source projects, we would encourage them to invest in the following:

Support the development of public resources, especially data sets and tooling, to allow researchers to continue to innovate and communicate as ML technology advances.
Contribute to legal and business best practices that help all companies use ML effectively and responsibly. This can be achieved by partnering with mainstream ML adopters to share best practices on translating business problems into ML solutions and providing a solid technical foundation for emerging regulatory frameworks.
Work cross-organizationally or with industry groups to drive interoperability in open-source frameworks and tooling. This could enable more companies to participate in ML collaboratively and flexibly, growing overall adoption.
Encourage efforts to raise public understanding and awareness of AI’s potential for positive societal impact, such as through research into AI applications for social good and efforts to reduce bias and increase inclusion.
Be transparent about the limitations and risks of ML-based solutions.

Authors

Peter Mattson is a Senior Staff Engineer at Google. He co-founded and is President of MLCommons®, and co-founded and was General Chair of the MLPerf consortium that preceded it.

Aarush Selvan is a Product Manager at Google where he works on Natural Language Processing for the Google Assistant.

David Kanter is a co-founder and the Executive Director of MLCommons, where he helps lead the MLPerf benchmarks and other initiatives.

Vijay Janapa Reddi is an Associate Professor at Harvard University. His research interests include computer architecture and runtime systems, specifically in the context of autonomous machines and mobile and edge computing system.

Roger Roberts is a Partner at McKinsey & Company’s Bay Area office and advises clients in a broad range of industries as they address their toughest technology challenges especially in retail and consumer industries

Jacomo Corbo is Founder and co-CEO of PhysicsX, a company building AI generative models for physics and engineering; previously Chief Scientist and Founder of QuantumBlack and a Partner at McKinsey & Company

Authors would like to thank Bryan Richardson, Chaitanya Adabala Viswa, Chris Anagnostopoulos, Christopher McGrillen, Daniel Herde, David Harvey, Kasia Tokarska, Liz Grennan, Martin Harrysson, Medha Bankhwal, Michael Chui, Michael Rudow, Rasmus Kastoft-Christensen, Rory Walsh, Pablo Illanes, Saurabh Sanghvi, Stephen Xu, Tinashe Handina, Tom Taylor-Vigrass, Yetunde Dada, and Douwe Kiela as well as Adam Yala from University of California, Berkeley, Kasia Chmielinski from Data Nutrition Project and contributors from Google, Harvard University, Hugging Face, Meta AI, and Partnership on AI for their contributions to this article.

Pitchbook ︎
https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai ︎
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2022, July 12). On the opportunities and risks of Foundation models. arXiv.org. Retrieved January 20, 2023, from https://arxiv.org/abs/2108.07258 ︎
“How to operationalize machine learning for maximum business impact: Frameworks for machine learning operationalisation are key,” Wired, in partnership with QuantumBlack, June 12, 2021, https://www.wired.co.uk/bc/article/mckinsey-machine-learning ︎
“Usage statistics of PHP for websites,” W3 Techs, n.d., https://w3techs.com/technologies/details/pl-php ︎
Tech hyperscalers comprises technology companies like Google, Microsoft, and Amazon that are making efforts to scale in software products and services but also expand to numerous industry verticals ︎
Data Nutrition Project, Google, Harvard University, Hugging Face, Meta AI, Partnership on AI, and the University of California, Berkeley ︎
https://pile.eleuther.ai/ ︎
https://laion.ai/blog/laion-400-open-dataset/ ︎
White, Olivia, Anu Madgavkar, Zac Townsend, James Manyika, Tunde Olanrewaju, Tawanda Sibanda, and Scott Kaufman. “Financial Data Unbound: The Value of Open Data for Individuals and Institutions.” McKinsey & Company June 29, 2021. https://www.mckinsey.com/industries/financial-services/our-insights/financial-data-unbound-the-value-of-open-data-for-individuals-and-institutions ︎
Alexandros Karargyris et al., “MedPerf: Open benchmarking platform or medical artificial intelligence using federated evaluation,” n.d., https://arxiv.org/ftp/arxiv/papers/2110/2110.01406.pdf ︎
https://dataperf.org/ ︎
“The value of a shared understanding of AI models,” n.d., https://modelcards.withgoogle.com/about; Michael Hind, “IBM FactSheets further advances trust in AI,” July 9, 2020, https://www.ibm.com/blogs/research/2020/07/aifactsheets/; Timnet Gebru et al., “Datasheets for datasets,” Cornell University, March 23, 2018, https://arxiv.org/abs/1803.09010; “The Data Nutrition Project,” n.d., https://datanutrition.org/ ︎
https://openai.com/pricing ︎
Silberg, Jake, and James Manyika. “Tackling Bias in Artificial Intelligence (and in Humans).” McKinsey & Company, July 22, 2020. https://www.mckinsey.com/featured-insights/artificial-intelligence/tackling-bias-in-artificial-intelligence-and-in-humans. ︎
https://www.mckinsey.com/featured-insights/in-the-balance/from-principles-to-practice-putting-ai-ethics-into-action ︎
https://skintone.google/ ︎
“Machine learning life cycle,” Partnership on AI, n.d., https://docs.google.com/presentation/d/1fh4AGTeXRN0hInvoZsYMz7Bb3mgzfZktdGZtKvBRED8/edit#slide=id.gdf40b9b06b_2_491 ︎
Daniel Zhang et al., “Artificial Intelligence Index Report 2022,” Stanford Institute for Human-Centered AI, Stanford University, March 2022, https://aiindex.stanford.edu/wp-content/uploads/2022/03/2022-AI-Index-Report_Master.pdf ︎
https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review ︎
“2021 AI Predictions: No uncertainty here,” PwC, October 2020, https://www.pwc.com/mt/en/publications/technology/ai-predictions-2021/ai-predictions-2021-no-uncertainty-here.html ︎
https://www.mckinsey.com/capabilities/quantumblack/our-insights/generative-ai-is-here-how-tools-like-chatgpt-could-change-your-business ︎

The post Perspective: Unlocking ML requires an ecosystem approach appeared first on MLCommons.

Latest MLPerf Results Display Gains for All

MLCommons — Wed, 09 Nov 2022 08:41:00 +0000

Today, MLCommons®, an open engineering consortium, announced new results from the industry-standard MLPerf Training, HPC and Tiny benchmark suites. Collectively, these benchmark suites scale from ultra-low power devices that draw just a few microwatts for inference all the way up to the most powerful multi-megawatt data center training platforms and supercomputers. The latest MLPerf results demonstrate up to a 5X improvement in performance helping deliver faster insights and deploy more intelligent capabilities in systems at all scales and power levels.

The MLPerf benchmark suites are comprehensive system tests that stress machine learning models including underlying software and hardware and in some cases, optionally measuring energy usage. The open-source and peer-reviewed benchmark suites create a level playing ground for competition, which fosters innovation and benefits society at large through better performance and energy efficiency for AI and ML applications.

The MLPerf Training benchmark suite measures the performance for training machine learning models that are used in commercial applications such as recommending movies, speech-to-text, autonomous vehicles, and medical imaging. MLPerf Training v2.1 includes nearly 200 results from 18 different submitters spanning all the way from small workstations up to large scale data center systems with thousands of processors.

The MLPerf HPC benchmark suite is targeted at supercomputers and measures the time it takes to train machine learning models for scientific applications and also incorporates an optional throughput metric for large systems that commonly support multiple users. The scientific workloads include weather modeling, cosmological simulation, and predicting chemical reactions based on quantum mechanics. MLPerf HPC 2.0 includes over 20 results from 5 organizations with time-to-train and throughput for all models and submissions from some of the world’s largest supercomputers.

The MLPerf Tiny benchmark suite is intended for the lowest power devices and smallest form factors, such as deeply embedded, intelligent sensing, and internet-of-things applications. It measures inference performance – how quickly a trained neural network can process new data and includes an optional energy measurement component. MLPerf Tiny 1.0 encompasses submissions from 8 different organizations including 59 performance results with 39 energy measurements or just over 66% – an all-time record.

“We are pleased to see the growth in the machine learning community and especially excited to see the first submissions from xFusion for MLPerf Training, Dell in MLPerf HPC and GreenWaves Technologies, OctoML, and Qualcomm in MLPerf Tiny,” said MLCommons Executive Director David Kanter. “The increasing adoption of energy measurement is particularly exciting, as a demonstration of the industry’s outstanding commitment to efficiency.”

To view the results and find additional information about the benchmarks please visit: https://mlcommons.org/en/training-normal-21/, https://mlcommons.org/en/training-hpc-20/, and https://www.mlcommons.org/en/inference-tiny-10/

About MLCommons

MLCommons is an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org/ and contact participation@mlcommons.org.

Press Contact:
David Kanter
press@mlcommons.org

The post Latest MLPerf Results Display Gains for All appeared first on MLCommons.

MLPerf Results Highlight More Capable ML Training

MLCommons — Wed, 29 Jun 2022 08:30:00 +0000

Today, MLCommons®, an open engineering consortium, released new results from MLPerf Training v2.0, which measures the performance of training machine learning models. Training models faster empowers researchers to unlock new capabilities such as diagnosing tumors, automatic speech recognition or improving movie recommendations. The latest MLPerf Training results demonstrate broad industry participation and up to 1.8X greater performance ultimately paving the way for more capable intelligent systems to benefit society at large.

In this round, MLPerf Training added a new object detection benchmark that trains the new RetinaNet reference model on the larger and more diverse Open Images dataset. This new test more accurately reflects state-of-the-art ML training for applications like collision avoidance for vehicles and robotics, retail analytics, and many others.

“I’m excited to release our new object detection benchmark, which was built based on extensive feedback from a customer advisory board and is an excellent tool for purchasing decisions, designing new accelerators and improving software,” said David Kanter, executive director of MLCommons.

The MLPerf Training v2.0 results include over 250 performance results from 21 different submitters including Azure, Baidu, Dell, Fujitsu, GIGABYTE, Google, Graphcore, HPE, Inspur, Intel-HabanaLabs, Lenovo, Nettrix, NVIDIA, Samsung, and Supermicro. In particular, MLCommons would like to congratulate first time MLPerf Training submitters ASUSTeK, CASIA, H3C, HazyResearch, Krai, and MosaicML.

“We are thrilled with the greater participation and the breadth, diversity, and performance of the MLPerf Training results,” said Eric Han, Co-Chair of the MLPerf Training Working Group. “We are especially excited about many of the novel software techniques highlighted in the latest round.”

To view the results and find additional information about the benchmarks please visit https://mlcommons.org/en/training-normal-20/

About MLCommons

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org/ and contact participation@mlcommons.org.

The post MLPerf Results Highlight More Capable ML Training appeared first on MLCommons.

MLPerf Training v1.1 Results

MLCommons — Wed, 01 Dec 2021 08:56:00 +0000

Today, MLCommons®, an open engineering consortium, released new results for MLPerf Training v1.1, the organization’s machine learning training performance benchmark suite. MLPerf Training measures the time it takes to train machine learning models to a standard quality target in a variety of tasks including image classification, object detection, NLP, recommendation, and reinforcement learning.

MLPerf Training is a full system benchmark, testing machine learning models, software, and hardware. MLPerf creates a reliable and consistent way to track performance over time, and the fair and representative benchmarks create a “level playing field” where competition drives the industry forward, accelerating innovation. Compared to the previous submission round, the best benchmark results improved by up to 2.3X, showing substantial improvement in hardware, software, and system scale.

Similar to past MLPerf Training results, the submissions consist of two divisions: closed and open. Closed submissions use the same reference model to ensure a level playing field across systems, while participants in the open division are permitted to submit a variety of models. Submissions are additionally classified by availability within each division, including systems commercially available, in preview, and RDI.

MLPerf Training v1.1 results further MLCommons’ goal to provide benchmarks and metrics that level the industry playing field through the comparison of ML systems, software, and solutions. The latest benchmark round received submissions from 14 organizations and released over 185 peer-reviewed results for machine learning systems spanning from edge devices to data center servers. Submissions this round included software and hardware innovations from Azure, Baidu, Dell, Fujitsu, GIGABYTE, Google, Graphcore, HabanaLabs, HPE, Inspur, Lenovo, NVIDIA, Samsung, and Supermicro. To view the results, please visit https://mlcommons.org/en/training-normal-11/.

“We’re thrilled to have such broad participation in MLPerf Training,” said Victor Bittorf, Co-Chair of the MLPerf Training Working Group. “Congratulations to all of our participants in this round, especially the first-time submitters. It’s particularly exciting to see the advances in the Open Division.”

“Looking back to the first MLPerf Training round in 2018, it’s remarkable that performance has improved by 30X for some of our benchmarks,” said David Kanter, Executive Director of MLCommons. “That rapid increase in performance will ultimately unleash new machine learning innovations that will benefit society.”

Additional information about the Training v1.1 benchmarks is available at https://mlcommons.org/en/training-normal-11/.

About MLCommons

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org/ or contact participation@mlcommons.org.

Press Contact:
press@mlcommons.org

The post MLPerf Training v1.1 Results appeared first on MLCommons.