MLPerf Inference Archives - MLCommons

MLCommons Releases New MLPerf Inference v5.0 Benchmark Results

MLCommons — Wed, 02 Apr 2025 15:00:00 +0000

Today, MLCommons^® announced new results for its industry-standard MLPerf^® Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. The results highlight that the AI community is focusing much of its attention and efforts on generative AI scenarios, and that the combination of recent hardware and software advances optimized for generative AI have led to dramatic performance improvements over the past year.

The MLPerf Inference benchmark suite, which encompasses both datacenter and edge systems, is designed to measure how quickly systems can run AI and ML models across a variety of workloads. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems. This round of MLPerf Inference results also includes tests for four new benchmarks: Llama 3.1 405B, Llama 2 70B Interactive for low-latency applications, RGAT, and Automotive PointPainting for 3D object detection.

Llama 2 70B generative AI test takes center stage

The Inference v5.0 results show that generative AI scenarios have gained momentum. Over the last year submissions have increased 2.5x to the Llama 2 70B benchmark test, which implements a large generative AI inference workload based on a widely-referenced open-source model. With the v5.0 release, Llama 2 70B has now supplanted Resnet50 as the highest submission rate test.

The performance results for Llama 2 70B have also skyrocketed since a year ago: the median submitted score has doubled, and the best score is 3.3 times faster compared to Inference v4.0.

“It’s clear now that much of the ecosystem is focused squarely on deploying generative AI, and that the performance benchmarking feedback loop is working,” said David Kanter, head of MLPerf at MLCommons. “We’re seeing an unprecedented flood of new generations of accelerators. The hardware is paired with new software techniques, including aligned support across hardware and software for the FP4 data format. With these advances, the community is setting new records for generative AI inference performance.”

The benchmark results for this round include results for six newly available or soon-to-be-shipped processors:

AMD Instinct MI325X
Intel Xeon 6980P “Granite Rapids”
Google TPU Trillium (TPU v6e)
NVIDIA B200
NVIDIA Jetson AGX Thor 128
NVIDIA GB200

Benchmarking the state of the art in generative AI: two new tests introduced

In step with advances in the AI community, MLPerf Inference v5.0 introduces a new benchmark utilizing the Llama 3.1 405B model, marking a new bar for the scale of a generative AI inference model in a performance benchmark. Llama 3.1 405B incorporates 405 billion parameters in its model while supporting input and output lengths up to 128,000 tokens (compared to only 4,096 tokens for Llama 2 70B). The benchmark tests three separate tasks: general question-answering, math, and code generation.

“This is our most ambitious inference benchmark to-date,” said Miro Hodak, co-chair of the MLPerf Inference working group. “It reflects the industry trend toward larger models, which can increase accuracy and support a broader set of tasks. It’s a more difficult and time-consuming test, but organizations are trying to deploy real-world models of this order of magnitude. Trusted, relevant benchmark results are critical to help them make better decisions on the best way to provision them.”

The Inference v5.0 suite also adds a new twist to its existing benchmark for Llama 2 70B with an additional test that adds low-latency requirements: Llama 2 70B Interactive. Reflective of industry trends toward interactive chatbots as well as next-generation reasoning and agentic systems, the benchmark requires systems under test (SUTs) to meet more demanding system response metrics for time to first token (TTFT) and time per output token (TPOT).

“A critical measure of the performance of a query system or a chatbot is whether it feels responsive to a person interacting with it. How quickly does it start to reply to a prompt, and at what pace does it deliver its entire response?” said Mitchelle Rasquinha, MLPerf Inference working group co-chair. “By enforcing tighter requirements for responsiveness, this interactive version of the Llama 2 70B test offers new insights into the performance of LLMs in real-world scenarios.”

More information on the selection of the Llama 3.1 405B and the new Llama 2 70B Interactive benchmarks can be found in this supplemental blog.

New Graph Neural Network datacenter benchmark for modeling relationship graphs

Also new to Inference v5.0 is a datacenter benchmark that implements a graph neural network (GNN) model. GNNs are useful for modeling links and relationships between nodes in a network and are commonly used in recommendation systems, knowledge-graph answering, fraud-detection systems, and other kinds of graph-based applications.

The GNN datacenter benchmark implements the RGAT model, based on the Illinois Graph Benchmark Heterogeneous (IGBH) dataset containing 547,306,935 nodes and 5,812,005,639 edges.

More information on the construction of the RGAT benchmark can be found here.

New edge benchmark: Automotive PointPainting test for 3D object detection

The Inference v5.0 benchmark introduces a new Automotive PointPainting benchmark for edge computing devices, specifically automobiles. While the MLPerf Automotive working group continues to develop the Minimum Viable Product benchmark first announced last summer, this test provides a proxy for an important edge-computing scenario: 3D object detection in camera feeds for applications such as self-driving cars.

More information on the Automotive PointPainting benchmark can be found here.

As industry raises the bar for AI systems, MLPerf Inference benchmark follows suit

“We rarely introduce four new tests in a single update to the benchmark suite,” said Miro Hodak, “but we felt it was necessary to best serve the community. The rapid pace of advancement in machine learning and the breadth of new applications are both staggering, and stakeholders need relevant and up-to-date data to inform their decision-making.”

MLPerf Inference v5.0 includes 17,457 performance results from 23 submitting organizations: AMD, ASUSTeK, Broadcom, Cisco, CoreWeave, CTuning, Dell, FlexAI, Fujitsu, GATEOverflow, Giga Computing, Google, HPE, Intel, Krai, Lambda, Lenovo, MangoBoost, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Sustainable Metal Cloud.

“We would like to welcome the five first-time submitters to the Inference benchmark: CoreWeave, FlexAI, GATEOverflow, Lambda, and MangoBoost,” said David Kanter. “The continuing growth in the community of submitters is a testament to the importance of accurate and trustworthy performance metrics to the AI community. I would also like to highlight Fujitsu’s broad set of datacenter power benchmark submissions and GateOverflow’s edge power submissions in this round, which reminds us that energy efficiency in AI systems is an increasingly critical issue in need of accurate data to guide decision-making.”

“The machine learning ecosystem continues to give the community ever greater capabilities. We are increasing the scale of AI models being trained and deployed, achieving new levels of interactive responsiveness, and deploying AI compute more broadly than ever before,” said Kanter. “We are excited to see new generations of hardware and software deliver these capabilities, and MLCommons is proud to present exciting results for a wide range of systems and several novel processors with this release of the MLPerf Inference benchmark. Our work to keep the benchmark suite current, comprehensive, and relevant at a time of rapid change is a real accomplishment, and ensures that we will continue to deliver valuable performance data to stakeholders.”

View the results

To view the results for MLPerf Inference v5.0, please visit the Datacenter and Edge benchmark results pages.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post MLCommons Releases New MLPerf Inference v5.0 Benchmark Results appeared first on MLCommons.

Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0

lori@mlcommons.org — Wed, 02 Apr 2025 14:59:24 +0000

Introduction

Graph neural networks (GNNs) have been increasing in popularity throughout the last few years with the abundance of data and use-cases that have underlying graph structures. GNNs are used in a variety of applications including social network analysis, fraud detection, transportation applications, recommendation systems, weather prediction, drug discovery, and chemical structure analysis. In addition, GNNs are also being explored for natural language processing and image segmentation. Generally GNNs have been shown to outperform other machine learning methods for tasks like node classification and graph classification and unlock many new applications that work with data that is more complex than traditional text, images, video, and audio. Based on the popularity of graph neural networks and the new application areas, the GNN task force recommended adding a graph neural network benchmark to MLPerf^® Inference v5.0.

Model selection

There are a variety of types of GNNs, including both Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). GCNs rely on convolution operators to capture information about relationships between nodes in a graph, while GATs use the attention mechanism. Traditional GATs focus on undirected single-relation graphs. For the MLPerf GNN benchmark, we chose a variant of graph attention networks called RGAT, or Relational Graph Attention Network, which extends GATs to allow for multi-relational graph structures – representing a wider range of applications and use cases.

What are RGATs?

Graph Attention Networks (GATs) are a specialized type of GNN that use attention mechanisms to dynamically weigh the importance of neighboring nodes during information propagation in graph-structured data. They’re particularly effective for node classification and social network analysis. Relational GATs (RGATs) extend this concept by incorporating relationship type discrimination between nodes in knowledge graphs (i.e., differentiating by edge type).

The figure below demonstrates a 2-layer RGAT with fan-out: When classifying a target node, we construct a subgraph by randomly sampling neighbors across two hierarchical levels (note that here, we fan-in instead of fan-out). In the first layer, neighbor embeddings serve directly as attention inputs, while subsequent layers use propagated outputs from GAT computations. Notably, the same graph node (e.g., any single GATnode in the figure) may appear multiple times in the computation subgraph. While this allows reuse of its learned embedding, each occurrence requires recomputation because input edges carry distinct values in different contexts. This architecture creates a memory-computation tradeoff dependent on graph relationships.

Subgraph Sampling Example

The figure below illustrates the architecture of a single RGAT layer, detailing its attention mechanism. For each node pair in the attention computation, both the local embedding (central node) and external embeddings (neighbors) pass through a shared MLP to generate separate Query and Key vectors. These projections are concatenated and transformed into an attention score. The scores undergo Softmax normalization to produce attention weights. These weights then compute a weighted sum of the Value vectors, which are derived from projected neighbor embeddings. This aggregated result becomes the hidden embedding passed to the next RGAT layer.

A key implementation detail is that while the same MLP parameters are reused across all computations, each attention head recalculates context-specific weights because input edges (Key/Value) differ for every node-edge interaction.

RGAT network architecture

Dataset and task selection

Some of the most popular applications of graph neural networks are social network analysis and recommenders. These applications can have massive datasets of user information which require a large amount of storage. One of the biggest challenges of graph neural networks is scaling to these large graphs, which can increase both the computational cost of processing all the relations in the graph, as well as the message passing and memory management of nodes across the graph, which are often stored on disk and even across multiple nodes.

To mimic this as much as possible, we chose to use the largest publicly available graph dataset at the time – the Illinois Graph Benchmark Heterogenous (IGB-H) dataset. The graph in this dataset consists of academic papers (“Paper” nodes), the field of study of those papers (“FoS” nodes), the authors of these papers (“Author” nodes), as well as the institutes the authors are affiliated with (“Institute” nodes). In addition to previously mentioned relations (topic, written by, and affiliated with), there is also an additional relation for “citation” which is a relation from “Paper” to “Paper”. The task for this dataset is classification of the ‘Paper’ nodes to a set of 2983 topics.

Graph structure of IGB-H

The IGB Heterogeneous “Full” dataset variant has 547 million nodes and 5.8 billion edges. With an embedding dimension of 1024, the dataset totals up to over 2 TB of data. In addition to this, for use in MLPerf Inference, the dataset is augmented with reverse edges, as well as self-loops for papers, over doubling the number of edges. It was decided that the subset used for inference as well as accuracy checks is the validation set from training, consisting of around 788 thousand “Paper” nodes. During classification, subgraphs are sampled for each input node with a fanout steps of 15 – 10 – 5 (i.e. 15 neighbors, 10 neighbors of each of those neighbors, and 5 neighbors of those neighbors). We acknowledge that some real-world applications of GNNs utilize “full fanout”, which uses every single neighbor of the node set at each step of subgraph generation. However, for the MLPerf Inference benchmark, we decided to use a fixed maximum fanout with the same parameters as when the model was trained in order to reduce variance of per-sample latencies, as it is possible for high neighbor count nodes to skew inference latency higher.

This iteration of GNN benchmarking only tests the ‘Offline’ MLPerf Inference scenario, as many of the production applications are only used in this setting. In the future, if there is sufficient interest, we may add the ‘Server’ scenario.

Accuracy metrics

To align with the selected task and dataset, the task force decided to use the ratio of correctly predicted topic “Paper” nodes in the validation. Baseline accuracy is based on 0.5% of the 157 million labelled nodes resulting in 788,000 validation nodes as evaluated in float32 is 72.86% (Model weights are kept in float32, while embedding is downcast to fp16 to save on storage/memory requirements).

Naturally, MLCommons^® provides a reasonable constraint for precisions lower than the baseline. This constraint is set at 99% of the reference. However, with neighborhood sampling embedded in the subgraph computation, some randomness had to be accounted for. The task force therefore made the decision to allow for an additional .5% margin.

Performance metrics

In our inaugural release, the benchmark is only tested in the ‘Offline’ scenario, where the performance metric is throughput measured in ‘samples per second’, and per-sample latency is not a performance consideration. We acknowledge that for some use-cases of GNNs, such as recommenders and transportation applications like map data processing, latency can be an important metric to consider and we may expand to consider different scenarios in future rounds of MLPerf Inference.

Conclusion

To address the growing usage of graph neural networks for different applications, a task force investigated and recommended the creation of a GNN benchmark within MLPerf Inference v5.0.

To represent the unique challenges associated with end-to-end deployment of GNNs, the task force selected a representative benchmark model, RGAT, that works with the largest public graph dataset with the highest precision. This benchmark emphasizes the different stages of the deployment process such as representation of sufficient scope, distributed storage at scale, and efficient computation and offers an opportunity for vendors to showcase efficient systems and optimizations for tackling these challenges. In the future we hope to expand the scope of the benchmark to include the different datacenter deployment scenarios for real-time such as online serving, which is increasingly finding applications in graph as well as recommender systems.

Details of the MLPerf Inference RGAT benchmark and reference implementation can be found here.

¹ The DGL neighborhood sampler has a known bug with seeding when the sampling is parallelized across CPUs

The post Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0 appeared first on MLCommons.

A New Automotive Benchmark for MLPerf Inference v5.0

lori@mlcommons.org — Wed, 02 Apr 2025 14:58:55 +0000

Intro

MLPerf^® Inference v5.0 introduced a new automotive benchmark. From the many potential workloads related to autonomous driving, the automotive benchmark task force chose 3D object detection as the benchmark task. Based on input from experts in the autonomous driving space, the task force selected PointPainting as the benchmark model. PointPainting is a sensor-fusion technique proposed by Vora et al. in 2019 that works in conjunction with image-based semantic segmentation to classify points generated by a lidar system. The experts indicated that 3D object detection is the most useful compute-intensive task to benchmark. We selected PointPainting because it is reasonably accurate, fast for inference, and has representative components of classical and newer 3D detection models.

Model selection

Compared to other machine learning applications, progress in autonomous driving is relatively more cautious. New models must go through extended review to ensure safety and reliability. We wanted to choose a model that contains useful components from both more proven older models as well as newer state-of-the-art models used in ongoing research. We also wanted to pick a model that is representative of multiple sensor modalities while still favoring the most common case of using video/images for vision, and a sensor suite synergistic between lower-resolution, affordable lidar and higher-resolution cameras.

Figure 1: Overview of PointPainting using DeepLabv3+ for segmentation and PointPillars for 3D Detection. On the left, five camera images are segmented by Deeplabv3+. In the middle, the segmentation scores are then painted on the lidar point cloud. Finally, the painted points are passed to PointPillars to produce the final 3D bounding boxes.

PointPainting is composed of two models: an image segmentation model that is used to “paint” lidar points by concatenating segmentation scores to lidar points and a lidar object detection model that takes the painted points as inputs and produces 3D detection predictions. A high level diagram is shown in Figure 1. We chose DeepLabv3+ with a ResNet-50 backbone as the segmentation model and PointPillars as the lidar detector. Both are popular for their respective tasks and are fast enough for real time applications. Additionally, these model choices account for about 90% of the PointPainting workload being performed in the segmentation task. Favoring images in the workload helps make the model more representative of the most common case of vision tasks while still incorporating lidar components in our model.

Dataset and task selection

Autonomous vehicles must perform many computational tasks. Our goal was to pick a task that is computationally intensive and representative of the self-driving tasks. To select one task to benchmark, we consulted with experts in the autonomous driving field. We had unanimous agreement that 3D object detection would be the most representative task. 3D tasks are inherently more computationally expensive than 2D tasks and often incorporate 2D models within them. 3D detection is also an important component of other tasks such as trajectory prediction.

We chose the Waymo Open Dataset (WOD) because it is a large high-quality dataset. Importantly, we verified that MLPerf benchmarking is considered a non-commercial use under the terms of the WOD license. The WOD provides five camera images and one lidar point cloud per frame. The images have resolutions of 1920×1280 and 1920×886 pixels. The data was captured in San Francisco, Mountain View, and Phoenix. The WOD provides annotations for 3D bounding boxes as well as video panoptic segmentation which is needed for the segmentation network.

There is no public implementation of PointPainting based on WOD. We trained a model ourselves to determine an accuracy target and generate a reference checkpoint for the benchmark. We built on the open source work of PointPillars, DeepLabv3+, PointPainting, and mmdetection 3d to create our implementation of PointPainting. As a baseline we trained PointPillars on WOD. Our goal was not to produce the most accurate version of PointPillars but to get a model with an accuracy that is representative of what can be achieved on the WOD dataset. Our PointPillars baseline achieved a mean average precision (mAP) of 49.91 across three classes (Pedestrians, Cyclists, and Vehicles).

Our trained PointPainting model is available to MLCommons® members by request here.

Performance metrics

MLPerf Inference’s single stream scenario for edge systems represents a task that accepts a single input and measures the latency for producing the output. This scenario is the closest and most relevant for an automotive workload where frames are received at a fixed rate from sensors. We chose a 99.9% accuracy threshold, as automotive workloads are safety-critical. The 99.9% latency percentile measures the 99.9% largest latency for a submission out of the required thousands of inference samples. These requirements balance the need for strong latency constraints with the desire to ensure the benchmark runtime is on the order of one day rather than weeks.

Conclusion

The automotive benchmark taskforce followed a rigorous process for choosing and building an automotive workload for MLPerf Inference. We have undertaken careful decisions regarding the task, dataset, and model to ensure the creation of a robust and representative benchmark. Additionally, we have determined accuracy targets and performance metrics that fit within MLPerf Inference while being useful for automotive tasks. MLCommons plans to continue developing automotive benchmarks across a suite of automotive-specific tasks. We welcome feedback from the community regarding this benchmark and future developments.

Additional technical contributors: Pablo Gonzalez, MLCommons; Anandhu Sooraj, MLCommons; Arjun Suresh, AMD (formerly at MLCommons).

The post A New Automotive Benchmark for MLPerf Inference v5.0 appeared first on MLCommons.

MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI

lori@mlcommons.org — Wed, 02 Apr 2025 14:58:06 +0000

Introduction

The Large Language Model (LLM) task force was convened to ensure that the MLPerf^® Inference v5.0 benchmarks address the evolving demands of LLMs and low-latency serving in real-world applications. We focus on three key aspects: first, benchmarking the performance of inferencing platforms on emerging open large LLMs; second, ensuring support for increased context and output sequence lengths required by Retrieval-Augmented Generation (RAG), agentic AI, and advanced reasoning tasks; and third, achieving reasonable response latency to enable efficient and practical deployment in diverse scenarios. These new benchmarks provide a robust framework for evaluating and advancing large-scale LLM inference performance. In MLPerf Inference v5.0, we added two new benchmarks to the suite: Llama 3.1 405B Instruct and Llama 2 Chat 70B.

Llama 3.1 405B Instruct

The selection of Meta’s Llama 3.1 405B Instruct model reflects its popularity and alignment with industry requirements. Several features of Llama 3.1 405B are particularly notable in this context:

Long Context Processing: Its benchmark-proven ability to process 128K-token contexts with minimal accuracy degradation addresses the growing need for models to summarize, analyze, and synthesize information from expansive documents while retaining critical details.
Structured Data Extraction: Llama 3.1 405B exhibits superior performance on needle-in-a-haystack (NIAH) tasks that extract structured data (e.g., key-value pairs) from noisy or unstructured corpora.
Multi-Source Synthesis: Llama 3.1 405B has demonstrated an exceptional capacity for reconciling ambiguous information across sources, notably in literature analysis where it matches GPT-4’s accuracy.
Cross-Contextual Reasoning: The model’s design emphasizes linking distant concepts, making it ideal for tasks requiring multi-document synthesis and cross-referencing.

Collectively, these attributes positioned Llama 3.1 405B Instruct as the best candidate, aligning with MLPerf’s mission to benchmark models that power real-world AI systems amid escalating computational and contextual complexity.

Dataset and task selection

Llama 3.1 405B Instruct has a context window of 128,000 tokens, compared to Llama 2 70B’s 4,096 tokens. This expanded context window enables Llama 3.1 405B to handle longer conversations, process larger documents or code samples, and perform tasks requiring extended context, such as long-form text summarization.

The dataset for Llama 3.1 405B Instruct was curated to evaluate its capabilities in long-context comprehension, retrieval-augmented reasoning, and task-specific optimization. The MLCommons^® task force integrated samples from a diverse mix of open datasets:

LongBench (multi-domain long-context tasks)
LongDataCollections (comprehension and inference from large text inputs)
Ruler (retrieval and reasoning under noise)
GovReport-Summary (domain-specific summarization)

This mix reflects real-world challenges such as synthesizing government reports, pinpointing critical data in cluttered contexts (needle-in-a-haystack), and executing precise document-based QA. With mean input/output context lengths of 9,400 and 680 tokens respectively, our dataset stresses the model’s ability to retain coherence and accuracy across extended sequences. By leveraging Llama 3.1 405B Instruct’s strengths, we achieved alignment between generated outputs and ground-truth references, validating its proficiency in parsing, contextual linking, and synthesizing insights from complex, lengthy inputs.

The Llama 3.1 405B Instruct benchmark employs task-specific accuracy metrics to evaluate performance across the dataset. For summarization tasks (e.g., GovReport-Summary), we adopted ROUGE-L, a widely recognized NLP metric that measures lexical overlap between generated summaries and ground-truth references. For precision-critical applications like needle-in-a-haystack retrieval and document-based QA, we utilized exact match (EM) scoring to ensure alignment with target answers, minimizing tolerance for errors in key-value extraction or factual responses. The accuracy threshold for closed division submissions is set to 99% of the FP16 reference.

Figure 1. Histogram of input and output (both ground truth and reference) token length distribution of the dataset for Llama 3.1 405B Instruct.

Performance metrics

The performance metrics chosen for the Llama 3.1 405B Instruct benchmark are the same as those used for the Llama 2 70B benchmark in MLPerf Inference v4.0: token throughput for the offline scenario and Time To First Token (TTFT) + Time Per Output Token (TPOT) for the server scenario. For more information, please refer to the “Measuring performance of the LLM” section of Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons. To balance the demands of long-context processing with real-world usability, we set a 99th percentile TTFT of 6 seconds and a 99th percentile TPOT of 175ms. These thresholds reflect the computational challenges of deploying large models with long context, while maintaining reasonable responsiveness.

Reference implementation

The reference implementation for Llama 3.1 405B Instruct leverages vLLM, a fast and versatile library for LLM inference and serving. This choice was driven by vLLM’s robust support for large language models and its wide-ranging compatibility with a diverse range of hardware, including NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi accelerators, TPUs etc, making it an ideal choice for deploying Llama 3.1-405B across different computing environments.

For readers interested in running the model themselves, we encourage them to follow the reference implementation, which contains code and instructions on how to run the entire benchmark end-to-end.

Llama 2 Chat 70B Interactive Benchmark

The Llama 2 70B benchmark was introduced in MLPerf Inference v4.0 and has quickly become one of the most popular benchmarks in the suite. Server latencies were set, according to the usage models and hardware capabilities of the time, at 2s TTFT and 200 ms TPOT. As adoption of generative AI has skyrocketed, state-of-the-art model serving has advanced and use cases such as reasoning models and agents are taking advantage of ever lower latencies. In an effort to set target latencies that are more representative of current requirements, we performed a new analysis of industry research, user surveys, and performance data from leading platforms like ChatGPT and Perplexity AI in late 2024. Our findings indicated that a 50th percentile token generation rate of 20–50 tokens per second (TPOT of 20-50ms) is critical for seamless user experience. To prioritize reliability under peak demand, MLPerf adopts a stricter 99th percentile threshold of 25 tokens/second (TPOT of 40ms), ensuring consistent responsiveness even during high-load scenarios. Additionally, we set a 99th percentile TTFT limit of 450ms to minimize initial latency, aligning with user expectations for near-instantaneous query response.

We continued to utilize the OpenOrca dataset and ROUGE accuracy metrics due to their relevance and proximity to chatbot tasks, which align well with the dataset’s diverse conversational scenarios and high-quality annotations.

Conclusion

The usage of LLMs is expanding rapidly across a diversity of application domains, with a demand for LLMs at different scales and response latencies. MLPerf is keeping pace with these trends by introducing a very large scale 405B benchmark and a low latency 70B interactive benchmark in the new MLPerf Inference v5.0 release. In this post, we shared the motivation and the complexities involved in the choice of model, task, dataset, accuracy, performance, and verification of such LLM benchmarks. We believe the design choices of this benchmark objectively address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers. With the addition of these benchmarks to MLPerf Inference, we now offer a comprehensive suite of language benchmarks at all scales (7B to 405B), diversity of architectures (like MoE) and deployment scenarios (datacenter, edge, or low latency).

Details of the MLPerf Inference Llama 3.1 405B Instruct benchmark and reference implementation can be found here.

For more information on the Llama 2 70B benchmark, please refer to the “Choosing a task and a dataset for the LLM” section of Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons.

The post MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI appeared first on MLCommons.

New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems

MLCommons — Wed, 28 Aug 2024 14:56:00 +0000

Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v4.1 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. This release includes first-time results for a new benchmark based on a mixture of experts (MoE) model architecture. It also presents new findings on power consumption related to inference execution.

MLPerf Inference v4.1

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models across a variety of deployment scenarios. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems.

The benchmark results for this round demonstrate broad industry participation, and includes the debut of six newly available or soon-to-be-shipped processors:

AMD MI300x accelerator (available)
AMD EPYC “Turin” CPU (preview)
Google “Trillium” TPUv6e accelerator (preview)
Intel “Granite Rapids” Xeon CPUs (preview)
NVIDIA “Blackwell” B200 accelerator (preview)
UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators

MLPerf Inference v4.1 includes 964 performance results from 22 submitting organizations: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud, and Untether AI.

“There is now more choice than ever in AI system technologies, and it’s heartening to see providers embracing the need for open, transparent performance benchmarks to help stakeholders evaluate their technologies,” said Mitchelle Rasquinha, MLCommons Inference working group co-chair.

New mixture of experts benchmark

Keeping pace with today’s ever-changing AI landscape, MLPerf Inference v4.1 introduces a new benchmark to the suite: mixture of experts. MoE is an architectural design for AI models that departs from the traditional approach of employing a single, massive model; it instead uses a collection of smaller “expert” models. Inference queries are directed to a subset of the expert models to generate results. Research and industry leaders have found that this approach can yield equivalent accuracy to a single monolithic model but often at a significant performance advantage because only a fraction of the parameters are invoked with each query.

The MoE benchmark is unique and one of the most complex implemented by MLCommons to date. It uses the open-source Mixtral 8x7B model as a reference implementation and performs inferences using datasets covering three independent tasks: general Q&A, solving math problems, and code generation.

“When determining to add a new benchmark, the MLPerf Inference working group observed that many key players in the AI ecosystem are strongly embracing MoE as part of their strategy,” said Miro Hodak, MLCommons Inference working group co-chair. “Building an industry-standard benchmark for measuring system performance on MoE models is essential to address this trend in AI adoption. We’re proud to be the first AI benchmark suite to include MoE tests to fill this critical information gap.”

Benchmarking Power Consumption

The MLPerf Inference v4.1 benchmark includes 31 power consumption test results across three submitted systems covering both datacenter and edge scenarios. These results demonstrate the continued importance of understanding the power requirements for AI systems running inference tasks. as power costs are a substantial portion of the overall expense of operating AI systems.

The Increasing Pace of AI Innovation

Today, we are witnessing an incredible groundswell of technological advances across the AI ecosystem, driven by a wide range of providers including AI pioneers; large, well-established technology companies; and small startups.

MLCommons would especially like to welcome first-time MLPerf Inference submitters AMD and Sustainable Metal Cloud, as well as Untether AI, which delivered both performance and power efficiency results.

“It’s encouraging to see the breadth of technical diversity in the systems submitted to the MLPerf Inference benchmark as vendors adopt new techniques for optimizing system performance such as vLLM and sparsity-aware inference,” said David Kanter, Head of MLPerf at MLCommons. “Farther down the technology stack, we were struck by the substantial increase in unique accelerator technologies submitted to the benchmark this time. We are excited to see that systems are now evolving at a much faster pace – at every layer – to meet the needs of AI. We are delighted to be a trusted provider of open, fair, and transparent benchmarks that help stakeholders get the data they need to make sense of the fast pace of AI innovation and drive the industry forward.”

View the Results

To view the results for MLPerf Inference v4.1, please visit the Datacenter and Edge benchmark results pages.

To learn more about selection of the new MoE benchmark read the blog.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems appeared first on MLCommons.

Mixtral 8x7B: a new MLPerf Inference benchmark for mixture of experts

MLCommons — Wed, 28 Aug 2024 14:54:00 +0000

As increasingly complex large language models (LLMs) emerge, the need for benchmarks that accurately reflect real-world applications becomes more pressing. MLPerf aims to select representative models widely used by customers in the industry, ensuring that the benchmarks are both realistic and relevant. By doing so, MLPerf provides valuable insights that help consumers make informed decisions about application deployment and budgeting. Additionally, these benchmarks enable vendors to optimize workloads effectively, catering to the practical needs of their customers while navigating the evolving AI technology landscape.

Building on LLM benchmarks already in MLPerf Inference suite

The MLPerf Inference suite already contains GPT-J and Llama 2 70B large language models, which were added in the v3.1 and v4.0 rounds, respectively. We set out to expand on the existing LLM benchmarks in the MLPerf Inference v4.1 suite by incorporating a mixture of experts (MoE) model.

The MoE architecture is behind some of the most popular and powerful large language models (LLMs) in the field, making it an ideal candidate for inclusion in our benchmark suite. MoE models are particularly appealing due to their ability to manage computational resources efficiently while maintaining high performance. By adding an MoE model, we can provide a more comprehensive evaluation of LLM capabilities that reflects the latest advancements in AI technology.

For readers interested in learning more about mixture of experts, please refer to the detailed explanation available in this article: Mixture of Experts Explained (huggingface.co).

Illustration of the sparse Switch FFN layer in the mixture of experts network architecture (from the Switch Transformers paper).

The multi-faceted ability of the MoE model requires sophisticated accuracy and performance testing that requires going beyond the single-task evaluation that has been the norm for MLPerf Inference benchmarks to date.

To take on these challenges, the MLPerf Inference working group created a technical task force to study the workload and agree on a representative benchmark. The task force goals were:

Understand how MoE models are used in the industry
Survey MoE models and identify those that can be used for MLPerf MoE benchmark
Choose an MoE model for the benchmark
Identify tasks and datasets for the benchmark
Create and test the reference implementation

Below, we describe how and why the task force evaluated the choices that led to the creation of the new MoE benchmark.

Model selection

In choosing models for the MLPerf Inference benchmark, we considered several mixture of experts (MoE) models, including DBRX, Grok, Mixtral 8x7B, and Mixtral 8x22B. Ultimately, Mixtral 8x7B was selected due to its widespread deployment and strong problem-solving capabilities across various domains. Mixtral 8x7B has gained popularity for its robust performance in handling diverse tasks, making it a good candidate for evaluating reasoning abilities. Its versatility in solving different types of problems provides a reliable basis for assessing the model’s effectiveness and enables the creation of a benchmark that is both relevant and comprehensive. In addition, Mixtral 8x7B’s size strikes a solid balance between performance and accuracy; it also allows for implementations across a wide range of processors.

Dataset and task selection

Using a single task for evaluation would not be sufficient to capture the full spectrum of the capabilities of a MoE model. Therefore, we thoroughly examined available literature to identify multiple tasks to include in the benchmark that would be both challenging and representative of real-world applications. Three were selected – Q&A, math reasoning, and coding – and we explain the rationale and selection processes in the following sections. The dataset for this combination of tasks is composed of 5,000 samples for each task, making the final dataset 15,000 samples.

Q&A task

For the Q&A benchmark dataset, we continue to use the Open Orca dataset, which aligns with the approach taken by our Llama 2 70B benchmark. Open Orca is recognized for its high quality and extensive coverage, making it an excellent choice for evaluating the question-answering capabilities of LLMs. This dataset combines augmented data from the FLAN Collection with a million GPT-4 completions, providing a rich resource for assessing natural language processing abilities.

We randomly sampled 5000 samples after filtering the original dataset following the curation methodology. We used the ROUGE metrics for accuracy measurement, following in the footsteps of MLPerf Inference v4.0.

For more details, please refer to Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons.

Question:Leo: If “A woman sitting on some steps smokes a cigarette while accessing her laptop.” does that mean that “A person sitting on some steps smokes a cigarette while accessing her laptop.”?Options:- yes- it is not possible to tell- no
Answer:Yes, the sentence “A woman sitting on some steps smokes a cigarette while accessing her laptop” can be changed to “A person sitting on some steps smokes a cigarette while accessing her laptop.” The gender of the person has been changed from specifically a woman to a person in general. However, it is important to note that the pronoun “her” is still used, which indicates that the person referred to is female. If you want to make the sentence completely gender-neutral, you could change “her laptop” to “the laptop” or “a laptop.”

A sample question from the open_orca dataset (GPT4 partition), and the response of Mixtral 8x7B.

Math reasoning task

We selected GSM8K (Grade School Math 8K) as the math reasoning dataset to evaluate the reasoning capabilities of LLMs like Mixtral 8x7B. GSM8K consists of 8,792 high-quality, linguistically diverse, grade school math word problems that require multi-step reasoning to solve. These problems are designed to be solvable by a bright middle school student and primarily involve performing a sequence of elementary calculations using basic arithmetic operations. The dataset’s importance lies in its ability to assess an LLMs capacity for complex problem-solving, as the questions typically require two to eight steps to reach the final answer. We randomly sampled 5,000 questions from the dataset.

In setting up the benchmarks, we utilize exact match (EM) as the evaluation score, which measures the model’s ability to produce the precise correct answer, reflecting its accuracy in reasoning and computation. Additionally, a 5-shot approach is employed, where the model is provided with five examples before attempting to solve the problems. This method helps gauge the model’s ability to learn from limited examples and apply that learning to new, similar tasks, further highlighting its reasoning prowess.

Question:The square footage of the two bedrooms in the apartment that Jenny and Martha share totals 300 square feet. If Jenny’s bedroom is 60 square feet larger than Martha’s, how large, in square feet, is Martha’s bedroom?
Answer:Let’s denote the size of Martha’s bedroom as M. Jenny’s bedroom is then M + 60 square feet. The total size of the two bedrooms is 300 square feet. So we have the equation M + (M + 60) = 300. Solving this equation for M, we get 2M + 60 = 300, which simplifies to 2M = 240, so M = 120. The answer is 120 square feet.

A sample question from the GSM8K, and the response of 5-shot Mixtral 8x7B.

Coding task

For the code generation task, the task force chose the MBXP (Mostly Basic X [Programming language] Problems) dataset created by the AWS AI Labs team. We came to this decision after evaluating many popular code generation benchmarks such as MBPP (Mostly Basic Python Problems), HumanEval, Multi-HumanEval, CoNaLa, and CodeAlpaca. We chose MBXP because of its wide range of languages and longer input prompts.

MBXP extends the popular MBPP dataset by including 12 other languages: C++, Javascript, Typescript, Kotlin, Scala, Php, Ruby, C#, Go, Swift, and Perl. The same core problems in MBPP are included, but they are extended with the function signature in the language of choice, which hints at which language the model should use.

We made several modifications to the dataset when adapting it for our use. The first was to apply the Mistral Prompt Format to the prompts present in the dataset to allow the model to better understand the task it was being presented with. Furthermore, we appended the following footer to the prompt “Here’s the completed code:nn“`$LANGUAGE” after discovering that Mixtral 8x7B’s code generation performance sharply decreases when generating code outside of markdown blocks.

In many code generation benchmarks, the metric used to evaluate performance is pass@k, which allows the language model k to attempt to generate code that successfully passes a set of tests. The MBXP benchmark utilizes this metric by providing a comprehensive series of tests for each problem in each language, all conveniently organized within the mxeval library. For each problem in the dataset, mxeval integrates the generated code with the corresponding test suite into a single file and executes it. A successful run, indicated by an exit code of 0, is marked as successful by mxeval. Conversely, if errors occur—such as test failures or issues with code compilation or interpretation—the run is marked as a failure. In our evaluation, we have opted to use pass@1 to minimize unnecessarily long inference times, ensuring a more efficient assessment process.

For the split of the final dataset, we selected a subset of 5000 samples. Instead of sampling equally from every language, we first decided to restrict our search to the languages in which Mixtral 8x7B is most fluent. The reasoning behind this was that the accuracy metric aims to highlight accuracy regressions, which are well hidden in languages where accuracy is already poor. Furthermore, JVM-based languages were omitted as evaluating them led to long evaluation times. These restrictions lead us to sample problems from C++, Javascript, Typescript, PHP, Python, and Ruby, providing us with a mix of syntaxes (C-style versus more exotic syntaxes) and a mix of compiled and interpreted languages.

Problem: [INST] Complete the following code. Be concise, don’t output anything that isn’t necessary.#include using namespace std;

/** * Write a function to combine two given sorted lists using heapq module. * > combineLists(vector{1, 3, 5, 7, 9, 11}, vector{0, 2, 4, 6, 8, 10}) * {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11} * > combineLists(vector{1, 3, 5, 6, 8, 9}, vector{2, 5, 7, 11}) * {1, 2, 3, 5, 5, 6, 7, 8, 9, 11} * > combineLists(vector{1, 3, 7}, vector{2, 4, 6}) * {1, 2, 3, 4, 6, 7} */vector combineLists(vector num1, vector num2) { [/INST]Here’s the completed code:
“`cpp

Solution:
#include using namespace std;
vector combineLists(vector num1, vector num2) { vector result; merge(num1.begin(), num1.end(), num2.begin(), num2.end(), back_inserter(result)); return result;}“`

A sample problem from MBXP, and the response of Mixtral 8x7B. The green text is inserted via preprocessing.

Performance metrics

We used the same performance metrics as those used in the Llama 2 70B benchmark: token throughput for the offline scenario, and TTFT+TPOT as the server scenario. For more information, please refer to the “Measuring performance of the LLM” section of Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons.

Reference Implementation

For readers interested in running the MoE model themselves, we encourage them to follow the reference implementation, which contains code and instructions on how to run the entire benchmark end-to-end. This can be found in the following GitHub link.

Conclusion

In the MLPerf Inference v4.1 round, there were eight Mixtral 8x7B submissions with 34 results! We hope to see more submissions in subsequent rounds as this model becomes more popular with users and as more hardware and software options optimized for the model come to market.

The post Mixtral 8x7B: a new MLPerf Inference benchmark for mixture of experts appeared first on MLCommons.

SDXL: An MLPerf Inference benchmark for text-to-image generation

MLCommons — Wed, 28 Aug 2024 14:45:00 +0000

Introduction

Generative AI models have been getting a lot of attention in recent years, and image generation models in particular have piqued interest among researchers, developers, and the public. The widespread adoption of image generation motivated us to add a new benchmark to the v4.0 round of MLPerf Inference using the Stable Diffusion XL model. This blog describes the complexities involved in benchmark construction such as model and dataset selection, with a particular focus on accuracy metrics. We believe these design choices address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers.

The evolution of image generation models can be grouped into three architectural eras: autoencoders, generative adversarial networks (GANs), and diffusion models.

Autoencoders are characterized by an encoder-decoder architecture where the encoder compresses a high-dimensional image into a low-dimensional representation in a latent space. When the latent space is mapped into a probability distribution, this network turns into a generative model where we sample latent representations and project them to image space using a decoder. This class of models is called variational autoencoders (VAEs.) VAEs set a benchmark in terms of image quality and ease of training, and this type of model is still being used as a key component in diffusion models.

GANs, first published in 2014, employ the novel idea of dueling networks: a generator and a discriminator. The generator is trained to take in Gaussian noise and produce realistic images, and the discriminator is trained to identify whether the generated image is real or fake. Once trained, the generator can be used as a stand-alone network to produce realistic images. One key drawback of GANs is that they tend to be challenging to train; it can be difficult to get them to converge, and sometimes they experience mode collapse where the GAN converges on a very limited set of outputs, limiting their expressiveness.

Diffusion models, the latest development in generative image models, offer the potential to generate a broader range of images with higher fidelity than GANs. Inspired by physics, the idea behind diffusion models is to learn the structure of the dataset by modeling the way data diffuses through the latent space. Diffusion models are trained to add and remove small amounts of Gaussian noise in repeated steps. During inference or generation, the model undergoes a reverse diffusion process where it considers a text prompt, starts with a randomly generated noisy image, and gradually works to denoise it until a predefined number of steps is reached.

Many diffusion models with varying approaches, such as Dall-E, Midjourney, and Imagen, have shown great potential in generating realistic images in recent years. One of the most important breakthroughs for diffusion models was the introduction of latent diffusion models (LDMs) in December 2021. The key differentiator in this family of models is that the entire diffusion process happens in latent space instead of pixel space. This is done using a VAE encoder which compresses the images into a low dimensional latent vector, performs the diffusion process with a U-Net backbone, and then finally uses a VAE decoder to project the latent back to the pixel spaces. This arrangement reduces the cost of training and inference substantially because the expensive denoising process in the latent space and therefore involves fewer parameters.

With increasing interest and adoption of LDMs and text-to-image models in general, MLCommons believed that a fair and robust benchmark in this domain for the deep learning community was needed. With this goal in mind, the MLPerf Inference working group formed a special task force in September 2023 to explore, define, and create a text-to-image-diffusion benchmark.

Choosing the model and its scheduler

The task force looked for candidate models that were openly available with compatible licenses, were popular in use, and produced state-of-the-art results. The candidates considered included Stable Diffusion (SD) 1.5 and 2.1 and Stable Diffusion XL (SDXL) 1.0. After deliberation, the task force chose SDXL 1.0 for the benchmark, and was subsequently introduced in the MLPerf Inference v4.0 benchmark round.

Stable Diffusion XL 1.0 is a diffusion model developed by Stability AI. By September 2023, it was the latest and largest open source text-to-image model that generated high-resolution (1024×1024) images with default settings. In comparison, SD 2.1 generated 768×768 images, and SD 1.5 generated 512×512 images. Higher resolution matters because it allows for greater detail and clarity in the generated images, making them more visually appealing and useful for various applications.

SDXL has two variants: SDXL-base (roughly 3.5 billion parameters) and SDXL-refiner (about 6.6 billion parameters). Both are significantly larger and more challenging than SD v2.1 (860 million parameters) and SD v1.5 (983 million parameters), adding a quantitative novelty to MLPerf Inference benchmark. The refiner model improves the fine-grained details in the generated images, as illustrated in Figure 1. However, this modest qualitative improvement comes at the cost of nearly double the parameters and compute requirements. Overall, we believe SDXL-base is a good compromise between practicality and pushing the limits of our machine-learning capabilities.

Figure 1: Left is output from SDXL-base, right is output from SDXL-refiner. Note the slight improvement in the shadowed skin texture on the neck in the right image compared to the left. Image courtesy: https://www.reddit.com/r/StableDiffusion/comments/15amidh/do_we_really_need_the_refiner_for_sdxl/

As depicted in Figure 2, SDXL has two dedicated networks, CLIP1 and CLIP2, that encode text prompt tokens into an embedding space. Using two networks better captures semantic information compared to the use of a single CLIP network in other diffusion models. SDXL also leverages a larger U-Net backbone than SD. The scheduler and the U-Net together denoise the image in latent space iteratively. The VAE decoder then recovers the image latent to a 1024×1024 RGB image output at the final stage.

Figure 2: Components of the text-to-image SDXL base pipeline

Schedulers are the algorithms that run multiple times iteratively to create a clean image from a completely random noisy image. The choice of scheduler determines the exact algorithm used to obtain a less-noisy latent. Some of the popular schedulers used for Stable Diffusion are DDIM, DDPM, and EulerDiscrete. Many scheduler algorithms involve a trade-off between denoising speed and quality, and each one works best with a particular number of steps. The speed of denoising mostly depends on the number of steps needed to reach a high-quality image.

The task force chose EulerDiscrete because it is one of the fastest schedulers available and has been widely adopted by users. From a qualitative standpoint, EulerDiscrete typically can produce clear images with almost no color artifacts in 20-30 steps, while DDIM requires around 50 steps. Quantitatively, the task force experimented with Hugging Face diffusers and showed that EulerDiscrete with 20-30 steps produced FID scores (a common quality metric that we discuss later) similar to those of DDIM scheduler with 50 steps.

Choosing the dataset

The choice of dataset is pivotal for the evaluation of an image-generation model. Each image in the dataset should be paired with a descriptive prompt that effectively communicates the visual content and context. This pairing enables the model to be assessed on its ability to generate images that are not only visually appealing but also contextually appropriate. The dataset should be diverse enough to cover various scenarios and object types in order to ensure the model’s performance is tested across different contexts.

The task force determined that the Common Objects in Context (COCO) dataset was an excellent match for this task. This dataset is recognized in the ML community for its diversity and for the richness of context in its image-caption pairs. A random subset of 5,000 prompt-image pairs were selected for the benchmark. This subset size is sufficient to ensure broad coverage and variety while being manageable for computational cost. Cross-validation tests verified that this random selection does not significantly alter the quality metrics. Additionally, COCO’s widespread use for similar tasks in the research community ensures relevance and comparability with the overall field.

Downloading the COCO images is not necessary for the benchmark evaluation. Instead, the FID score can be computed using pre-calculated statistics from the COCO dataset. Similarly, when using the CLIP score, the focus is on how well the output images correlate with the input prompts rather than with the ground-truth images.

Negative prompts are an extremely common guiding mechanism for image generation. Users can specify what they want to exclude from generation and also guide the output image away from undesired stylistic elements (e.g., avoiding cartoonish styles or unwanted elements like cars or buildings). Unfortunately, the task force could not find a suitable dataset with negative prompts. Since this is a crucial feature to the end user that also impacts performance, several prompts were analyzed, and a single common negative prompt was selected for all samples. Prompt “normal quality, low quality, worst quality, low res, blurry, nsfw, nude” generated the best results while adding additional content moderation safeguards beyond what a filtered training dataset provides and was therefore used in the benchmark.

Challenges of measuring accuracy

Evaluating generative AI models designed for image generation presents several unique challenges that we had to resolve to construct a representative and effective benchmark. A primary issue is the absence of a definitive “ground truth,” since the success of these models often depends on how well the generated images meet varying user expectations rather than adhering to an objective standard. Furthermore, the stochastic nature of these models means that output depends on the initial noise that seeds the model. Although the initial noise input can be controlled, as seen in Figures 3 and 4, the model pipeline remains sensitive to changes in hardware and software configurations. Minor adjustments to the pipeline (such as the use of quantization) can lead to wildly varying outputs. This variability rules out the use of error-based metrics such as mean squared error (MSE), root mean squared error (RMSE), or peak signal-to-noise ratio (PSNR), while also challenging the applicability of structural similarity index measure (SSIM)-based methods.

Figure 3: Two images created from the same prompt showcasing different optimization techniques. The image on the left was generated by an FP16 model, while the image on the right comes from an FP16+INT8 model featuring deep cache.

Figure 4: Visualization of the absolute error on each pixel when FP8 output from Intel Gaudi is compared to FP32 reference. The prompt was “some meat and vegetables are arranged on a white plate.”

In response to these challenges, the community has developed metrics to assess image-generating models quantitatively. The Fréchet inception distance (FID) measures the similarity between two sets of images (generated and real) by computing the Fréchet distance between two Gaussians fitted to feature vectors in an embedded space, typically derived from a pretrained inception network. However, the FID score alone is insufficient since it does not assess whether images accurately respond to the given prompts. So, the contrastive language-image pre-training (CLIP) score was chosen as an additional accuracy metric to evaluate the semantic relevance of the generated images to their textual descriptions. Since it is possible to optimize for one of these two metrics at the expense of the other, we explicitly require that both FID and CLIP scores be used in conjunction to ensure balanced optimization.

While these metrics can gauge certain aspects of the accuracy or quality of an image generation model, they have some shortcomings. Both metrics rely on pre-trained neural networks and can be sensitive to the choice of feature space and vulnerable to adversarial attacks. Also, they may not always correlate with human judgment of image quality. Moreover, FID requires a large sample of images to measure the distribution of a generated image dataset representatively, which can increase the cost and complexity of evaluation cycles. We acknowledge these limitations and are open to updating the benchmark in response to advances in this field.

Determining the accuracy threshold

Even though the U-Net latent noise input is fixed and the randomness is suppressed in our benchmark pipeline, different hardware and software can still produce varied outputs. To determine a reasonable accuracy threshold, the task force obtained the mean FP32 precision FID and CLIP scores on accelerators from multiple vendors for the 5,000-sample evaluation dataset. A ±2% variation limit for FID and ±0.2% variation limit for CLIP was set around this mean. FID was found to be more sensitive to hardware variation, hence the larger range. In this benchmark, a valid submission needs to achieve an FID score within the range of [23.01085758, 23.95007626] and CLIP scores within [31.68631873, 31.81331801].

Measuring performance

Figure 5: Generated images using SDXL with EulerDiscrete scheduler with 50 denoising steps (left) and 20 steps (right). There is a slight difference observed between 50 and 20 steps but no significant noise.

The run time of this particular benchmark depends on the number of denoising steps and the size of the validation set. An example can be seen in Figure 5. The task force conducted experiments to determine the optimum parameters that minimize the run time for performance and accuracy runs while maintaining the quality of generated images. As a result, 20 denoising steps and 5000 validation samples were chosen.

Robustness of the benchmark

Output reproducibility

SDXL starts the image generation with a latent space filled with random numbers from the standard normal distribution. In every following denoising step, a new image with less noise in the latent space is generated. In most applications, the initial latent state is defined by user-specified seed value for the random number generator. This arrangement allows for regenerating the same image later with more denoising steps and/or with different sampler settings.

The StableDiffusionXLPipeline used in our reference implementation relies on torch.randn which does not guarantee determinism across PyTorch releases and between different accelerators. MLPerf Inference rules restrict non-determinism, but submitters are free to use any framework for their submission. To ensure determinism, the task force decided to generate latent representations and provide them along with the reference implementation. Images from different accelerators with such a constant latent can be seen in Figure 6. Such a configuration helps to verify whether a submission maintains the model equivalence rule. This setup also enables the tightening of the FID and CLIP accuracy thresholds.

Qualcomm Cloud AI 100 submission	NVIDIA H100 submission
Intel Gaudi 2 submission	HuggingFace fp32 reference

Figure 6: Similar images are generated from four different accelerators thanks to setting a constant latent in the benchmark.

Compliance check

As noted earlier, FID and CLIP have limited capability in capturing the aesthetic assessment of generated images. Furthermore, the restricted non-determinism ensures the generated images from each submission are expected to be visually similar to the reference images generated from pytorch fp32 reference implementation. The task force therefore chose to add a human compliance check to the submission and review process for SDXL.

The task force decided to include a check that targets 10 image indices chosen using a random seed generator. Submitters are required to include the 10 images generated for these prompts along with their submissions. These images are inspected by submitters during the review stage of the results submission process. Any image determined to be a pure-noise image, such as a white noise image, or an image that presents a completely different semantic information than its text prompt will disqualify the corresponding submission. After successfully passing this human inspection, these images are removed from each submission before its publication.

Conclusion

In this blog post, we described the new SDXL benchmark introduced in the v4.0 round of MLPerf Inference, including the complexities involved in the choices of model and dataset. In particular, the current accuracy metrics for image comparisons have notable limitations, so we undertook several measures to ensure a robust benchmark. We believe these design choices capture the real challenges faced by practitioners and accurately and fairly represent performance for customers looking at adopting image generation.

Additional details about the MLPerf Inference Stable Diffusion benchmark and reference implementation can be found here.

The post SDXL: An MLPerf Inference benchmark for text-to-image generation appeared first on MLCommons.

New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models

MLCommons — Wed, 27 Mar 2024 14:56:00 +0000

Today, MLCommons announced new results from our industry-standard MLPerf Inference v4.0 benchmark suite, which delivers industry standard machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner.

MLPerf Inference v4.0

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models in a variety of deployment scenarios. In order to keep pace with today’s ever-changing generative AI landscape, the working group created a new task force to determine which of these models should be added to the v4.0 version of the benchmark. This task force analyzed factors including model licenses, ease of use and deployment, and accuracy in their decision-making process.

After careful consideration, we opted to include two new benchmarks to the suite. The Llama 2 70B model was chosen to represent the “larger” LLMs with 70 billion parameters, while Stable Diffusion XL was selected to represent text-to-image generative AI models.

Llama 2 70B is an order of magnitude larger than the GPT-J model introduced in MLPerf Inference v3.1 and correspondingly more accurate. One of the reasons it was selected for inclusion in the MLPerf Inference v4.0 release is that this larger model size requires a different class of hardware than smaller LLMs, which provides a great benchmark for higher-end systems. We are thrilled to collaborate with Meta to bring Llama 2 70B to the MLPerf Inference v4.0 benchmark suite. You can learn more about the selection of Llama 2 in this deep-dive post.

“Generative AI use-cases are front and center in our v4.0 submission round,” said Mitchelle Rasquinha, co-chair of the MLPerf Inference working group. “In terms of model parameters, Llama 2 is a dramatic increase to the models in the inference suite. Dedicated task-forces worked around the clock to set up the benchmarks and both models received competitive submissions. Congratulations to all!”

For the second new generative AI benchmark, the task force chose Stability AI’s Stable Diffusion XL, with 2.6 billion parameters. This popular model is used to create compelling images through a text-based prompt. By generating a high number of images, the benchmark is able to calculate metrics such as latency and throughput to understand overall performance.

“The v4.0 release of MLPerf Inference represents a full embrace of generative AI within the benchmark suite,” said Miro Hodak, MLPerf Inference co-chair. “A full third of the benchmarks are generative AI workloads: including a small and a large LLM and a text-to-image generator, ensuring that MLPerf Inference benchmark suite captures the current state of the art.”

MLPerf Inference v4.0 includes over 8500 performance results and 900 Power results from 23 submitting organizations, including: ASUSTeK, Azure, Broadcom, Cisco, CTuning, Dell, Fujitsu, Giga Computing, Google, Hewlett Packard Enterprise, Intel, Intel Habana Labs, Juniper Networks, Krai, Lenovo, NVIDIA, Oracle, Qualcomm Technologies, Inc., Quanta Cloud Technology, Red Hat, Supermicro, SiMa, and Wiwynn.

Four firms–Dell, Fujitsu, NVIDIA, and Qualcomm Technologies, Inc.–submitted data center-focused power numbers for MLPerf Inference v.4.0. The power tests require power-consumption measurements to be captured while the MLPerf Inference benchmarks are running, and the ensuing results indicate the power-efficient performance of the systems tested. The latest submissions demonstrate continued progress in efficient AI acceleration.

“Submitting to MLPerf is quite challenging and a real accomplishment”, said David Kanter, executive director of MLCommons. “Due to the complex nature of machine-learning workloads, each submitter must ensure that both their hardware and software stacks are capable, stable, and performant for running these types of ML workloads. This is a considerable undertaking. In that spirit, we celebrate the hard work and dedication of Juniper Networks, Red Hat, and Wiwynn, who are all first-time submitters for the MLPerf Inference benchmark.”

View the Results

To view the results for MLPerf Inference v4.0 visit the Datacenter and Edge results pages.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a member or affiliate, please visit MLCommons.org or email participation@mlcommons.org.

The post New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models appeared first on MLCommons.

Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models

MLCommons — Wed, 27 Mar 2024 14:55:00 +0000

The MLCommons^ⓇMLPerf^Ⓡ Inference benchmark suite is an industry standard for measuring the performance of machine learning (ML) and artificial intelligence (AI) workloads from diverse domains including vision, speech, and natural language processing. For each of these domains, the suite includes a carefully selected set of workloads that represents the state of the art in the industry across different application segments. These benchmarks not only provide key information to consumers for making application deployment and budget decisions but also enable vendors to deliver critical workload optimizations within certain practical constraints to their customers.

With the advent and widespread adoption of large language models (LLMs), it has become even more crucial to make objective, fair and robust benchmarks readily available to the community. With this goal in mind, the MLPerf Inference working group formed a special task force in early 2023 to explore, define, and create language benchmarks.

Building on small LLM in Inference v3.1

In deliberations leading up to the MLPerf Inference v3.1 round, which was available for submission in August 2023, the task force initially considered two principal LLM segments: a “smaller” LLM represented by GPT-J with six billion parameters, and a “larger” LLM based on GPT-3 with 175 billion parameters.

The smaller LLM was fine-tuned for text summarization, which is an application of increasing importance, aiding in human understanding and decision making from large bodies of text. This benchmark had a successful debut in the v3.1 round, with 11 organizations submitting 30 results across CPUs, GPUs, and custom accelerators. Details of the MLPerf Inference GPT-J benchmark and reference implementation can be found here.

The task force faced a number of technical challenges with the larger, more compute-intensive LLM, and the team eventually chose to postpone work on this benchmark to the MLPerf v4.0 round.

Introducing Llama 2 70B in MLPerf Inference v4.0

For the MLPerf Inference v4.0 round, the working group decided to revisit the “larger” LLM task and spawned a new task force. The task force examined several potential candidates for inclusion: GPT-175B, Falcon-40B, Falcon-180B, BLOOMZ, and Llama 2 70B.

After careful evaluation and discussion, the task force chose Llama 2 70B as the model that best suited the goals of the benchmark. The choice of Llama 2 70B as the flagship “larger” LLM was determined by several factors:

Licensing restrictions: For an open benchmark consortium, model licenses must be flexible enough to offer unlimited access and use to participating submitters at minimum and, ideally, to the press and public in general. This requirement typically reduces the candidate pool to models (and datasets) with permissive licenses such as Apache 2.0, MIT, or similar. Thanks to Meta’s support, MLCommons is making the Llama 2 family of models available to MLCommons’ members, as part of the MLPerf benchmark suite, for the purpose of members conducting the benchmarking.
Ease of use and deployment (technical): Llama 2 70B stands out for its accessibility and ease of deployment compared to models like GPT-175B that demand substantial resources and engineering effort for pre-training, fine-tuning, and framework alignment. The Llama-2-70B-Chat-HF model dramatically streamlined the MLPerf benchmark development process, allowing the task force to integrate an advanced LLM model without extensive resource allocation.
Community interests: While LLM model parameter sizes have increased tremendously in the recent past, the broader ML/AI developer community’s interest also plays an important role in shaping industry adoption across different application segments. Using Hugging Face contributions and leaderboards as a proxy for community engagement, Llama 2 70B was the leading candidate for inclusion during the task force’s discussions.
Quality: In gauging the suitability of candidate models, we considered their accuracy scores and output text quality across popular datasets curated for different language modeling tasks. Llama 2 70B was one of the best performing models in our evaluation under this criteria.

Choosing a task and a dataset for the LLM

The task force evaluated many of the popular tasks for LLMs and, after careful deliberation, chose to use a question-answering (or Q&A) scenario. We picked the Q&A task because it is one of the most common ways that LLMs are being used in serving applications today. Q&A tasks reflect real-world demands for natural-language understanding and interaction. We considered dialogue-based tasks, like an interactive chat session that accumulates context over a series of interactions, but their evaluation complexity posed challenges, particularly in ensuring accurate assessments and also in the difficulty of measuring performance during such a dialogue. Thus, to streamline our benchmarking process and ensure robust evaluation, we opted for the Q&A scenario, with discrete pairs of questions and answers, due to its widespread applicability and straightforward evaluation metrics.

Examining the field of available datasets for this task, we chose the Open Orca dataset among various top-rated datasets such as AllenAI/soda. Open Orca is recognized as one of the highest quality and most extensive datasets available for question answering. The dataset contains a fusion of augmented FLAN Collection data with one million GPT-4 completions, making it a rich source for evaluating NLP capabilities.

In order to shorten benchmark runtime, we decided to select a subset of 24,576 samples from this dataset. The samples in this dataset were further curated for characteristics such as prompt quality, sample source (e.g cot, flan), maximum input length (1,024 tokens), maximum output length (1,024 tokens) and minimum reference response length (at least three tokens) to create the validation set for the benchmark. Interested readers can refer to processorca.py for more details.

Figure 1: Input length distribution in the curated dataset

We observed that in general Llama 2 70B generates slightly longer outputs than GPT-4 on our dataset. Additionally, some of the output sequences from our model do not finish their generation within our bound of 1,024 tokens.

Choosing accuracy metrics for the LLM

For comprehensive evaluation, the LLM inference task force carefully considered how to adhere to a high standard of model quality. Given the potential accuracy and performance trade-offs during optimization, submissions should not deviate significantly from the output that is generated by the 32-bit floating-point (FP32) reference model that defines canonical correctness.

The task force investigated various accuracy metrics for language models, including perplexity, MMLU, MAUVE, and ExactMatch. Some of these metrics do not take the entire generated text into consideration (e..g perplexity and MMLU), while others have non-deterministic behavior (e.g. MAUVE). The task force ended up choosing ROUGE scores as the accuracy metrics. In essence, n-gram based metrics, such as ROUGE scores, measure how close a given text is to its reference. Keeping the robustness of the accuracy in mind, the taskforce sided with choosing a set of three ROUGE metrics for this benchmark: ROUGE-1, ROUGE-2 and ROUGE-L.

The reference FP32 implementation achieves the following reference ROUGE scores:

ROUGE-1 = 44.4312
ROUGE-2 = 22.0352
ROUGE-L = 28.6162

The Llama 2 70B benchmark allows submissions under two categories: in the first category, all achieved scores must be above 99.9% of the reference scores; in the second category, all achieved scores must be above 99% of the reference scores. However, since there were no v4.0 submissions with the accuracy lower than 99.9%, the 4.0 results table contains only the highest accuracy category.

To further tighten this benchmark, the taskforce decided to allow only submissions with a generation length of 294.45 tokens per sample, with a +/- 10% tolerance.

Figure 2: ROUGE scores distribution in outputs from the reference implementation

Measuring performance of the LLM

An LLM inference takes as input several tokens (a sentence or paragraph) and generates tokens based on the instructions. In its most basic form, one could treat the problem as asking the benchmark to fill in the last word of a sentence. Internally, the processing happens in two main steps. The input tokens are processed together to generate the first output token. This step is commonly referred to as the prompt phase or the context phase. After this phase, computation is primarily limited to generating the rest of the tokens in the response, one token at a time, reusing the context from prior computation in each step. This phase is generally called the generation phase. It is critical for the performance and accuracy metrics to capture the application execution behavior in both of these phases of computation.

While other applications could use throughput metrics such as queries/second, the amount of computational work done by one LLM inference can be extremely different from another. This delta is due to the fact that queries can have different input and output token lengths. Therefore, the task force decided to use tokens/second as the metric for throughput measurement, which is a well understood and standard metric for LLM throughput measurement across the industry.

Latency constraints for the server scenario

The server scenario in MLPerf Inference benchmarks simulates online applications where query arrival is random and the response latency is important. The query arrival traffic is described with a Poisson distribution where the query arrival rate (queries per second or “target_qps”) is the Poisson parameter. The test is successful if latency constraints for the benchmark are met.

In industrial applications of LLMs, the following latency metrics are considered important:

TTFT: Time to first token
TPOT: Time per output token, sometimes also called TBT (time between tokens)

Using the typical human reading speed as a logical anchor, these latency constraints were set as:

TTFT: <= 2 seconds
TPOT: <= 200 milliseconds

A TPOT of 200 ms translates to a maximum allowed generation latency that maps to ~240 words per minute (depending on the tokenizer), which is often cited as the average human reading speed.

While reading speed is a good yardstick for generation tasks such as LLM-based chat or tech support, other use cases have tighter latency constraints. For example, code generation, real-time slide generation, and smart agents all require much faster response times.

Ensuring compliance to the benchmark criteria

To ensure compliance and prevent potential exploitation, the task force devised a compliance check mechanism. This test is designed to verify the adherence of tokens generated by submitters to predefined criteria. Specifically, three key tests are applied to the submissions:

First token consistency: The first token generated must match the first token reported to the LoadGen for timing purposes, ensuring synchronization and accuracy in the token generation processes.
EOS token validation: The end-of-sequence (EOS) token should not be generated more than once in the generated output, maintaining coherence and integrity in the completion of sequences.
Token count verification: The total number of tokens generated by the model must align with the number reported to the LoadGen, validating the completeness and accuracy of the generated output.

Conclusion

In this post, we described the complexities involved in the choices of model, task, dataset, accuracy, performance, and verification of a contemporary LLM benchmark. We believe the design choices of this benchmark objectively address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers.

Details of the MLPerf Inference Llama 2 70B benchmark and reference implementation can be found here.

Given the overwhelming interests from vendors, as demonstrated by the number of submissions, we hope to continue to improve on the benchmark in future iterations, accounting for the different application domains that may steer LLM adoption.

Additional technical contributors: Jinho Suh, NVIDIA; Junwei Yang, Google; Pablo Gonzalez, MLCommons.

The post Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models appeared first on MLCommons.

MLPerf Results Highlight Growing Importance of Generative AI and Storage

MLCommons — Mon, 11 Sep 2023 08:00:00 +0000

Today we announced new results from two MLPerf benchmark suites: MLPerf Inference v3.1, which delivers industry standard Machine Learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner, and MLPerf Storage v0.5. This publication marks the first ever release of results from the MLPerf Storage benchmark, which measures the performance of storage systems in the context of ML training workloads.

MLPerf Inference v3.1 Introduces New LLM and Recommendation Benchmarks

MLPerf Inference v3.1 includes record participation, with over 13,500 performance results and up to 40% performance gains, from 26 different submitters, and over 2000 power results. Submitters include: ASUSTeK, Azure, cTuning, Connect Tech, Dell, Fujitsu, Giga Computing, Google, H3C, HPE, IEI, Intel, Intel-Habana-Labs, Krai, Lenovo, Moffett, Neural Magic, NVIDIA, Nutanix, Oracle, Qualcomm, Quanta Cloud Technology, SiMA, Supermicro, TTA, and xFusion. In particular, MLCommons® would like to congratulate first time MLPerf Inference submitters Connect Tech, Nutanix, Oracle, and TTA.

“Submitting to MLPerf is not trivial. It’s a significant accomplishment, as this is not a simple point and click benchmark. It requires real engineering work and is a testament to our submitters’ commitment to AI, to their customers, and to ML.” said David Kanter, Executive Director of MLCommons.

The MLCommons MLPerf Inference benchmark suite measures how fast systems can run models in a variety of deployment scenarios. ML inference is behind everything from the latest generative AI chatbots to safety features in vehicles, such as automatic lane-keeping and speech-to-text interfaces. Improving performance and power efficiency is key to deploying more capable AI systems that benefit society.

MLPerf Inference v3.1 introduces two new benchmarks to the suite. The first is a large language model (LLM) using the GPT-J reference model to summarize CNN news articles, which garnered results from 15 different submitters, reflecting the rapid adoption of generative AI. The second is an updated recommender, modified to be more representative of industry practices, using the DLRM-DCNv2 reference model and a much larger datasets, with 9 submissions. These new tests help advance AI by ensuring that industry-standard benchmarks represent the latest trends in AI adoption to help guide customers, vendors, and researchers.

“The submissions for MLPerf inference v3.1 are indicative of a wide range of accelerators being developed to serve ML workloads. The current benchmark suite has broad coverage among ML domains and the most recent addition of GPT-J is a welcome addition to the generative AI space. The results should be very helpful to users when selecting the best accelerators for their respective domains,” said Mitchelle Rasquinha, MLPerf Inference Working Group co-chair.

MLPerf Inference benchmarks primarily focus on datacenter and edge systems. The v3.1 submissions showcase several different processors and accelerators across use cases in computer vision, recommender systems, and language processing. There are both open and closed submissions related to performance, power, and networking categories. Closed submissions use the same reference model to ensure a level playing field across systems, while participants in the open division are permitted to submit a variety of models.

First Results for New MLPerf Storage Benchmark

The MLPerf Storage Benchmark Suite is the first open-source AI/ML benchmark suite that measures the performance of storage for ML training workloads. The benchmark was created through a collaboration spanning more than a dozen leading industry and academic organizations and includes a variety of storage setups including: parallel file systems, local storage, and software defined storage. The MLPerf Storage Benchmark will be an effective tool for purchasing, configuring, and optimizing storage for machine learning applications, as well as for designing next-generation systems and technologies.

Training neural networks is both a compute and data-intensive workload that demands high-performance storage to sustain good overall system performance and availability. For many customers developing the next generation of ML models, it is a challenge to find the right balance between storage and compute resources while making sure that both are efficiently utilized. MLPerf Storage helps overcome this problem by accurately modeling the I/O patterns posed by ML workloads, providing the flexibility to mix and match different storage systems with different accelerator types.

“Our first benchmark has over 28 performance results from five companies which is a great start given this is the first submission round,” explains Oana Balmau, Storage Working Group co-chair. “We’d like to congratulate MLPerf Storage submitters: Argonne National Laboratory (ANL), DDN, Micron, Nutanix, WEKA for their outstanding results and accomplishments.”

The MLPerf Storage benchmark suite is built on the codebase of DLIO, a benchmark designed for I/O measurement in high performance computing, adapted to meet current storage needs.

View the Results

To view the results for MLPerf Inference v3.1 and MLPerf Storage v0.5 and find additional information about the benchmarks please visit the following pages:

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ members – global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate, please visit MLCommmons or contact participation@mlcommons.org.

The post MLPerf Results Highlight Growing Importance of Generative AI and Storage appeared first on MLCommons.

MLPerf Inference Delivers Power Efficiency and Performance Gains

MLCommons — Wed, 05 Apr 2023 08:39:00 +0000

Today, MLCommons®, the leading open AI engineering consortium, announced new results from the industry-standard MLPerf Inference v3.0 and Mobile v3.0 benchmark suites, which measure the performance and power-efficiency of applying a trained machine learning model to new data. The latest benchmark results illustrate the industry’s emphasis on power efficiency, with 50% more power efficiency results, and significant gains in performance by over 60% in some benchmark tests.

Inference is the critical operational step in machine learning, where a trained model is deployed for actual use, bringing intelligence into a vast array of applications and systems. Machine learning inference is behind everything from the latest generative AI chatbots to safety features in vehicles such as automatic lane-keeping, and speech-to-text interfaces. Improving performance and power efficiency will lead the way for deploying more capable AI systems that benefit society.

The MLPerf benchmark suites are comprehensive system tests that stress machine learning models including the underlying software and hardware and in some cases, optionally measuring power efficiency. The open-source and peer-reviewed benchmark suites create a level playing ground for competition, which fosters innovation and benefits society at large through better performance and power efficiency for AI and ML applications.

The MLPerf Inference benchmarks primarily focus on datacenter and edge systems. This round featured even greater participation across the community with a record-breaking 25 submitting organizations, over 6,700 performance results, and more than 2,400 performance and power efficiency measurements. The submitters include Alibaba, ASUSTeK, Azure, cTuning, Deci.ai, Dell, Gigabyte, H3C, HPE, Inspur, Intel, Krai, Lenovo, Moffett, Nettrix, NEUCHIPS, Neural Magic, NVIDIA, Qualcomm Technologies, Inc., Quanta Cloud Technology, rebellions, SiMa, Supermicro, VMware, and xFusion, with nearly half of the submitters also measuring power efficiency.

MLCommons congratulates our many first time MLPerf Inference submitters on their outstanding results and accomplishments. cTuning, Quanta Cloud Technology, rebellions, SiMa, and xFusion all debuted their first performance results. cTuning, NEUCHIPS, and SiMa also weighed in with their first power efficiency measurements. Lastly, HPE, NVIDIA, and Qualcomm all submitted their first results for inference over the network.

The MLPerf Mobile benchmark suite is tailored for smartphones, tablets, notebooks, and other client systems. The MLPerf Mobile application for Android and iOS is expected to be available shortly.

To view the results and find additional information about the benchmarks please visit https://mlcommons.org/en/inference-datacenter-30, https://mlcommons.org/en/inference-edge-30 and https://mlcommons.org/en/inference-mobile-30.

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org and contact participation@mlcommons.org.

Press Contact:
Kelly Berschauer
kelly@mlcommons.org

The post MLPerf Inference Delivers Power Efficiency and Performance Gains appeared first on MLCommons.

Perspective: Unlocking ML requires an ecosystem approach

MLCommons — Fri, 10 Mar 2023 08:04:00 +0000

Over the past 10 years, machine learning (ML) has grown exponentially, from a small research field to a broad community spanning universities, technology platforms, industries, nonprofits, and governments. The global funding (PE/VC) in AI companies increased by 390% from $16 billion in 2015 to $79 billion in 2022.^[1] Recent advances in ML have made it possible for machines to not only consume but also create content to a level that approaches what humans can do. This capability is sometimes called Generative AI^[2]. This is primarily enabled by foundation models^[3] – large, pre-trained models trained on copious amounts of diverse data through self-supervised learning that can be adapted to a wide array of downstream tasks. The disruptive potential of foundation models lies in their ability to achieve state-of-the-art performance on tasks with little or no task-specific training required (though such “fine-tuning” further improves capabilities).

While considerable innovation is taking place across the ML research-to-production life cycle, it often occurs organically and in silos, resulting in uneven impact and a low transfer of innovation across industry verticals. A handful of technology companies are experiencing benefits from deploying cutting-edge ML at scale, while others are still learning to operationalize it: to quote one study, nearly 80 percent of AI solutions are abandoned because companies struggle to scale them.^[4] Even in companies with advanced ML capabilities, there is considerable friction and uncertainty within the ML development cycle. Reasons for this include:

ML has a fundamentally different programming paradigm than traditional software. In traditional software engineering, a programmer writes explicit instructions for the computer to execute, while in machine learning—especially for deep neural networks—the programmer provides a data set and specifies a model architecture and other training code from which the program learns the desired behavior. While this is an incredibly powerful programming tool, it creates an inherent unpredictability that must be accounted for throughout the ML development life cycle.
The ML programming paradigm requires a very different development cycle. The best practice for traditional software development uses a collaborative, flexible, agile methodology and the scrum framework, whereby teams make rapid, incremental advancements toward well-defined goals. ML development, by contrast, is relatively opaque and unpredictable; in effect, data is the new code, and the quality and quantity of data determine the ML’s functionality—though this connection is still poorly understood. Teams continuously revisit data sets and retrain models as they move ML innovations from research into production – a process we term the “ML development flywheel” (Exhibit 1).
ML lacks a mature ecosystem – a set of shared tools, resources, standards, and best practices – to support this development cycle. Traditional computer programming has had decades to develop an ecosystem that has become so prevalent that nearly 80 percent of all sites on the internet today rely on a common software stack^[5]. ML development, however, is still considered an art as much as a science, often requiring a custom solution for each new problem and highly experienced engineers who can try various techniques gained through experience to arrive at the desired model behavior and convergence.

Exhibit 1: ML Development Flywheel

The result is that the ML field today resembles the internet of the late 1990s. It has high potential but uneven impact, knowledge is spread across myriad organizations and individuals, and programming requires a collection of custom, often vertically integrated tech stacks.

Motivated by the challenge of increasing ML’s impact in society, the MLCommons Association collaborated with several partners on a high-level analysis of how to develop a healthier ML ecosystem (Exhibit 2). We wanted to answer two key questions—where are the opportunities for unlocking ML’s greatest impact on society, and what are the best ways to address them—with the goal of providing a perspective on how various ML ecosystem players, including enterprises and technology hyperscalers,^[6] can contribute to driving greater ML adoption.

Exhibit 2: Types of ML Ecosystem players

We interviewed 32 experts across these ML ecosystem players^[7]—including engineers and product managers from leading tech companies, business leaders, the heads of industry and civil-society groups, and academics—to explore the challenges they face in developing ML and attempt to prioritize solutions for addressing them.

Opportunities to improve ML

Exhibit 3: Opportunities to improve ML fall into three broad categories

Opportunities to unlock the full impact of ML at scale fall into three categories: increasing the capabilities of ML technology, broadening access to and adoption of ML, and ensuring that the results of ML benefit society (Exhibit 3). All ecosystem players have a role to play for ML to reach its full potential, and contributions can be in the form of tools, data, funding, education, training, benchmarks, best practices, partnerships and coalitions, and governance and policy.

Deep dives on critical ecosystem areas to improve

In this section, we discuss three key opportunities that emerged in our discussions with business leaders and ML practitioners: data discoverability, access, and quality; operational scaling and hardening; and ethical applied ML.

Data discoverability, access, and quality: ML developer productivity is data-limited, but public data is primitive and data tools are poor

ML developers face two broad categories of data challenges:

Finding, accessing, and labeling data. Developers need to locate relevant data, gain access to it—which can be especially difficult if it’s held by multiple organizations—format it for consistency, and sometimes efficiently and correctly label it.
Making data useful. Collected data can be incomplete, incorrect, biased, or misleading in numerous ways, problems which are hard to detect prior to deployment.

These challenges are exacerbated by the relatively low investment companies are making in public ML data (compared with the substantial investment in model development), the nonexistent to primitive nature of metrics and standards for ML data sets, and the lack of a widely used tool stack for working with data. Opportunities for improving the ML data ecosystem include:

Improving public data sets. There is opportunity for a network of partners to work together to create open data sets that support multiple contributors and continuous quality improvements. Datasets such as the Pile^[8] and LAION^[9] are early, promising examples of the potential of this approach, though other issues such as licensing still need to be addressed. One estimate states that the boost to the economy from adoption of open-data ecosystems in finance could range from about 1 to 1.5 percent of GDP in 2030 in the European Union, the United Kingdom, and the United States, to as much as 4 to 5 percent in India.^[10] There is a particularly strong need for open datasets that define community consensus on usable public data, serve as tests and benchmarks for different approaches, or address societal issues without commercial interest.
Increasing data sharing. Public data sets should be populated by standardizing best practices for anonymizing and releasing data or finding innovative ways to leverage private data without releasing it. For example, MedPerf is an open framework for benchmarking machine learning in the medical domain that securely evaluates medical AI models on distributed healthcare data sets while keeping the data private.^[11]
Agreeing on common data metrics and standards. First, metrics, such as DataPerf, should be created that quantify data quality (DataPerf is a benchmark suite for ML datasets and data-centric algorithms^[12]). Second, data formats and metadata content should be standardized; Google model cards, IBM fact sheets, datasheets for datasets, and the Data Nutrition Project all provide examples of how to do this.^[13] Third, best practices for ethical data sourcing and consistent licensing should be developed.
Improving interoperable data-centric tools. These metrics and standards should be leveraged to improve data quality and reduce costs through a data-development stack of interoperable tools akin to those widely used for code.

Operational scaling and hardening: deploying models from research into production too often relies on costly and fragile bespoke solutions

Delivering stellar results using a prototype model trained on an experimental data set is a significant accomplishment, but putting the model into production to deliver business value brings new challenges that generally fall into two categories:

Building the ML data platform. Experimental data is often hand-curated or has different characteristics from the actual, production data at a company. As a result, teams need to invest in building persistent pipelines to access and transform production data and design features to approximate those used for the experimental model. Often the new model does not match the evaluation performance of the experimental one and additional work is needed to bring it up to par. An abrupt handoff between the research and production teams can increase the challenge of this work if the team responsible for operationalizing the model into a product is unaware of the context and datasets used in model development. Finally, live data needs to be collected as the model is put through dark launches, live experiments, and full launch, all while monitoring the model output to ensure the query and training distribution match, and it isn’t producing unintended outcomes. Building this end-to-end data story, with an ability to track artifact provenance, requires significant up-front investment before any business impact is realized.
Navigating infrastructure fragmentation. Much of the open-source infrastructure used to write ML models and compile them down to hardware-native code was created by researchers for their own specific, fast-changing needs. This has resulted in an ecosystem of toolchains that is highly siloed or patched together using custom bridges, which creates a host of interoperability problems—for example, trying to port a server-side model written in Pytorch to a mobile back-end using TFLite can be a challenging process—and means teams often have to throw out their existing model code and pipelines and start from scratch as their business needs evolve, significantly slowing down development velocity.

ML tooling innovation is addressing some of these challenges, in the process attracting high levels of VC funding and some of the best talent in the industry. An ecosystem perspective suggests new solutions are needed to accelerate ML adoption:

Data-platform standards and best practices. Best practices for building robust data platforms could enable a smooth transition from prototyping to live monitoring. While it is common for different companies to use different tools for each stage of the pipeline, data sharing and security standards would ensure these tools work seamlessly together.
Standard interchange formats. Decoupling higher and lower parts of the core ML stack could drive interoperability and reduce fragmentation. Developers should be able to design the toolchain (for example, PyTorch framework with XLA compiler) that best meets their needs and works across hardware back-ends. Defining a standard set of operators at various levels of the stack would make ML development much more scalable and make it easier for hardware and software vendors to focus on their strengths rather than on needing to support multiple independent code paths.
Include people and organizations. Promising ML applications often fail to gain traction for nontechnical reasons such as a lack of stakeholder involvement during development and skepticism by front-line users during early iterations. Organizations as a whole, not just ML engineering teams, need to be aligned around the ML development flywheel.

Ethical applied ML: developers lack universally accepted standards, technologies and benchmarks to build responsible ML applications

ML has driven phenomenal advances in many areas, from autonomous vehicles and drug discovery to financial fraud-detection and retail-product recommendations. However, it has also had negative effects, including perpetuating systemic biases^[15], making many companies cautious about adopting it. Building a trust-based AI architecture will expand the scope of industries and use cases where ML can be applied and will require work across several dimensions – bias and fairness, explainability and transparency, human oversight and accountability, privacy and data ethics, performance and safety, security and sustainability^[16]. These can be addressed through several approaches:

Privacy-enhancing technologies. Many tech hyperscalers and open-source communities provide privacy-enhancing technology solutions, such as federated learning and homomorphic encryption, but these solutions lack standardization, which can lead to reinventing the wheel in silos rather than alignment among ecosystem players. A common set of community backed standards can help ensure these technologies are more widely adopted, benefiting everyone.
Best practices. There is a need for universally accepted best practices so that organizations with hundreds of models can carry out the full gamut of fairness checks and evaluate data risks. Instituting best practices while the technology is still young can minimize problems caused by proprietary systems not “talking” to one another later, help standardize models, and contribute to curbing bias. Early development of such practices can help avoid what occurred with electronic health records; the promise was that they would help patients and providers easily access relevant information, but the reality is that records remain fragmented across platforms, often causing frustration for stakeholders at every level.
Policy and governance. Global standards for AI regulation are still being developed and governments, technology companies, and civil society groups can help shape legislation to focus on key principles like transparency, human centricity, contestability, data protection, and fairness. For regulations already in place, there remains an opportunity to provide tools and certifications to help companies implement these guidelines.
Tools and benchmarks. There is a need for comprehensive benchmarks tailored to the needs of specific industries for identifying and addressing bias. An example of this is the Monk Skin Tone Scale^[17] developed and released by Google that can be leveraged to evaluate computer vision datasets and models for skin tone representation. There is also a need for better documentation and artifact tracking at each stage of the ML lifecycle.^[18]
Education and training. To accelerate the adoption of tools, benchmarks, and best practices, training resources should be customized to the needs and roles of each of the players involved in each step of ML development flywheel. One example of such a resource is provided by EDM Council, a global association focused on industry-validated trainings and certifications and other practitioner resources.

Concluding thoughts

The ML field is still nascent—barely 10 years old—and significant work is needed to transform it into an established discipline that achieves its full potential of social and economic impact. Mainstream ML adopters and tech hyperscalers will be key to this endeavor.

Key takeaways for Enterprise adopters of ML

In recent years, the momentum of R&D in AI has been exponential; more than 30 times as many patents were filed in the field in 2021 than in 2015.^[19] Given this momentum, as well as other factors such as increased investment, AI is not only performing better but also becoming more affordable, leading to increased adoption across industries. According to McKinsey & Company’s Global Survey on the state of AI^[20] 2022, 50 percent of respondents reported their organizations had adopted AI in at least one function in 2021, compared with 20 percent in 2017. While AI adoption has grown 2.5 times since 2017, it has more recently plateaued. While the consensus is that AI will be a mainstream technology at their companies, a 2020 survey found that only 76 percent of organizations were close to breaking even on their AI investments when considering the costs and not just the benefits.^[21] To address these challenges and realize the full potential of AI for themselves and other ecosystem players, enterprise adopters could adopt the following strategies:

Reassess the path to ML production. The path to performance observability and model conformance lies as much in the data that is fueling the infrastructure as it does in the design of the infrastructure itself. Thus, enterprise ML adopters should be as metrics-driven on evaluating data quality as they are on enabling high-performance hardware. This would enable the next level of scale and impact for enterprises. Enterprise adopters could work to prevent bias and drift by building in time for iterative, unpredictable developments and bringing in all relevant organizational stakeholders, including legal and ethics leads, early in the ML development flywheel. This would help organizations identify regulatory requirements up-front, mitigate bias, and communicate clearly about privacy and governance laws.
Support emerging standards and best practices, and tailor them to current and future challenges. Many current ML development tools are either domain-specific or don’t scale well when moving from research prototyping to deploying ML at scale. Adopting open-source frameworks and best practices and eschewing one-off solutions can ensure an organization does not get stuck on a tech island as it matures but continues to learn and adapt as tooling solutions, policies, and innovations in the ecosystem evolve and scale.
Invest in people. Today’s AI heroes are often just those who write model code. However, enterprises need to ensure that everyone involved in the ML Development Flywheel, such as those who enable the data flow, provide tooling solutions, assure unbiased outcomes, and monitor live operations are also rewarded and upskilled.

Key takeaways for tech hyperscalers

As some of the largest investors in ML research, infrastructure, and applications, tech hyperscalers such as Google and Meta are critical to advancing the ML ecosystem. In addition to the significant contributions they already make through university collaborations and open-source projects, we would encourage them to invest in the following:

Support the development of public resources, especially data sets and tooling, to allow researchers to continue to innovate and communicate as ML technology advances.
Contribute to legal and business best practices that help all companies use ML effectively and responsibly. This can be achieved by partnering with mainstream ML adopters to share best practices on translating business problems into ML solutions and providing a solid technical foundation for emerging regulatory frameworks.
Work cross-organizationally or with industry groups to drive interoperability in open-source frameworks and tooling. This could enable more companies to participate in ML collaboratively and flexibly, growing overall adoption.
Encourage efforts to raise public understanding and awareness of AI’s potential for positive societal impact, such as through research into AI applications for social good and efforts to reduce bias and increase inclusion.
Be transparent about the limitations and risks of ML-based solutions.

Authors

Peter Mattson is a Senior Staff Engineer at Google. He co-founded and is President of MLCommons®, and co-founded and was General Chair of the MLPerf consortium that preceded it.

Aarush Selvan is a Product Manager at Google where he works on Natural Language Processing for the Google Assistant.

David Kanter is a co-founder and the Executive Director of MLCommons, where he helps lead the MLPerf benchmarks and other initiatives.

Vijay Janapa Reddi is an Associate Professor at Harvard University. His research interests include computer architecture and runtime systems, specifically in the context of autonomous machines and mobile and edge computing system.

Roger Roberts is a Partner at McKinsey & Company’s Bay Area office and advises clients in a broad range of industries as they address their toughest technology challenges especially in retail and consumer industries

Jacomo Corbo is Founder and co-CEO of PhysicsX, a company building AI generative models for physics and engineering; previously Chief Scientist and Founder of QuantumBlack and a Partner at McKinsey & Company

Authors would like to thank Bryan Richardson, Chaitanya Adabala Viswa, Chris Anagnostopoulos, Christopher McGrillen, Daniel Herde, David Harvey, Kasia Tokarska, Liz Grennan, Martin Harrysson, Medha Bankhwal, Michael Chui, Michael Rudow, Rasmus Kastoft-Christensen, Rory Walsh, Pablo Illanes, Saurabh Sanghvi, Stephen Xu, Tinashe Handina, Tom Taylor-Vigrass, Yetunde Dada, and Douwe Kiela as well as Adam Yala from University of California, Berkeley, Kasia Chmielinski from Data Nutrition Project and contributors from Google, Harvard University, Hugging Face, Meta AI, and Partnership on AI for their contributions to this article.

Pitchbook ︎
https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai ︎
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2022, July 12). On the opportunities and risks of Foundation models. arXiv.org. Retrieved January 20, 2023, from https://arxiv.org/abs/2108.07258 ︎
“How to operationalize machine learning for maximum business impact: Frameworks for machine learning operationalisation are key,” Wired, in partnership with QuantumBlack, June 12, 2021, https://www.wired.co.uk/bc/article/mckinsey-machine-learning ︎
“Usage statistics of PHP for websites,” W3 Techs, n.d., https://w3techs.com/technologies/details/pl-php ︎
Tech hyperscalers comprises technology companies like Google, Microsoft, and Amazon that are making efforts to scale in software products and services but also expand to numerous industry verticals ︎
Data Nutrition Project, Google, Harvard University, Hugging Face, Meta AI, Partnership on AI, and the University of California, Berkeley ︎
https://pile.eleuther.ai/ ︎
https://laion.ai/blog/laion-400-open-dataset/ ︎
White, Olivia, Anu Madgavkar, Zac Townsend, James Manyika, Tunde Olanrewaju, Tawanda Sibanda, and Scott Kaufman. “Financial Data Unbound: The Value of Open Data for Individuals and Institutions.” McKinsey & Company June 29, 2021. https://www.mckinsey.com/industries/financial-services/our-insights/financial-data-unbound-the-value-of-open-data-for-individuals-and-institutions ︎
Alexandros Karargyris et al., “MedPerf: Open benchmarking platform or medical artificial intelligence using federated evaluation,” n.d., https://arxiv.org/ftp/arxiv/papers/2110/2110.01406.pdf ︎
https://dataperf.org/ ︎
“The value of a shared understanding of AI models,” n.d., https://modelcards.withgoogle.com/about; Michael Hind, “IBM FactSheets further advances trust in AI,” July 9, 2020, https://www.ibm.com/blogs/research/2020/07/aifactsheets/; Timnet Gebru et al., “Datasheets for datasets,” Cornell University, March 23, 2018, https://arxiv.org/abs/1803.09010; “The Data Nutrition Project,” n.d., https://datanutrition.org/ ︎
https://openai.com/pricing ︎
Silberg, Jake, and James Manyika. “Tackling Bias in Artificial Intelligence (and in Humans).” McKinsey & Company, July 22, 2020. https://www.mckinsey.com/featured-insights/artificial-intelligence/tackling-bias-in-artificial-intelligence-and-in-humans. ︎
https://www.mckinsey.com/featured-insights/in-the-balance/from-principles-to-practice-putting-ai-ethics-into-action ︎
https://skintone.google/ ︎
“Machine learning life cycle,” Partnership on AI, n.d., https://docs.google.com/presentation/d/1fh4AGTeXRN0hInvoZsYMz7Bb3mgzfZktdGZtKvBRED8/edit#slide=id.gdf40b9b06b_2_491 ︎
Daniel Zhang et al., “Artificial Intelligence Index Report 2022,” Stanford Institute for Human-Centered AI, Stanford University, March 2022, https://aiindex.stanford.edu/wp-content/uploads/2022/03/2022-AI-Index-Report_Master.pdf ︎
https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review ︎
“2021 AI Predictions: No uncertainty here,” PwC, October 2020, https://www.pwc.com/mt/en/publications/technology/ai-predictions-2021/ai-predictions-2021-no-uncertainty-here.html ︎
https://www.mckinsey.com/capabilities/quantumblack/our-insights/generative-ai-is-here-how-tools-like-chatgpt-could-change-your-business ︎

The post Perspective: Unlocking ML requires an ecosystem approach appeared first on MLCommons.