MLPerf Inference Archives - MLCommons https://mlcommons.org/category/mlperf-inference/ Better AI for Everyone Tue, 25 Feb 2025 16:49:36 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://mlcommons.org/wp-content/uploads/2024/10/cropped-favicon-32x32.png MLPerf Inference Archives - MLCommons https://mlcommons.org/category/mlperf-inference/ 32 32 New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems https://mlcommons.org/2024/08/mlperf-inference-v4-1-results/ Wed, 28 Aug 2024 14:56:00 +0000 http://local.mlcommons/2024/08/mlperf-inference-v4-1-results/ New mixture of experts benchmark tracks emerging architectures for AI models

The post New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems appeared first on MLCommons.

]]>
Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v4.1 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. This release includes first-time results for a new benchmark based on a mixture of experts (MoE) model architecture. It also presents new findings on power consumption related to inference execution.

MLPerf Inference v4.1

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models across a variety of deployment scenarios. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems.

The benchmark results for this round demonstrate broad industry participation, and includes the debut of six newly available or soon-to-be-shipped processors:

  • AMD MI300x accelerator (available)
  • AMD EPYC “Turin” CPU (preview)
  • Google “Trillium” TPUv6e accelerator (preview)
  • Intel “Granite Rapids” Xeon CPUs (preview)
  • NVIDIA “Blackwell” B200 accelerator (preview)
  • UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators

MLPerf Inference v4.1 includes 964 performance results from 22 submitting organizations: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud, and Untether AI.

“There is now more choice than ever in AI system technologies, and it’s heartening to see providers embracing the need for open, transparent performance benchmarks to help stakeholders evaluate their technologies,” said Mitchelle Rasquinha, MLCommons Inference working group co-chair.

New mixture of experts benchmark

Keeping pace with today’s ever-changing AI landscape, MLPerf Inference v4.1 introduces a new benchmark to the suite: mixture of experts. MoE is an architectural design for AI models that departs from the traditional approach of employing a single, massive model; it instead uses a collection of smaller “expert” models. Inference queries are directed to a subset of the expert models to generate results. Research and industry leaders have found that this approach can yield equivalent accuracy to a single monolithic model but often at a significant performance advantage because only a fraction of the parameters are invoked with each query.

The MoE benchmark is unique and one of the most complex implemented by MLCommons to date. It uses the open-source Mixtral 8x7B model as a reference implementation and performs inferences using datasets covering three independent tasks: general Q&A, solving math problems, and code generation.

“When determining to add a new benchmark, the MLPerf Inference working group observed that many key players in the AI ecosystem are strongly embracing MoE as part of their strategy,” said Miro Hodak, MLCommons Inference working group co-chair. “Building an industry-standard benchmark for measuring system performance on MoE models is essential to address this trend in AI adoption. We’re proud to be the first AI benchmark suite to include MoE tests to fill this critical information gap.”

Benchmarking Power Consumption

The MLPerf Inference v4.1 benchmark includes 31 power consumption test results across three submitted systems covering both datacenter and edge scenarios. These results demonstrate the continued importance of understanding the power requirements for AI systems running inference tasks. as power costs are a substantial portion of the overall expense of operating AI systems.

The Increasing Pace of AI Innovation

Today, we are witnessing an incredible groundswell of technological advances across the AI ecosystem, driven by a wide range of providers including AI pioneers; large, well-established technology companies; and small startups. 

MLCommons would especially like to welcome first-time MLPerf Inference submitters AMD and Sustainable Metal Cloud, as well as Untether AI, which delivered both performance and power efficiency results. 

“It’s encouraging to see the breadth of technical diversity in the systems submitted to the MLPerf Inference benchmark as vendors adopt new techniques for optimizing system performance such as vLLM and sparsity-aware inference,” said David Kanter, Head of MLPerf at MLCommons. “Farther down the technology stack, we were struck by the substantial increase in unique accelerator technologies submitted to the benchmark this time. We are excited to see that systems are now evolving at a much faster pace – at every layer – to meet the needs of AI. We are delighted to be a trusted provider of open, fair, and transparent benchmarks that help stakeholders get the data they need to make sense of the fast pace of AI innovation and drive the industry forward.”

View the Results

To view the results for MLPerf Inference v4.1, please visit the Datacenter and Edge benchmark results pages.

To learn more about selection of the new MoE benchmark read the blog.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems appeared first on MLCommons.

]]>
Mixtral 8x7B: a new MLPerf Inference benchmark for mixture of experts https://mlcommons.org/2024/08/moe-mlperf-inference-benchmark/ Wed, 28 Aug 2024 14:54:00 +0000 http://local.mlcommons/2024/08/moe-mlperf-inference-benchmark/ MLPerf task force shares insights on the design of its mixture of experts large language model benchmark

The post Mixtral 8x7B: a new MLPerf Inference benchmark for mixture of experts appeared first on MLCommons.

]]>
As increasingly complex large language models (LLMs) emerge, the need for benchmarks that accurately reflect real-world applications becomes more pressing. MLPerf aims to select representative models widely used by customers in the industry, ensuring that the benchmarks are both realistic and relevant. By doing so, MLPerf provides valuable insights that help consumers make informed decisions about application deployment and budgeting. Additionally, these benchmarks enable vendors to optimize workloads effectively, catering to the practical needs of their customers while navigating the evolving AI technology landscape.

Building on LLM benchmarks already in MLPerf Inference suite

The MLPerf Inference suite already contains GPT-J and Llama 2 70B large language models, which were added in the v3.1 and v4.0 rounds, respectively. We set out to expand on the existing LLM benchmarks in the MLPerf Inference v4.1 suite by incorporating a mixture of experts (MoE) model.

The MoE architecture is behind some of the most popular and powerful large language models (LLMs) in the field, making it an ideal candidate for inclusion in our benchmark suite. MoE models are particularly appealing due to their ability to manage computational resources efficiently while maintaining high performance. By adding an MoE model, we can provide a more comprehensive evaluation of LLM capabilities that reflects the latest advancements in AI technology.

For readers interested in learning more about mixture of experts, please refer to the detailed explanation available in this article: Mixture of Experts Explained (huggingface.co).

Illustration of the sparse Switch FFN layer in the mixture of experts network architecture (from the Switch Transformers paper).

The multi-faceted ability of the MoE model requires sophisticated accuracy and performance testing that requires going beyond the single-task evaluation that has been the norm for MLPerf Inference benchmarks to date.  

To take on these challenges, the MLPerf Inference working group created a technical task force to study the workload and agree on a representative benchmark. The task force goals were:

  1. Understand how MoE models are used in the industry
  2. Survey MoE models and identify those that can be used for MLPerf MoE benchmark
  3. Choose an MoE model for the benchmark 
  4. Identify tasks and datasets for the benchmark
  5. Create and test the reference implementation

Below, we describe how and why the task force evaluated the choices that led to the creation of the new MoE benchmark.

Model selection

In choosing models for the MLPerf Inference benchmark, we considered several mixture of experts (MoE) models, including DBRX, Grok, Mixtral 8x7B, and Mixtral 8x22B. Ultimately, Mixtral 8x7B was selected due to its widespread deployment and strong problem-solving capabilities across various domains. Mixtral 8x7B has gained popularity for its robust performance in handling diverse tasks, making it a good candidate for evaluating reasoning abilities. Its versatility in solving different types of problems provides a reliable basis for assessing the model’s effectiveness and enables the creation of a benchmark that is both relevant and comprehensive. In addition, Mixtral 8x7B’s size strikes a solid balance between performance and accuracy; it also allows for implementations across a wide range of processors.

Dataset and task selection

Using a single task for evaluation would not be sufficient to capture the full spectrum of the capabilities of a MoE model. Therefore, we thoroughly examined available literature to identify multiple tasks to include in the benchmark that would be both challenging and representative of real-world applications. Three were selected – Q&A, math reasoning, and coding – and we explain the rationale and selection processes in the following sections. The dataset for this combination of tasks is composed of 5,000 samples for each task, making the final dataset 15,000 samples.

Q&A task

For the Q&A benchmark dataset, we continue to use the Open Orca dataset, which aligns with the approach taken by our Llama 2 70B benchmark. Open Orca is recognized for its high quality and extensive coverage, making it an excellent choice for evaluating the question-answering capabilities of LLMs. This dataset combines augmented data from the FLAN Collection with a million GPT-4 completions, providing a rich resource for assessing natural language processing abilities.

We randomly sampled 5000 samples after filtering the original dataset following the curation methodology. We used the ROUGE metrics for accuracy measurement, following in the footsteps of MLPerf Inference v4.0. 

For more details, please refer to Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons

Question:Leo: If “A woman sitting on some steps smokes a cigarette while accessing her laptop.” does that mean that “A person  sitting on some steps smokes a cigarette while accessing her laptop.”?Options:- yes- it is not possible to tell- no
Answer:Yes, the sentence “A woman sitting on some steps smokes a cigarette while accessing her laptop” can be changed to “A person sitting on some steps smokes a cigarette while accessing her laptop.” The gender of the person has been changed from specifically a woman to a person in general. However, it is important to note that the pronoun “her” is still used, which indicates that the person referred to is female. If you want to make the sentence completely gender-neutral, you could change “her laptop” to “the laptop” or “a laptop.”

A sample question from the open_orca dataset (GPT4 partition), and the response of Mixtral 8x7B.

Math reasoning task

We selected GSM8K (Grade School Math 8K) as the math reasoning dataset to evaluate the reasoning capabilities of LLMs like Mixtral 8x7B. GSM8K consists of 8,792 high-quality, linguistically diverse, grade school math word problems that require multi-step reasoning to solve. These problems are designed to be solvable by a bright middle school student and primarily involve performing a sequence of elementary calculations using basic arithmetic operations. The dataset’s importance lies in its ability to assess an LLMs capacity for complex problem-solving, as the questions typically require two to eight steps to reach the final answer. We randomly sampled 5,000 questions from the dataset.

In setting up the benchmarks, we utilize exact match (EM) as the evaluation score, which measures the model’s ability to produce the precise correct answer, reflecting its accuracy in reasoning and computation. Additionally, a 5-shot approach is employed, where the model is provided with five examples before attempting to solve the problems. This method helps gauge the model’s ability to learn from limited examples and apply that learning to new, similar tasks, further highlighting its reasoning prowess.

Question:The square footage of the two bedrooms in the apartment that Jenny and Martha share totals 300 square feet. If Jenny’s bedroom is 60 square feet larger than Martha’s, how large, in square feet, is Martha’s bedroom?
Answer:Let’s denote the size of Martha’s bedroom as M. Jenny’s bedroom is then M + 60 square feet. The total size of the two bedrooms is 300 square feet. So we have the equation M + (M + 60) = 300. Solving this equation for M, we get 2M + 60 = 300, which simplifies to 2M = 240, so M = 120. The answer is 120 square feet.

A sample question from the GSM8K, and the response of 5-shot Mixtral 8x7B.

Coding task

For the code generation task, the task force chose the MBXP (Mostly Basic X [Programming language] Problems) dataset created by the AWS AI Labs team. We came to this decision after evaluating many popular code generation benchmarks such as MBPP (Mostly Basic Python Problems), HumanEval, Multi-HumanEval, CoNaLa, and CodeAlpaca. We chose MBXP because of its wide range of languages and longer input prompts.

MBXP extends the popular MBPP dataset by including 12 other languages: C++, Javascript, Typescript, Kotlin, Scala, Php, Ruby, C#, Go, Swift, and Perl. The same core problems in MBPP are included, but they are extended with the function signature in the language of choice, which hints at which language the model should use.

We made several modifications to the dataset when adapting it for our use. The first was to apply the Mistral Prompt Format to the prompts present in the dataset to allow the model to better understand the task it was being presented with. Furthermore, we appended the following footer to the prompt “Here’s the completed code:nn“`$LANGUAGE” after discovering that Mixtral 8x7B’s code generation performance sharply decreases when generating code outside of markdown blocks.

In many code generation benchmarks, the metric used to evaluate performance is pass@k, which allows the language model k to attempt to generate code that successfully passes a set of tests. The MBXP benchmark utilizes this metric by providing a comprehensive series of tests for each problem in each language, all conveniently organized within the mxeval library. For each problem in the dataset, mxeval integrates the generated code with the corresponding test suite into a single file and executes it. A successful run, indicated by an exit code of 0, is marked as successful by mxeval. Conversely, if errors occur—such as test failures or issues with code compilation or interpretation—the run is marked as a failure. In our evaluation, we have opted to use pass@1 to minimize unnecessarily long inference times, ensuring a more efficient assessment process.

For the split of the final dataset, we selected a subset of 5000 samples. Instead of sampling equally from every language, we first decided to restrict our search to the languages in which Mixtral 8x7B is most fluent. The reasoning behind this was that the accuracy metric aims to highlight accuracy regressions, which are well hidden in languages where accuracy is already poor. Furthermore, JVM-based languages were omitted as evaluating them led to long evaluation times. These restrictions lead us to sample problems from C++, Javascript, Typescript, PHP, Python, and Ruby, providing us with a mix of syntaxes (C-style versus more exotic syntaxes) and a mix of compiled and interpreted languages.

Problem:<s> [INST] Complete the following code. Be concise, don’t output anything that isn’t necessary.#include <bits/stdc++.h>using namespace std;

/** * Write a function to combine two given sorted lists using heapq module. * > combineLists(vector<int>{1, 3, 5, 7, 9, 11}, vector<int>{0, 2, 4, 6, 8, 10}) * {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11} * > combineLists(vector<int>{1, 3, 5, 6, 8, 9}, vector<int>{2, 5, 7, 11}) * {1, 2, 3, 5, 5, 6, 7, 8, 9, 11} * > combineLists(vector<int>{1, 3, 7}, vector<int>{2, 4, 6}) * {1, 2, 3, 4, 6, 7} */vector<int> combineLists(vector<int> num1, vector<int> num2) { [/INST]Here’s the completed code:
“`cpp

Solution:
#include <bits/stdc++.h>using namespace std;
vector<int> combineLists(vector<int> num1, vector<int> num2) {    vector<int> result;    merge(num1.begin(), num1.end(), num2.begin(), num2.end(), back_inserter(result));    return result;}“`

A sample problem from MBXP, and the response of Mixtral 8x7B. The green text is inserted via preprocessing.

Performance metrics

We used the same performance metrics as those used in the Llama 2 70B benchmark: token throughput for the offline scenario, and TTFT+TPOT as the server scenario. For more information, please refer to the “Measuring performance of the LLM” section of Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models – MLCommons.

Reference Implementation

For readers interested in running the MoE model themselves, we encourage them to follow the reference implementation, which contains code and instructions on how to run the entire benchmark end-to-end. This can be found in the following GitHub link.

Conclusion

In the MLPerf Inference v4.1 round, there were eight Mixtral 8x7B submissions with 34 results! We hope to see more submissions in subsequent rounds as this model becomes more popular with users and as more hardware and software options optimized for the model come to market.

The post Mixtral 8x7B: a new MLPerf Inference benchmark for mixture of experts appeared first on MLCommons.

]]>
SDXL: An MLPerf Inference benchmark for text-to-image generation https://mlcommons.org/2024/08/sdxl-mlperf-text-to-image-generation-benchmark/ Wed, 28 Aug 2024 14:45:00 +0000 http://local.mlcommons/2024/08/sdxl-mlperf-text-to-image-generation-benchmark/ MLPerf task force shares insights on the design of its text-to-image benchmark

The post SDXL: An MLPerf Inference benchmark for text-to-image generation appeared first on MLCommons.

]]>
Introduction

Generative AI models have been getting a lot of attention in recent years, and image generation models in particular have piqued interest among researchers, developers, and the public. The widespread adoption of image generation motivated us to add a new benchmark to the v4.0 round of MLPerf Inference using the Stable Diffusion XL model. This blog describes the complexities involved in benchmark construction such as model and dataset selection, with a particular focus on accuracy metrics. We believe these design choices address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers.

The evolution of image generation models can be grouped into three architectural eras: autoencoders, generative adversarial networks (GANs), and diffusion models. 

Autoencoders are characterized by an encoder-decoder architecture where the encoder compresses a high-dimensional image into a low-dimensional representation in a latent space. When the latent space is mapped into a probability distribution, this network turns into a generative model where we sample latent representations and project them to image space using a decoder. This class of models is called variational autoencoders (VAEs.) VAEs set a benchmark in terms of image quality and ease of training, and this type of model is still being used as a key component in diffusion models.

GANs, first published in 2014, employ the novel idea of dueling networks: a generator and a discriminator. The generator is trained to take in Gaussian noise and produce realistic images, and the discriminator is trained to identify whether the generated image is real or fake. Once trained, the generator can be used as a stand-alone network to produce realistic images. One key drawback of GANs is that they tend to be challenging to train; it can be difficult to get them to converge, and sometimes they experience mode collapse where the GAN converges on a very limited set of outputs, limiting their expressiveness. 

Diffusion models, the latest development in generative image models, offer the potential to generate a broader range of images with higher fidelity than GANs. Inspired by physics, the idea behind diffusion models is to learn the structure of the dataset by modeling the way data diffuses through the latent space. Diffusion models are trained to add and remove small amounts of Gaussian noise in repeated steps. During inference or generation, the model undergoes a reverse diffusion process where it considers a text prompt, starts with a randomly generated noisy image, and gradually works to denoise it until a predefined number of steps is reached.

Many diffusion models with varying approaches, such as Dall-E, Midjourney, and Imagen, have shown great potential in generating realistic images in recent years. One of the most important breakthroughs for diffusion models was the introduction of latent diffusion models (LDMs) in December 2021. The key differentiator in this family of models is that the entire diffusion process happens in latent space instead of pixel space. This is done using a VAE encoder which compresses the images into a low dimensional latent vector, performs the diffusion process with a U-Net backbone, and then finally uses a VAE decoder to project the latent back to the pixel spaces. This arrangement reduces the cost of training and inference substantially because the expensive denoising process in the latent space and therefore involves fewer parameters.

With increasing interest and adoption of LDMs and text-to-image models in general, MLCommons believed that a fair and robust benchmark in this domain for the deep learning community was needed. With this goal in mind, the MLPerf Inference working group formed a special task force in September 2023 to explore, define, and create a text-to-image-diffusion benchmark.

Choosing the model and its scheduler

The task force looked for candidate models that were openly available with compatible licenses, were popular in use, and produced state-of-the-art results. The candidates considered included Stable Diffusion (SD) 1.5 and 2.1 and Stable Diffusion XL (SDXL) 1.0. After deliberation, the task force chose SDXL 1.0 for the benchmark, and was subsequently introduced in the MLPerf Inference v4.0 benchmark round.

Stable Diffusion XL 1.0 is a diffusion model developed by Stability AI. By September 2023, it was the latest and largest open source text-to-image model that generated high-resolution (1024×1024) images with default settings. In comparison, SD 2.1 generated 768×768 images, and SD 1.5 generated 512×512 images. Higher resolution matters because it allows for greater detail and clarity in the generated images, making them more visually appealing and useful for various applications.

SDXL has two variants: SDXL-base (roughly 3.5 billion parameters) and SDXL-refiner (about 6.6 billion parameters). Both are significantly larger and more challenging than SD v2.1 (860 million parameters) and SD v1.5 (983 million parameters), adding a quantitative novelty to MLPerf Inference benchmark. The refiner model improves the fine-grained details in the generated images, as illustrated in Figure 1. However, this modest qualitative improvement comes at the cost of nearly double the parameters and compute requirements. Overall, we believe SDXL-base is a good compromise between practicality and pushing the limits of our machine-learning capabilities.

Figure 1: Left is output from SDXL-base, right is output from SDXL-refiner. Note the slight improvement in the shadowed skin texture on the neck in the right image compared to the left. Image courtesy: https://www.reddit.com/r/StableDiffusion/comments/15amidh/do_we_really_need_the_refiner_for_sdxl/

As depicted in Figure 2, SDXL has two dedicated networks, CLIP1 and CLIP2, that encode text prompt tokens into an embedding space. Using two networks better captures semantic information compared to the use of a single CLIP network in other diffusion models. SDXL also leverages a larger U-Net backbone than SD. The scheduler and the U-Net together denoise the image in latent space iteratively. The VAE decoder then recovers the image latent to a 1024×1024 RGB image output at the final stage.

Figure 2: Components of the text-to-image SDXL base pipeline

Schedulers are the algorithms that run multiple times iteratively to create a clean image from a completely random noisy image. The choice of scheduler determines the exact algorithm used to obtain a less-noisy latent. Some of the popular schedulers used for Stable Diffusion are DDIM, DDPM, and EulerDiscrete. Many scheduler algorithms involve a trade-off between denoising speed and quality, and each one works best with a particular number of steps. The speed of denoising mostly depends on the number of steps needed to reach a high-quality image.

The task force chose EulerDiscrete because it is one of the fastest schedulers available and has been widely adopted by users. From a qualitative standpoint, EulerDiscrete typically can produce clear images with almost no color artifacts in 20-30 steps, while DDIM requires around 50 steps. Quantitatively, the task force experimented with Hugging Face diffusers and showed that EulerDiscrete with 20-30 steps produced FID scores (a common quality metric that we discuss later) similar to those of DDIM scheduler with 50 steps.

Choosing the dataset

The choice of dataset is pivotal for the evaluation of an image-generation model. Each image in the dataset should be paired with a descriptive prompt that effectively communicates the visual content and context. This pairing enables the model to be assessed on its ability to generate images that are not only visually appealing but also contextually appropriate. The dataset should be diverse enough to cover various scenarios and object types in order to ensure the model’s performance is tested across different contexts.

The task force determined that the Common Objects in Context (COCO) dataset was an excellent match for this task. This dataset is recognized in the ML community for its diversity and for the richness of context in its image-caption pairs. A random subset of 5,000 prompt-image pairs were selected for the benchmark. This subset size is sufficient to ensure broad coverage and variety while being manageable for computational cost. Cross-validation tests verified that this random selection does not significantly alter the quality metrics. Additionally, COCO’s widespread use for similar tasks in the research community ensures relevance and comparability with the overall field.

Downloading the COCO images is not necessary for the benchmark evaluation. Instead, the FID score can be computed using pre-calculated statistics from the COCO dataset. Similarly, when using the CLIP score, the focus is on how well the output images correlate with the input prompts rather than with the ground-truth images.

Negative prompts are an extremely common guiding mechanism for image generation. Users can specify what they want to exclude from generation and also guide the output image away from undesired stylistic elements (e.g., avoiding cartoonish styles or unwanted elements like cars or buildings). Unfortunately, the task force could not find a suitable dataset with negative prompts. Since this is a crucial feature to the end user that also impacts performance, several prompts were analyzed, and a single common negative prompt was selected for all samples. Prompt “normal quality, low quality, worst quality, low res, blurry, nsfw, nude” generated the best results while adding additional content moderation safeguards beyond what a filtered training dataset provides and was therefore used in the benchmark.

Challenges of measuring accuracy

Evaluating generative AI models designed for image generation presents several unique challenges that we had to resolve to construct a representative and effective benchmark. A primary issue is the absence of a definitive “ground truth,” since the success of these models often depends on how well the generated images meet varying user expectations rather than adhering to an objective standard. Furthermore, the stochastic nature of these models means that output depends on the initial noise that seeds the model. Although the initial noise input can be controlled, as seen in Figures 3 and 4, the model pipeline remains sensitive to changes in hardware and software configurations. Minor adjustments to the pipeline (such as the use of quantization) can lead to wildly varying outputs. This variability rules out the use of error-based metrics such as mean squared error (MSE), root mean squared error (RMSE), or peak signal-to-noise ratio (PSNR), while also challenging the applicability of structural similarity index measure (SSIM)-based methods.

Figure 3: Two images created from the same prompt showcasing different optimization techniques. The image on the left was generated by an FP16 model, while the image on the right comes from an FP16+INT8 model featuring deep cache.

Figure 4: Visualization of the absolute error on each pixel when FP8 output from Intel Gaudi is compared to FP32 reference. The prompt was “some meat and vegetables are arranged on a white plate.”

In response to these challenges, the community has developed metrics to assess image-generating models quantitatively. The Fréchet inception distance (FID) measures the similarity between two sets of images (generated and real) by computing the Fréchet distance between two Gaussians fitted to feature vectors in an embedded space, typically derived from a pretrained inception network. However, the FID score alone is insufficient since it does not assess whether images accurately respond to the given prompts. So, the contrastive language-image pre-training (CLIP) score was chosen as an additional accuracy metric to evaluate the semantic relevance of the generated images to their textual descriptions. Since it is possible to optimize for one of these two metrics at the expense of the other, we explicitly require that both FID and CLIP scores be used in conjunction to ensure balanced optimization.

While these metrics can gauge certain aspects of the accuracy or quality of an image generation model, they have some shortcomings. Both metrics rely on pre-trained neural networks and can be sensitive to the choice of feature space and vulnerable to adversarial attacks. Also, they may not always correlate with human judgment of image quality. Moreover, FID requires a large sample of images to measure the distribution of a generated image dataset representatively, which can increase the cost and complexity of evaluation cycles. We acknowledge these limitations and are open to updating the benchmark in response to advances in this field.

Determining the accuracy threshold

Even though the U-Net latent noise input is fixed and the randomness is suppressed in our benchmark pipeline, different hardware and software can still produce varied outputs. To determine a reasonable accuracy threshold, the task force obtained the mean FP32 precision FID and CLIP scores on accelerators from multiple vendors for the 5,000-sample evaluation dataset. A ±2% variation limit for FID and ±0.2% variation limit for CLIP was set around this mean. FID was found to be more sensitive to hardware variation, hence the larger range. In this benchmark, a valid submission needs to achieve an FID score within the range of [23.01085758, 23.95007626] and CLIP scores within [31.68631873, 31.81331801].

Measuring performance

Figure 5: Generated images using SDXL with EulerDiscrete scheduler with 50 denoising steps (left) and 20 steps (right). There is a slight difference observed between 50 and 20 steps but no significant noise.

The run time of this particular benchmark depends on the number of denoising steps and the size of the validation set. An example can be seen in Figure 5. The task force conducted experiments to determine the optimum parameters that minimize the run time for performance and accuracy runs while maintaining the quality of generated images. As a result, 20 denoising steps and 5000 validation samples were chosen.

Robustness of the benchmark

Output reproducibility

SDXL starts the image generation with a latent space filled with random numbers from the standard normal distribution. In every following denoising step, a new image with less noise in the latent space is generated. In most applications, the initial latent state is defined by user-specified seed value for the random number generator. This arrangement allows for regenerating the same image later with more denoising steps and/or with different sampler settings.

The StableDiffusionXLPipeline used in our reference implementation relies on torch.randn which does not guarantee determinism across PyTorch releases and between different accelerators. MLPerf Inference rules restrict non-determinism, but submitters are free to use any framework for their submission. To ensure determinism, the task force decided to generate latent representations and provide them along with the reference implementation. Images from different accelerators with such a constant latent can be seen in Figure 6. Such a configuration helps to verify whether a submission maintains the model equivalence rule. This setup also enables the tightening of the FID and CLIP accuracy thresholds.


Qualcomm Cloud AI 100 submission

NVIDIA H100 submission

Intel Gaudi 2 submission

HuggingFace fp32 reference

Figure 6: Similar images are generated from four different accelerators thanks to setting a constant latent in the benchmark.

Compliance check

As noted earlier, FID and CLIP have limited capability in capturing the aesthetic assessment of generated images. Furthermore, the restricted non-determinism ensures the generated images from each submission are expected to be visually similar to the reference images generated from pytorch fp32 reference implementation. The task force therefore chose to add a human compliance check to the submission and review process for SDXL.

The task force decided to include a check that targets 10 image indices chosen using a random seed generator. Submitters are required to include the 10 images generated for these prompts along with their submissions. These images are inspected by submitters during the review stage of the results submission process. Any image determined to be a pure-noise image, such as a white noise image, or an image that presents a completely different semantic information than its text prompt will disqualify the corresponding submission. After successfully passing this human inspection, these images are removed from each submission before its publication. 

Conclusion

In this blog post, we described the new SDXL benchmark introduced in the v4.0 round of MLPerf Inference, including the complexities involved in the choices of model and dataset. In particular, the current accuracy metrics for image comparisons have notable limitations, so we undertook several measures to ensure a robust benchmark. We believe these design choices capture the real challenges faced by practitioners and accurately and fairly represent performance for customers looking at adopting image generation.

Additional details about the MLPerf Inference Stable Diffusion benchmark and reference implementation can be found here.

The post SDXL: An MLPerf Inference benchmark for text-to-image generation appeared first on MLCommons.

]]>
New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models https://mlcommons.org/2024/03/mlperf-inference-v4/ Wed, 27 Mar 2024 14:56:00 +0000 http://local.mlcommons/2024/03/mlperf-inference-v4/ With 70 billion parameters, Llama 2 70B is the largest model added to the MLPerf Inference benchmark suite.

The post New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models appeared first on MLCommons.

]]>
Today, MLCommons announced new results from our industry-standard MLPerf Inference v4.0 benchmark suite, which delivers industry standard machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner.

MLPerf Inference v4.0

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models in a variety of deployment scenarios. In order to keep pace with today’s ever-changing generative AI landscape, the working group created a new task force to determine which of these models should be added to the v4.0 version of the benchmark. This task force analyzed factors including model licenses, ease of use and deployment, and accuracy in their decision-making process.

After careful consideration, we opted to include two new benchmarks to the suite. The Llama 2 70B model was chosen to represent the “larger” LLMs with 70 billion parameters, while Stable Diffusion XL was selected to represent text-to-image generative AI models.

Llama 2 70B is an order of magnitude larger than the GPT-J model introduced in MLPerf Inference v3.1 and correspondingly more accurate. One of the reasons it was selected for inclusion in the MLPerf Inference v4.0 release is that this larger model size requires a different class of hardware than smaller LLMs, which provides a great benchmark for higher-end systems. We are thrilled to collaborate with Meta to bring Llama 2 70B to the MLPerf Inference v4.0 benchmark suite. You can learn more about the selection of Llama 2 in this deep-dive post.

“Generative AI use-cases are front and center in our v4.0 submission round,” said Mitchelle Rasquinha, co-chair of the MLPerf Inference working group. “In terms of model parameters, Llama 2 is a dramatic increase to the models in the inference suite. Dedicated task-forces worked around the clock to set up the benchmarks and both models received competitive submissions. Congratulations to all!”

For the second new generative AI benchmark, the task force chose Stability AI’s Stable Diffusion XL, with 2.6 billion parameters. This popular model is used to create compelling images through a text-based prompt. By generating a high number of images, the benchmark is able to calculate metrics such as latency and throughput to understand overall performance.

“The v4.0 release of MLPerf Inference represents a full embrace of generative AI within the benchmark suite,” said Miro Hodak, MLPerf Inference co-chair. “A full third of the benchmarks are generative AI workloads: including a small and a large LLM and a text-to-image generator, ensuring that MLPerf Inference benchmark suite captures the current state of the art.”

MLPerf Inference v4.0 includes over 8500 performance results and 900 Power results from 23 submitting organizations, including: ASUSTeK, Azure, Broadcom, Cisco, CTuning, Dell, Fujitsu, Giga Computing, Google, Hewlett Packard Enterprise, Intel, Intel Habana Labs, Juniper Networks, Krai, Lenovo, NVIDIA, Oracle, Qualcomm Technologies, Inc., Quanta Cloud Technology, Red Hat, Supermicro, SiMa, and Wiwynn.

Four firms–Dell, Fujitsu, NVIDIA, and Qualcomm Technologies, Inc.–submitted data center-focused power numbers for MLPerf Inference v.4.0. The power tests require power-consumption measurements to be captured while the MLPerf Inference benchmarks are running, and the ensuing results indicate the power-efficient performance of the systems tested. The latest submissions demonstrate continued progress in efficient AI acceleration. 

“Submitting to MLPerf is quite challenging and a real accomplishment”, said David Kanter, executive director of MLCommons. “Due to the complex nature of machine-learning workloads, each submitter must ensure that both their hardware and software stacks are capable, stable, and performant for running these types of ML workloads. This is a considerable undertaking. In that spirit, we celebrate the hard work and dedication of Juniper Networks, Red Hat, and Wiwynn, who are all first-time submitters for the MLPerf Inference benchmark.”

View the Results

To view the results for MLPerf Inference v4.0 visit the Datacenter and Edge results pages.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a member or affiliate, please visit MLCommons.org or email participation@mlcommons.org.

The post New MLPerf Inference Benchmark Results Highlight The Rapid Growth of Generative AI Models appeared first on MLCommons.

]]>
Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models https://mlcommons.org/2024/03/mlperf-llama2-70b/ Wed, 27 Mar 2024 14:55:00 +0000 http://local.mlcommons/2024/03/mlperf-llama2-70b/ MLPerf Inference task force shares insights on the selection of Llama 2 for the latest MLPerf Inference benchmark round.

The post Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models appeared first on MLCommons.

]]>
The MLCommonsMLPerf Inference benchmark suite is an industry standard for measuring the performance of machine learning (ML) and artificial intelligence (AI) workloads from diverse domains including vision, speech, and natural language processing. For each of these domains, the suite includes a carefully selected set of workloads that represents the state of the art in the industry across different application segments. These benchmarks not only provide key information to consumers for making application deployment and budget decisions but also enable vendors to deliver critical workload optimizations within certain practical constraints to their customers.

With the advent and widespread adoption of large language models (LLMs), it has become even more crucial to make objective, fair and robust benchmarks readily available to the community. With this goal in mind, the MLPerf Inference working group formed a special task force in early 2023 to explore, define, and create language benchmarks.  

Building on small LLM in Inference v3.1

In deliberations leading up to the MLPerf Inference v3.1 round, which was available for submission in August 2023, the task force initially considered two principal LLM segments: a “smaller” LLM represented by GPT-J with six billion parameters, and a “larger” LLM based on GPT-3 with 175 billion parameters.

The smaller LLM was fine-tuned for text summarization, which is an application of increasing importance, aiding in human understanding and decision making from large bodies of text. This benchmark had a successful debut in the v3.1 round, with 11 organizations submitting 30 results across CPUs, GPUs, and custom accelerators. Details of the MLPerf Inference GPT-J benchmark and reference implementation can be found here.

The task force faced a number of technical challenges with the larger, more compute-intensive LLM, and the team eventually chose to postpone work on this benchmark to the MLPerf v4.0 round.

Introducing Llama 2 70B in MLPerf Inference v4.0

For the MLPerf Inference v4.0 round, the working group decided to revisit the “larger” LLM task and spawned a new task force. The task force examined several potential candidates for inclusion: GPT-175B, Falcon-40B, Falcon-180B, BLOOMZ, and Llama 2 70B.

After careful evaluation and discussion, the task force chose Llama 2 70B as the model that best suited the goals of the benchmark. The choice of Llama 2 70B as the flagship “larger” LLM was determined by several factors:

  • Licensing restrictions: For an open benchmark consortium, model licenses must be flexible enough to offer unlimited access and use to participating submitters at minimum and, ideally, to the press and public in general. This requirement typically reduces the candidate pool to models (and datasets) with permissive licenses such as Apache 2.0, MIT, or similar. Thanks to Meta’s support, MLCommons is making the Llama 2 family of models available to MLCommons’ members, as part of the MLPerf benchmark suite, for the purpose of members conducting the benchmarking.
  • Ease of use and deployment (technical): Llama 2 70B stands out for its accessibility and ease of deployment compared to models like GPT-175B that demand substantial resources and engineering effort for pre-training, fine-tuning, and framework alignment. The Llama-2-70B-Chat-HF model dramatically streamlined the MLPerf benchmark development process, allowing the task force to integrate an advanced LLM model without extensive resource allocation.
  • Community interests: While LLM model parameter sizes have increased tremendously in the recent past, the broader ML/AI developer community’s interest also plays an important role in shaping industry adoption across different application segments. Using Hugging Face contributions and leaderboards as a proxy for community engagement, Llama 2 70B was the leading candidate for inclusion during the task force’s discussions.
  • Quality: In gauging the suitability of candidate models, we considered their accuracy scores and output text quality across popular datasets curated for different language modeling tasks. Llama 2 70B was one of the best performing models in our evaluation under this criteria.

Choosing a task and a dataset for the LLM

The task force evaluated many of the popular tasks for LLMs and, after careful deliberation, chose to use a question-answering (or Q&A) scenario. We picked the Q&A task because it is one of the most common ways that LLMs are being used in serving applications today. Q&A tasks reflect real-world demands for natural-language understanding and interaction. We considered dialogue-based tasks, like an interactive chat session that accumulates context over a series of interactions, but their evaluation complexity posed challenges, particularly in ensuring accurate assessments and also in the difficulty of measuring performance during such a dialogue. Thus, to streamline our benchmarking process and ensure robust evaluation, we opted for the Q&A scenario, with discrete pairs of questions and answers, due to its widespread applicability and straightforward evaluation metrics.

Examining the field of available datasets for this task, we chose the Open Orca dataset among various top-rated datasets such as AllenAI/soda. Open Orca is recognized as one of the highest quality and most extensive datasets available for question answering. The dataset contains a fusion of augmented FLAN Collection data with one million GPT-4 completions, making it a rich source for evaluating NLP capabilities. 

In order to shorten benchmark runtime, we decided to select a subset of 24,576 samples from this dataset. The samples in this dataset were further curated for characteristics such as prompt quality, sample source (e.g cot, flan), maximum input length (1,024 tokens), maximum output length (1,024 tokens) and minimum reference response length (at least three tokens) to create the validation set for the benchmark. Interested readers can refer to processorca.py for more details.

 Figure 1: Input length distribution in the curated dataset

We observed that in general Llama 2 70B generates slightly longer outputs than GPT-4 on our dataset. Additionally, some of the output sequences from our model do not finish their generation within our bound of 1,024 tokens.

Choosing accuracy metrics for the LLM

For comprehensive evaluation, the LLM inference task force carefully considered how to adhere to a high standard of model quality. Given the potential accuracy and performance trade-offs during optimization, submissions should not deviate significantly from the output that is generated by the 32-bit floating-point (FP32) reference model that defines canonical correctness. 

The task force investigated various accuracy metrics for language models, including perplexity, MMLU, MAUVE, and ExactMatch. Some of these metrics do not take the entire generated text into consideration (e..g perplexity and MMLU), while others have non-deterministic behavior (e.g. MAUVE). The task force ended up choosing ROUGE scores as the accuracy metrics. In essence, n-gram based metrics, such as ROUGE scores, measure how close a given text is to its reference. Keeping the robustness of the accuracy in mind, the taskforce sided with choosing a set of three ROUGE metrics for this benchmark: ROUGE-1, ROUGE-2 and ROUGE-L.

The reference FP32 implementation achieves the following reference ROUGE scores:

  • ROUGE-1 = 44.4312
  • ROUGE-2 = 22.0352
  • ROUGE-L = 28.6162

The Llama 2 70B benchmark allows submissions under two categories: in the first category, all achieved scores must be above 99.9% of the reference scores; in the second category, all achieved scores must be above 99% of the reference scores. However, since there were no v4.0 submissions with the accuracy lower than 99.9%, the 4.0 results table contains only the highest accuracy category.

To further tighten this benchmark, the taskforce decided to allow only submissions with a generation length of 294.45 tokens per sample, with a +/- 10% tolerance.

Figure 2: ROUGE scores distribution in outputs from the reference implementation

Measuring performance of the LLM

An LLM inference takes as input several tokens (a sentence or paragraph) and generates tokens based on the instructions. In its most basic form, one could treat the problem as asking the benchmark to fill in the last word of a sentence. Internally, the processing happens in two main steps. The input tokens are processed together to generate the first output token. This step is commonly referred to as the prompt phase or the context phase. After this phase, computation is primarily limited to generating the rest of the tokens in the response, one token at a time, reusing the context from prior computation in each step. This phase is generally called the generation phase. It is critical for the performance and accuracy metrics to capture the application execution behavior in both of these phases of computation. 

While other applications could use throughput metrics such as queries/second, the amount of computational work done by one LLM inference can be extremely different from another. This delta is due to the fact that queries can have different input and output token lengths. Therefore, the task force decided to use tokens/second as the metric for throughput measurement, which is a well understood and standard metric for LLM throughput measurement across the industry. 

Latency constraints for the server scenario

The server scenario in MLPerf Inference benchmarks simulates online applications where query arrival is random and the response latency is important. The query arrival traffic is described with a Poisson distribution where the query arrival rate (queries per second or “target_qps”) is the Poisson parameter. The test is successful if latency constraints for the benchmark are met.

In industrial applications of LLMs, the following latency metrics are considered important:

  • TTFT: Time to first token
  • TPOT: Time per output token, sometimes also called TBT (time between tokens)

Using the typical human reading speed as a logical anchor, these latency constraints were set as:

  • TTFT: <= 2 seconds
  • TPOT: <= 200 milliseconds

A TPOT of 200 ms translates to a maximum allowed generation latency that maps to ~240 words per minute (depending on the tokenizer), which is often cited as the average human reading speed.

While reading speed is a good yardstick for generation tasks such as LLM-based chat or tech support, other use cases have tighter latency constraints. For example, code generation, real-time slide generation, and smart agents all require much faster response times. 

Ensuring compliance to the benchmark criteria

To ensure compliance and prevent potential exploitation, the task force devised a compliance check mechanism. This test is designed to verify the adherence of tokens generated by submitters to predefined criteria. Specifically, three key tests are applied to the submissions:

  1. First token consistency: The first token generated must match the first token reported to the LoadGen for timing purposes, ensuring synchronization and accuracy in the token generation processes.
  2. EOS token validation: The end-of-sequence (EOS) token should not be generated more than once in the generated output, maintaining coherence and integrity in the completion of sequences.
  3. Token count verification: The total number of tokens generated by the model must align with the number reported to the LoadGen, validating the completeness and accuracy of the generated output.

Conclusion

In this post, we described the complexities involved in the choices of model, task, dataset, accuracy, performance, and verification of a contemporary LLM benchmark. We believe the design choices of this benchmark objectively address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers.

Details of the MLPerf Inference Llama 2 70B benchmark and reference implementation can be found here.

Given the overwhelming interests from vendors, as demonstrated by the number of submissions, we hope to continue to improve on the benchmark in future iterations, accounting for the different application domains that may steer LLM adoption. 

Additional technical contributors: Jinho Suh, NVIDIA;  Junwei Yang, Google; Pablo Gonzalez, MLCommons.

The post Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models appeared first on MLCommons.

]]>
MLPerf Results Highlight Growing Importance of Generative AI and Storage https://mlcommons.org/2023/09/mlperf-results-highlight-growing-importance-of-generative-ai-and-storage/ Mon, 11 Sep 2023 08:00:00 +0000 http://local.mlcommons/2023/09/mlperf-results-highlight-growing-importance-of-generative-ai-and-storage/ Latest benchmarks include LLM in inference and the first results for storage benchmark

The post MLPerf Results Highlight Growing Importance of Generative AI and Storage appeared first on MLCommons.

]]>
Today we announced new results from two MLPerf™ benchmark suites: MLPerf Inference v3.1, which delivers industry standard Machine Learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner, and MLPerf Storage v0.5. This publication marks the first ever release of results from the MLPerf Storage benchmark, which measures the performance of storage systems in the context of ML training workloads.

MLPerf Inference v3.1 Introduces New LLM and Recommendation Benchmarks

MLPerf Inference v3.1 includes record participation, with over 13,500 performance results and up to 40% performance gains, from 26 different submitters, and over 2000 power results. Submitters include: ASUSTeK, Azure, cTuning, Connect Tech, Dell, Fujitsu, Giga Computing, Google, H3C, HPE, IEI, Intel, Intel-Habana-Labs, Krai, Lenovo, Moffett, Neural Magic, NVIDIA, Nutanix, Oracle, Qualcomm, Quanta Cloud Technology, SiMA, Supermicro, TTA, and xFusion. In particular, MLCommons® would like to congratulate first time MLPerf Inference submitters Connect Tech, Nutanix, Oracle, and TTA.

“Submitting to MLPerf is not trivial. It’s a significant accomplishment, as this is not a simple point and click benchmark. It requires real engineering work and is a testament to our submitters’ commitment to AI, to their customers, and to ML.” said David Kanter, Executive Director of MLCommons.

The MLCommons MLPerf Inference benchmark suite measures how fast systems can run models in a variety of deployment scenarios. ML inference is behind everything from the latest generative AI chatbots to safety features in vehicles, such as automatic lane-keeping and speech-to-text interfaces. Improving performance and power efficiency is key to deploying more capable AI systems that benefit society.

MLPerf Inference v3.1 introduces two new benchmarks to the suite. The first is a large language model (LLM) using the GPT-J reference model to summarize CNN news articles, which garnered results from 15 different submitters, reflecting the rapid adoption of generative AI. The second is an updated recommender, modified to be more representative of industry practices, using the DLRM-DCNv2 reference model and a much larger datasets, with 9 submissions. These new tests help advance AI by ensuring that industry-standard benchmarks represent the latest trends in AI adoption to help guide customers, vendors, and researchers.

“The submissions for MLPerf inference v3.1 are indicative of a wide range of accelerators being developed to serve ML workloads. The current benchmark suite has broad coverage among ML domains and the most recent addition of GPT-J is a welcome addition to the generative AI space. The results should be very helpful to users when selecting the best accelerators for their respective domains,” said Mitchelle Rasquinha, MLPerf Inference Working Group co-chair.

MLPerf Inference benchmarks primarily focus on datacenter and edge systems. The v3.1 submissions showcase several different processors and accelerators across use cases in computer vision, recommender systems, and language processing. There are both open and closed submissions related to performance, power, and networking categories. Closed submissions use the same reference model to ensure a level playing field across systems, while participants in the open division are permitted to submit a variety of models.

First Results for New MLPerf Storage Benchmark

The MLPerf Storage Benchmark Suite is the first open-source AI/ML benchmark suite that measures the performance of storage for ML training workloads. The benchmark was created through a collaboration spanning more than a dozen leading industry and academic organizations and includes a variety of storage setups including: parallel file systems, local storage, and software defined storage. The MLPerf Storage Benchmark will be an effective tool for purchasing, configuring, and optimizing storage for machine learning applications, as well as for designing next-generation systems and technologies.

Training neural networks is both a compute and data-intensive workload that demands high-performance storage to sustain good overall system performance and availability. For many customers developing the next generation of ML models, it is a challenge to find the right balance between storage and compute resources while making sure that both are efficiently utilized. MLPerf Storage helps overcome this problem by accurately modeling the I/O patterns posed by ML workloads, providing the flexibility to mix and match different storage systems with different accelerator types.

“Our first benchmark has over 28 performance results from five companies which is a great start given this is the first submission round,” explains Oana Balmau, Storage Working Group co-chair. “We’d like to congratulate MLPerf Storage submitters: Argonne National Laboratory (ANL), DDN, Micron, Nutanix, WEKA for their outstanding results and accomplishments.”

The MLPerf Storage benchmark suite is built on the codebase of DLIO, a benchmark designed for I/O measurement in high performance computing, adapted to meet current storage needs.

View the Results

To view the results for MLPerf Inference v3.1 and MLPerf Storage v0.5 and find additional information about the benchmarks please visit the following pages:

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ members – global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate, please visit MLCommmons or contact participation@mlcommons.org.

The post MLPerf Results Highlight Growing Importance of Generative AI and Storage appeared first on MLCommons.

]]>
MLPerf Inference Delivers Power Efficiency and Performance Gains https://mlcommons.org/2023/04/mlperf-inference-delivers-power-efficiency-and-performance-gains/ Wed, 05 Apr 2023 08:39:00 +0000 http://local.mlcommons/2023/04/mlperf-inference-delivers-power-efficiency-and-performance-gains/ Record participation in MLCommons’ benchmark suite showcases improvements in efficiency and capabilities for deploying machine learning

The post MLPerf Inference Delivers Power Efficiency and Performance Gains appeared first on MLCommons.

]]>
Today, MLCommons®, the leading open AI engineering consortium, announced new results from the industry-standard MLPerf™ Inference v3.0 and Mobile v3.0 benchmark suites, which measure the performance and power-efficiency of applying a trained machine learning model to new data. The latest benchmark results illustrate the industry’s emphasis on power efficiency, with 50% more power efficiency results, and significant gains in performance by over 60% in some benchmark tests.

Inference is the critical operational step in machine learning, where a trained model is deployed for actual use, bringing intelligence into a vast array of applications and systems. Machine learning inference is behind everything from the latest generative AI chatbots to safety features in vehicles such as automatic lane-keeping, and speech-to-text interfaces. Improving performance and power efficiency will lead the way for deploying more capable AI systems that benefit society.

The MLPerf benchmark suites are comprehensive system tests that stress machine learning models including the underlying software and hardware and in some cases, optionally measuring power efficiency. The open-source and peer-reviewed benchmark suites create a level playing ground for competition, which fosters innovation and benefits society at large through better performance and power efficiency for AI and ML applications.

The MLPerf Inference benchmarks primarily focus on datacenter and edge systems. This round featured even greater participation across the community with a record-breaking 25 submitting organizations, over 6,700 performance results, and more than 2,400 performance and power efficiency measurements. The submitters include Alibaba, ASUSTeK, Azure, cTuning, Deci.ai, Dell, Gigabyte, H3C, HPE, Inspur, Intel, Krai, Lenovo, Moffett, Nettrix, NEUCHIPS, Neural Magic, NVIDIA, Qualcomm Technologies, Inc., Quanta Cloud Technology, rebellions, SiMa, Supermicro, VMware, and xFusion, with nearly half of the submitters also measuring power efficiency.

MLCommons congratulates our many first time MLPerf Inference submitters on their outstanding results and accomplishments. cTuning, Quanta Cloud Technology, rebellions, SiMa, and xFusion all debuted their first performance results. cTuning, NEUCHIPS, and SiMa also weighed in with their first power efficiency measurements. Lastly, HPE, NVIDIA, and Qualcomm all submitted their first results for inference over the network.

The MLPerf Mobile benchmark suite is tailored for smartphones, tablets, notebooks, and other client systems. The MLPerf Mobile application for Android and iOS is expected to be available shortly.

To view the results and find additional information about the benchmarks please visit https://mlcommons.org/en/inference-datacenter-30https://mlcommons.org/en/inference-edge-30 and https://mlcommons.org/en/inference-mobile-30.

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org and contact participation@mlcommons.org.

Press Contact:
Kelly Berschauer
kelly@mlcommons.org

The post MLPerf Inference Delivers Power Efficiency and Performance Gains appeared first on MLCommons.

]]>
Perspective: Unlocking ML requires an ecosystem approach https://mlcommons.org/2023/03/unlocking-ml-requires-an-ecosystem-approach/ Fri, 10 Mar 2023 08:04:00 +0000 http://local.mlcommons/2023/03/unlocking-ml-requires-an-ecosystem-approach/ Factories need good roads to deliver value

The post Perspective: Unlocking ML requires an ecosystem approach appeared first on MLCommons.

]]>
Over the past 10 years, machine learning (ML) has grown exponentially, from a small research field to a broad community spanning universities, technology platforms, industries, nonprofits, and governments. The global funding (PE/VC) in AI companies increased by 390% from $16 billion in 2015 to $79 billion in 2022.[1] Recent advances in ML have made it possible for machines to not only consume but also create content to a level that approaches what humans can do. This capability is sometimes called Generative AI[2]. This is primarily enabled by foundation models[3] – large, pre-trained models trained on copious amounts of diverse data through self-supervised learning that can be adapted to a wide array of downstream tasks. The disruptive potential of foundation models lies in their ability to achieve state-of-the-art performance on tasks with little or no task-specific training required (though such “fine-tuning” further improves capabilities).

While considerable innovation is taking place across the ML research-to-production life cycle, it often occurs organically and in silos, resulting in uneven impact and a low transfer of innovation across industry verticals. A handful of technology companies are experiencing benefits from deploying cutting-edge ML at scale, while others are still learning to operationalize it: to quote one study, nearly 80 percent of AI solutions are abandoned because companies struggle to scale them.[4] Even in companies with advanced ML capabilities, there is considerable friction and uncertainty within the ML development cycle. Reasons for this include:

  • ML has a fundamentally different programming paradigm than traditional software. In traditional software engineering, a programmer writes explicit instructions for the computer to execute, while in machine learning—especially for deep neural networks—the programmer provides a data set and specifies a model architecture and other training code from which the program learns the desired behavior. While this is an incredibly powerful programming tool, it creates an inherent unpredictability that must be accounted for throughout the ML development life cycle.
  • The ML programming paradigm requires a very different development cycle. The best practice for traditional software development uses a collaborative, flexible, agile methodology and the scrum framework, whereby teams make rapid, incremental advancements toward well-defined goals. ML development, by contrast, is relatively opaque and unpredictable; in effect, data is the new code, and the quality and quantity of data determine the ML’s functionality—though this connection is still poorly understood. Teams continuously revisit data sets and retrain models as they move ML innovations from research into production – a process we term the “ML development flywheel” (Exhibit 1).
  • ML lacks a mature ecosystem – a set of shared tools, resources, standards, and best practices – to support this development cycle. Traditional computer programming has had decades to develop an ecosystem that has become so prevalent that nearly 80 percent of all sites on the internet today rely on a common software stack[5]. ML development, however, is still considered an art as much as a science, often requiring a custom solution for each new problem and highly experienced engineers who can try various techniques gained through experience to arrive at the desired model behavior and convergence.

Exhibit 1: ML Development Flywheel

The result is that the ML field today resembles the internet of the late 1990s. It has high potential but uneven impact, knowledge is spread across myriad organizations and individuals, and programming requires a collection of custom, often vertically integrated tech stacks.

Motivated by the challenge of increasing ML’s impact in society, the MLCommons Association collaborated with several partners on a high-level analysis of how to develop a healthier ML ecosystem (Exhibit 2). We wanted to answer two key questions—where are the opportunities for unlocking ML’s greatest impact on society, and what are the best ways to address them—with the goal of providing a perspective on how various ML ecosystem players, including enterprises and technology hyperscalers,[6] can contribute to driving greater ML adoption.

Exhibit 2: Types of ML Ecosystem players

We interviewed 32 experts across these ML ecosystem players[7]—including engineers and product managers from leading tech companies, business leaders, the heads of industry and civil-society groups, and academics—to explore the challenges they face in developing ML and attempt to prioritize solutions for addressing them.

Opportunities to improve ML

Exhibit 3: Opportunities to improve ML fall into three broad categories

Opportunities to unlock the full impact of ML at scale fall into three categories: increasing the capabilities of ML technology, broadening access to and adoption of ML, and ensuring that the results of ML benefit society (Exhibit 3). All ecosystem players have a role to play for ML to reach its full potential, and contributions can be in the form of tools, data, funding, education, training, benchmarks, best practices, partnerships and coalitions, and governance and policy.

Deep dives on critical ecosystem areas to improve

In this section, we discuss three key opportunities that emerged in our discussions with business leaders and ML practitioners: data discoverability, access, and quality; operational scaling and hardening; and ethical applied ML.

Data discoverability, access, and quality: ML developer productivity is data-limited, but public data is primitive and data tools are poor

ML developers face two broad categories of data challenges:

  • Finding, accessing, and labeling data. Developers need to locate relevant data, gain access to it—which can be especially difficult if it’s held by multiple organizations—format it for consistency, and sometimes efficiently and correctly label it.
  • Making data useful. Collected data can be incomplete, incorrect, biased, or misleading in numerous ways, problems which are hard to detect prior to deployment.

These challenges are exacerbated by the relatively low investment companies are making in public ML data (compared with the substantial investment in model development), the nonexistent to primitive nature of metrics and standards for ML data sets, and the lack of a widely used tool stack for working with data. Opportunities for improving the ML data ecosystem include:

  • Improving public data sets. There is opportunity for a network of partners to work together to create open data sets that support multiple contributors and continuous quality improvements. Datasets such as the Pile[8] and LAION[9] are early, promising examples of the potential of this approach, though other issues such as licensing still need to be addressed. One estimate states that the boost to the economy from adoption of open-data ecosystems in finance could range from about 1 to 1.5 percent of GDP in 2030 in the European Union, the United Kingdom, and the United States, to as much as 4 to 5 percent in India.[10] There is a particularly strong need for open datasets that define community consensus on usable public data, serve as tests and benchmarks for different approaches, or address societal issues without commercial interest.
  • Increasing data sharing. Public data sets should be populated by standardizing best practices for anonymizing and releasing data or finding innovative ways to leverage private data without releasing it. For example, MedPerf is an open framework for benchmarking machine learning in the medical domain that securely evaluates medical AI models on distributed healthcare data sets while keeping the data private.[11]
  • Agreeing on common data metrics and standards. First, metrics, such as DataPerf, should be created that quantify data quality (DataPerf is a benchmark suite for ML datasets and data-centric algorithms[12]). Second, data formats and metadata content should be standardized; Google model cards, IBM fact sheets, datasheets for datasets, and the Data Nutrition Project all provide examples of how to do this.[13] Third, best practices for ethical data sourcing and consistent licensing should be developed.
  • Improving interoperable data-centric tools. These metrics and standards should be leveraged to improve data quality and reduce costs through a data-development stack of interoperable tools akin to those widely used for code.

Operational scaling and hardening: deploying models from research into production too often relies on costly and fragile bespoke solutions

Delivering stellar results using a prototype model trained on an experimental data set is a significant accomplishment, but putting the model into production to deliver business value brings new challenges that generally fall into two categories:

  • Building the ML data platform. Experimental data is often hand-curated or has different characteristics from the actual, production data at a company. As a result, teams need to invest in building persistent pipelines to access and transform production data and design features to approximate those used for the experimental model. Often the new model does not match the evaluation performance of the experimental one and additional work is needed to bring it up to par. An abrupt handoff between the research and production teams can increase the challenge of this work if the team responsible for operationalizing the model into a product is unaware of the context and datasets used in model development. Finally, live data needs to be collected as the model is put through dark launches, live experiments, and full launch, all while monitoring the model output to ensure the query and training distribution match, and it isn’t producing unintended outcomes. Building this end-to-end data story, with an ability to track artifact provenance, requires significant up-front investment before any business impact is realized.
  • Navigating infrastructure fragmentation. Much of the open-source infrastructure used to write ML models and compile them down to hardware-native code was created by researchers for their own specific, fast-changing needs. This has resulted in an ecosystem of toolchains that is highly siloed or patched together using custom bridges, which creates a host of interoperability problems—for example, trying to port a server-side model written in Pytorch to a mobile back-end using TFLite can be a challenging process—and means teams often have to throw out their existing model code and pipelines and start from scratch as their business needs evolve, significantly slowing down development velocity.

ML tooling innovation is addressing some of these challenges, in the process attracting high levels of VC funding and some of the best talent in the industry. An ecosystem perspective suggests new solutions are needed to accelerate ML adoption:

  • Data-platform standards and best practices. Best practices for building robust data platforms could enable a smooth transition from prototyping to live monitoring. While it is common for different companies to use different tools for each stage of the pipeline, data sharing and security standards would ensure these tools work seamlessly together.
  • Standard interchange formats. Decoupling higher and lower parts of the core ML stack could drive interoperability and reduce fragmentation. Developers should be able to design the toolchain (for example, PyTorch framework with XLA compiler) that best meets their needs and works across hardware back-ends. Defining a standard set of operators at various levels of the stack would make ML development much more scalable and make it easier for hardware and software vendors to focus on their strengths rather than on needing to support multiple independent code paths.
  • Include people and organizations. Promising ML applications often fail to gain traction for nontechnical reasons such as a lack of stakeholder involvement during development and skepticism by front-line users during early iterations. Organizations as a whole, not just ML engineering teams, need to be aligned around the ML development flywheel.

Ethical applied ML: developers lack universally accepted standards, technologies and benchmarks to build responsible ML applications

ML has driven phenomenal advances in many areas, from autonomous vehicles and drug discovery to financial fraud-detection and retail-product recommendations. However, it has also had negative effects, including perpetuating systemic biases[15], making many companies cautious about adopting it. Building a trust-based AI architecture will expand the scope of industries and use cases where ML can be applied and will require work across several dimensions – bias and fairness, explainability and transparency, human oversight and accountability, privacy and data ethics, performance and safety, security and sustainability[16]. These can be addressed through several approaches:

  • Privacy-enhancing technologies. Many tech hyperscalers and open-source communities provide privacy-enhancing technology solutions, such as federated learning and homomorphic encryption, but these solutions lack standardization, which can lead to reinventing the wheel in silos rather than alignment among ecosystem players. A common set of community backed standards can help ensure these technologies are more widely adopted, benefiting everyone.
  • Best practices. There is a need for universally accepted best practices so that organizations with hundreds of models can carry out the full gamut of fairness checks and evaluate data risks. Instituting best practices while the technology is still young can minimize problems caused by proprietary systems not “talking” to one another later, help standardize models, and contribute to curbing bias. Early development of such practices can help avoid what occurred with electronic health records; the promise was that they would help patients and providers easily access relevant information, but the reality is that records remain fragmented across platforms, often causing frustration for stakeholders at every level.
  • Policy and governance. Global standards for AI regulation are still being developed and governments, technology companies, and civil society groups can help shape legislation to focus on key principles like transparency, human centricity, contestability, data protection, and fairness. For regulations already in place, there remains an opportunity to provide tools and certifications to help companies implement these guidelines.
  • Tools and benchmarks. There is a need for comprehensive benchmarks tailored to the needs of specific industries for identifying and addressing bias. An example of this is the Monk Skin Tone Scale[17] developed and released by Google that can be leveraged to evaluate computer vision datasets and models for skin tone representation. There is also a need for better documentation and artifact tracking at each stage of the ML lifecycle.[18]
  • Education and training. To accelerate the adoption of tools, benchmarks, and best practices, training resources should be customized to the needs and roles of each of the players involved in each step of ML development flywheel. One example of such a resource is provided by EDM Council, a global association focused on industry-validated trainings and certifications and other practitioner resources.

Concluding thoughts

The ML field is still nascent—barely 10 years old—and significant work is needed to transform it into an established discipline that achieves its full potential of social and economic impact. Mainstream ML adopters and tech hyperscalers will be key to this endeavor.

Key takeaways for Enterprise adopters of ML

In recent years, the momentum of R&D in AI has been exponential; more than 30 times as many patents were filed in the field in 2021 than in 2015.[19] Given this momentum, as well as other factors such as increased investment, AI is not only performing better but also becoming more affordable, leading to increased adoption across industries. According to McKinsey & Company’s Global Survey on the state of AI[20] 2022, 50 percent of respondents reported their organizations had adopted AI in at least one function in 2021, compared with 20 percent in 2017. While AI adoption has grown 2.5 times since 2017, it has more recently plateaued. While the consensus is that AI will be a mainstream technology at their companies, a 2020 survey found that only 76 percent of organizations were close to breaking even on their AI investments when considering the costs and not just the benefits.[21] To address these challenges and realize the full potential of AI for themselves and other ecosystem players, enterprise adopters could adopt the following strategies:

  • Reassess the path to ML production. The path to performance observability and model conformance lies as much in the data that is fueling the infrastructure as it does in the design of the infrastructure itself. Thus, enterprise ML adopters should be as metrics-driven on evaluating data quality as they are on enabling high-performance hardware. This would enable the next level of scale and impact for enterprises. Enterprise adopters could work to prevent bias and drift by building in time for iterative, unpredictable developments and bringing in all relevant organizational stakeholders, including legal and ethics leads, early in the ML development flywheel. This would help organizations identify regulatory requirements up-front, mitigate bias, and communicate clearly about privacy and governance laws.
  • Support emerging standards and best practices, and tailor them to current and future challenges. Many current ML development tools are either domain-specific or don’t scale well when moving from research prototyping to deploying ML at scale. Adopting open-source frameworks and best practices and eschewing one-off solutions can ensure an organization does not get stuck on a tech island as it matures but continues to learn and adapt as tooling solutions, policies, and innovations in the ecosystem evolve and scale.
  • Invest in people. Today’s AI heroes are often just those who write model code. However, enterprises need to ensure that everyone involved in the ML Development Flywheel, such as those who enable the data flow, provide tooling solutions, assure unbiased outcomes, and monitor live operations are also rewarded and upskilled.

Key takeaways for tech hyperscalers

As some of the largest investors in ML research, infrastructure, and applications, tech hyperscalers such as Google and Meta are critical to advancing the ML ecosystem. In addition to the significant contributions they already make through university collaborations and open-source projects, we would encourage them to invest in the following:

  • Support the development of public resources, especially data sets and tooling, to allow researchers to continue to innovate and communicate as ML technology advances.
  • Contribute to legal and business best practices that help all companies use ML effectively and responsibly. This can be achieved by partnering with mainstream ML adopters to share best practices on translating business problems into ML solutions and providing a solid technical foundation for emerging regulatory frameworks.
  • Work cross-organizationally or with industry groups to drive interoperability in open-source frameworks and tooling. This could enable more companies to participate in ML collaboratively and flexibly, growing overall adoption.
  • Encourage efforts to raise public understanding and awareness of AI’s potential for positive societal impact, such as through research into AI applications for social good and efforts to reduce bias and increase inclusion.
  • Be transparent about the limitations and risks of ML-based solutions.

Authors


Peter Mattson is a Senior Staff Engineer at Google. He co-founded and is President of MLCommons®, and co-founded and was General Chair of the MLPerf™ consortium that preceded it.

Aarush Selvan is a Product Manager at Google where he works on Natural Language Processing for the Google Assistant.

David Kanter is a co-founder and the Executive Director of MLCommons, where he helps lead the MLPerf benchmarks and other initiatives.

Vijay Janapa Reddi is an Associate Professor at Harvard University. His research interests include computer architecture and runtime systems, specifically in the context of autonomous machines and mobile and edge computing system.

Roger Roberts is a Partner at McKinsey & Company’s Bay Area office and advises clients in a broad range of industries as they address their toughest technology challenges especially in retail and consumer industries

Jacomo Corbo is Founder and co-CEO of PhysicsX, a company building AI generative models for physics and engineering; previously Chief Scientist and Founder of QuantumBlack and a Partner at McKinsey & Company

Authors would like to thank Bryan Richardson, Chaitanya Adabala Viswa, Chris Anagnostopoulos, Christopher McGrillen, Daniel Herde, David Harvey, Kasia Tokarska, Liz Grennan, Martin Harrysson, Medha Bankhwal, Michael Chui, Michael Rudow, Rasmus Kastoft-Christensen, Rory Walsh, Pablo Illanes, Saurabh Sanghvi, Stephen Xu, Tinashe Handina, Tom Taylor-Vigrass, Yetunde Dada, and Douwe Kiela as well as Adam Yala from University of California, Berkeley, Kasia Chmielinski from Data Nutrition Project and contributors from Google, Harvard University, Hugging Face, Meta AI, and Partnership on AI for their contributions to this article.


  1. Pitchbook ↩
  2. https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai ↩
  3. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2022, July 12). On the opportunities and risks of Foundation models. arXiv.org. Retrieved January 20, 2023, from https://arxiv.org/abs/2108.07258 ↩
  4. “How to operationalize machine learning for maximum business impact: Frameworks for machine learning operationalisation are key,” Wired, in partnership with QuantumBlack, June 12, 2021, https://www.wired.co.uk/bc/article/mckinsey-machine-learning ↩
  5. “Usage statistics of PHP for websites,” W3 Techs, n.d., https://w3techs.com/technologies/details/pl-php ↩
  6. Tech hyperscalers comprises technology companies like Google, Microsoft, and Amazon that are making efforts to scale in software products and services but also expand to numerous industry verticals ↩
  7. Data Nutrition Project, Google, Harvard University, Hugging Face, Meta AI, Partnership on AI, and the University of California, Berkeley ↩
  8. https://pile.eleuther.ai/ ↩
  9. https://laion.ai/blog/laion-400-open-dataset/ ↩
  10. White, Olivia, Anu Madgavkar, Zac Townsend, James Manyika, Tunde Olanrewaju, Tawanda Sibanda, and Scott Kaufman. “Financial Data Unbound: The Value of Open Data for Individuals and Institutions.” McKinsey & Company June 29, 2021. https://www.mckinsey.com/industries/financial-services/our-insights/financial-data-unbound-the-value-of-open-data-for-individuals-and-institutions ↩
  11. Alexandros Karargyris et al., “MedPerf: Open benchmarking platform or medical artificial intelligence using federated evaluation,” n.d., https://arxiv.org/ftp/arxiv/papers/2110/2110.01406.pdf ↩
  12. https://dataperf.org/ ↩
  13. “The value of a shared understanding of AI models,” n.d., https://modelcards.withgoogle.com/about; Michael Hind, “IBM FactSheets further advances trust in AI,” July 9, 2020, https://www.ibm.com/blogs/research/2020/07/aifactsheets/; Timnet Gebru et al., “Datasheets for datasets,” Cornell University, March 23, 2018, https://arxiv.org/abs/1803.09010; “The Data Nutrition Project,” n.d., https://datanutrition.org/ ↩
  14. https://openai.com/pricing ↩
  15. Silberg, Jake, and James Manyika. “Tackling Bias in Artificial Intelligence (and in Humans).” McKinsey & Company, July 22, 2020. https://www.mckinsey.com/featured-insights/artificial-intelligence/tackling-bias-in-artificial-intelligence-and-in-humans↩
  16. https://www.mckinsey.com/featured-insights/in-the-balance/from-principles-to-practice-putting-ai-ethics-into-action ↩
  17. https://skintone.google/ ↩
  18. “Machine learning life cycle,” Partnership on AI, n.d., https://docs.google.com/presentation/d/1fh4AGTeXRN0hInvoZsYMz7Bb3mgzfZktdGZtKvBRED8/edit#slide=id.gdf40b9b06b_2_491 ↩
  19. Daniel Zhang et al., “Artificial Intelligence Index Report 2022,” Stanford Institute for Human-Centered AI, Stanford University, March 2022, https://aiindex.stanford.edu/wp-content/uploads/2022/03/2022-AI-Index-Report_Master.pdf ↩
  20. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review ↩
  21. “2021 AI Predictions: No uncertainty here,” PwC, October 2020, https://www.pwc.com/mt/en/publications/technology/ai-predictions-2021/ai-predictions-2021-no-uncertainty-here.html ↩
  22. https://www.mckinsey.com/capabilities/quantumblack/our-insights/generative-ai-is-here-how-tools-like-chatgpt-could-change-your-business ↩

The post Perspective: Unlocking ML requires an ecosystem approach appeared first on MLCommons.

]]>
MLPerf Results Show Advances in Machine Learning Inference https://mlcommons.org/2022/09/mlperf-results-show-advances-in-machine-learning-inference/ Thu, 08 Sep 2022 08:43:00 +0000 http://local.mlcommons/2022/09/mlperf-results-show-advances-in-machine-learning-inference/ MLCommons establishes a new record with over 5,300 performance results and 2,400 power measurement results, 1.37X and 1.09X more than the previous round.

The post MLPerf Results Show Advances in Machine Learning Inference appeared first on MLCommons.

]]>
SAN FRANCISCO – September 8, 2022 – Today, the open engineering consortium MLCommons® announced fresh results from MLPerf™ Inference v2.1, which analyzes the performance of inference – the application of a trained machine learning model to new data. Inference allows for the intelligent enhancement of a vast array of applications and systems. This round established new benchmarks with nearly 5,300 performance results and 2,400 power measurements, 1.37X and 1.09X more than the previous round, respectively, reflecting the community’s vigor.

MLPerf benchmarks are comprehensive system tests that stress machine learning models, software, and hardware, and optionally monitor energy consumption. The open-source and peer-reviewed benchmark suites level the playing ground for competitiveness, which fosters innovation, performance, and energy efficiency for the whole sector.

The MLPerf Inference benchmarks are focused on datacenter and edge systems, and Alibaba, ASUSTeK, Azure, Biren, Dell, Fujitsu, GIGABYTE, H3C, HPE, Inspur, Intel, Krai, Lenovo, Moffett, Nettrix, Neural Magic, NVIDIA, OctoML, Qualcomm Technologies, Inc., SAPEON, and Supermicro are among the contributors to the submission round.

To view the results and find additional information about the benchmarks please visit https://mlcommons.org/en/inference-datacenter-21/ and https://mlcommons.org/en/inference-edge-21/. These results reveal extensive industry participation, a focus on energy saving, paving the path for more capable intelligent systems that will benefit society as a whole.

About MLCommons
MLCommons is an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org/ and contact participation@mlcommons.org.

Press Contact:
David Kanter
press@mlcommons.org

The post MLPerf Results Show Advances in Machine Learning Inference appeared first on MLCommons.

]]>
MLPerf Results Show Advances in Machine Learning Inference Performance and Efficiency https://mlcommons.org/2022/04/mlperf-results-show-advances-in-machine-learning-inference-performance-and-efficiency/ Wed, 06 Apr 2022 08:32:00 +0000 http://local.mlcommons/2022/04/mlperf-results-show-advances-in-machine-learning-inference-performance-and-efficiency/ MLCommons’ latest benchmarks illustrate focus on energy efficiency and up to 3.3X performance gains

The post MLPerf Results Show Advances in Machine Learning Inference Performance and Efficiency appeared first on MLCommons.

]]>
Today, MLCommons®, an open engineering consortium, released new results for three MLPerf™ benchmark suites – Inference v2.0, Mobile v2.0, and Tiny v0.7. These three benchmark suites measure the performance of inference – applying a trained machine learning model to new data. Inference enables adding intelligence to a wide range of applications and systems. Collectively, these benchmark suites scale from ultra-low power devices that draw just a few microwatts all the way up to the most powerful datacenter computing platforms. The latest MLPerf results demonstrate wide industry participation, an emphasis on energy efficiency, and up to 3.3X greater performance ultimately paving the way for more capable intelligent systems to benefit society at large.

The MLPerf benchmarks are full system tests that stress machine learning models, software, and hardware and optionally measure power usage. The open-source and peer-reviewed benchmark suites provide a level playing field for competition that drives innovation, performance, and energy-efficiency for the entire industry.

“This was an outstanding effort by the ML community with so many new participants and the tremendous increase in the number and diversity of submissions.” said David Kanter, Executive Director of MLCommons. “I’m especially excited to see greater adoption of power and energy measurements, highlighting the industry’s focus on efficient AI.”

The MLPerf Inference benchmarks primarily focus on datacenter and edge systems and submitters include Alibaba, ASUSTeK, Azure, Deci.ai, Dell, Fujitsu, FuriosaAI, Gigabyte, H3C, Inspur, Intel, Krai, Lenovo, Nettrix, Neuchips, NVIDIA, Qualcomm Technologies, Inc., Supermicro, and ZhejiangLab. This round set new records with over 3,900 performance results and 2,200 power measurements, respectively 2X and 6X more than the prior round, demonstrating the momentum of the community.

The MLPerf Mobile benchmark suite targets smartphones, tablets, notebooks, and other client systems with the latest submissions highlighting an average 2X performance gain over the previous round. MLPerf Mobile v2.0 includes a new image segmentation model, MOSAIC, that was developed by Google Research with feedback from MLCommons. The MLPerf Mobile application and the corresponding source code, which incorporates the latest updates and submitting vendors’ backends, are expected to be available in the second quarter of 2022.

The MLPerf Tiny benchmark suite is intended for the lowest power devices and smallest form factors, such as deeply embedded, intelligent sensing, and internet-of-things applications. The second round of MLPerf Tiny results showed tremendous growth in collaboration with submissions from Alibaba, Andes, hls4ml-FINN team, Plumerai, Renesas, Silicon Labs, STMicroelectronics, and Syntiant. Collectively, these organizations submitted 19 different systems with 3X more results than the first round and over half the results incorporating energy measurements, an impressive achievement for the first benchmarking round with energy measurement.

MLCommons would like to congratulate first time MLPerf Inference submitters ASUSTeK, Azure, H3C, and ZhejiangLab and also Gigabyte and Fujitsu for their first power measurements along with first time MLPerf Tiny submitters Alibaba, Andes, Plumerai, Renesas, Silicon Labs, and STMicroelectronics and also the hls4ml-FINN team and Syntiant on their first energy measurements.

To view the results and find additional information about the benchmarks please visit https://mlcommons.org/en/inference-datacenter-20/,
https://mlcommons.org/en/inference-edge-20/,
https://mlcommons.org/en/inference-mobile-20/, and https://www.mlcommons.org/en/inference-tiny-07/

About MLCommons

MLCommons is an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org/ and contact participation@mlcommons.org.

Press Contact:
David Kanter
press@mlcommons.org

The post MLPerf Results Show Advances in Machine Learning Inference Performance and Efficiency appeared first on MLCommons.

]]>
MLPerf Inference v1.1 Results https://mlcommons.org/2021/09/mlperf-inference-v1-1-results/ Wed, 22 Sep 2021 08:01:00 +0000 http://local.mlcommons/2021/09/mlperf-inference-v1-1-results/ Latest round includes over 1,800 performance and 350 power results for leading ML inference systems

The post MLPerf Inference v1.1 Results appeared first on MLCommons.

]]>
Today, MLCommons®, an open engineering consortium, released new results for MLPerf™ Inference v1.1, the organization’s machine learning inference performance benchmark suite. MLPerf Inference measures the performance of applying a trained machine learning model to new data for a wide variety of applications and form factors, and optionally includes system power measurement.

MLPerf Inference is a full system benchmark, testing machine learning models, software, and hardware. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation and performance for the entire industry. While the majority of systems improved by 5-30% in just 5 months, some submissions have improved by more than two times previous performance, demonstrating the value of software optimization that will have a real impact on AI workloads.

Similar to past MLPerf Inference results, the submissions consist of two divisions: closed and open. Closed submissions use the same reference model to ensure a level playing field across systems, while participants in the open division are permitted to submit a variety of models. Submissions are additionally classified by availability within each division, including systems commercially available, in preview, and RDI (research, development, and internal).

MLPerf Inference v1.1 results further MLCommons’ goal to provide benchmarks and metrics that level the industry playing field through the comparison of ML systems, software, and solutions. The latest benchmark round received submissions from 20 organizations and released over 1,800 peer-reviewed performance results for machine learning systems spanning from edge devices to data center servers. This is the second round of MLPerf Inference to offer power measurement, with over 350 power results.

Submissions this round included software and hardware innovations from Alibaba, Centaur Technology, cTuning, Dell, EdgeCortix, Fujitsu, FuriosaAI, Gigabyte, HPE, Inspur, Intel, Krai, Lenovo, LTechKorea, Nettrix, Neuchips, NVIDIA, OctoML, Qualcomm Technologies, Inc., and Supermicro. In particular, MLCommons would like to congratulate new submitters cTuning, LTechKorea, and OctoML, and also Inspur for submitting their first power measurements. To view the results, please visit https://mlcommons.org/en/inference-datacenter-11/ and https://www.mlcommons.org/en/inference-edge-11/.

“We had an outstanding set of results that showed significant improvement across the MLPerf Inference benchmark suite,” said Ramesh Chukka, Co-Chair of the MLPerf Inference Working Group. “We are excited about bringing these performance gains to the AI community. Congratulations to all of our submitters, especially the first time submission teams.”

“The progress demonstrated in this round of results is outstanding over such a short time frame,” said David Kanter, Executive Director of MLCommons. “We are especially excited to see more software solution providers joining the MLPerf community to help improve machine learning.”

Additional information about the Inference v1.1 benchmarks is available at https://mlcommons.org/en/inference-datacenter-11/ and https://www.mlcommons.org/en/inference-edge-11/.

About MLCommons

MLCommons is an open engineering consortium with a mission to accelerate machine learning innovation, raise all boats and increase its positive impact on society. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit http://mlcommons.org/ or contact participation@mlcommons.org.

Press Contact:
press@mlcommons.org

The post MLPerf Inference v1.1 Results appeared first on MLCommons.

]]>
MLPerf Inference v1.0 Results with First Power Measurements https://mlcommons.org/2021/04/mlperf-inference-v1-0-results-with-first-power-measurements/ Wed, 21 Apr 2021 08:08:00 +0000 http://local.mlcommons/2021/04/mlperf-inference-v1-0-results-with-first-power-measurements/ The latest benchmark includes 1,994 performance and 862 power efficiency results for leading ML inference systems

The post MLPerf Inference v1.0 Results with First Power Measurements appeared first on MLCommons.

]]>
Today, MLCommons®, an open engineering consortium, released results for MLPerf™ Inference v1.0, the organization’s machine learning inference performance benchmark suite. In its third round of submissions, the results measured how quickly a trained neural network can process new data for a wide range of applications on a variety of form factors and for the first-time, a system power measurement methodology.

MLPerf Inference v1.0 is a cornerstone of MLCommons’ initiative to provide benchmarks and metrics that level the industry playing field through the comparison of ML systems, software, and solutions. The latest benchmark round received submissions from 17 organizations and released 1,994 peer-reviewed results for machine learning systems spanning from edge devices to data center servers. To view the results, please visit https://www.mlcommons.org/en/inference-datacenter-10/ and https://www.mlcommons.org/en/inference-edge-10/.

MLPerf Power Measurement – A new metric to understand system efficiency

The MLPerf Inference v1.0 suite introduces new power measurement techniques, tools, and metrics to complement performance benchmarks. These new metrics enable reporting and comparing energy consumption, performance and power for submitting systems. In this round, the new power measurement was optional for submission with 864 results released. The power measurement was developed in partnership with Standard Performance Evaluation Corp. (SPEC), the leading provider of standardized benchmarks and tools for evaluating the performance of today’s computing systems. MLPerf adopted and built on the industry-standard SPEC PTDaemon power measurement Interface.

“As we look at the accelerating adoption of machine learning, artificial intelligence, and the anticipated scale of ML projects, the ability to measure power consumption in ML environments will be critical for sustainability goals all around the world,” said Klaus-Dieter Lange, SPECpower Committee Chair. “MLCommons developed MLPerf in the best tradition of vendor-neutral standardized benchmarks, and SPEC was very excited to be a partner in their development process. We look forward to widespread adoption of this extremely valuable benchmark.”

“We are pleased to see the ongoing engagement from the machine learning community with MLPerf,” said Sachin Idgunji, Chair of the MLPerf Power Working Group. “The addition of a power methodology will highlight energy efficiency and bring a valuable, new level of transparency to the industry.”

“We wanted to add a metric that could showcase the power and energy cost from different levels of ML performance across workloads,” said Arun Tejusve, Chair of the MLPerf Power Working Group. “MLPerf Power v1.0 is a monumental step toward this goal, and will help drive the creation of more energy-efficient algorithms and systems across the industry.”

Submitters this round include: Alibaba, Centaur Technology, Dell Technologies, EdgeCortix, Fujitsu, Gigabyte, HPE, Inspur, Intel, Lenovo, Krai, Moblint, Neuchips, NVIDIA, Qualcomm Technologies, Supermicro, and Xilinx. Additional information about the Inference v1.0 benchmarks is available at www.mlcommons.org/en/inference-datacenter-10/ and www.mlcommons.org/en/inference-edge-10/.

About MLCommons

MLCommons is an open engineering consortium with a mission to accelerate machine learning innovation, raise all boats and increase its positive impact on society. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding member partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a member of the organization, please visit http://mlcommons.org/ or contact membership@mlcommons.org.

Press Contact:
press@mlcommons.org

PTDaemon® and SPEC® are trademarks of the Standard Performance Evaluation Corporation. All other product and company names herein may be trademarks of their registered owners.

The post MLPerf Inference v1.0 Results with First Power Measurements appeared first on MLCommons.

]]>