Power Archives - MLCommons

MLCommons Power Working Group Presents MLPerf Power benchmark at IEEE HPCA Symposium

MLCommons — Tue, 04 Mar 2025 23:15:00 +0000

This week, members of the MLCommons^® Power working group are presenting a paper at the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA) conference. The paper presents MLPerf^® Power, a framework for benchmarking the energy efficiency of AI systems.

MLPerf Power is designed to measure energy efficiency across diverse applications and deployments of machine learning technologies, “from microwatts to megawatts.” That includes datacenters, edge deployments, mobile devices, and tiny “Internet of Things” implementations. It also includes a range of applications covering both training and inference.

“We cannot improve what we do not measure,” said Arun Tejusve (Tejus) Raghunath Rajan (Meta), Co-chair of the MLCommons Power working group. “Across the board, MLPerf has driven performance improvements for 6+ years. Our goal is to do the same for MLPerf Power. By providing a way to consistently measure and report power, we aim to move the needle of energy efficiency across the industry in a positive and sustainable manner.”

Fig. 1. The Power consumption range across MLPerf divisions, highlighting the need for scalable power measurement

Requirements for benchmarking energy efficiency

The MLPerf Power framework defines eleven requirements for benchmarking and measuring the energy efficiency of Machine Learning (ML) systems under three broad themes:

Power measurements: ensuring that measurements are compatible with a wide range and scale of platforms and configurations, account for system-level interactions and shared resources, and where feasible, measure physical power with an appropriate analyzer.
Diverse ML benchmarking: defining consistent rules and reporting guidelines, using industry-standard workloads, and collecting data on power consumption and performance at different throughputs and ML model accuracy/quality targets.
Evaluation, Trends, and Impact: Communicating power consumption and energy efficiency trends and optimizations, providing case studies relevant to real-world scenarios, and being driven and audited by the industry.

The MLPerf Power framework addresses several myths and pitfalls related to measuring the power usage of AI systems. One is that measuring power consumption alone is sufficient. Instead, the benchmark focuses on measuring power efficiency: system performance per unit of power.

Another myth is that isolating and measuring just the ML components is sufficient. While they do make up a large portion of the power consumption, fully understanding AI system power efficiency requires measuring all the system components, including compute, memory/storage, the interconnect, and the cooling system. This requires a careful, comprehensive approach since different components vary their level of activity at different phases of an AI workload.

MLPerf Power benchmark submissions provide early and impactful insights

ML Commons has received 1,841 MLPerf Power benchmark submissions to-date, spanning several years and four workload versions. The benchmark results are already providing useful insights that inform the design of future AI systems.

One important insight relates to scaling up training systems and the complex relationship between system scale, training time, and energy consumption. Specifically, scaling up the number of accelerators reduces absolute training time, but the amount of energy consumed increases at a non-linear rate. This is due to two factors: the increased communication requirements between a larger set of accelerators; and the decreased utilization in each accelerator. This points to the need to design AI systems with an understanding of these tradeoffs to achieve the necessary balance between scale, speed and energy consumption. “In the AI era, where scaling laws are driving computational demands, energy efficiency is a fundamental imperative for sustainable and responsible technological advancement,” noted Sachin Idgunji (NVIDIA), founding Co-Chair of the MLCommons Power working group. “Benchmarks such as MLPerf Power give AI system builders the information they need to help them achieve their design goals.”

Fig.2: Energy consumption and time-to-train for Llama2-70b LoRA fine-tuning across different numbers of accelerators.

Another insight derived from the MLPerf Power results involves the relationship and trade-off between AI model accuracy and energy efficiency. In earlier versions of the benchmark, results showed that organizations were sacrificing energy efficiency – by up to 50% – to increase inference accuracy from 99% to 99.9%. We believe that these benchmark results compelled organizations to deploy significant optimization techniques. In more recent benchmark versions, we now see that reduced precision techniques such as quantization have significantly narrowed the gap in energy efficiency between inference systems running at 99% and 99.9% accuracy. Moving forward, it’s expected that these techniques will continue to be a driving force in designing energy efficient AI systems.

The Path to Environmentally Sustainable AI Systems

Establishing an industry-wide standard for ML system power measurement is a critical step toward creating a set of metrics for what qualifies as “environmentally sustainable AI”.

“That’s our north star,” said Vijay Reddi, Professor at Harvard University and Vice President of MLCommons. “Our mission is to enable building sustainable AI systems, but to do so we need to define metrics that we can use to quantify energy efficiency – and ultimately sustainability. MLCommons is an engineering organization; we build things. We’ve built the framework, the metrics, the benchmark, and the benchmarking process to get us the data we need to understand, quantify, and ultimately build sustainable AI systems, and we’re continuing to iterate and refine it to keep pace with technology.”

The HPCA paper makes several recommendations, including developing tools that can estimate the carbon footprint of an AI system. Doing so requires measuring the system’s energy efficiency, but it must also be contextualized in several other factors, including the nature of the workload as well as the carbon emissions of the system’s power source.

In the short term, the team’s most critical need is for more submissions to the MLPerf Power benchmark. “The roughly 1,800 submissions we have now is enough to deliver a first set of qualitative insights,” said Arya Tschand, PhD student at Harvard University who led the work, “but if we can increase that by an order of magnitude we can extrapolate the data further, identify and quantify additional trends, and make more specific and actionable design recommendations that are grounded in hard data. That benefits both the industry and the environment: higher energy efficiency saves AI companies money by reducing their substantial power bills, and it also reduces the demand for both existing and new power generation.”

Visit the MLPerf Power working group page to join the working group and learn more, including how to submit to the MLPerf Power Benchmark.

About ML Commons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

The post MLCommons Power Working Group Presents MLPerf Power benchmark at IEEE HPCA Symposium appeared first on MLCommons.

New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems

MLCommons — Wed, 28 Aug 2024 14:56:00 +0000

Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v4.1 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. This release includes first-time results for a new benchmark based on a mixture of experts (MoE) model architecture. It also presents new findings on power consumption related to inference execution.

MLPerf Inference v4.1

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models across a variety of deployment scenarios. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems.

The benchmark results for this round demonstrate broad industry participation, and includes the debut of six newly available or soon-to-be-shipped processors:

AMD MI300x accelerator (available)
AMD EPYC “Turin” CPU (preview)
Google “Trillium” TPUv6e accelerator (preview)
Intel “Granite Rapids” Xeon CPUs (preview)
NVIDIA “Blackwell” B200 accelerator (preview)
UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators

MLPerf Inference v4.1 includes 964 performance results from 22 submitting organizations: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud, and Untether AI.

“There is now more choice than ever in AI system technologies, and it’s heartening to see providers embracing the need for open, transparent performance benchmarks to help stakeholders evaluate their technologies,” said Mitchelle Rasquinha, MLCommons Inference working group co-chair.

New mixture of experts benchmark

Keeping pace with today’s ever-changing AI landscape, MLPerf Inference v4.1 introduces a new benchmark to the suite: mixture of experts. MoE is an architectural design for AI models that departs from the traditional approach of employing a single, massive model; it instead uses a collection of smaller “expert” models. Inference queries are directed to a subset of the expert models to generate results. Research and industry leaders have found that this approach can yield equivalent accuracy to a single monolithic model but often at a significant performance advantage because only a fraction of the parameters are invoked with each query.

The MoE benchmark is unique and one of the most complex implemented by MLCommons to date. It uses the open-source Mixtral 8x7B model as a reference implementation and performs inferences using datasets covering three independent tasks: general Q&A, solving math problems, and code generation.

“When determining to add a new benchmark, the MLPerf Inference working group observed that many key players in the AI ecosystem are strongly embracing MoE as part of their strategy,” said Miro Hodak, MLCommons Inference working group co-chair. “Building an industry-standard benchmark for measuring system performance on MoE models is essential to address this trend in AI adoption. We’re proud to be the first AI benchmark suite to include MoE tests to fill this critical information gap.”

Benchmarking Power Consumption

The MLPerf Inference v4.1 benchmark includes 31 power consumption test results across three submitted systems covering both datacenter and edge scenarios. These results demonstrate the continued importance of understanding the power requirements for AI systems running inference tasks. as power costs are a substantial portion of the overall expense of operating AI systems.

The Increasing Pace of AI Innovation

Today, we are witnessing an incredible groundswell of technological advances across the AI ecosystem, driven by a wide range of providers including AI pioneers; large, well-established technology companies; and small startups.

MLCommons would especially like to welcome first-time MLPerf Inference submitters AMD and Sustainable Metal Cloud, as well as Untether AI, which delivered both performance and power efficiency results.

“It’s encouraging to see the breadth of technical diversity in the systems submitted to the MLPerf Inference benchmark as vendors adopt new techniques for optimizing system performance such as vLLM and sparsity-aware inference,” said David Kanter, Head of MLPerf at MLCommons. “Farther down the technology stack, we were struck by the substantial increase in unique accelerator technologies submitted to the benchmark this time. We are excited to see that systems are now evolving at a much faster pace – at every layer – to meet the needs of AI. We are delighted to be a trusted provider of open, fair, and transparent benchmarks that help stakeholders get the data they need to make sense of the fast pace of AI innovation and drive the industry forward.”

View the Results

To view the results for MLPerf Inference v4.1, please visit the Datacenter and Edge benchmark results pages.

To learn more about selection of the new MoE benchmark read the blog.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems appeared first on MLCommons.

New MLPerf Training Benchmark Results Highlight Hardware and Software Innovations in AI Systems

MLCommons — Wed, 12 Jun 2024 14:55:00 +0000

Today, MLCommons^®announced new results for the MLPerf^® Training v4.0 benchmark suite, including first-time results for two benchmarks: LoRA fine-tuning of LLama 2 70B and GNN.

MLPerf Training v4.0

The MLPerf Training benchmark suite comprises full system tests that stress machine learning (ML) models, software, and hardware for a broad range of applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

MLPerf Training v4.0 includes over 205 performance results from 17 submitting organizations: ASUSTeK, Dell, Fujitsu, Giga Computing, Google, HPE, Intel (Habana Labs), Juniper Networks, Lenovo, NVIDIA, NVIDIA + CoreWeave, Oracle, Quanta Cloud Technology, Red Hat + Supermicro, Supermicro, Sustainable Metal Cloud (SMC), and tiny corp.

MLCommons would like to especially welcome first-time MLPerf Training submitters Juniper Networks, Oracle, SMC, and tiny corp.

Congratulations to first-time participant SMC for submitting the first-ever set of power results for MLPerf Training. These results highlight the impact of SMC’s immersion cooling solutions for data center systems. Our industry-standard power measurement works with MLPerf Training and is the first and only method to accurately measure full system power draw and energy consumption for both cloud and on-premise systems in a trusted and consistent fashion. These metrics are critical for the entire community to understand and improve the overall efficiency for training ML models – which will ultimately reduce the energy use and improve the environmental impact of AI in the coming years.

The Training v4.0 results demonstrate broad industry participation and showcase substantial performance gains in ML systems and software. Compared to the last round of results six months ago, this round brings a 1.8X speed-up in training time for Stable Diffusion. Meanwhile, the best results in the RetinaNet and GPT3 tests are 1.2X and 1.13X faster, respectively, thanks to performance scaling at increased system sizes.

“I’m thrilled by the performance gains we are seeing especially for generative AI,” said David Kanter, executive director of MLCommons. “Together with our first power measurement results for MLPerf Training we are increasing capabilities and reducing the environmental footprint – making AI better for everyone.”

New LLM fine-tuning benchmark

The MLPerf Training v4.0 suite introduces a new benchmark to target fine-tuning a large language model (LLM). An LLM that has been pre-trained on a general corpus of text can be fine-tuned to improve its accuracy on specific tasks, and the computational costs of doing so can differ from pre-training.

A variety of approaches to fine-tuning an LLM at lower computational costs have been introduced over the past few years. The MLCommons Training working group evaluated several of these algorithms and ultimately selected LoRA as the basis for its new benchmark. First introduced in 2021, LoRA freezes original pre-trained parameters in a network layer and injects trainable rank decomposition matrices. Since LoRA fine-tuning trains only a small portion of the network parameters, this approach dramatically reduces the computational and memory demands compared to pre-training or supervised fine-tuning.

“Fine-tuning LLMs is a notable workload because AI practitioners across many organizations make use of this technology. LoRA was the optimal choice for a state-of-the-art fine-tuning technique; it significantly reduces trainable parameters while maintaining performance comparable to fully fine-tuned models,” said Hiwot Kassa, MLPerf Training working group co-chair.

The new LoRA benchmark uses the Llama 2 70B general LLM as its base. This model is fine-tuned with the Scrolls dataset of government documents with a goal of generating more accurate document summaries. Accuracy is measured using the ROUGE algorithm for evaluating the quality of document summaries. The model uses a context length of 8,192 tokens, keeping pace with the industry’s rapid evolution toward longer context lengths.

The LLM fine-tuning benchmark is already achieving widespread adoption, with over 30 submissions in its initial round.

Learn more about the selection of the LoRA fine-tuning algorithm for the MLPerf Training benchmark in this blog.

New GNN benchmark for classification in graphs

MLPerf Training v4.0 also introduces a graph neural network (GNN) benchmark for measuring the performance of ML systems on problems that are represented by large graph-structured data, such as those used to implement literary databases, drug discovery applications, fraud detection systems, social networks, and recommender systems.

“Training on large graph-structured datasets poses unique system challenges, demanding optimizations for sparse operations and inter-node communication. We hope the addition of a GNN based benchmark in MLPerf Training broadens the challenges offered by the suite and spurs software and hardware innovations for this critical class of workload,” said Ritika Borkar MLPerf Training working group co-chair.

The MLPerf Training GNN benchmark is used for a node classification task where the goal is to predict a label for each node in a graph. The benchmark uses an R-GAT model and is trained on the 2.2 terabyte IGBH full dataset, the largest available open-source graph dataset with 547 million nodes and 5.8 billion edges. The IGBH database is a graph showing the relationships between academic authors, papers, and institutes. Each node in the graph can be classified into one of 2,983 classes.

The MLPerf Training team recently submitted MLPerf R-Gat to the Illinois Graph Benchmark (IGB) leaderboard which helps the industry keep track of the state of the art for GNN models, encouraging reproducibility. We are pleased to announce that their submission is currently #1 with a 72% test accuracy.

Learn more about the selection of the GNN benchmark in this blog.

View the results

To view the full results for MLPerf Training v4.0 and find additional information about the benchmarks, please visit the Training benchmark page.

About MLCommons

For additional information on MLCommons and details on becoming a member or affiliate, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Training Benchmark Results Highlight Hardware and Software Innovations in AI Systems appeared first on MLCommons.

MLPerf Inference v1.0 Results with First Power Measurements

MLCommons — Wed, 21 Apr 2021 08:08:00 +0000

Today, MLCommons®, an open engineering consortium, released results for MLPerf Inference v1.0, the organization’s machine learning inference performance benchmark suite. In its third round of submissions, the results measured how quickly a trained neural network can process new data for a wide range of applications on a variety of form factors and for the first-time, a system power measurement methodology.

MLPerf Inference v1.0 is a cornerstone of MLCommons’ initiative to provide benchmarks and metrics that level the industry playing field through the comparison of ML systems, software, and solutions. The latest benchmark round received submissions from 17 organizations and released 1,994 peer-reviewed results for machine learning systems spanning from edge devices to data center servers. To view the results, please visit https://www.mlcommons.org/en/inference-datacenter-10/ and https://www.mlcommons.org/en/inference-edge-10/.

MLPerf Power Measurement – A new metric to understand system efficiency

The MLPerf Inference v1.0 suite introduces new power measurement techniques, tools, and metrics to complement performance benchmarks. These new metrics enable reporting and comparing energy consumption, performance and power for submitting systems. In this round, the new power measurement was optional for submission with 864 results released. The power measurement was developed in partnership with Standard Performance Evaluation Corp. (SPEC), the leading provider of standardized benchmarks and tools for evaluating the performance of today’s computing systems. MLPerf adopted and built on the industry-standard SPEC PTDaemon power measurement Interface.

“As we look at the accelerating adoption of machine learning, artificial intelligence, and the anticipated scale of ML projects, the ability to measure power consumption in ML environments will be critical for sustainability goals all around the world,” said Klaus-Dieter Lange, SPECpower Committee Chair. “MLCommons developed MLPerf in the best tradition of vendor-neutral standardized benchmarks, and SPEC was very excited to be a partner in their development process. We look forward to widespread adoption of this extremely valuable benchmark.”

“We are pleased to see the ongoing engagement from the machine learning community with MLPerf,” said Sachin Idgunji, Chair of the MLPerf Power Working Group. “The addition of a power methodology will highlight energy efficiency and bring a valuable, new level of transparency to the industry.”

“We wanted to add a metric that could showcase the power and energy cost from different levels of ML performance across workloads,” said Arun Tejusve, Chair of the MLPerf Power Working Group. “MLPerf Power v1.0 is a monumental step toward this goal, and will help drive the creation of more energy-efficient algorithms and systems across the industry.”

Submitters this round include: Alibaba, Centaur Technology, Dell Technologies, EdgeCortix, Fujitsu, Gigabyte, HPE, Inspur, Intel, Lenovo, Krai, Moblint, Neuchips, NVIDIA, Qualcomm Technologies, Supermicro, and Xilinx. Additional information about the Inference v1.0 benchmarks is available at www.mlcommons.org/en/inference-datacenter-10/ and www.mlcommons.org/en/inference-edge-10/.

About MLCommons

MLCommons is an open engineering consortium with a mission to accelerate machine learning innovation, raise all boats and increase its positive impact on society. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding member partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a member of the organization, please visit http://mlcommons.org/ or contact membership@mlcommons.org.

Press Contact:
press@mlcommons.org

PTDaemon® and SPEC® are trademarks of the Standard Performance Evaluation Corporation. All other product and company names herein may be trademarks of their registered owners.

The post MLPerf Inference v1.0 Results with First Power Measurements appeared first on MLCommons.