MLPerf Storage Archives - MLCommons

New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems

MLCommons — Mon, 04 Aug 2025 14:56:19 +0000

San Francisco, CA. Today, MLCommons® announced results for its industry-standard MLPerf® Storage v2.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. This round of the benchmark saw dramatically increased participation, more geographic representation from submitting organizations, and greater diversity of the systems submitted for testing.

The benchmark results show that storage systems performance continues to improve rapidly, with tested systems serving roughly twice the number of accelerators than in the v1.0 benchmark round.

Additionally, the v2.0 benchmark adds new tests that replicate real-world checkpointing for AI training systems. The benchmark results provide essential information for stakeholders who need to configure the frequency of checkpoints to optimize for high performance – particularly at scale.

Version 2.0 adds checkpointing tasks, delivers essential insights

As AI training systems have continued to scale up to billions and even trillions of parameters, and the largest clusters of processors have reached one hundred thousand accelerators or more, system failures have become a prominent technical challenge. Because data centers tend to run accelerators at near-maximum utilization for their entire lifecycle, both the accelerators themselves and the supporting hardware (power supplies, memory, cooling systems, etc.) are heavily burdened, minimizing their expected lifetime. This is a chronic issue, especially in large clusters: if the mean time to failure for an accelerator is 50,000 hours, then a 100,000-accelerator cluster running for extended periods at full utilization will likely experience a failure every half-hour. A cluster with one million accelerators would expect to see a failure every three minutes. Worse, because AI training usually involves massively parallel computation where all the accelerators are moving in lockstep on the same iteration of training, a failure of one processor can grind an entire cluster to a halt.

It is now broadly accepted that saving checkpoints of intermediate training results at regular intervals is essential to keep AI training systems running at high performance. The AI training community has developed mathematical models that can optimize cluster performance and utilization by trading off the overhead of regular checkpoints against the expected frequency and cost of failure recovery (rolling back the computation, restoring the most recent checkpoint, restarting the training from that point, and duplicating the lost work). Those models, however, require accurate data on the scale and performance of the storage systems that are used to implement the checkpointing system.

The MLPerf Storage v2.0 checkpoint benchmark tests provide precisely that data, and the results from this round suggest that stakeholders procuring AI training systems need to carefully consider the performance of the storage systems they buy, to ensure that they can store and retrieve a cluster’s checkpoints without slowing the system down to an unacceptable level. For a deeper understanding of the issues around storage systems and checkpointing, as well as of the design of the checkpointing benchmarks, we encourage you to read this post from Wes Vaske, a member of the MLPerf Storage working group.

“At the scale of computation being implemented for training large AI models, regular component failures are simply a fact of life,” said Curtis Anderson, MLPerf Storage working group co-chair. “Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance. This initial round of checkpoint benchmark results shows us that current storage systems offer a wide range of performance specifications, and not all systems are well-matched to every checkpointing scenario. It also highlights the critical role of software frameworks such as PyTorch and TensorFlow in coordinating training, checkpointing, and failure recovery, as well as some opportunities for enhancing those frameworks to further improve overall system performance.”

Workload benchmarks show rapid innovation in support of larger-scale training systems

Continuing from the v1.0 benchmark suite, the v2.0 suite measures storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators’ “think time” the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

The v2.0 results show that submitted storage systems have substantially increased the number of accelerators they can simultaneously support, roughly twice the number compared to the systems in the v1.0 benchmark.

“Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It’s no surprise to see that storage system providers are innovating to support ever larger scale systems,” said Oana Balmau, MLPerf Storage working group co-chair.

The v2.0 submissions also included a much more diverse set of technical approaches to delivering high-performance storage for AI training, including:

6 local storage solutions;
2 solutions using in-storage accelerators;
13 software-defined solutions;
12 block systems;
16 on-prem shared storage solutions;
2 object stores.

“Necessity continues to be the mother of invention: faced with the need to deliver storage solutions that are both high-performance and at unprecedented scale, the technical community has stepped up once again and is innovating at a furious pace,” said Balmau.

MLPerf Storage v2.0: skyrocketing participation and diversity of submitters

The MLPerf Storage benchmark was created through a collaborative engineering process by 35 leading storage solution providers and academic research groups across 3 years. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v2.0 benchmark results, from a broad set of technology providers, reflect the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v2.0 includes >200 performance results from 26 submitting organizations: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong. The submitters represent seven different countries, demonstrating the value of the MLPerf Storage benchmark to the global community of stakeholders.

“The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,” said David Kanter, Head of MLPerf at MLCommons. The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it. I would especially like to welcome first-time submitters Alluxio, ExponTech, FarmGPU, H3C, Kingston, KIOXIA, Oracle, Quanta Computer, Samsung, Sandisk, TTA, UBIX, IBM, and WDC.”

“This level of participation is a game-changer for benchmarking: it enables us to openly publish more accurate and more representative data on real-world systems,” Kanter continued. That, in turn, gives the stakeholders on the front lines the information and tools they need to succeed at their jobs. The checkpoint benchmark results are an excellent case in point: now that we can measure checkpoint performance, we can think about optimizing it.”

We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark suite.

View the Results

To view the results for MLPerf Storage v2.0, please visit the Storage benchmark results.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

The post New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems appeared first on MLCommons.

Announcing the MLPerf Storage v2.0 Checkpointing Workload

MLCommons — Mon, 04 Aug 2025 14:43:55 +0000

The previous versions of MLCommons® MLPerf ®Storage (v0.5 and v1.1) focused on ingestion for training data-intensive models like 3D U-Net, ResNet-50, and CosmoFlow. The new checkpointing workload is focused on addressing the backup and recovery speed for training, with a focus on LLMs on scale-out systems.

In this blog, we’ll cover the motivation for checkpointing, how failure rates of AI systems depend on the scale of the cluster, the details of how the workload is run, and how to interpret the results of this new workload.

Failures Halt AI Training Clusters

AI training today is typically a synchronous process. In particular, as a model is being trained on a large cluster, every participating member in the cluster is using the same set of weights. Conceptually, each node will read in data and perform calculations, and then all the nodes will share their calculations to update the weights of the model. The weights must be updated across all nodes before starting the next set of calculations.

Synchronous training makes the entire process much more predictable and reproducible for machine learning researchers and engineers. One downside of synchronous training is that straggler nodes can limit the progress of training jobs. In particular, the failure of a single component that prevents the work of a single node from contributing to the model update results in a cluster-wide stall of the training job.,

Server uptimes are typically measured in months if not years, so failures are often not a major consideration. However, two things set AI training apart from traditional enterprise or cloud workloads.

First, the systems in AI training systems are typically run at maximum performance for their entire lifecycle. This usage model and utilization result in higher failure rates than in traditional systems.

Second, AI training is typically synchronous while traditional workloads are decoupled. If a traditional web server fails, its load can be easily transitioned to another web server. However, due to the synchronous nature of AI training, the failure rate of a cluster scales with the power of the number of components. For example, if a cluster of 100,000 accelerators has a failure rate of R for a given accelerator, then the failure rate of the cluster is R^100,000. For large-scale systems, even a tiny component failure rate can be highly problematic.

Checkpointing Ensures Forward Progress

The MLPerf Storage Working Group has developed a model of the failure rates of AI clusters and estimates of the target checkpoint intervals based on real-world data. Meta’s “Llama3: Herd of Models” paper provides information about failure rates seen at scale (https://arxiv.org/abs/2407.21783). In that paper, Meta describes the training cluster used (16,000 accelerators) to train the Llama 3 models. The training process took 54 days and encountered 419 failures. The majority of these failures scale with the size of the cluster, and assuming the failures are uniformly distributed over time yields a model for the Mean Time Between Failures (MTTF) as the cluster size scales up:

Due to the cluster failure rate scaling to the power of cluster size, MTTF exponentially decreases as cluster size increases. The model correlates fairly closely with Meta’s data for 16,000 accelerators, predicting 7.67 failures per day versus the observed data (7.76 failures/day).

Generally, checkpointing strategies are designed to carefully balance the frequency of checkpointing against the protection. A common rule of thumb with large-scale systems is that losing 5% or less of the progress since the last checkpoint is reasonable. Put another way, there should be roughly 20 checkpoints between expected failures.

The exponential impact of cluster size on failure directly translates into a similar impact on checkpointing. For 16,000 accelerators (which is similar to the size of some clusters that are deployed today), 155 checkpoints per day are necessary, or roughly once every 9.3 minutes. At the scale of 100,000 accelerators, this equates to 967 checkpoints per day or a checkpoint every 1.5 minutes.

Finally, the last metric we want to pull from this model is the time available to perform a checkpoint. First, we wanted to ensure we were taking checkpoints frequently enough to minimize the time between failures. Now, we want to ensure we’re not spending too much time with the cluster stopped while we’re taking checkpoints.

As with most forms of redundancy and data protection, checkpointing is not free – it imposes some overhead on the overall AI training effort. As the frequency of checkpoints increases with cluster size, this overhead becomes more pronounced. To keep the overhead below 5% is increasingly challenging as clusters grow in size. For example, a 16,000 accelerator cluster that checkpoints every 9.3 minutes, the checkpointing process can take 28 seconds. For a larger cluster comprising 100,000 accelerators, this falls to a mere 4.4 seconds. These examples illustrate how increasing cluster size for AI training creates an increasing bottleneck in the storage system.

Checkpointing Workload in v2.0

The new checkpointing workload is implemented to be representative of real systems as much as possible. The checkpoint workload defined here creates an in-memory representation of the model and optimizer state, then uses PyTorch save and load to perform the write and read operations.

For the workload, we define four different model sizes:

· 8 Billion parameters

· 70 Billion parameters

· 405 Billion parameters

· 1 Trillion parameters

The model sizes are intended to emulate the Llama 3 Herd of Models with a 1T parameter model added with extrapolated values to represent larger models currently in development.

For CLOSED submissions, each of the model sizes has a required number of processes for saving or loading a checkpoint. These process numbers represent a reasonable minimum size of a system that can train a model of a given size.

OPEN submissions allow changing the number of processes to a multiple of the model’s defined Data Parallelism (see the table below). This is to represent sharding the model over more hosts, which results in more processes initiating the checkpoint save or load at the same time and with less data per process (the model and optimizer states stay the same size as more data parallel instances are used as the model and optimizer are replicated between data parallel instances).

Each process executing the checkpoint will write one or more large files (100s of MBs to GBs in size) depending on the model size, and benchmark-internal parameters representing the parallelism configuration and the model and optimizer sharding method.

Additionally, there’s an operating mode used only for client-local storage solutions like local NVMe devices called Subset mode. When run in subset mode – regardless of the required number of processes – only eight processes are used.

Subset mode represents using an NVMe drive in each of the accelerator nodes as the first tier of checkpointing storage. Subset mode will perform a portion of the checkpoint depending on the model size.

Finally, to ensure the test is measuring the performance of the underlying storage, a successful checkpoint workload run consists of 10 Save operations followed by 10 Load operations. If the client memory is above a threshold (3x the size of a node’s checkpoint), then the submitter must clear the filesystem caches between the save and load phases.

Table 1 – Checkpointing Parameters by Model Size

Model	8B	70B	405B	1T
Data Parallelism	1	1	2	2
Total Processes	8	64	512	1024
Checkpoint size	105 GB	912 GB	5.29 TB	15 TB
Subset: 8-Process Size	105 GB	114 GB	94 GB	144 GB

One important takeaway is the sheer size of the checkpoints. The actual model only accounts for a portion of the data in a given checkpoint. The bulk of the data is actually the optimizer state of the cluster at that point in time.

For example, the checkpoint for the 8 Billion parameter model totals 105GB, and the optimizer state makes up 90GB. For the 1 Trillion parameter model, the optimizer makes up a whopping 13.2TB of the 15TB of checkpoint data.

This helps to put some real sense of scale around the challenge of checkpointing. For a very large model like a 1 trillion parameter training on a cluster of 100,000 accelerators, writing out the 15TB checkpoint every 1.5 minutes results in over 14 Petabytes of writes per day for checkpoints alone.

For synchronous checkpointing, this implies writing 15TB of data in under 5 seconds – averaging about 3.6 TB/s across the cluster or roughly 200GB/s for the entire duration of the training job.

Interpreting the Results

The results of a checkpointing workload will include one or more of the supported model sizes with the associated Save and Load performance scores.

The default metric is throughput (GiB/s) for the operation, with the option to expand the table to see the result as a duration number as well (in seconds).

On the surface, the large reads and writes may seem like a trivial workload, but the large number of processes performing the operations in parallel results in interleaving of data at the physical layer in a way that can inhibit performance in storage systems.

Furthermore, a checkpoint Save or Load is only complete when every process is done. This means that the numbers represented in the results table are based on the time taken by the worst-performing process rather than an average of steady-state performance.

End users and customers will have differing views on the relative importance of Save vs Load. Save operations happen more frequently, which means more chances for a bad save to impact operations, while every Load operation is likely being performed during the most critical time: while the cluster is stopped.

Additionally, comparing the subset and full checkpoint modes with each other is difficult. As executed, the subset mode does not provide full fault tolerance and requires either a secondary storage solution or additional software enhancements by the end user.

Conclusion

In this blog post, we have provided an overview of the motivation for checkpointing as a new workload within MLPerf Storage, the relationship between AI training cluster size and checkpointing frequency and volume, an overview of checkpointing in MLPerf Storage v2.0, and a rough guide on interpreting the results. This is the first industry standard benchmark for checkpointing in AI training and will ultimately help the community understand performance requirements and enable greater optimization and capabilities for the entire ecosystem.

You can read more about the MLPerf Storage v2.0 results on our blog.

The post Announcing the MLPerf Storage v2.0 Checkpointing Workload appeared first on MLCommons.

New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance

MLCommons — Wed, 25 Sep 2024 14:55:00 +0000

Storage system providers showcase innovative solutions to keep pace with faster accelerators.

Today, MLCommons® announced results for its industry-standard MLPerf® Storage v1.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. The results show that as accelerator technology has advanced and datasets continue to increase in size, ML system providers must ensure that their storage solutions keep up with the compute needs. This is a time of rapid change in ML systems, where progress in one technology area drives new demands in other areas. High-performance AI training now requires storage systems that are both large-scale and high-speed, lest access to stored data becomes the bottleneck in the entire system. With the v1.0 release of MLPerf Storage benchmark results, it is clear that storage system providers are innovating to meet that challenge.

Version 1.0 storage benchmark breaks new ground

The MLPerf Storage benchmark is the first and only open, transparent benchmark to measure storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators’ “think time” the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

Three models are included in the benchmark to ensure diverse patterns of AI training are tested: 3D-UNet, Resnet50, and CosmoFlow. These workloads offer a variety of sample sizes, ranging from hundreds of megabytes to hundreds of kilobytes, as well as wide-ranging simulated “think times” from a few milliseconds to a few hundred milliseconds.

The benchmark emulates NVIDIA A100 and H100 models as representatives of the currently available accelerator technologies. The H100 accelerator reduces the per-batch computation time for the 3D-UNet workload by 76% compared to the earlier V100 accelerator in the v0.5 round, turning what was typically a bandwidth-sensitive workload into much more of a latency-sensitive workload.

In addition, MLPerf Storage v1.0 includes support for distributed training. Distributed training is an important scenario for the benchmark because it represents a common real-world practice for faster training of models with large datasets, and it presents specific challenges for a storage system not only in delivering higher throughput but also in serving multiple training nodes simultaneously.

V1.0 benchmark results show performance improvement in storage technology for ML systems

The broad scope of workloads submitted to the benchmark reflect the wide range and diversity of different storage systems and architectures. This is testament to how important ML workloads are to all types of storage solutions, and demonstrates the active innovation happening in this space.

“The MLPerf Storage v1.0 results demonstrate a renewal in storage technology design,” said Oana Balmau, MLPerf Storage working group co-chair. “At the moment, there doesn’t appear to be a consensus ‘best of breed’ technical architecture for storage in ML systems: the submissions we received for the v1.0 benchmark took a wide range of unique and creative approaches to providing high-speed, high-scale storage.”

The results in the distributed training scenario show the delicate balance needed between the number of hosts, the number of simulated accelerators per host, and the storage system in order to serve all accelerators at the required utilization. Adding more nodes and accelerators to serve ever-larger training datasets increases the throughput demands. Distributed training adds another twist, because historically different technologies – with different throughputs and latencies – have been used for moving data within a node and between nodes. The maximum number of accelerators a single node can support may not be limited by the node’s own hardware but instead by the ability to move enough data quickly to that node in a distributed environment (up to 2.7 GiB/s per emulated accelerator). Storage system architects now have few design tradeoffs available to them: the systems must be high-throughput and low-latency, to keep a large-scale AI training system running at peak load.

“As we anticipated, the new, faster accelerator hardware significantly raised the bar for storage, making it clear that storage access performance has become a gating factor for overall training speed,” said Curtis Anderson, MLPerf Storage working group co-chair. “To prevent expensive accelerators from sitting idle, system architects are moving to the fastest storage they can procure – and storage providers are innovating in response.”

MLPerf Storage v1.0

The MLPerf Storage benchmark was created through a collaborative engineering process across more than a dozen leading storage solution providers and academic research groups. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v1.0 benchmark results, from a broad set of technology providers, demonstrate the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v1.0 includes over 100 performance results from 13 submitting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Micron, Nutanix, Simplyblock, Volumez, WEKA, and YanRong Tech.

“We’re excited to see so many storage providers, both large and small, participate in the first-of-its-kind v1.0 Storage benchmark,” said David Kanter, Head of MLPerf at MLCommons. “It shows both that the industry is recognizing the need to keep innovating in storage technologies to keep pace with the rest of the AI technology stack, and also that the ability to measure the performance of those technologies is critical to the successful deployment of ML training systems. As a trusted provider of open, fair, and transparent benchmarks, MLCommons ensures that technology providers know the performance target they need to meet, and consumers can procure and tune ML systems to maximize their utilization – and ultimately their return on investment.”

We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark. Future work includes improving and increasing accelerator emulations and AI training scenarios.

View the Results

To view the results for MLPerf Storage v1.0, please visit the Storage benchmark results.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.

The post New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance appeared first on MLCommons.

MLPerf Results Highlight Growing Importance of Generative AI and Storage

MLCommons — Mon, 11 Sep 2023 08:00:00 +0000

Today we announced new results from two MLPerf benchmark suites: MLPerf Inference v3.1, which delivers industry standard Machine Learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner, and MLPerf Storage v0.5. This publication marks the first ever release of results from the MLPerf Storage benchmark, which measures the performance of storage systems in the context of ML training workloads.

MLPerf Inference v3.1 Introduces New LLM and Recommendation Benchmarks

MLPerf Inference v3.1 includes record participation, with over 13,500 performance results and up to 40% performance gains, from 26 different submitters, and over 2000 power results. Submitters include: ASUSTeK, Azure, cTuning, Connect Tech, Dell, Fujitsu, Giga Computing, Google, H3C, HPE, IEI, Intel, Intel-Habana-Labs, Krai, Lenovo, Moffett, Neural Magic, NVIDIA, Nutanix, Oracle, Qualcomm, Quanta Cloud Technology, SiMA, Supermicro, TTA, and xFusion. In particular, MLCommons® would like to congratulate first time MLPerf Inference submitters Connect Tech, Nutanix, Oracle, and TTA.

“Submitting to MLPerf is not trivial. It’s a significant accomplishment, as this is not a simple point and click benchmark. It requires real engineering work and is a testament to our submitters’ commitment to AI, to their customers, and to ML.” said David Kanter, Executive Director of MLCommons.

The MLCommons MLPerf Inference benchmark suite measures how fast systems can run models in a variety of deployment scenarios. ML inference is behind everything from the latest generative AI chatbots to safety features in vehicles, such as automatic lane-keeping and speech-to-text interfaces. Improving performance and power efficiency is key to deploying more capable AI systems that benefit society.

MLPerf Inference v3.1 introduces two new benchmarks to the suite. The first is a large language model (LLM) using the GPT-J reference model to summarize CNN news articles, which garnered results from 15 different submitters, reflecting the rapid adoption of generative AI. The second is an updated recommender, modified to be more representative of industry practices, using the DLRM-DCNv2 reference model and a much larger datasets, with 9 submissions. These new tests help advance AI by ensuring that industry-standard benchmarks represent the latest trends in AI adoption to help guide customers, vendors, and researchers.

“The submissions for MLPerf inference v3.1 are indicative of a wide range of accelerators being developed to serve ML workloads. The current benchmark suite has broad coverage among ML domains and the most recent addition of GPT-J is a welcome addition to the generative AI space. The results should be very helpful to users when selecting the best accelerators for their respective domains,” said Mitchelle Rasquinha, MLPerf Inference Working Group co-chair.

MLPerf Inference benchmarks primarily focus on datacenter and edge systems. The v3.1 submissions showcase several different processors and accelerators across use cases in computer vision, recommender systems, and language processing. There are both open and closed submissions related to performance, power, and networking categories. Closed submissions use the same reference model to ensure a level playing field across systems, while participants in the open division are permitted to submit a variety of models.

First Results for New MLPerf Storage Benchmark

The MLPerf Storage Benchmark Suite is the first open-source AI/ML benchmark suite that measures the performance of storage for ML training workloads. The benchmark was created through a collaboration spanning more than a dozen leading industry and academic organizations and includes a variety of storage setups including: parallel file systems, local storage, and software defined storage. The MLPerf Storage Benchmark will be an effective tool for purchasing, configuring, and optimizing storage for machine learning applications, as well as for designing next-generation systems and technologies.

Training neural networks is both a compute and data-intensive workload that demands high-performance storage to sustain good overall system performance and availability. For many customers developing the next generation of ML models, it is a challenge to find the right balance between storage and compute resources while making sure that both are efficiently utilized. MLPerf Storage helps overcome this problem by accurately modeling the I/O patterns posed by ML workloads, providing the flexibility to mix and match different storage systems with different accelerator types.

“Our first benchmark has over 28 performance results from five companies which is a great start given this is the first submission round,” explains Oana Balmau, Storage Working Group co-chair. “We’d like to congratulate MLPerf Storage submitters: Argonne National Laboratory (ANL), DDN, Micron, Nutanix, WEKA for their outstanding results and accomplishments.”

The MLPerf Storage benchmark suite is built on the codebase of DLIO, a benchmark designed for I/O measurement in high performance computing, adapted to meet current storage needs.

View the Results

To view the results for MLPerf Inference v3.1 and MLPerf Storage v0.5 and find additional information about the benchmarks please visit the following pages:

About MLCommons

MLCommons is an open engineering consortium with a mission to make machine learning better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ members – global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets, and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate, please visit MLCommmons or contact participation@mlcommons.org.

The post MLPerf Results Highlight Growing Importance of Generative AI and Storage appeared first on MLCommons.

Introducing the MLPerf Storage Benchmark Suite

MLCommons — Tue, 20 Jun 2023 08:27:00 +0000

The MLCommons® Storage Working Group is excited to announce the availability of the MLPerf Storage Benchmark Suite. This is the first artificial intelligence/machine learning benchmark suite that measures the performance of storage for machine learning (ML) workloads. It was created through a collaboration across more than a dozen leading organizations. The MLPerf Storage Benchmark will be an effective tool for purchasing, configuring, and optimizing storage for ML applications, as well as designing next-generation systems and technologies.

Striking a Balance Between Storage and Compute

Training neural networks is both a compute and data-intensive workload that demands high-performance storage to sustain good overall system performance and availability. For many customers developing the next generation of ML models, it is a challenge to find the right balance between storage and compute resources while making sure that both are efficiently utilized. MLPerf Storage helps overcome this problem by accurately modeling the I/O patterns posed by ML workloads for various accelerator types, providing the flexibility to mix and match different storage systems with different accelerator types.

How it Works

MLPerf Storage measures the sustained performance of a storage system for MLPerf Training and HPC workloads on both PyTorch and Tensorflow without requiring the use of expensive accelerators. It instead relies on a novel and elegant emulation mechanism that captures the full realistic behavior of neural network training. The first round of MLPerf Storage includes the BERT model which pioneered transformers for language modeling and 3DUNet which performs segmentation for 3D medical images. Subsequent rounds will expand the number and variety of workloads, the variety of emulated accelerators, and other features.

MLPerf Storage overview: Data loading and on-line pre-processing are performed using the ML frameworks, while model training is emulated.

The MLPerf Storage Benchmark Suite marks a major milestone in MLCommons’ line-up of MLPerf training benchmark suites. With the addition of MLPerf Storage, we hope to stimulate innovation in the academic and research communities to push the state-of-the-art in storage for ML, as well as providing a flexible tool for cluster designers to understand the interplay between storage and compute resources at scale. We are excited to see how the community will accelerate storage performance for some of the world’s most valuable and challenging emerging applications.

The MLPerf Storage inference benchmarks were created thanks to the contributions and leadership of our working members over the last 18 months.

MLPerf Storage Benchmark timeline

MLPerf Storage open for submissions June 19, 2023
All submissions are due August 4, 2023
Benchmark competition results publish September 13, 2023

How to Get Involved

Submit a benchmark result.
Spread the word to companies who would be interested in submitting.
Help us continue the development of the MLPerf Storage Benchmark by joining the working group.

The post Introducing the MLPerf Storage Benchmark Suite appeared first on MLCommons.