Over the last 18 months, AMD has rolled out three successive generations of its Instinct GPUs, reflecting an accelerated roadmap to meet the growing demands of generative AI. The MI300X, launched in 2023, was the first widely deployed GPU in the series, offering large-scale inference readiness. The MI325X followed in 2024, featuring expanded 256GB HBM3E memory and higher compute density, enabling support for larger models and higher concurrency. Most recently, the MI355X, introduced in 2025, incorporated the new CDNA 4 architecture, 288GB of HBM3E, and capacity to support models with up to 520 billion parameters on a single GPU.
This rapid cadence underpins AMD's push beyond raw performance, prioritising efficiency, cost optimisation, and flexible scaling for real-world AI workloads.
Breakthroughs in MLPerf v5.1
AMD's submission to MLPerf Inference v5.1 emphasised efficiency and scalability over peak benchmark figures, with several innovations standing out.
FP4 precision emerged as a notable development, first introduced on the MI355X GPU. This advancement delivered a 2.7-fold increase in tokens per second compared with the MI325X running in FP8 on the Llama 2 70B benchmark. Crucially, it maintained accuracy while lowering costs and improving throughput, demonstrating production readiness for ultra-large model inference.
Structured pruning was another highlight. Using the MI355X alongside ROCm software libraries, AMD optimised the Llama 3.1 405B model. Results included an 82% throughput boost from a 21% depth-pruned model and a 90% uplift from a 33% pruned and fine-tuned variant, all without loss of accuracy. This approach offered enterprises a practical means to reduce infrastructure costs while accelerating performance.
Scaling was also a focus. Demonstrations showed predictable linear improvements from single-node inference to multi-node clusters. A four-node MI355X cluster achieved a 3.4-fold token throughput increase compared to a previous-generation four-node MI300X system, while an eight-node MI355X cluster showed smooth scalability, reinforcing its suitability for enterprise-level deployments.
MI325X in competitive workloads
The MI325X GPU was also included in the latest submission across a range of generative AI workloads. Tests showed competitive performance compared with averaged NVIDIA H200 submissions. On the Llama 2-70B model, the MI325X reached 91% of the H200 average in interactive inference, while leading by 11% in throughput on Mixtral offline tests.
Performance across LLMs and image-generation workloads, including SD-XL, showed parity or near-parity with H200 results, supported by continuous ROCm software enhancements. These optimisations improved inference kernels, framework integrations, and communication libraries, ensuring efficient utilisation across diverse deployments.
For organisations, the MI325X results confirmed flexibility and cost efficiency, demonstrating balanced throughput across offline, server, and interactive inference scenarios.
Partner ecosystem and heterogeneous scaling
Partner submissions were central to the v5.1 results, with systems from companies such as Asus, Giga Computing, Dell, and Supermicro landing within 1–3% of AMD's own results. This highlighted the consistency and maturity of the Instinct ecosystem, offering enterprises confidence in reproducible, deployment-ready performance across platforms.
In a first for MLPerf, AMD also supported a heterogeneous GPU submission. MangoBoost combined MI300X and MI325X nodes into a mixed cluster, achieving near-linear scaling efficiency at 94% of AMD's reference calculations. This result underscored the potential for organisations to extend existing infrastructure with newer hardware without compromising efficiency.
The role of ROCm software
Underlying these achievements was AMD's ROCm software platform. Optimised libraries, framework integration, and orchestration capabilities ensured that Instinct GPUs performed consistently across different configurations, from single-GPU setups to mixed-generation clusters. ROCm enabled reproducible scaling results and smooth integration with widely used frameworks such as PyTorch and TensorFlow.
Conclusion
The MLPerf Inference v5.1 submission highlighted AMD's continued progress in enabling practical, large-scale generative AI deployments. With the introduction of FP4 precision, innovations in structured pruning, and seamless scaling across multi-node and mixed-generation environments, AMD demonstrated that its Instinct GPUs and ROCm software provide an efficient, flexible foundation for enterprise AI workloads.
The results point to a maturing ecosystem that balances hardware advancements with software-driven efficiency, offering organisations the tools to deploy generative AI cost-effectively and at scale.