Maia 200: An AI Accelerator Built for Inference - Official Microsoft Blog

Today we are proud to introduce Maia 200, a breakthrough inference accelerator designed to dramatically improve the economics of AI token generation. The Maia 200 is a powerful tool for AI inference: an accelerator built on TSMC’s 3nm process with native FP8/FP4 tensor cores, a redesigned memory system with 216 GB HBM3e at 7 TB/s and 272 MB SRAM on-chip, plus data movement engines that keep models powered fast and massive. This makes Maia 200 the most efficient first-party silicon of any hyperscaler, with three times the FP4 performance over third-generation Amazon Trainium and FP8 performance over Google’s seventh-generation TPU. The Maia 200 is also the most powerful inference system Microsoft has ever deployed, with 30% better performance per dollar than the latest generation of hardware in our fleet today.

Maia 200 is part of our heterogeneous AI infrastructure and will serve multiple models, including the latest GPT-5.2 models from OpenAI, bringing a performance-per-dollar advantage to Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence team will use Maia 200 to generate synthetic data and reinforce learning to improve next-generation internal models. For synthetic data feed use cases, the Maia 200’s unique design helps accelerate the speed at which high-quality domain-specific data can be generated and filtered, providing fresher, more targeted signals for subsequent training.

The Maia 200 is deployed in our US Central data center region near Des Moines, Iowa, with the US West 3 data center region near Phoenix, Arizona to follow and future regions to follow. Maia 200 integrates seamlessly with Azure, and we’re previewing the Maia SDK with a complete set of tools for creating and optimizing models for Maia 200. It includes a full feature set, including PyTorch integration, the Triton compiler and an optimized core library, and access to the low-level Maia programming language. This provides developers with fine-grained control when needed, while allowing easy porting of the model across heterogeneous hardware accelerators.

Table of Contents

Designed for AI inference

Manufactured on TSMC’s industry-leading 3-nanometer process, each Maia 200 chip contains more than 140 billion transistors and is optimized for large-scale AI workloads while delivering efficient performance for the dollar. On both fronts, the Maia 200 is built to excel. Designed for the latest low-precision computing models, each Maia 200 chip delivers more than 10 petaFLOPS of 4-bit precision (FP4) and more than 5 petaFLOPS of 8-bit (FP8) performance, all within a 750W SoC TDP envelope. In practical terms, the Maia 200 can effortlessly run today’s biggest models, with plenty of room for even bigger models in the future.

A close-up of the Maia 200 AI acceleration chip.

Crucially, FLOPS are not the only ingredient for faster AI. Feeding data is equally important. The Maia 200 attacks this bottleneck with a redesigned memory subsystem. The Maia 200 memory subsystem is focused on narrow-precision data types, a specialized DMA engine, on-die SRAM, and a specialized NoC structure for high-bandwidth data transfer, which increases token throughput.

The table titled

Optimized AI systems

At the system level, the Maia 200 represents a new, two-layer scalable network design built on standard Ethernet. A proprietary transport layer and tightly integrated NIC unlock performance, high reliability, and significant cost advantages without relying on proprietary architectures.

Each accelerator exhibits:

2.8 TB/s bi-directional dedicated bandwidth
Predictable, high-performance collective operations across clusters of up to 6,144 accelerators

This architecture provides scalable performance for dense inference clusters while reducing power consumption and overall TCO across the global Azure fleet.

In each stack, the four Maia accelerators are fully interconnected by direct, unswitched lines that maintain broadband communication locally for optimal inference efficiency. The same communication protocols are used for intra-rack and inter-rack networks using the Maia AI transport protocol, enabling seamless scaling across nodes, racks and accelerator clusters with minimal network hops. This unified structure simplifies programming, improves workload flexibility, and reduces stranded capacity while maintaining consistent performance and cost-effectiveness at cloud scale.

Top view of a Maia 200 blade server.

A cloud-based development approach

A core principle of Microsoft’s silicon development programs is to validate as much of the entire system as possible before final silicon availability.

A sophisticated pre-silicon environment drove the Maia 200 architecture from its earliest stages, modeling the LLM’s high-fidelity computing and communication patterns. This early co-development environment allowed us to optimize silicon, network and system software as a unified whole, long before the first silicon.

We also designed the Maia 200 for fast and seamless availability in the data center from the ground up, building in early validation of some of the most complex system elements, including the backend network and our second-generation closed-loop, liquid-cooled heat exchanger. Native integration with the Azure control plane provides both chip- and rack-level security, telemetry, diagnostics, and management capabilities, maximizing reliability and uptime for production-critical AI workloads.

The result of these investments was that AI models were running on Maia 200 silicon within days of the delivery of the first packaged part. The time from first silicon to first deployment in a data center rack has been reduced to less than half that of comparable AI infrastructure programs. And this end-to-end approach, from chip to software to data centers, translates directly into higher utilization, shorter time-to-production and continuous improvement in performance per dollar and per watt at cloud scale.

View of Maia 200 rack and HXU cooling unit.

Sign up to preview the Maia SDK

The era of large-scale artificial intelligence is just beginning, and the infrastructure will determine what is possible. Our Maia AI accelerator program is designed to be multi-generational. As we deploy Maia 200 across our global infrastructure, we’re already designing for future generations, and we expect each generation to continually set new benchmarks for what’s possible, delivering ever-increasing performance and efficiency for the most important AI workloads.

Today, we invite developers, AI startups, and academics to begin exploring early model and workload optimization with the new Maia 200 Software Development Kit (SDK). The SDK includes the Triton Compiler, support for PyTorch, low-level programming in NPL and Maia simulator, and a cost calculator to optimize for efficiency earlier in the code lifecycle. Sign up for a preview here.

Get more photos, videos and resources on our Maia 200 website and read more details.

Scott Guthrie is responsible for hyperscale cloud computing solutions and services including Azure, Microsoft’s cloud computing platform, generative AI solutions, data platforms, and information and cybersecurity. These platforms and services help organizations around the world address pressing challenges and drive long-term transformation.

Maia 200: An AI Accelerator Built for Inference – Official Microsoft Blog

Designed for AI inference

Optimized AI systems

A cloud-based development approach

Sign up to preview the Maia SDK

Leave a Comment Cancel reply