Multipath Reliable Connection (MRC): The New Standard for Massive AI Clusters

By Dillip Chowdary May 09, 2026 13 min read

Training the next generation of Frontier Models requires thousands of GPUs working in perfect synchronization. However, as clusters approach the 100,000 GPU mark, traditional networking protocols like InfiniBand and standard RoCE v2 face significant bottlenecks. To solve this, an industry consortium led by OpenAI, NVIDIA, Microsoft, and AMD has unveiled Multipath Reliable Connection (MRC).

The Problem: Failure Amplification

In massive clusters, a single packet loss or switch failure can trigger a "stop-the-world" event across thousands of GPUs, leading to significant wasted compute time (and millions of dollars in lost electricity). Traditional Ethernet relies on ECMP (Equal-Cost Multi-Path), which is static and can lead to hashing collisions, where multiple high-bandwidth flows are funneled through the same oversubscribed link.

How MRC Works

MRC introduces a dynamic, packet-level load-balancing layer directly into the NIC (Network Interface Card) hardware. Key features include:

Hardware Support

The protocol is designed as an open standard but requires hardware acceleration. NVIDIA has confirmed support for MRC in its ConnectX-9 NICs, while AMD will support it via the Pensando Pollere data processing units. Microsoft and Google are expected to implement MRC in their custom silicon (Maia and TPU v7/v8) to enable larger shared-memory abstractions.

Impact on AI Training

Benchmark data released by the consortium suggests that MRC can improve All-to-All communication efficiency by up to 35% in clusters larger than 32,000 GPUs. This efficiency directly translates to faster training times for Mixture of Experts (MoE) models, which are particularly sensitive to network tail-latency.

MRC represents a significant step toward the "AI Factory" vision, where the network is no longer a bottleneck but a seamless backplane for planetary-scale intelligence.