Training the next generation of Frontier Models requires thousands of GPUs working in perfect synchronization. However, as clusters approach the 100,000 GPU mark, traditional networking protocols like InfiniBand and standard RoCE v2 face significant bottlenecks. To solve this, an industry consortium led by OpenAI, NVIDIA, Microsoft, and AMD has unveiled Multipath Reliable Connection (MRC).
In massive clusters, a single packet loss or switch failure can trigger a "stop-the-world" event across thousands of GPUs, leading to significant wasted compute time (and millions of dollars in lost electricity). Traditional Ethernet relies on ECMP (Equal-Cost Multi-Path), which is static and can lead to hashing collisions, where multiple high-bandwidth flows are funneled through the same oversubscribed link.
MRC introduces a dynamic, packet-level load-balancing layer directly into the NIC (Network Interface Card) hardware. Key features include:
The protocol is designed as an open standard but requires hardware acceleration. NVIDIA has confirmed support for MRC in its ConnectX-9 NICs, while AMD will support it via the Pensando Pollere data processing units. Microsoft and Google are expected to implement MRC in their custom silicon (Maia and TPU v7/v8) to enable larger shared-memory abstractions.
Benchmark data released by the consortium suggests that MRC can improve All-to-All communication efficiency by up to 35% in clusters larger than 32,000 GPUs. This efficiency directly translates to faster training times for Mixture of Experts (MoE) models, which are particularly sensitive to network tail-latency.
MRC represents a significant step toward the "AI Factory" vision, where the network is no longer a bottleneck but a seamless backplane for planetary-scale intelligence.