NVIDIA has introduced Multipath Reliable Connection (MRC), a revolutionary transport protocol designed to eliminate the "Network Wall" in AI training.
In modern AI training, network reliability is as critical as GPU performance. A single dropped packet can stall a massive training job across thousands of GPUs. NVIDIA’s new MRC protocol addresses this by distributing traffic across multiple network paths simultaneously.
By implementing MRC, data center operators can achieve 99.9% network utilization for all-to-all collectives, which are the backbone of LLM training. This translates to a 15-20% reduction in total training time for frontier-scale models.