[Standard] CNCF Kubernetes AI Conformance v1.35: The AI Era

For years, running **AI** workloads on **Kubernetes** felt like forcing a square peg into a round hole. While K8s was designed for stateless web microservices, AI models require stateful, high-throughput, and hardware-dependent environments. Today, the **CNCF** (Cloud Native Computing Foundation) has officially bridged that gap with the release of **Kubernetes v1.35** and the accompanying **AI Conformance v1.35** standard.

Dynamic Resource Allocation (DRA) goes GA

The headline feature of v1.35 is the General Availability of **Dynamic Resource Allocation** (DRA). Before DRA, managing GPUs in Kubernetes was limited to simple integer requests (e.g., `nvidia.com/gpu: 1`). This led to massive resource fragmentation and underutilization. DRA allows for much more granular control, enabling pods to request specific hardware features like **NVLink** topologies, fractional GPU slicing (MIG/MPS), and even specific memory bandwidth profiles.

Technically, DRA introduces a new resource claim model that decouples device discovery from the core kubelet. This allows vendors to write specialized resource drivers that can handle the complex "gang scheduling" required for large-scale **distributed training** jobs. If a pod needs 8 GPUs connected via a specific high-speed interconnect, DRA ensures that the scheduler only places that pod on a node that meets the exact topological requirements.

The AI Conformance Program

Alongside the software release, the CNCF has launched the **AI Conformance Program**. Similar to the standard Kubernetes conformance tests, this program ensures that cloud providers (AWS, Google, Azure) provide a consistent experience for AI developers. A "CNCF AI Certified" cluster must support specific APIs for **VRAM** isolation, automated driver lifecycle management, and native integration with the **Kueue** job queueing system.

This is a major win for **Sovereign Cloud** providers. By adhering to the AI Conformance standard, smaller providers can offer a "Kubernetes-native" AI experience that is compatible with the same **Helm** charts and **Kueflow** pipelines used in the major public clouds. This reduces vendor lock-in and allows enterprises to move their training workloads to the most cost-effective region without rewriting their infrastructure code.

Technical Insight: Multi-Cluster AI Mesh

Kubernetes v1.35 also introduces alpha support for Multi-Cluster AI Mesh. This allows a single training job to span across multiple physical clusters, utilizing Submariner or Cilium ClusterMesh to handle the high-speed cross-cluster networking required for gradient synchronization.

Optimizing for Inference: Sidecar Containers and WASM

While training gets the headlines, v1.35 also brings significant improvements for **AI Inference**. The new **Inference Sidecar** pattern allows for model weights to be loaded from a shared volume into a sidecar container, reducing pod startup times (Cold Starts) by up to 80%. There is also improved support for **WebAssembly** (Wasm) runtimes, which are increasingly being used to run lightweight models at the edge with minimal overhead.

With v1.35, Kubernetes has officially transitioned from a container orchestrator to an **AI Operating System**. The ecosystem is now focused on "Day 2" operations: observability for GPU metrics via **Prometheus**, automated scaling based on inference latency, and secure multi-tenancy for shared AI clusters. The road to **Artificial General Intelligence** (AGI) is being paved with YAML.

Kubernetes v1.35: The AI Conformance Leap

Dynamic Resource Allocation (DRA) goes GA

The AI Conformance Program

Technical Insight: Multi-Cluster AI Mesh

Optimizing for Inference: Sidecar Containers and WASM

Upgrading your K8s Clusters?