Apple Siri + Gemini on Private Cloud Compute: A Deep Dive into the 3-Tier Architecture, Model Distillation, and What It Means for Developers
Top Highlights
- iOS 26.4 shipped Gemini-powered Siri on March 1, 2026 — complex reasoning, multi-step planning, on-screen awareness via Neural Engine pixel reading
- 3-tier routing architecture: on-device Apple Foundation Models → Apple's Private Cloud Compute (PCC) → Google Gemini, with each tier handling a different complexity band
- Model distillation confirmed March 25, 2026: Apple can compress Gemini into smaller on-device models using knowledge distillation, reducing cloud dependency over time
- PCC uses stateless Apple Silicon servers — ephemeral processing, no persistent storage, PII stripped before any Gemini call; cryptographic attestation allows public verification
- WWDC 2026 (June 8–12) will expand SiriKit and App Intents APIs to expose cross-app reasoning and on-screen context to third-party developers
The Partnership: What Apple and Google Actually Agreed To
On January 12, 2026, Apple and Google announced a multi-year strategic collaboration in which Google's Gemini model family powers a rebuilt version of Siri. The financial terms disclosed include a $1 billion annual payment from Apple to Google — structured not as a search deal (Apple already pays Google ~$20B/year for default search) but as a separate AI infrastructure and licensing arrangement. This makes it among the largest AI licensing contracts publicly disclosed to date.
What Apple gets: access to a 1.2 trillion parameter Gemini model, Google's multimodal vision capabilities (used for on-screen awareness), and critically, the ability to distill Gemini's knowledge into smaller Apple Foundation Models that run entirely on-device. What Google gets: deployment at over 1 billion active iOS devices, and the largest production stress-test of Gemini inference throughput ever attempted.
The arrangement contains a notable tension that surfaced in February 2026 when Google executives publicly contradicted Apple's privacy messaging: Google stated that some Siri queries do route to Google's own servers, not exclusively through Apple's PCC. Apple has since clarified that only queries above a complexity threshold that PCC cannot resolve independently fall through to Google's infrastructure — and those queries pass through additional PII-stripping layers first. Understanding exactly where queries go requires understanding the three-tier routing model.
The 3-Tier Processing Architecture
Every Siri query in iOS 26.4+ is evaluated by a routing classifier that runs entirely on-device. This classifier — itself a small fine-tuned model — determines which tier should handle the query based on complexity, context requirements, and whether the response requires real-time data or large-context reasoning. The three tiers operate as a waterfall with increasing privacy cost at each level.
On-Device — Apple Foundation Models
~70% of queriesRuns on the Apple Neural Engine using distilled Gemini student models and Apple's own Foundation Models. Handles timers, reminders, app launches, simple Q&A, on-screen awareness for displayed content, and short-context reasoning. Zero data leaves the device. Latency: <200ms.
Examples: "Set a timer for 10 minutes" · "What's on my screen?" · "Read this email to me"
Apple Private Cloud Compute (PCC)
~25% of queriesRoutes to Apple Silicon servers running a hardened OS with stateless, ephemeral computation. PII is stripped (names, addresses, emails, phone numbers) before processing. Handles cross-app reasoning, multi-turn conversations requiring broader context, and tasks requiring Apple's larger Foundation Model variants. No data persisted, no logs retained.
Examples: "Summarise all my emails from Sarah this month" · "Plan a trip to Tokyo next week using my calendar"
Google Gemini (via PCC Privacy Proxy)
~5% of queriesMost complex queries — real-time knowledge, visual understanding beyond on-device Neural Engine capabilities, long-context document analysis — route through PCC's privacy proxy to the full 1.2T parameter Gemini model on Google's infrastructure. The privacy proxy applies an additional anonymisation layer; Google processes the query in stateless compute containers and returns the response, retaining no query data.
Examples: "Analyse this 200-page contract and flag unusual clauses" · "What does this sign say?" (foreign language)
func routeQuery(query: SiriQuery) -> ProcessingTier {
let complexity = on_device_classifier.evaluate(query)
if complexity.requiresRealTimeKnowledge || complexity.contextTokens > 32_000 {
// Strip PII, route via PCC privacy proxy to Gemini
return .geminiViaPrivacyProxy(pii_stripped: true)
}
if complexity.requiresLargeContext || complexity.crossAppReasoning {
// Route to Apple PCC — stateless, ephemeral, PII stripped
return .applePrivateCloudCompute(persist: false)
}
// Handle entirely on-device with Apple Neural Engine
return .onDevice(model: .appleFoundationModel)
}
Private Cloud Compute: How Apple Built a Verifiable Privacy Cloud
Apple's Private Cloud Compute is not a conventional cloud infrastructure. It was purpose-built to satisfy a single constraint: technically enforceable privacy guarantees — meaning the system cannot be coerced into revealing user data even under a legal subpoena or an insider threat from Apple's own site reliability engineers.
Apple Silicon Server Nodes
PCC nodes run on custom Apple Silicon servers that carry the same Secure Enclave architecture found in iPhones and Macs. The Secure Enclave provides hardware-level cryptographic isolation — the OS running inference cannot read the keys that protect user data in transit. This means even a fully compromised PCC OS cannot retrospectively decrypt previously processed queries.
Stateless Ephemeral Processing
Every PCC compute node is stateless by design. When a query arrives, it is processed in an isolated memory space that is zeroed immediately after the response is generated. There is no write-back to persistent storage, no session state, and no logging of query content. The PCC OS enforces this at the kernel level — application code running inference cannot call any storage API; those syscalls are blocked in the hardened OS configuration.
Public Verifiability via Cryptographic Attestation
Apple has open-sourced key PCC components and provides a Virtual Research Environment that allows security researchers to install and run the PCC software stack on Apple Silicon Macs. Every PCC node publishes a cryptographic attestation of its software configuration — a signed hash of the exact binary stack running on that node — which iOS clients verify before routing any query to it. If Apple were to deploy a modified PCC build that weakened privacy guarantees, the attestation would change, iOS clients would reject it, and the deviation would be publicly detectable.
// Before routing any query to PCC, iOS verifies:
// 1. The PCC node's attestation matches a known-good software hash
// 2. The attestation is signed by Apple's Secure Enclave key
// 3. The software hash is in Apple's public transparency log
let attestation = await pccNode.fetchAttestation()
guard attestation.softwareHash == knownGoodHash,
attestation.signedBy == AppleSecureEnclavePublicKey,
transparencyLog.contains(attestation.softwareHash) else {
// Refuse to route — node software not verified
throw PCCError.attestationFailed
}
// Only now send the (PII-stripped) query
Model Distillation: Bringing Gemini On-Device
The most strategically significant detail confirmed on March 25, 2026 is that Apple's licensing agreement includes the right to distill Gemini's capabilities into smaller models that run entirely on-device. Knowledge distillation is a training technique where a large "teacher" model's output probability distributions — not just its final answers, but its entire uncertainty profile across all possible responses — are used to train a smaller "student" model. The student learns to reason like the teacher, not just mimic its outputs.
For Apple, this has a profound architectural implication: every tier-3 Gemini query that runs today generates training signal that Apple can use to improve its on-device and PCC-tier models tomorrow. Over time, the fraction of queries requiring Google's servers should decrease as distilled models become capable of handling progressively more complex requests. This means Google's role in the architecture is structurally self-eroding — Apple is licensing both compute and training data in a single deal.
Distillation in Practice
- Teacher: Gemini 1.2T parameter model — generates probability distribution over token space for each response
- Student: Apple Foundation Model (3B–7B parameters) — trained on the teacher's distributions via KL-divergence loss, not just on ground-truth labels
- Result: Student model inherits teacher's reasoning patterns on in-distribution tasks at ~1/100th the inference cost
- Limitation: Distillation transfers in-distribution capability — out-of-distribution tasks (novel reasoning, rare languages) still require the teacher
Apple has not disclosed a timeline for which capabilities will be distilled first. Based on Tier 1 query distribution, the most likely near-term candidates are longer-context summarisation and cross-document reasoning — tasks currently handled by PCC's larger model variants — which would allow Apple to move the Tier 2 / Tier 3 boundary significantly toward the device.
On-Screen Awareness: How Siri Sees Your Display
On-screen awareness is the feature most visible to end users and the one with the most direct developer implications. Using the Apple Neural Engine to process display pixels in real time, Siri builds a semantic model of what is currently visible — not just the text, but the visual layout, interactive elements, and spatial relationships between UI components. This happens entirely on-device; no screenshot ever leaves the phone.
When a user says "Send this to Sarah" while viewing a photo, Siri identifies the photo, resolves "Sarah" against the user's contacts graph, determines the appropriate sharing mechanism (Messages, Mail, AirDrop), and executes the multi-step action — all without the user specifying app names or navigating menus. The same mechanism allows Siri to extract calendar events from emails, fill forms using data visible on-screen, and describe or translate foreign-language content in the viewport.
For developers, the current on-screen awareness API surface is limited to what apps already expose through App Intents and NSUserActivity. Apps that have not adopted App Intents are treated as opaque by Siri's on-screen model — Siri can see the pixels but cannot perform semantic actions within the app. This creates an immediate developer incentive: adopting App Intents now puts your app in scope for Siri's cross-app reasoning before WWDC 2026 expands the API further.
Developer Implications: What to Build Before WWDC
App Intents — The Gateway to Siri Intelligence
App Intents is the framework Apple introduced in iOS 16 that allows apps to expose their capabilities as structured, parameterised actions. With Gemini-powered Siri, App Intents has become the integration surface for cross-app AI reasoning. An app that implements App Intents can have its actions invoked by Siri's planner component — the Gemini-backed system that chains multiple actions to complete a user's high-level goal.
import AppIntents
// Register a structured action Siri's planner can invoke
struct CreateInvoiceIntent: AppIntent {
static var title: LocalizedStringResource = "Create Invoice"
static var description = IntentDescription("Creates a new invoice in the app")
@Parameter(title: "Client Name") var clientName: String
@Parameter(title: "Amount") var amount: Double
@Parameter(title: "Due Date") var dueDate: Date?
func perform() async throws -> some IntentResult & ProvidesValue<Invoice> {
let invoice = try await InvoiceService.create(
client: clientName, amount: amount, dueDate: dueDate
)
return .result(value: invoice)
}
}
// Siri can now chain this with other intents:
// "Create an invoice for the client I emailed yesterday for $2,500 due next Friday"
// → Siri reads email → extracts client name → calls CreateInvoiceIntent
Privacy-Aware Design for Siri Integration
Apple's privacy architecture has a direct impact on what data your App Intents should expose. When Siri routes a query to PCC or Gemini, it carries the intent parameters you declared — not raw app data. This means: design your intent parameters to contain the minimum information required for the action. An intent that takes a clientName: String is better than one taking a full Client object with email addresses and payment history, since only what you declare reaches the cloud tiers.
WWDC 2026 — What to Expect
WWDC 2026 runs June 8–12 at Apple Park. Based on the architecture described above and Apple's historical WWDC pattern, expect the following developer-facing announcements:
- App Intents v3 — expanded parameter types, streaming responses, and background intent execution for long-running agentic tasks
- On-screen awareness API — programmatic hooks for apps to declare the semantic meaning of their screen content, improving Siri's understanding beyond pixel-level vision
- SiriKit deprecation path — legacy SiriKit domains (messaging, payments, workouts) likely folded into the App Intents model with migration guides
- PCC transparency tools for developers — APIs to verify PCC attestation from within apps, allowing privacy-sensitive apps to confirm their data routes only to verified PCC nodes
Strategic Analysis: Who Benefits Most
Apple gets Gemini's capabilities without the reputational cost of training a general-purpose internet-scale model. Apple's competitive advantage has always been hardware-software integration and privacy positioning — not foundation model research. Outsourcing the model while owning the privacy layer and the distillation pipeline is architecturally coherent with Apple's strengths.
Google gets Gemini deployed at 1 billion+ device scale, with Apple's privacy architecture providing cover for regulatory concerns about personal data. The $1B annual payment is less important than the inference volume — every complex Siri query processed by Gemini generates implicit feedback about real-world usage patterns that Google cannot buy anywhere else.
Developers face the most immediate pressure. Apps that do not adopt App Intents will progressively lose Siri surface area as the intelligence layer expands. By WWDC 2026, Siri will have cross-app reasoning capability that treats App Intents as the integration contract. Apps that have not invested in this integration will not participate in agentic workflows that iOS 26.5 enables — effectively ceding a growing share of user interaction to the OS layer.
The Privacy Controversy
Google's February 2026 statement that some queries reach Google's servers — contradicting Apple's pure-PCC messaging — highlights a structural ambiguity: Apple controls the routing classifier, and users cannot inspect routing decisions. Apple's transparency log covers PCC node software, not individual query routing. Security researchers have called for per-query routing transparency. Watch for WWDC 2026 to address this, likely through enhanced user-facing privacy reporting in Settings → Privacy → Siri Intelligence.
5 Actionable Takeaways for Developers
-
1
Adopt App Intents now, before WWDC. Apps already exposing structured intents will be in scope for Gemini-powered cross-app reasoning from day one of iOS 26.5. Apps that wait for WWDC documentation will miss the initial intelligence layer deployment window.
-
2
Design intent parameters for minimum necessary disclosure. Only the parameters you declare in App Intents reach the cloud processing tiers. Keep intent schemas narrow — this is both a privacy best practice and a performance optimisation (smaller payloads route faster).
-
3
Understand the 3-tier routing boundary for privacy-sensitive apps. If your app handles sensitive data (health, finance, legal), audit which of your App Intent parameters could reach Google's servers via Tier 3. Consider designing a PCC-only intent variant for sensitive actions if Apple provides a routing hint API at WWDC.
-
4
Watch the distillation roadmap. As Apple distills more Gemini capabilities on-device, the Tier 1 boundary expands. Apps optimised for on-device intelligence — low-latency, no network dependency — will benefit disproportionately as the on-device model becomes more capable.
-
5
Deprecate legacy SiriKit integrations before WWDC. Apple's history with deprecated frameworks suggests a 2-year migration window. If your app uses legacy SiriKit messaging or payment domains, begin migrating to App Intents equivalents now to avoid an emergency rewrite post-WWDC.