Home Posts Event-Driven APIs for Real-Time LLM Streaming [2026]
System Architecture

Event-Driven APIs for Real-Time LLM Streaming [2026]

Event-Driven APIs for Real-Time LLM Streaming [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 14, 2026 · 9 min read

Bottom Line

Treat model output as typed events, not raw string chunks. Once text, audio, and transcript deltas share one internal contract, your frontend can render real-time UX without model-specific branches.

Key Takeaways

  • Use response.output_text.delta for text streams and response.output_audio.delta for audio streams.
  • OpenAI recommends WebRTC for browser/mobile realtime clients and WebSocket for server-to-server.
  • The Realtime API session lifetime is capped at 60 minutes, so reconnect logic is part of the design.
  • Normalize provider events into one envelope with ordering, completion, and error semantics.

Real-time LLM UX breaks down when your API treats streaming as a special case. The stable approach is to model every token, transcript fragment, and audio chunk as an event with ordering, lifecycle, and payload metadata. OpenAI now exposes typed streaming events in the Responses API and low-latency multimodal events in the Realtime API, which makes it practical to design one event-driven backend that can serve chat, voice, and mixed-modality clients.

  • Responses API streams semantic SSE events such as response.created, response.output_text.delta, and response.completed.
  • Realtime API emits multimodal server events including response.output_audio.delta, response.output_audio_transcript.delta, and response.done.
  • Use one internal envelope so web, mobile, and voice clients subscribe to the same contract.
  • Keep ordering explicit with request IDs and provider sequence fields instead of relying on arrival time.

Design the event contract

Prerequisites

  • Node.js server with support for fetch, SSE, and WebSockets
  • An OPENAIAPIKEY stored on the server, never in the browser
  • Basic familiarity with Express or another HTTP framework
  • A client that can consume EventSource or WebSocket messages
  • A log policy that redacts prompts and transcripts before storage

Bottom Line

Do not leak provider-specific event trees into your app. Translate them once at the edge into started, delta, completed, and failed events, then let every client render against that contract.

Step 1: Define a canonical envelope

OpenAI gives you rich event types, but product code should not know whether a token came from Responses or Realtime. Define one internal schema and keep it boring. That schema should answer four questions for every message: what stream it belongs to, where it belongs in order, what kind of content arrived, and whether the stream is still alive.

type StreamEvent =
  | {
      type: "stream.started";
      requestId: string;
      provider: "openai";
      model: string;
      at: number;
    }
  | {
      type: "text.delta";
      requestId: string;
      sequence: number;
      text: string;
    }
  | {
      type: "audio.delta";
      requestId: string;
      sequence: number;
      mimeType: "audio/pcm" | "audio/pcmu" | "audio/pcma";
      base64: string;
    }
  | {
      type: "transcript.delta";
      requestId: string;
      sequence: number;
      text: string;
    }
  | {
      type: "stream.completed";
      requestId: string;
      at: number;
    }
  | {
      type: "stream.failed";
      requestId: string;
      code: string;
      message: string;
    };

The win is architectural, not cosmetic:

  • Provider upgrades stay isolated in one adapter layer.
  • Your frontend state machine only handles a small, predictable set of events.
  • Text-only and voice features share the same persistence, tracing, and retry logic.
  • Replay tests become easy because event logs are provider-agnostic JSON.
Watch out: OpenAI’s streaming guide notes that partial output is harder to moderate than a fully buffered response. If you need approval gates, add them before you fan out partial events to end users.

Build the text stream path

Step 2: Proxy the Responses stream over SSE

For typed text streaming, the simplest production shape is server-side SSE. OpenAI’s Responses API supports text and image inputs and can stream semantic events when you set stream: true. Your server should consume the provider stream, translate event types, and emit your internal contract to the browser.

import express from "express";
import OpenAI from "openai";
import crypto from "node:crypto";

const app = express();
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function emit(res, event, data) {
  res.write(`event: ${event}\n`);
  res.write(`data: ${JSON.stringify(data)}\n\n`);
}

app.get("/api/stream", async (req, res) => {
  const prompt = String(req.query.q ?? "Summarize this image and answer briefly.");
  const requestId = crypto.randomUUID();

  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache, no-transform");
  res.setHeader("Connection", "keep-alive");

  emit(res, "stream.started", {
    type: "stream.started",
    requestId,
    provider: "openai",
    model: "gpt-5.2",
    at: Date.now()
  });

  const stream = await client.responses.create({
    model: "gpt-5.2",
    input: [
      {
        role: "user",
        content: [{ type: "input_text", text: prompt }]
      }
    ],
    stream: true
  });

  for await (const event of stream) {
    if (event.type === "response.output_text.delta") {
      emit(res, "text.delta", {
        type: "text.delta",
        requestId,
        sequence: event.sequence_number,
        text: event.delta
      });
    }

    if (event.type === "response.completed") {
      emit(res, "stream.completed", {
        type: "stream.completed",
        requestId,
        at: Date.now()
      });
      res.end();
    }
  }
});

If your team shares or reviews payload samples often, run the event examples through the Code Formatter so the schema stays readable in docs and PRs.

Step 3: Keep the browser dumb

Your client should subscribe, append deltas, and react to lifecycle events. It should not parse provider-specific event names, infer ordering from timing, or guess when a stream is done.

const source = new EventSource("/api/stream?q=Explain%20event-driven%20APIs");
let text = "";

source.addEventListener("text.delta", (e) => {
  const payload = JSON.parse(e.data);
  text += payload.text;
  renderText(text);
});

source.addEventListener("stream.completed", () => {
  source.close();
  setStatus("done");
});

source.onerror = () => {
  source.close();
  setStatus("retrying");
};

This division of labor matters. Once your browser only understands text.delta and friends, you can swap transport, provider, or model without rewriting UI state.

Add multimodal output

Step 4: Bridge Realtime audio and transcript events

When you need sub-second voice experiences, switch to the Realtime API. OpenAI documents it as a low-latency multimodal interface for audio, images, and text inputs with audio and text outputs. For browser and mobile clients, OpenAI recommends WebRTC. For server-to-server work, use WebSocket and forward normalized events to your application clients.

import WebSocket from "ws";

const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime", {
  headers: {
    Authorization: `Bearer ${process.env.OPENAI_API_KEY}`
  }
});

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      audio: {
        output: {
          format: { type: "audio/pcm", rate: 24000 }
        }
      }
    }
  }));

  ws.send(JSON.stringify({
    type: "response.create",
    response: {
      output_modalities: ["audio", "text"]
    }
  }));
});

ws.on("message", (message) => {
  const event = JSON.parse(message.toString());

  if (event.type === "response.output_audio.delta") {
    fanOut({
      type: "audio.delta",
      requestId: event.response_id,
      sequence: Date.now(),
      mimeType: "audio/pcm",
      base64: event.delta
    });
  }

  if (event.type === "response.output_audio_transcript.delta") {
    fanOut({
      type: "transcript.delta",
      requestId: event.response_id,
      sequence: Date.now(),
      text: event.delta
    });
  }

  if (event.type === "response.done") {
    fanOut({
      type: "stream.completed",
      requestId: event.response.id,
      at: Date.now()
    });
  }
});

Two implementation notes matter here:

  • Audio bytes arrive Base64-encoded, so decode them with Buffer.from(delta, "base64") only at the point where you need playback or storage.
  • The Realtime guide states that the full response.done payload omits raw audio bytes, so you must capture audio from response.output_audio.delta as it arrives.
  • The Realtime conversations guide caps a session at 60 minutes, so long-running voice apps need reconnect and session handoff logic.

Verify the event flow

Step 5: Test lifecycle, ordering, and cleanup

You are done when the stream is correct under failure, not when the happy path prints tokens. Verification should focus on invariants that stay true across both APIs.

  1. Assert that every stream emits one start event and exactly one terminal event.
  2. Assert that deltas can be replayed in order to reconstruct the final text or transcript.
  3. Assert that disconnects close upstream provider streams and release sockets.
  4. Assert that the UI disables input only for the active request ID, not globally.
event: stream.started
data: {"type":"stream.started","requestId":"req_123"}

event: text.delta
data: {"type":"text.delta","requestId":"req_123","sequence":1,"text":"Design "}

event: text.delta
data: {"type":"text.delta","requestId":"req_123","sequence":2,"text":"around events"}

event: stream.completed
data: {"type":"stream.completed","requestId":"req_123"}

Expected behavior:

  • The client renders partial text immediately after the first delta.
  • The final UI state flips only after stream.completed or stream.failed.
  • Audio playback continues from chunked deltas even if transcript deltas arrive on a different cadence.
  • Logs store redacted payloads, not raw prompts or user speech. If you need a quick redaction pass for debugging samples, the Data Masking Tool is a practical preprocessing step.

Troubleshoot and what’s next

Troubleshooting: top 3

  1. Tokens arrive out of order. Do not concatenate by arrival time. Preserve sequence_number from the Responses stream when it exists, and otherwise add your own monotonic sequence at the adapter boundary.
  2. Audio never plays back after completion. In Realtime, raw audio is not included in response.done. Buffer or forward every response.output_audio.delta chunk as it arrives.
  3. Streams leak after users navigate away. Wire client disconnects to upstream cancellation and close EventSource or WebSocket objects explicitly. Leaked streams silently inflate cost and exhaust connection pools.

What’s next

  • Add a resumable event log so clients can reconnect and replay from the last acknowledged sequence.
  • Introduce backpressure metrics for slow mobile clients and oversized audio buffers.
  • Split transport from semantics so SSE, WebSocket, and WebRTC all publish the same internal event envelope.
  • Version your contract early. A simple schemaVersion field prevents painful client migrations later.

The core pattern is simple: provider events in, product events out. Once that boundary is explicit, real-time LLM streaming stops being a special feature and becomes a standard event pipeline your whole stack can reason about.

Frequently Asked Questions

How do I design a streaming API for LLM tokens without coupling the frontend to one provider? +
Translate provider-specific events into a small internal contract such as stream.started, text.delta, stream.completed, and stream.failed. Keep provider event names like response.output_text.delta inside your backend adapter layer so the frontend only depends on product semantics.
Should I use SSE or WebSocket for real-time LLM streaming? +
Use SSE when you mainly need server-to-client text streaming with simple browser support. Use WebSocket when you need bidirectional messaging, low-latency audio, or a server-to-server bridge to the Realtime API.
Why is my Realtime response.done event missing audio bytes? +
Because the Realtime API sends raw audio incrementally in response.output_audio.delta events. The final response.done event is for response state and assembled metadata, not for replaying the full binary stream.
What should I log from streamed LLM responses in production? +
Log request IDs, model names, event types, timings, and redacted payload summaries. Avoid storing raw prompts, transcript fragments, or Base64 audio unless you have a clear retention policy and a privacy reason to keep them.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.