Event-Driven APIs for Real-Time LLM Streaming [2026]
Bottom Line
Treat model output as typed events, not raw string chunks. Once text, audio, and transcript deltas share one internal contract, your frontend can render real-time UX without model-specific branches.
Key Takeaways
- ›Use
response.output_text.deltafor text streams andresponse.output_audio.deltafor audio streams. - ›OpenAI recommends WebRTC for browser/mobile realtime clients and WebSocket for server-to-server.
- ›The Realtime API session lifetime is capped at 60 minutes, so reconnect logic is part of the design.
- ›Normalize provider events into one envelope with ordering, completion, and error semantics.
Real-time LLM UX breaks down when your API treats streaming as a special case. The stable approach is to model every token, transcript fragment, and audio chunk as an event with ordering, lifecycle, and payload metadata. OpenAI now exposes typed streaming events in the Responses API and low-latency multimodal events in the Realtime API, which makes it practical to design one event-driven backend that can serve chat, voice, and mixed-modality clients.
- Responses API streams semantic SSE events such as
response.created,response.output_text.delta, andresponse.completed. - Realtime API emits multimodal server events including
response.output_audio.delta,response.output_audio_transcript.delta, andresponse.done. - Use one internal envelope so web, mobile, and voice clients subscribe to the same contract.
- Keep ordering explicit with request IDs and provider sequence fields instead of relying on arrival time.
Design the event contract
Prerequisites
- Node.js server with support for
fetch, SSE, and WebSockets - An OPENAIAPIKEY stored on the server, never in the browser
- Basic familiarity with Express or another HTTP framework
- A client that can consume
EventSourceor WebSocket messages - A log policy that redacts prompts and transcripts before storage
Bottom Line
Do not leak provider-specific event trees into your app. Translate them once at the edge into started, delta, completed, and failed events, then let every client render against that contract.
Step 1: Define a canonical envelope
OpenAI gives you rich event types, but product code should not know whether a token came from Responses or Realtime. Define one internal schema and keep it boring. That schema should answer four questions for every message: what stream it belongs to, where it belongs in order, what kind of content arrived, and whether the stream is still alive.
type StreamEvent =
| {
type: "stream.started";
requestId: string;
provider: "openai";
model: string;
at: number;
}
| {
type: "text.delta";
requestId: string;
sequence: number;
text: string;
}
| {
type: "audio.delta";
requestId: string;
sequence: number;
mimeType: "audio/pcm" | "audio/pcmu" | "audio/pcma";
base64: string;
}
| {
type: "transcript.delta";
requestId: string;
sequence: number;
text: string;
}
| {
type: "stream.completed";
requestId: string;
at: number;
}
| {
type: "stream.failed";
requestId: string;
code: string;
message: string;
};The win is architectural, not cosmetic:
- Provider upgrades stay isolated in one adapter layer.
- Your frontend state machine only handles a small, predictable set of events.
- Text-only and voice features share the same persistence, tracing, and retry logic.
- Replay tests become easy because event logs are provider-agnostic JSON.
Build the text stream path
Step 2: Proxy the Responses stream over SSE
For typed text streaming, the simplest production shape is server-side SSE. OpenAI’s Responses API supports text and image inputs and can stream semantic events when you set stream: true. Your server should consume the provider stream, translate event types, and emit your internal contract to the browser.
import express from "express";
import OpenAI from "openai";
import crypto from "node:crypto";
const app = express();
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
function emit(res, event, data) {
res.write(`event: ${event}\n`);
res.write(`data: ${JSON.stringify(data)}\n\n`);
}
app.get("/api/stream", async (req, res) => {
const prompt = String(req.query.q ?? "Summarize this image and answer briefly.");
const requestId = crypto.randomUUID();
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache, no-transform");
res.setHeader("Connection", "keep-alive");
emit(res, "stream.started", {
type: "stream.started",
requestId,
provider: "openai",
model: "gpt-5.2",
at: Date.now()
});
const stream = await client.responses.create({
model: "gpt-5.2",
input: [
{
role: "user",
content: [{ type: "input_text", text: prompt }]
}
],
stream: true
});
for await (const event of stream) {
if (event.type === "response.output_text.delta") {
emit(res, "text.delta", {
type: "text.delta",
requestId,
sequence: event.sequence_number,
text: event.delta
});
}
if (event.type === "response.completed") {
emit(res, "stream.completed", {
type: "stream.completed",
requestId,
at: Date.now()
});
res.end();
}
}
});If your team shares or reviews payload samples often, run the event examples through the Code Formatter so the schema stays readable in docs and PRs.
Step 3: Keep the browser dumb
Your client should subscribe, append deltas, and react to lifecycle events. It should not parse provider-specific event names, infer ordering from timing, or guess when a stream is done.
const source = new EventSource("/api/stream?q=Explain%20event-driven%20APIs");
let text = "";
source.addEventListener("text.delta", (e) => {
const payload = JSON.parse(e.data);
text += payload.text;
renderText(text);
});
source.addEventListener("stream.completed", () => {
source.close();
setStatus("done");
});
source.onerror = () => {
source.close();
setStatus("retrying");
};This division of labor matters. Once your browser only understands text.delta and friends, you can swap transport, provider, or model without rewriting UI state.
Add multimodal output
Step 4: Bridge Realtime audio and transcript events
When you need sub-second voice experiences, switch to the Realtime API. OpenAI documents it as a low-latency multimodal interface for audio, images, and text inputs with audio and text outputs. For browser and mobile clients, OpenAI recommends WebRTC. For server-to-server work, use WebSocket and forward normalized events to your application clients.
import WebSocket from "ws";
const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime", {
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`
}
});
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
audio: {
output: {
format: { type: "audio/pcm", rate: 24000 }
}
}
}
}));
ws.send(JSON.stringify({
type: "response.create",
response: {
output_modalities: ["audio", "text"]
}
}));
});
ws.on("message", (message) => {
const event = JSON.parse(message.toString());
if (event.type === "response.output_audio.delta") {
fanOut({
type: "audio.delta",
requestId: event.response_id,
sequence: Date.now(),
mimeType: "audio/pcm",
base64: event.delta
});
}
if (event.type === "response.output_audio_transcript.delta") {
fanOut({
type: "transcript.delta",
requestId: event.response_id,
sequence: Date.now(),
text: event.delta
});
}
if (event.type === "response.done") {
fanOut({
type: "stream.completed",
requestId: event.response.id,
at: Date.now()
});
}
});Two implementation notes matter here:
- Audio bytes arrive Base64-encoded, so decode them with
Buffer.from(delta, "base64")only at the point where you need playback or storage. - The Realtime guide states that the full
response.donepayload omits raw audio bytes, so you must capture audio fromresponse.output_audio.deltaas it arrives. - The Realtime conversations guide caps a session at 60 minutes, so long-running voice apps need reconnect and session handoff logic.
Verify the event flow
Step 5: Test lifecycle, ordering, and cleanup
You are done when the stream is correct under failure, not when the happy path prints tokens. Verification should focus on invariants that stay true across both APIs.
- Assert that every stream emits one start event and exactly one terminal event.
- Assert that deltas can be replayed in order to reconstruct the final text or transcript.
- Assert that disconnects close upstream provider streams and release sockets.
- Assert that the UI disables input only for the active request ID, not globally.
event: stream.started
data: {"type":"stream.started","requestId":"req_123"}
event: text.delta
data: {"type":"text.delta","requestId":"req_123","sequence":1,"text":"Design "}
event: text.delta
data: {"type":"text.delta","requestId":"req_123","sequence":2,"text":"around events"}
event: stream.completed
data: {"type":"stream.completed","requestId":"req_123"}Expected behavior:
- The client renders partial text immediately after the first delta.
- The final UI state flips only after
stream.completedorstream.failed. - Audio playback continues from chunked deltas even if transcript deltas arrive on a different cadence.
- Logs store redacted payloads, not raw prompts or user speech. If you need a quick redaction pass for debugging samples, the Data Masking Tool is a practical preprocessing step.
Troubleshoot and what’s next
Troubleshooting: top 3
- Tokens arrive out of order. Do not concatenate by arrival time. Preserve
sequence_numberfrom the Responses stream when it exists, and otherwise add your own monotonic sequence at the adapter boundary. - Audio never plays back after completion. In Realtime, raw audio is not included in
response.done. Buffer or forward everyresponse.output_audio.deltachunk as it arrives. - Streams leak after users navigate away. Wire client disconnects to upstream cancellation and close
EventSourceor WebSocket objects explicitly. Leaked streams silently inflate cost and exhaust connection pools.
What’s next
- Add a resumable event log so clients can reconnect and replay from the last acknowledged sequence.
- Introduce backpressure metrics for slow mobile clients and oversized audio buffers.
- Split transport from semantics so SSE, WebSocket, and WebRTC all publish the same internal event envelope.
- Version your contract early. A simple
schemaVersionfield prevents painful client migrations later.
The core pattern is simple: provider events in, product events out. Once that boundary is explicit, real-time LLM streaming stops being a special feature and becomes a standard event pipeline your whole stack can reason about.
Frequently Asked Questions
How do I design a streaming API for LLM tokens without coupling the frontend to one provider? +
stream.started, text.delta, stream.completed, and stream.failed. Keep provider event names like response.output_text.delta inside your backend adapter layer so the frontend only depends on product semantics.Should I use SSE or WebSocket for real-time LLM streaming? +
Why is my Realtime response.done event missing audio bytes? +
response.output_audio.delta events. The final response.done event is for response state and assembled metadata, not for replaying the full binary stream.What should I log from streamed LLM responses in production? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.