RAG for Video [2026]: Multimodal Pipeline Tutorial
Bottom Line
In 2026, the reliable way to build video RAG is still to decompose video into text-bearing evidence: speech, visible text, scene descriptions, and timestamps. Better embeddings help, but the real recall gains come from chunk design, modality alignment, and keeping every hit tied to a precise time range.
Key Takeaways
- ›As of May 2, 2026, OpenAI embedding models are text-only, so video RAG starts with modality extraction.
- ›Use gpt-4o-transcribe-diarize for timestamped speech and gpt-4.1-mini for low-cost frame summaries.
- ›Build time-bounded evidence chunks that combine speech, visual state, and metadata before embedding.
- ›Start with text-embedding-3-large and a reduced dimension size for a solid recall-to-cost baseline.
Video RAG sounds like a multimodal indexing problem, but the practical architecture is still text-centric. As of May 02, 2026, OpenAI’s embedding models remain text-only, which means the winning pattern is to convert video into aligned evidence: speech, visible text, frame-level scene summaries, and timing metadata. In this tutorial, you’ll build that pipeline end to end, verify retrieval quality, and leave with an MVP you can harden for production.
- OpenAI embedding models are text-only, so raw video files are not the retrieval unit.
- Speech + frame summaries + timestamps usually beat transcript-only indexing on video search tasks.
- Chunk shape matters more than most teams expect; poor temporal grouping kills recall.
- Evidence metadata is mandatory if you want explainable answers and clip-level playback.
Prerequisites
You need a Python environment, ffmpeg on your path, an OPENAIAPIKEY, and a test video. Keep the first pass deliberately small: one five-to-fifteen minute recording is enough to validate the design.
- Python with
openaiandnumpyinstalled. - ffmpeg for audio extraction and frame sampling.
- A video that includes both spoken context and meaningful on-screen state.
- A place to store temporary artifacts like
audio.wavand sampled frames.
If your video includes customer names, account numbers, or support screenshots, scrub them before sharing debug artifacts with teammates. A lightweight option is TechBytes’ Data Masking Tool.
Bottom Line
Do not try to retrieve over raw video binaries. Convert video into timestamped, text-rich evidence chunks, then embed those chunks with text-embedding-3-large.
Step 1: Extract modalities
The first step is to split the video into artifacts that models can process cheaply and predictably. For a baseline, sample frames at 1 fps and extract a single audio track. That gives you enough coverage for meetings, walkthroughs, support recordings, and most UI demos.
Extract audio and frames
mkdir -p frames
ffmpeg -i demo.mp4 -map 0:a? -vn audio.wav
ffmpeg -i demo.mp4 -vf "fps=1,scale=960:-1" frames/frame_%05d.jpgThe -map 0:a? form keeps the command resilient when a video has no audio stream. The -vf alias applies a simple filter graph, and fps=1 gives you one frame per second. Increase that later for fast-changing interfaces.
Transcribe audio and summarize frames
Use gpt-4o-transcribe-diarize for timestamped speech, then run a low-detail vision pass over every fifth frame. The goal is not literary description. You want retrieval evidence: visible text, scene state, UI transitions, and notable actions.
from openai import OpenAI
import base64, glob, os
client = OpenAI()
def as_data_url(path):
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:image/jpeg;base64,{b64}"
with open("audio.wav", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe-diarize",
file=audio_file,
)
frame_notes = []
for path in sorted(glob.glob("frames/*.jpg"))[::5]:
resp = client.responses.create(
model="gpt-4.1-mini",
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "Describe the scene, visible text, UI state, and notable actions in one tight paragraph."},
{"type": "input_image", "image_url": as_data_url(path), "detail": "low"}
]
}],
)
frame_notes.append({"frame": os.path.basename(path), "caption": resp.output_text})This is the architectural shift many teams miss: your retrieval corpus is not the original video. It is the normalized output of multiple passes over the video.
Step 2: Build retrieval units
Now merge modalities into chunks a retriever can rank. A good first chunk is a short, time-bounded evidence unit built around a transcript segment, plus nearby visual context. Keep each chunk self-contained enough that a reranker or answer generator can cite it without reopening the source video.
- Keep start and end timestamps on every chunk.
- Include speech, visible text, and visual state in one normalized text field.
- Store raw metadata separately for debugging and playback links.
- Prefer chunks of roughly 10-30 seconds of meaning, not arbitrary file boundaries.
import re
def frame_second(name):
idx = int(re.search(r"(\d+)", name).group(1))
return idx - 1
docs = []
for seg in transcript.segments:
nearby = [
note["caption"] for note in frame_notes
if seg.start - 2 <= frame_second(note["frame"]) <= seg.end + 2
]
text = (
f"time={seg.start:.1f}-{seg.end:.1f}\n"
f"speaker={getattr(seg, 'speaker', 'unknown')}\n"
f"speech={seg.text}\n"
f"visual={' | '.join(nearby[:2])}"
)
docs.append({
"id": len(docs),
"start": seg.start,
"end": seg.end,
"text": text,
})If your corpus is mostly screen recordings, visual captions often carry the key answer. Error banners, disabled buttons, tab names, and form states may never be spoken aloud.
Step 3: Embed and index
Once each chunk is normalized into text, embed it with text-embedding-3-large. OpenAI’s docs also confirm that text-embedding-3 models support a dimensions parameter, which is useful when you want a cheaper baseline without changing the overall architecture.
import numpy as np
embed_resp = client.embeddings.create(
model="text-embedding-3-large",
dimensions=1024,
input=[d["text"] for d in docs],
)
matrix = np.array([row.embedding for row in embed_resp.data], dtype="float32")
matrix /= np.linalg.norm(matrix, axis=1, keepdims=True)
def search(query, k=5):
q = client.embeddings.create(
model="text-embedding-3-large",
dimensions=1024,
input=query,
).data[0].embedding
q = np.array(q, dtype="float32")
q /= np.linalg.norm(q)
scores = matrix @ q
top = scores.argsort()[-k:][::-1]
return [{"score": float(scores[i]), **docs[i]} for i in top]This local cosine index is enough for validation. Once the shape is working, you can swap in a managed vector system. If you use OpenAI vector stores later, the current auto chunker uses 800 token chunks with 400 tokens of overlap, which is a decent default for text-heavy corpora but not a replacement for your own video-aware chunking strategy.
Verification and expected output
Before you wire an answer generator on top, test retrieval directly. Ask for moments you already know exist in the video. Strong video RAG should return timestamped hits that combine what was said with what was visible.
- Query for spoken facts: product names, decisions, bug numbers, action items.
- Query for visual-only facts: button labels, dialog titles, error toasts, menu states.
- Query for mixed facts: “where does the speaker explain the failed payment while the form shows the validation error?”
for hit in search("Show me where the app fails payment validation"):
print(f"{hit['score']:.3f} {hit['start']:.1f}-{hit['end']:.1f} {hit['text'][:160]}")Expected output should look like this:
0.842 132.0-145.7 time=132.0-145.7 speaker=Agent speech=The card form rejects the ZIP code... visual=Checkout page with red validation banner and disabled Submit button
0.801 146.1-154.9 time=146.1-154.9 speaker=Agent speech=We can reproduce it on guest checkout... visual=Billing modal open with highlighted postal code fieldIf the top hits are semantically close but visually wrong, increase frame density around scene changes. If they are visually right but temporally noisy, tighten chunk boundaries.
Troubleshooting and what's next
Troubleshooting top 3
- Visual-only moments are missing. Increase sampling from 1 fps to 2-4 fps for UI demos, or resample only around slide changes and scene cuts.
- Transcript dominates retrieval. Keep speech and visual evidence both in the chunk text, and do not let long monologues swallow short but important visual states.
- Transcription requests fail on long files. OpenAI’s speech-to-text uploads are currently limited to 25 MB, so split or compress long audio before sending it.
What's next
- Add a reranking pass that scores the top 20-50 hits using the original query plus full chunk text.
- Introduce scene-cut detection so frame sampling gets denser only when visuals actually change.
- Store playback URLs like
video.mp4?t=132beside every chunk so answers can deep-link to evidence. - Run an eval set with known timestamped answers before you optimize cost, batch size, or storage backend.
The core lesson is simple: better embeddings help, but they do not remove the need for a disciplined multimodal ingestion layer. If you make the evidence units clean, timestamped, and modality-aware, the retrieval stack becomes much easier to trust.
Frequently Asked Questions
Can I embed an MP4 file directly for video RAG? +
Should I create one embedding per video frame? +
What chunk size works best for video retrieval? +
start/end timestamps. The right size depends on pacing: meetings tolerate longer chunks, while product demos and screen recordings usually need tighter windows.Do I need diarization for a video RAG pipeline? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.