Can I embed an MP4 file directly for video RAG?

No. As of May 02, 2026, OpenAI’s embedding models are text-only, so the practical path is to extract evidence from the video first. Build chunks from speech, visible text, frame summaries, and timestamps, then embed that normalized text.

Should I create one embedding per video frame?

Usually no. Frame-level embeddings create a large, noisy index and often reduce ranking quality because individual frames lack enough semantic context. Start with short time windows that merge transcript and nearby visual state, then go finer only if your use case truly needs frame-accurate retrieval.

What chunk size works best for video retrieval?

A strong baseline is a chunk that covers roughly 10-30 seconds of meaning and preserves exact start/end timestamps. The right size depends on pacing: meetings tolerate longer chunks, while product demos and screen recordings usually need tighter windows.

Do I need diarization for a video RAG pipeline?

Not always, but diarization helps when speaker identity changes the meaning of a segment, such as interviews, sales calls, or support escalations. If the use case is mostly solo narration over a UI, plain timestamps may be enough for the first version.

RAG for Video [2026]: Multimodal Pipeline Tutorial

Video RAG sounds like a multimodal indexing problem, but the practical architecture is still text-centric. As of May 02, 2026, OpenAI’s embedding models remain text-only, which means the winning pattern is to convert video into aligned evidence: speech, visible text, frame-level scene summaries, and timing metadata. In this tutorial, you’ll build that pipeline end to end, verify retrieval quality, and leave with an MVP you can harden for production.

OpenAI embedding models are text-only, so raw video files are not the retrieval unit.
Speech + frame summaries + timestamps usually beat transcript-only indexing on video search tasks.
Chunk shape matters more than most teams expect; poor temporal grouping kills recall.
Evidence metadata is mandatory if you want explainable answers and clip-level playback.

Prerequisites

You need a Python environment, ffmpeg on your path, an OPENAIAPIKEY, and a test video. Keep the first pass deliberately small: one five-to-fifteen minute recording is enough to validate the design.

Python with openai and numpy installed.
ffmpeg for audio extraction and frame sampling.
A video that includes both spoken context and meaningful on-screen state.
A place to store temporary artifacts like audio.wav and sampled frames.

If your video includes customer names, account numbers, or support screenshots, scrub them before sharing debug artifacts with teammates. A lightweight option is TechBytes’ Data Masking Tool.

Bottom Line

Do not try to retrieve over raw video binaries. Convert video into timestamped, text-rich evidence chunks, then embed those chunks with text-embedding-3-large.

Pro tip: Use one short internal demo video first. You will learn more from inspecting 20 bad hits than from bulk-indexing 2,000 hours of footage.

Step 1: Extract modalities

The first step is to split the video into artifacts that models can process cheaply and predictably. For a baseline, sample frames at 1 fps and extract a single audio track. That gives you enough coverage for meetings, walkthroughs, support recordings, and most UI demos.

Extract audio and frames

mkdir -p frames
ffmpeg -i demo.mp4 -map 0:a? -vn audio.wav
ffmpeg -i demo.mp4 -vf "fps=1,scale=960:-1" frames/frame_%05d.jpg

The -map 0:a? form keeps the command resilient when a video has no audio stream. The -vf alias applies a simple filter graph, and fps=1 gives you one frame per second. Increase that later for fast-changing interfaces.

Transcribe audio and summarize frames

Use gpt-4o-transcribe-diarize for timestamped speech, then run a low-detail vision pass over every fifth frame. The goal is not literary description. You want retrieval evidence: visible text, scene state, UI transitions, and notable actions.

from openai import OpenAI
import base64, glob, os

client = OpenAI()

def as_data_url(path):
    with open(path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:image/jpeg;base64,{b64}"

with open("audio.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="gpt-4o-transcribe-diarize",
        file=audio_file,
    )

frame_notes = []
for path in sorted(glob.glob("frames/*.jpg"))[::5]:
    resp = client.responses.create(
        model="gpt-4.1-mini",
        input=[{
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Describe the scene, visible text, UI state, and notable actions in one tight paragraph."},
                {"type": "input_image", "image_url": as_data_url(path), "detail": "low"}
            ]
        }],
    )
    frame_notes.append({"frame": os.path.basename(path), "caption": resp.output_text})

This is the architectural shift many teams miss: your retrieval corpus is not the original video. It is the normalized output of multiple passes over the video.

Step 2: Build retrieval units

Now merge modalities into chunks a retriever can rank. A good first chunk is a short, time-bounded evidence unit built around a transcript segment, plus nearby visual context. Keep each chunk self-contained enough that a reranker or answer generator can cite it without reopening the source video.

Keep start and end timestamps on every chunk.
Include speech, visible text, and visual state in one normalized text field.
Store raw metadata separately for debugging and playback links.
Prefer chunks of roughly 10-30 seconds of meaning, not arbitrary file boundaries.

import re

def frame_second(name):
    idx = int(re.search(r"(\d+)", name).group(1))
    return idx - 1

docs = []
for seg in transcript.segments:
    nearby = [
        note["caption"] for note in frame_notes
        if seg.start - 2 <= frame_second(note["frame"]) <= seg.end + 2
    ]

    text = (
        f"time={seg.start:.1f}-{seg.end:.1f}\n"
        f"speaker={getattr(seg, 'speaker', 'unknown')}\n"
        f"speech={seg.text}\n"
        f"visual={' | '.join(nearby[:2])}"
    )

    docs.append({
        "id": len(docs),
        "start": seg.start,
        "end": seg.end,
        "text": text,
    })

If your corpus is mostly screen recordings, visual captions often carry the key answer. Error banners, disabled buttons, tab names, and form states may never be spoken aloud.

Watch out: Do not create one embedding per frame unless your use case truly needs frame-level retrieval. It explodes index size and usually hurts semantic ranking.

Step 3: Embed and index

Once each chunk is normalized into text, embed it with text-embedding-3-large. OpenAI’s docs also confirm that text-embedding-3 models support a dimensions parameter, which is useful when you want a cheaper baseline without changing the overall architecture.

import numpy as np

embed_resp = client.embeddings.create(
    model="text-embedding-3-large",
    dimensions=1024,
    input=[d["text"] for d in docs],
)

matrix = np.array([row.embedding for row in embed_resp.data], dtype="float32")
matrix /= np.linalg.norm(matrix, axis=1, keepdims=True)

def search(query, k=5):
    q = client.embeddings.create(
        model="text-embedding-3-large",
        dimensions=1024,
        input=query,
    ).data[0].embedding

    q = np.array(q, dtype="float32")
    q /= np.linalg.norm(q)
    scores = matrix @ q
    top = scores.argsort()[-k:][::-1]
    return [{"score": float(scores[i]), **docs[i]} for i in top]

This local cosine index is enough for validation. Once the shape is working, you can swap in a managed vector system. If you use OpenAI vector stores later, the current auto chunker uses 800 token chunks with 400 tokens of overlap, which is a decent default for text-heavy corpora but not a replacement for your own video-aware chunking strategy.

Verification and expected output

Before you wire an answer generator on top, test retrieval directly. Ask for moments you already know exist in the video. Strong video RAG should return timestamped hits that combine what was said with what was visible.

Query for spoken facts: product names, decisions, bug numbers, action items.
Query for visual-only facts: button labels, dialog titles, error toasts, menu states.
Query for mixed facts: “where does the speaker explain the failed payment while the form shows the validation error?”

for hit in search("Show me where the app fails payment validation"):
    print(f"{hit['score']:.3f} {hit['start']:.1f}-{hit['end']:.1f} {hit['text'][:160]}")

Expected output should look like this:

0.842 132.0-145.7 time=132.0-145.7 speaker=Agent speech=The card form rejects the ZIP code... visual=Checkout page with red validation banner and disabled Submit button
0.801 146.1-154.9 time=146.1-154.9 speaker=Agent speech=We can reproduce it on guest checkout... visual=Billing modal open with highlighted postal code field

If the top hits are semantically close but visually wrong, increase frame density around scene changes. If they are visually right but temporally noisy, tighten chunk boundaries.

Troubleshooting and what's next

Troubleshooting top 3

Visual-only moments are missing. Increase sampling from 1 fps to 2-4 fps for UI demos, or resample only around slide changes and scene cuts.
Transcript dominates retrieval. Keep speech and visual evidence both in the chunk text, and do not let long monologues swallow short but important visual states.
Transcription requests fail on long files. OpenAI’s speech-to-text uploads are currently limited to 25 MB, so split or compress long audio before sending it.

What's next

Add a reranking pass that scores the top 20-50 hits using the original query plus full chunk text.
Introduce scene-cut detection so frame sampling gets denser only when visuals actually change.
Store playback URLs like video.mp4?t=132 beside every chunk so answers can deep-link to evidence.
Run an eval set with known timestamped answers before you optimize cost, batch size, or storage backend.

The core lesson is simple: better embeddings help, but they do not remove the need for a disciplined multimodal ingestion layer. If you make the evidence units clean, timestamped, and modality-aware, the retrieval stack becomes much easier to trust.

RAG for Video [2026]: Multimodal Pipeline Tutorial

Bottom Line

Prerequisites

Bottom Line

Step 1: Extract modalities

Extract audio and frames

Transcribe audio and summarize frames

Step 2: Build retrieval units

Step 3: Embed and index

Verification and expected output

Troubleshooting and what's next

Troubleshooting top 3

What's next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox