What is semantic reranking in RAG?

Semantic reranking is a second-stage ranking step where a model scores each query-document pair directly after initial retrieval. In practice, you retrieve a short candidate list with embeddings, then reorder it with a CrossEncoder before sending the final chunks to the generator.

How many documents should I rerank in a documentation RAG pipeline?

Start by reranking the top 20 to 50 candidates. Fewer than that can miss the right chunk, while much more than that usually adds latency faster than it adds precision.

Why does reranking help technical documentation more than generic content?

Technical docs are full of near-duplicate vocabulary such as auth, token, config, and migration. A reranker evaluates the full pair jointly, so it is better at distinguishing the exact procedure or API behavior from a broad overview page.

Can I use semantic reranking with FAISS, OpenSearch, or pgvector?

Yes. The reranker is independent of the first-stage index as long as that retriever returns a candidate list. You can keep your existing vector store and add reranking as a final ordering step before generation.

Semantic Reranking for RAG Docs [Deep Dive Guide]

If your documentation RAG system retrieves vaguely related pages instead of the exact migration note, config flag, or API behavior a user asked for, the weak point is usually ranking. Semantic reranking fixes that by adding a second pass: first retrieve candidates quickly with embeddings, then score each query-document pair jointly with a CrossEncoder. For technical docs, that usually means better precision on edge-case queries without rebuilding your whole stack.

Prerequisites

This tutorial uses the documented retrieve-and-rerank pattern from Sentence Transformers. The code stays intentionally small so you can drop it into an existing RAG pipeline.

Python 3 environment with package install access.
A folder of Markdown or text-based technical docs.
A first-stage embedding model such as sentence-transformers/all-MiniLM-L6-v2.
A reranker model such as cross-encoder/ms-marco-MiniLM-L6-v2.
A baseline query set so you can compare ranking quality before and after the change.

pip install -U sentence-transformers numpy

If your docs contain credentials, customer IDs, or production snippets, redact them before building an eval set. For quick cleanup of pasted samples, the Data Masking Tool is a practical pre-processing step.

Bottom Line

Use embeddings to find likely matches fast, then rerank only that short list with a cross-encoder. In documentation RAG, that usually produces more exact answers because the reranker evaluates the full query against the full chunk instead of comparing two independent vectors.

1. Prepare doc chunks the reranker can judge

Reranking cannot recover chunks that never make it into the candidate set. For technical documentation, chunking quality is the first lever.

Split by headings before splitting by size.
Keep each chunk focused on one procedure, flag, endpoint, or failure mode.
Include document metadata in the text passed to retrieval and reranking.
Avoid mixing unrelated commands into the same chunk.

A simple loader for Markdown docs:

from pathlib import Path
import re


def load_chunks(doc_dir: str, min_chars: int = 120, max_chars: int = 1200):
    chunks = []
    for path in Path(doc_dir).glob('**/*.md'):
        raw = path.read_text(encoding='utf-8')
        blocks = re.split(r'\n(?=##?\s)', raw)
        for i, block in enumerate(blocks):
            block = block.strip()
            if len(block) < min_chars:
                continue
            block = block[:max_chars]
            chunks.append({
                'id': f'{path.stem}:{i}',
                'path': str(path),
                'text': f'FILE: {path.name}\n{block}'
            })
    return chunks


chunks = load_chunks('docs')
print(f'Loaded {len(chunks)} chunks')

That FILE: prefix looks minor, but it often helps on technical corpora because chunks from auth.md and cli.md can share vocabulary while meaning very different things. If you plan to publish or review the sample script internally, run it through the Code Formatter first so teammates do not waste time on style cleanup.

2. Retrieve candidates with a bi-encoder

The first stage is about recall and speed. According to the official SentenceTransformer usage docs, bi-encoders are commonly used as the first step in a two-stage retrieval process because embedding generation and similarity search are efficient.

Use encode_document for your corpus, encode_query for the query, and semantic_search to fetch the top candidates.

from sentence_transformers import SentenceTransformer, util

embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

corpus_texts = [chunk['text'] for chunk in chunks]
corpus_embeddings = embedder.encode_document(
    corpus_texts,
    convert_to_tensor=True,
    normalize_embeddings=True
)

query = 'How do I replace deprecated CLI auth flags with the API key header?'
query_embedding = embedder.encode_query(
    query,
    convert_to_tensor=True,
    normalize_embeddings=True
)

hits = util.semantic_search(
    query_embedding,
    corpus_embeddings,
    top_k=8
)[0]

for rank, hit in enumerate(hits, start=1):
    chunk = chunks[hit['corpus_id']]
    print(rank, round(hit['score'], 4), chunk['path'], chunk['id'])

At this point, do not worry if the ranking looks only “pretty good.” The goal of stage one is to ensure the right chunk appears somewhere in the top K. For documentation RAG, K=20 to K=50 is a common operating range when you plan to rerank only a short list.

3. Rerank the candidate list with a CrossEncoder

This is the accuracy pass. The official CrossEncoder docs note that it processes both texts jointly and is more accurate for pairwise tasks like reranking, but cannot precompute embeddings for individual documents. That is exactly why it belongs in stage two, not stage one.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')

candidate_chunks = [chunks[hit['corpus_id']] for hit in hits]
pairs = [(query, chunk['text']) for chunk in candidate_chunks]
rerank_scores = reranker.predict(pairs)

reranked = sorted(
    zip(candidate_chunks, rerank_scores),
    key=lambda item: item[1],
    reverse=True
)

print('\nRERANKED RESULTS')
for rank, (chunk, score) in enumerate(reranked[:5], start=1):
    print(rank, round(float(score), 4), chunk['path'], chunk['id'])

In a live RAG system, the final answer stage should read from the reranked list, not the raw embedding list. The clean pattern is:

Retrieve top K chunks with the bi-encoder.
Score those K query-document pairs with the cross-encoder.
Return the top N reranked chunks to the generator.

Pro tip: Store both the retrieval score and the rerank score in logs. When ranking fails, that tells you whether the problem is poor candidate recall or poor pairwise judgment.

Verification and expected output

Run the script with a small query set where you already know the correct document or section. Your goal is not “higher scores.” Your goal is that the exact chunk moves higher in the list.

python rerank_demo.py

Example console output:

Top retrieved candidates
1 0.7124 docs/auth_overview.md auth_overview:2
2 0.7061 docs/cli_migration.md cli_migration:1
3 0.6977 docs/api_headers.md api_headers:4
4 0.6912 docs/getting_started.md getting_started:3

RERANKED RESULTS
1 8.8421 docs/api_headers.md api_headers:4
2 7.9365 docs/cli_migration.md cli_migration:1
3 4.2874 docs/auth_overview.md auth_overview:2
4 1.9038 docs/getting_started.md getting_started:3

Exact numbers will vary, but the pattern should be stable: chunks that discuss the exact migration or API behavior should climb above broader overview pages. For a real evaluation, label a small set of queries and compare MRR, Hit Rate, or NDCG before and after reranking.

def reciprocal_rank(results, expected_id):
    for i, item in enumerate(results, start=1):
        if item['id'] == expected_id:
            return 1 / i
    return 0.0

Troubleshooting and What’s next

Top 3 issues

The reranker barely helps. Usually the correct chunk never entered the candidate set. Increase retrieval top_k, improve chunk boundaries, or add better titles and paths into chunk text.
Latency jumps too much. That is normal because a cross-encoder scores every query-document pair directly. Lower rerank depth, batch predictions, or use a smaller reranker before moving to heavier models.
Generic overview pages still win. Your chunks are probably too broad. Split long pages by heading and keep procedural content separate from conceptual overviews.

Watch out: Do not feed the entire document back into the reranker. Cross-encoders are slower than embedding search, and overlong passages dilute the exact signal you want the model to evaluate.

What’s next

Swap the in-memory search for your production retriever such as FAISS, OpenSearch, or pgvector, but keep the rerank stage identical.
Add offline evaluation with 25-100 labeled developer queries so ranking changes are measurable.
Cache corpus embeddings and rebuild them only when the docs change.
Review the official Retrieve & Re-Rank guide and the published CrossEncoder model card before tuning models further.

The practical takeaway is straightforward: if your RAG system already retrieves roughly relevant documentation, semantic reranking is often the shortest path to noticeably better answer grounding. It adds one extra model call over a small candidate set, and for technical documentation that trade usually favors accuracy.

Semantic Reranking for RAG Docs [Deep Dive Guide]

Bottom Line

Prerequisites

Bottom Line

1. Prepare doc chunks the reranker can judge

2. Retrieve candidates with a bi-encoder

3. Rerank the candidate list with a CrossEncoder

Verification and expected output

Troubleshooting and What’s next

Top 3 issues

What’s next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox