Semantic Reranking for RAG Docs [Deep Dive Guide]
Bottom Line
High-accuracy documentation RAG usually comes from a two-stage pipeline: retrieve fast with embeddings, then rerank the top candidates with a cross-encoder. The win is precision, not raw recall, so chunk quality and candidate depth still decide whether reranking can help.
Key Takeaways
- ›Use a bi-encoder for fast top-K retrieval, then a cross-encoder for final ranking.
- ›For technical docs, rerank only the best 20-50 chunks to control latency.
- ›Prepend titles or paths to chunks so the reranker sees local document context.
- ›Measure MRR or hit-rate before and after reranking instead of trusting anecdotal wins.
If your documentation RAG system retrieves vaguely related pages instead of the exact migration note, config flag, or API behavior a user asked for, the weak point is usually ranking. Semantic reranking fixes that by adding a second pass: first retrieve candidates quickly with embeddings, then score each query-document pair jointly with a CrossEncoder. For technical docs, that usually means better precision on edge-case queries without rebuilding your whole stack.
Prerequisites
This tutorial uses the documented retrieve-and-rerank pattern from Sentence Transformers. The code stays intentionally small so you can drop it into an existing RAG pipeline.
- Python 3 environment with package install access.
- A folder of Markdown or text-based technical docs.
- A first-stage embedding model such as sentence-transformers/all-MiniLM-L6-v2.
- A reranker model such as cross-encoder/ms-marco-MiniLM-L6-v2.
- A baseline query set so you can compare ranking quality before and after the change.
pip install -U sentence-transformers numpyIf your docs contain credentials, customer IDs, or production snippets, redact them before building an eval set. For quick cleanup of pasted samples, the Data Masking Tool is a practical pre-processing step.
Bottom Line
Use embeddings to find likely matches fast, then rerank only that short list with a cross-encoder. In documentation RAG, that usually produces more exact answers because the reranker evaluates the full query against the full chunk instead of comparing two independent vectors.
1. Prepare doc chunks the reranker can judge
Reranking cannot recover chunks that never make it into the candidate set. For technical documentation, chunking quality is the first lever.
- Split by headings before splitting by size.
- Keep each chunk focused on one procedure, flag, endpoint, or failure mode.
- Include document metadata in the text passed to retrieval and reranking.
- Avoid mixing unrelated commands into the same chunk.
A simple loader for Markdown docs:
from pathlib import Path
import re
def load_chunks(doc_dir: str, min_chars: int = 120, max_chars: int = 1200):
chunks = []
for path in Path(doc_dir).glob('**/*.md'):
raw = path.read_text(encoding='utf-8')
blocks = re.split(r'\n(?=##?\s)', raw)
for i, block in enumerate(blocks):
block = block.strip()
if len(block) < min_chars:
continue
block = block[:max_chars]
chunks.append({
'id': f'{path.stem}:{i}',
'path': str(path),
'text': f'FILE: {path.name}\n{block}'
})
return chunks
chunks = load_chunks('docs')
print(f'Loaded {len(chunks)} chunks')That FILE: prefix looks minor, but it often helps on technical corpora because chunks from auth.md and cli.md can share vocabulary while meaning very different things. If you plan to publish or review the sample script internally, run it through the Code Formatter first so teammates do not waste time on style cleanup.
2. Retrieve candidates with a bi-encoder
The first stage is about recall and speed. According to the official SentenceTransformer usage docs, bi-encoders are commonly used as the first step in a two-stage retrieval process because embedding generation and similarity search are efficient.
Use encode_document for your corpus, encode_query for the query, and semantic_search to fetch the top candidates.
from sentence_transformers import SentenceTransformer, util
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
corpus_texts = [chunk['text'] for chunk in chunks]
corpus_embeddings = embedder.encode_document(
corpus_texts,
convert_to_tensor=True,
normalize_embeddings=True
)
query = 'How do I replace deprecated CLI auth flags with the API key header?'
query_embedding = embedder.encode_query(
query,
convert_to_tensor=True,
normalize_embeddings=True
)
hits = util.semantic_search(
query_embedding,
corpus_embeddings,
top_k=8
)[0]
for rank, hit in enumerate(hits, start=1):
chunk = chunks[hit['corpus_id']]
print(rank, round(hit['score'], 4), chunk['path'], chunk['id'])At this point, do not worry if the ranking looks only “pretty good.” The goal of stage one is to ensure the right chunk appears somewhere in the top K. For documentation RAG, K=20 to K=50 is a common operating range when you plan to rerank only a short list.
3. Rerank the candidate list with a CrossEncoder
This is the accuracy pass. The official CrossEncoder docs note that it processes both texts jointly and is more accurate for pairwise tasks like reranking, but cannot precompute embeddings for individual documents. That is exactly why it belongs in stage two, not stage one.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')
candidate_chunks = [chunks[hit['corpus_id']] for hit in hits]
pairs = [(query, chunk['text']) for chunk in candidate_chunks]
rerank_scores = reranker.predict(pairs)
reranked = sorted(
zip(candidate_chunks, rerank_scores),
key=lambda item: item[1],
reverse=True
)
print('\nRERANKED RESULTS')
for rank, (chunk, score) in enumerate(reranked[:5], start=1):
print(rank, round(float(score), 4), chunk['path'], chunk['id'])In a live RAG system, the final answer stage should read from the reranked list, not the raw embedding list. The clean pattern is:
- Retrieve top K chunks with the bi-encoder.
- Score those K query-document pairs with the cross-encoder.
- Return the top N reranked chunks to the generator.
Verification and expected output
Run the script with a small query set where you already know the correct document or section. Your goal is not “higher scores.” Your goal is that the exact chunk moves higher in the list.
python rerank_demo.pyExample console output:
Top retrieved candidates
1 0.7124 docs/auth_overview.md auth_overview:2
2 0.7061 docs/cli_migration.md cli_migration:1
3 0.6977 docs/api_headers.md api_headers:4
4 0.6912 docs/getting_started.md getting_started:3
RERANKED RESULTS
1 8.8421 docs/api_headers.md api_headers:4
2 7.9365 docs/cli_migration.md cli_migration:1
3 4.2874 docs/auth_overview.md auth_overview:2
4 1.9038 docs/getting_started.md getting_started:3Exact numbers will vary, but the pattern should be stable: chunks that discuss the exact migration or API behavior should climb above broader overview pages. For a real evaluation, label a small set of queries and compare MRR, Hit Rate, or NDCG before and after reranking.
def reciprocal_rank(results, expected_id):
for i, item in enumerate(results, start=1):
if item['id'] == expected_id:
return 1 / i
return 0.0Troubleshooting and What’s next
Top 3 issues
- The reranker barely helps. Usually the correct chunk never entered the candidate set. Increase retrieval top_k, improve chunk boundaries, or add better titles and paths into chunk text.
- Latency jumps too much. That is normal because a cross-encoder scores every query-document pair directly. Lower rerank depth, batch predictions, or use a smaller reranker before moving to heavier models.
- Generic overview pages still win. Your chunks are probably too broad. Split long pages by heading and keep procedural content separate from conceptual overviews.
What’s next
- Swap the in-memory search for your production retriever such as FAISS, OpenSearch, or pgvector, but keep the rerank stage identical.
- Add offline evaluation with 25-100 labeled developer queries so ranking changes are measurable.
- Cache corpus embeddings and rebuild them only when the docs change.
- Review the official Retrieve & Re-Rank guide and the published CrossEncoder model card before tuning models further.
The practical takeaway is straightforward: if your RAG system already retrieves roughly relevant documentation, semantic reranking is often the shortest path to noticeably better answer grounding. It adds one extra model call over a small candidate set, and for technical documentation that trade usually favors accuracy.
Frequently Asked Questions
What is semantic reranking in RAG? +
How many documents should I rerank in a documentation RAG pipeline? +
Why does reranking help technical documentation more than generic content? +
auth, token, config, and migration. A reranker evaluates the full pair jointly, so it is better at distinguishing the exact procedure or API behavior from a broad overview page.Can I use semantic reranking with FAISS, OpenSearch, or pgvector? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.