Small Language Models [2026]: Phi-4, Llama 4, Ministral
Bottom Line
Treat this as a three-way split: Phi-4 for compact text reasoning, Llama 4 Scout for massive context, and Ministral 3 3B for the smallest Apache-licensed edge deployment.
Key Takeaways
- ›As of May 22, 2026, Meta does not publish an official model named Llama-4-Tiny.
- ›As of May 22, 2026, Mistral does not publish an official model named Mistral-Nano.
- ›Phi-4 is a 14B text model with 16K context; Phi-4-mini extends the family to 128K.
- ›Llama 4 Scout leads on context at 10M; Ministral 3 3B leads on edge-friendly footprint.
Searches for Phi-4, Llama-4-Tiny, and Mistral-Nano often mix official SKUs with shorthand. This reference normalizes those names against vendor docs current on May 22, 2026, then gives you a practical selection matrix, a live search filter, keyboard shortcuts, deployment commands, and production defaults. The goal is speed: enough detail to choose the right small-model family without getting trapped in stale naming or benchmark theater.
| Dimension | Phi-4 | Llama 4 Scout | Ministral 3 3B | Edge |
|---|---|---|---|---|
| Official naming | Phi-4 is official | Closest official match for Llama-4-Tiny | Closest official match for Mistral-Nano | Phi-4 |
| Core form factor | 14B dense text model | Scout-17B-16E multimodal model | 3.4B language model + 0.4B vision encoder | Ministral 3 3B |
| Context window | 16K | 10M | 256K | Llama 4 Scout |
| Modalities | Text | Image + text | Image + text | Tie: Llama 4 Scout / Ministral 3 3B |
| License posture | Microsoft open model release | Meta Llama license | Apache 2.0 | Ministral 3 3B |
| Local deployment signal | Easy text-only workflows; family also includes Phi-4-mini | Official repo notes 4 GPUs for full bf16 inference | Model card says it can fit in 8GB VRAM in FP8 | Ministral 3 3B |
| Best fit | Compact reasoning and text tasks | Huge-context multimodal systems | Edge agents, Apache-friendly shipping | Depends on workload |
Official Model Map
Bottom Line
Use Phi-4 when you want compact text reasoning, Llama 4 Scout when long context dominates the design, and Ministral 3 3B when the smallest multimodal edge deployment and Apache 2.0 matter most.
- Phi-4: Microsoft publishes it as a 14B dense decoder-only text model with a 16K context window. See the official model card.
- Llama-4-Tiny: Meta does not list that exact SKU. The nearest compact official Llama 4 entry is Llama 4 Scout, shown in Meta docs as a multimodal model with a 10M context window and in the official repo as Scout-17B-16E. See Meta Llama docs and the official repository.
- Mistral-Nano: Mistral does not list that exact SKU. The closest current tiny edge model is Ministral 3 3B Instruct 2512, which the official model card describes as a 3.4B language model plus a 0.4B vision encoder with a 256K context window. See the official model card.
- Important family detail: if you need longer context inside the Phi family, Microsoft’s official Phi-4-mini-instruct card lists a 128K context window. See the official model card.
When To Choose Each
Choose Phi-4 when:
- You need a straightforward text-only model with strong compact reasoning.
- Your prompts fit comfortably inside 16K context.
- You want a simpler serving path without multimodal packing overhead.
- You expect to stay inside the Phi family and possibly step down to Phi-4-mini later.
Choose Llama 4 Scout when:
- You need the biggest officially published context window in this set: 10M.
- You want a vendor-backed multimodal path with a broad surrounding ecosystem.
- You are building retrieval-heavy or memory-heavy systems where context length is a first-order constraint.
- You can afford the operational complexity of a larger deployment footprint.
Choose Ministral 3 3B when:
- You need the smallest official multimodal option in this comparison.
- Apache 2.0 licensing matters for packaging or redistribution.
- You want native vendor guidance around function calling and JSON output.
- You care about edge and local deployment, especially where 8GB VRAM class targets matter.
Live Filter + Shortcuts
A cheat sheet is only useful if you can scan it fast. The snippet below adds a client-side filter for cards and a small keyboard layer for faster navigation. The same section IDs also make it easy for your frontend to render a sticky ToC.
HTML shell
<input id='slm-filter' type='search' placeholder='Filter by context, vision, license, edge...' aria-label='Filter models' />
<div id='slm-list'>
<article data-filter='phi-4 text reasoning 14b 16k' tabindex='0'>Phi-4</article>
<article data-filter='llama 4 scout multimodal 10m context' tabindex='0'>Llama 4 Scout</article>
<article data-filter='ministral 3 3b apache vision 256k edge' tabindex='0'>Ministral 3 3B</article>
</div>
<nav class='toc-sticky' aria-label='On this page'></nav>Filter + shortcut script
const input = document.getElementById('slm-filter');
const cards = [...document.querySelectorAll('#slm-list [data-filter]')];
const codeBlocks = [...document.querySelectorAll('pre')];
let codeIndex = 0;
function applyFilter() {
const q = input.value.toLowerCase().trim();
cards.forEach((card) => {
card.hidden = q !== '' && !card.dataset.filter.includes(q);
});
}
function focusCard(step) {
const visible = cards.filter((card) => !card.hidden);
const current = visible.indexOf(document.activeElement);
const next = visible[(current + step + visible.length) % visible.length];
if (next) next.focus();
}
document.addEventListener('keydown', (event) => {
if (event.key === '/' && document.activeElement !== input) {
event.preventDefault();
input.focus();
input.select();
}
if (event.key === 'j') focusCard(1);
if (event.key === 'k') focusCard(-1);
if (event.key === 'c') {
const block = codeBlocks[codeIndex % codeBlocks.length];
if (block) block.scrollIntoView({ behavior: 'smooth', block: 'center' });
codeIndex += 1;
}
if (event.key === 'Escape' && document.activeElement === input) {
input.blur();
}
});
input.addEventListener('input', applyFilter);Keyboard shortcuts
| Key | Action | Why it helps |
|---|---|---|
/ | Focus search | Jump straight into filtering. |
j | Next visible card | Fast scan through filtered results. |
k | Previous visible card | Reverse scan without touching the mouse. |
c | Jump to next code block | Useful in long reference posts. |
Esc | Blur search input | Return keyboard control to the page. |
Commands By Purpose
These are vendor-published commands or code paths worth bookmarking. Keep the model IDs exact.
1. Acquire weights or list official models
- Llama 4 Scout: use the official llama-models CLI.
- Phi-4: Microsoft publishes direct model pages on Hugging Face, plus Docker Model Runner examples.
- Ministral 3 3B: Mistral’s official model card favors vLLM for serving.
pip install llama-models
llama-model list
llama-model download --source meta --model-id Llama-4-Scout-17B-16E-Instructdocker model run hf.co/microsoft/phi-42. Serve locally
- Use Phi-4-mini-instruct if you want the longer-context compact checkpoint in the Phi-4 family.
- Use the published Ministral 3 3B launch flags exactly when enabling tools.
vllm serve "microsoft/Phi-4-mini-instruct"vllm serve mistralai/Ministral-3-3B-Instruct-2512 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral3. Run the official Meta local inference path
- Meta’s official repo notes that Llama 4 models require at least 4 GPUs for full bf16 inference.
- This is the reference script shape from the official repository.
pip install .[torch]
NGPUS=4
CHECKPOINT_DIR=~/.llama/checkpoints/Llama-4-Scout-17B-16E-Instruct
PYTHONPATH=$(git rev-parse --show-toplevel) \
torchrun --nproc_per_node=$NGPUS \
-m models.llama4.scripts.chat_completion $CHECKPOINT_DIR \
--world_size $NGPUS4. Call an OpenAI-compatible local endpoint
- Microsoft publishes OpenAI-compatible examples for served Phi checkpoints.
- Mistral’s vLLM examples also assume an OpenAI-compatible endpoint.
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "microsoft/Phi-4-mini-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'Configuration
For production, the biggest mistake is choosing purely on benchmark headlines. Choose on context, modality, license, deployment target, and tool wiring first.
Default selection rules
- Route text-only reasoning traffic to Phi-4 when 16K context is enough.
- Route very long-context multimodal workflows to Llama 4 Scout.
- Route small multimodal edge workflows to Ministral 3 3B.
- Prefer Phi-4-mini over Phi-4 when you need 128K context inside the same family.
Baseline config object
{
"router": {
"text_reasoning": "microsoft/phi-4",
"long_context_multimodal": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"edge_multimodal": "mistralai/Ministral-3-3B-Instruct-2512"
},
"defaults": {
"temperature": 0.1,
"json_mode": true,
"tool_whitelist": true,
"context_budget_guardrail": true,
"prompt_log_redaction": true
}
}Operational guardrails
- Keep tool lists short. Mistral explicitly recommends limiting the tool set to what the use case requires.
- Do not spend a 10M context window unless retrieval and summarization still fail the task.
- When storing prompts, traces, or tool-call arguments, scrub secrets and PII first. A simple way to harden logs during debugging is the Data Masking Tool.
- If you are pasting prompt templates into docs or repos, normalize indentation and JSON examples before shipping with the Code Formatter.
Advanced Usage
Use family-level routing, not one-model absolutism
- Phi-4 and Phi-4-mini solve different deployment constraints inside the same family.
- Llama 4 Scout is not a drop-in “tiny” model; it is the smallest official Llama 4 entry, but still a bigger operational commitment.
- Ministral 3 3B is the best fit here when you need a tiny multimodal checkpoint with clear edge positioning.
Exploit model-specific strengths deliberately
- Use Phi-4 for compact reasoning pipelines, code-adjacent helpers, and text-heavy copilots.
- Use Llama 4 Scout for agent memory, long multimodal sessions, and retrieval systems where context packing is the bottleneck.
- Use Ministral 3 3B for local assistants, robotics-style interfaces, document/image edge tasks, and strict license-sensitive packaging.
Read benchmarks with caution
- Microsoft, Meta, and Mistral publish different benchmark mixes, prompt formats, and evaluation assumptions.
- Do not compare one vendor’s best reasoning score to another vendor’s best instruct score as if they were interchangeable.
- For real selection, weight modality, context, license, memory fit, and tooling ahead of a single leaderboard number.
Production checklist
- Lock model IDs exactly.
- Pin serving stack versions when vendor docs require them.
- Cap context aggressively even when the model advertises much more.
- Evaluate tool-calling reliability separately from pure chat quality.
- Redact logs before storing prompts or multimodal payloads.
Frequently Asked Questions
Is there an official model called Llama-4-Tiny? +
Is Mistral-Nano a real official model name? +
Which of these small models is best for edge deployment? +
Should I use Phi-4 or Phi-4-mini? +
Are the published benchmarks directly comparable across Phi, Llama, and Mistral? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.