Speculative Decoding [Deep Dive]: Sub-10ms Edge LLMs
Bottom Line
Speculative decoding is the most practical way to push edge LLM serving toward sub-10 ms per token because it cuts target-model passes without changing output quality. The catch is operational: the win depends on acceptance rate, low-to-medium concurrency, and workload fit.
Key Takeaways
- ›Lossless speculative decoding preserves the target model’s output distribution.
- ›Published systems report roughly 2x to 3.6x speedups, but gains shrink at high QPS.
- ›Edge latency improves only when draft acceptance is high and target passes actually collapse.
- ›Prompt-based methods like n-gram can outperform heavier drafters on repetitive code and editing tasks.
Speculative decoding has moved from research trick to production lever because it attacks the real bottleneck in LLM serving: serial, memory-bound token generation. On April 29, 2026, the engineering question is no longer whether it works, but when it is the fastest path to a sub-10 ms streaming experience at the edge. In practice, that usually means sub-10 ms per accepted token after prefill, not a full response in 10 ms flat.
- Speculative decoding is lossless when implemented correctly: the target model keeps the same output distribution.
- Published results range from roughly 2x to 3.6x, but only on workloads where verification amortizes well.
- Edge deployments benefit most when concurrency stays modest and the target model is still compute-idle between memory-bound decoding steps.
- For repetitive workloads such as code editing, prompt-based methods like n-gram or suffix decoding can be the cleanest win.
The Lead
Bottom Line
If your edge runtime is bottlenecked by one-token-at-a-time verification, speculative decoding is the highest-leverage latency optimization available today. The real job is not merely turning it on, but matching the speculation method to your workload, batch shape, and hardware budget.
Speculative decoding was formalized by Leviathan et al. in 2023 as a way to generate multiple tokens in parallel without changing the target model’s output distribution. The core move is simple: a cheaper drafter proposes several next tokens, then the expensive target verifies them in one pass. If the target accepts a long prefix, several autoregressive steps collapse into one.
That matters disproportionately on edge hardware. A laptop GPU, embedded accelerator, or compact inference appliance is usually constrained less by theoretical FLOPS than by memory traffic, cache behavior, and power ceilings. In that environment, reducing the number of target-model passes often buys more user-visible latency than another round of kernel micro-optimization.
Why plain autoregressive decoding stalls
- Each new token forces another full target-model pass, even when the next few tokens are easy to predict.
- Decoder inference is frequently memory-bound, so hardware sits underutilized while weights and KV cache move around.
- At the edge, the cost of every wasted pass is amplified by smaller memory pools and tighter thermal envelopes.
- Streaming UX depends on inter-token latency, not just aggregate tokens/sec.
What changed in 2025 and 2026
The ecosystem is now materially more usable. vLLM documents model-based methods such as EAGLE, MTP, draft models, PARD, and MLP, alongside prompt-based methods such as n-gram and suffix decoding. TensorRT-LLM documents Draft-Target-Model, NGram, Medusa, ReDrafter, EAGLE, and Lookahead. llama.cpp ships both draft-model and draftless n-gram-style implementations. The practical implication is that edge teams no longer need to invent the control loop from scratch.
Architecture & Implementation
The architecture decision is not really about which paper sounds smartest. It is about where your workload sits on three axes: predictability, concurrency, and verifier cost. The right drafter is the one that advances the most accepted tokens per target pass without introducing enough overhead to erase the gain.
Method selection
- Draft-target speculation: best when you can afford a separate lightweight model with the same vocabulary and good alignment to the target.
- EAGLE or ReDrafter: strong when you want higher acceptance and tighter integration around the target model’s hidden states.
- Medusa: attractive when you prefer extra decoding heads over a separate drafter model.
- n-gram or suffix decoding: ideal when prompts and continuations contain local repetition, especially code completion, editing, and iterative rewrite flows.
- Lookahead: useful when you want same-model prediction and verification without adding an external drafter.
Production request path
- Prefill the prompt and establish the KV cache for the target model.
- Generate a short draft sequence using the chosen speculation method.
- Verify the full draft in one target-model pass.
- Accept the longest valid prefix and discard the rest.
- Repeat with adaptive draft length until stop conditions fire.
The last point is the operational heart of the system. Fixed speculation depth is easy to reason about, but dynamic depth usually wins in practice. Easy continuations want longer drafts; uncertain continuations want short ones. The controller should react to acceptance history instead of pretending every token is equally predictable.
Reference configurations
In vLLM, prompt-lookup speculation is configured through --speculative-config:
vllm serve <target-model> --speculative-config '{ "method": "ngram", "num_speculative_tokens": 4, "prompt_lookup_min": 2, "prompt_lookup_max": 5 }'In llama.cpp, the documented knobs include --draft-max and --spec-type:
./llama-server [...] --spec-type ngram-simple --draft-max 64 --spec-ngram-size-n 12 --spec-ngram-size-m 48Those snippets matter because they show where current runtimes expose the tradeoff surface. In vLLM, the docs also note feature constraints: pipeline parallelism is not composable with speculative decoding as of vllm<=0.15.0, and draft-model speculation is listed as unsupported in vllm<=0.10.0. That is exactly the kind of detail teams miss when they benchmark a demo and then wonder why the production topology behaves differently.
What actually makes sub-10 ms possible
- Short target-model verification paths with high acceptance.
- Hot KV cache and low cache eviction pressure.
- Quantized or otherwise edge-optimized target weights.
- Low enough concurrency that the target still benefits from collapsing sequential passes.
- Tight runtime integration so draft generation, verification, and acceptance do not bounce through Python or slow scheduler edges.
Benchmarks & Metrics
Benchmarks around speculation are notoriously easy to misuse. The most important recent correction comes from SPEED-Bench from NVIDIA Research, which argues that speculative decoding is inherently data-dependent and shows that synthetic inputs can overestimate real-world throughput. That is the right framing for edge work: benchmark the workload you actually serve, not the workload that flatters the paper figure.
Published numbers worth knowing
- Leviathan et al. 2023 reported roughly 2x-3x acceleration on T5-XXL with identical outputs.
- vLLM Speculators documentation describes 2-3x faster generation for interactive use cases when draft models are well matched.
- The EAGLE paper reports 2.7x-3.5x latency speedup on LLaMA2-Chat 70B and doubled throughput.
- An official TensorRT-LLM engineering post reported speculative decoding throughput gains of up to 3.6x.
Those numbers are real, but they are not portable. They do not mean your edge box instantly becomes 3x faster. They mean the ceiling is high when the drafter is accurate, the verifier is memory-bound, and batching has not already saturated the hardware.
Metrics that matter more than headline speedup
- Inter-token latency: measure median and p95 after prefill. This is the closest proxy for perceived responsiveness.
- Time to first token: speculation often helps decoding more than prefill, so do not let a better steady-state hide a weak startup path.
- Accepted tokens per verification pass: this is the core efficiency number behind the user experience.
- Acceptance rate: low acceptance means the drafter is mostly generating work the target throws away.
- Power per generated token: edge systems do not get to ignore thermal cost.
A sane edge benchmark protocol
- Benchmark baseline decoding and speculative decoding on the same prompt set, with the same sampler and output limits.
- Separate prefill from decode metrics so you can see where the gain originates.
- Sweep speculation depth instead of trusting a default.
- Repeat the sweep across concurrency levels, because the optimal draft length changes with batch shape.
- Include at least one repetitive workload and one open-ended workload to expose acceptance variance.
If you are shipping a local coding assistant, normalize benchmark outputs before comparing edit-heavy traces. In practice that often means applying a formatter or linter pass to avoid mistaking stylistic churn for model error; TechBytes’ Code Formatter is a useful companion for that workflow.
Strategic Impact
The strategic value of speculation is that it changes the economics of local intelligence. Sub-10 ms per token is not just a nicer chart. It is the threshold where assistants feel interruptible, code tools feel attached to the cursor, and robotics or field devices can keep inference on-box instead of waiting on a round trip to a region hundreds of miles away.
Why edge teams care
- Privacy: more traffic can stay local, which reduces the need to ship raw prompts off-device.
- Reliability: inference keeps working in bandwidth-constrained or intermittently connected environments.
- Cost control: fewer target passes can translate into more useful work per watt and per accelerator.
- Product design: lower token latency enables more aggressive streaming, interruption, and turn-taking patterns.
This also changes how teams think about prompt data. Once you start replaying real production traces to tune acceptance, you need a disciplined sanitization path. Before those traces enter your benchmark harness, scrub them with a tool like the Data Masking Tool so privacy work does not get bolted on after optimization.
The deeper strategic point is that speculation is one of the rare optimizations that can improve both user experience and infrastructure efficiency at the same time. That makes it more durable than cosmetic latency wins that disappear the moment product traffic shifts.
Road Ahead
The next phase of speculative decoding will be less about one canonical algorithm and more about adaptive orchestration. Runtimes are already moving in that direction: TensorRT Edge-LLM highlights EAGLE-3 alongside quantization and chunked prefill for embedded deployments, while evaluation work like SPEED-Bench is forcing the field to confront real serving conditions instead of toy prompts.
Expect three things to define the next year. First, dynamic controllers will tune draft length and even switch speculation method mid-request. Second, prompt-based and model-based speculation will be combined instead of treated as competing camps. Third, edge runtimes will optimize speculation as a scheduling problem, not just a model problem.
The practical takeaway is blunt: if you want a genuinely fast edge LLM in 2026, you should stop treating speculative decoding as an optional research extra. It is now part of the core serving architecture. The teams that win will be the ones that benchmark it honestly, wire it tightly into the runtime, and choose the speculation strategy that matches the traffic they actually have.
Frequently Asked Questions
What is speculative decoding in LLM inference? +
Does speculative decoding change output quality or model behavior? +
When does speculative decoding fail to improve latency? +
Which runtimes support speculative decoding today? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.