LLM Edge Inference: Quantization & KV-Caching [Cheat Sheet]

Optimizing Large Language Models (LLMs) for edge deployment requires a fundamental shift from 'maximum precision' to 'maximum efficiency.' In 2026, the bottleneck for edge AI is rarely compute power—it is memory bandwidth and VRAM capacity. This guide provides a comprehensive cheat sheet for the two most impactful optimizations: Quantization and KV-Caching.

1. Optimization Overview

Before diving into specific commands, use the interactive filter below to find the right technique for your hardware constraints.

INT4 Quantization: Reduces weights to 4 bits; ideal for mobile/NPU.
FP8 E4M3: High-precision quantization for H100/L40S hardware.
PagedAttention: Dynamic memory allocation for KV-caches.
FlashAttention-3: Optimized kernel for sparse attention patterns.

The Golden Rule of Edge Inference

Memory bandwidth (GB/s) is the primary driver of tokens-per-second. Reducing model size via Quantization directly increases throughput by allowing more parameters to be loaded into the NPU cache per clock cycle.

2. Quantization Reference Table

Common quantization formats and their trade-offs for Llama-3 and Mistral-Large models.

Format	VRAM (7B Model)	Perplexity Loss	Best Use Case
FP16	14.0 GB	Zero	Server-grade inference
INT8	7.5 GB	Negligible	General purpose desktop
INT4 (GGUF)	4.2 GB	< 1%	Mobile & Apple Silicon
INT2	2.1 GB	High	Experimental low-power

3. KV-Caching & PagedAttention

KV-Caching avoids redundant computation by storing previously computed Keys and Values in memory. PagedAttention (introduced by vLLM) solves the memory fragmentation issue by managing this cache in 'pages' similar to OS virtual memory.

Keyboard Shortcuts (vLLM/Llama.cpp)

Action	Shortcut
Interrupt Generation	`Ctrl + C`
Clear KV-Cache	`Shift + R`
Toggle Debug Info	`D`
Reload Configuration	`Ctrl + R`

4. Edge Deployment Commands

Grouped by deployment phase. Use these snippets to initialize optimized runners.

Phase 1: Quantization (AutoGPTQ)

python -m autogptq.quantize \
  --model_name_or_path ./llama-3-8b \
  --bits 4 \
  --group_size 128 \
  --save_dir ./llama-3-8b-4bit

Phase 2: Serving with KV-Cache (vLLM)

python -m vllm.entrypoints.openai.api_server \
  --model ./llama-3-8b-4bit \
  --gpu-memory-utilization 0.8 \
  --block-size 16 \
  --max-num-seqs 256

Before deploying these configurations, ensure your configuration files are clean using our Code Formatter tool to avoid syntax errors in edge environments.

5. Advanced Configuration

For high-throughput environments, tweak the num_scheduler_steps and max_model_len parameters to balance latency and batch size.

# Advanced vLLM config for Raspberry Pi 5 / Jetson Orin
config = {
    "model": "quantized-phi-4",
    "kv_cache_dtype": "fp8",
    "enforce_eager": True,
    "max_context_len_to_capture": 2048
}