How much VRAM do I need for a private LLM home server in 2026?

For most developers, 16GB is the practical starting point and 32GB is the comfort tier. The exact ceiling depends on quantization, context length, and concurrency, but memory capacity usually becomes the first hard limit long before raw compute marketing does.

Is an Apple Mac Studio a good private LLM server?

Yes, if your priority is a quiet box with a large shared memory pool and strong local ergonomics. It is less flexible than a PCIe GPU tower, but unified memory can make larger-context local inference far more practical than a small discrete-GPU system.

Should I buy two smaller GPUs or one large GPU for local inference?

One larger GPU is usually the cleaner choice for latency, thermals, and operational simplicity. Ollama will prefer loading a model onto one GPU if it fits there, and crossing the PCIe bus remains a real performance tax when a model must be spread across cards.

What matters more for local LLMs: CPU, RAM, or GPU?

For GPU-backed inference, the GPU memory ceiling usually matters first, then memory bandwidth, then system design details like cooling and storage. CPU and system RAM still matter for orchestration, long contexts, preprocessing, and CPU fallback, but they rarely rescue an undersized GPU.

LLM Home Server Hardware [2026] Selection Cheat Sheet

As of May 08, 2026, private LLM home-server buying has become much simpler: stop shopping by benchmark slogans and shop by memory ceilings, bandwidth, power, and software support. This cheat sheet is built for fast decisions, not theory. Use it to shortlist a box for Ollama, llama.cpp, and related local stacks, then drop in the filter, shortcuts, and command blocks below for your own internal docs or homelab wiki.

16GB is the practical entry tier for current local LLM work.
32GB VRAM is where single-GPU builds stop feeling cramped.
Unified memory is the cleanest path to larger contexts in a quiet footprint.
System RAM, NVMe capacity, and PSU headroom still decide whether the box is pleasant to live with.

Build class	What to buy	Best for	Main tradeoff	Edge
Budget lab	Intel Arc B580 12GB	Cheap local testing, small quantized models	12GB fills up quickly	Price
Mainstream tower	RTX 5080 16GB	Fast single-user inference, dev boxes	16GB is still a ceiling	Balance
High-end tower	RTX 5090 32GB	Serious single-GPU local work	575W and louder thermals	Single-GPU ceiling
Quiet dense memory box	Mac Studio with M4 Max or M3 Ultra	Larger contexts, quiet always-on use	Less upgradeable, different toolchain	Noise and density

Selection Matrix

Bottom Line

For most builders, the real fork is simple: buy 16GB if you are optimizing for budget, 32GB if you want a box that still feels current a year from now, or Apple unified memory if you need larger working sets without a hot tower.

Pick by memory, not marketing

RTX 5090 officially ships with 32GB GDDR7, a 512-bit interface, PCIe Gen 5, and 575W total graphics power. Source: NVIDIA official specs.
RTX 5080 officially ships with 16GB GDDR7, a 256-bit interface, PCIe Gen 5, and 360W total graphics power. Source: NVIDIA official specs.
Radeon RX 9070 XT officially ships with 16GB GDDR6, a 256-bit interface, and 304W typical board power. Source: AMD official specs.
Intel Arc B580 officially ships with 12GB GDDR6, a 192-bit interface, and 190W TBP. Source: Intel official specs.
Mac Studio currently offers M4 Max at 36GB or 64GB unified memory, and M3 Ultra at 96GB unified memory with 819GB/s memory bandwidth. Source: Apple official specs.

Fast sizing rules

Very rough rule: 4-bit weights take about half a byte per parameter before runtime overhead and KV cache.
That means 8B models are easy on modern GPUs, 14B-32B is where memory planning matters, and 70B-class quantized runs stop being casual on 16GB boxes.
Long contexts are often the silent budget killer because KV cache growth can erase the headroom you thought you had.
If you want concurrent users, size for memory overhead first and raw tokens per second second.

Choose this class when

Choose a budget lab when you mostly test prompts, agents, and small local models.
Choose a mainstream tower when you want the broadest compatibility and easy Linux tuning.
Choose a high-end tower when you want one GPU to carry serious local inference without constant compromises.
Choose a quiet unified-memory box when acoustic profile, idle power behavior, and larger memory pools matter more than PCIe upgrade paths.

For internal runbooks and hardware catalogs, add a tiny client-side filter so engineers can instantly narrow by VRAM, power, or toolchain. The block below is dependency-free and works well in a docs page, wiki, or static site.

Filter this cheat sheet

RTX 5090
32GB, single-GPU ceiling, hot and expensive.

RTX 5080
16GB, balanced build, easy mainstream choice.

Radeon RX 9070 XT
16GB, viable if you accept ROCm-specific tuning.

Intel Arc B580
12GB, cheap experimentation box.

Mac Studio M4 Max
Quiet local serving with unified memory.

Mac Studio M3 Ultra
Dense memory for larger contexts and bigger local runs.

<input id='hardware-filter' type='search' placeholder='Type 32GB, Apple, quiet...' />
<div id='hardware-grid'>
  <div class='filter-card' data-filter='rtx 5090 32gb nvidia hot'>RTX 5090</div>
  <div class='filter-card' data-filter='mac studio quiet apple unified memory'>Mac Studio</div>
</div>

<script>
const input = document.getElementById('hardware-filter');
const cards = [...document.querySelectorAll('#hardware-grid .filter-card')];
input?.addEventListener('input', (e) => {
  const q = e.target.value.toLowerCase().trim();
  cards.forEach((card) => {
    const haystack = card.dataset.filter || '';
    card.style.display = !q || haystack.includes(q) ? '' : 'none';
  });
});
</script>

Keyboard Shortcuts

Keyboard shortcuts matter once your reference page gets long. The table below is enough for a premium docs feel without pulling in a full command palette.

Shortcut	Action	Why it matters
`/`	Focus the live filter	Fast scan by GPU, memory, or software stack
`Esc`	Clear the filter	Resets the grid without touching the mouse
`g h`	Jump to hardware selection	Useful when you revisit sizing rules often
`g c`	Jump to commands	Best for operational lookups during setup
`g a`	Jump to advanced usage	Good for multi-GPU and Apple-specific notes

<script>
let pending = '';
document.addEventListener('keydown', (e) => {
  const tag = (document.activeElement?.tagName || '').toLowerCase();
  const typing = tag === 'input' || tag === 'textarea';

  if (e.key === '/' && !typing) {
    e.preventDefault();
    document.getElementById('hardware-filter')?.focus();
    return;
  }

  if (e.key === 'Escape') {
    const input = document.getElementById('hardware-filter');
    if (input) {
      input.value = '';
      input.dispatchEvent(new Event('input'));
      input.blur();
    }
    pending = '';
    return;
  }

  if (typing) return;
  if (pending === 'g') {
    const map = { h: 'selection-matrix', c: 'commands-by-purpose', a: 'advanced-usage' };
    const id = map[e.key];
    if (id) document.getElementById(id)?.scrollIntoView({ behavior: 'smooth' });
    pending = '';
    return;
  }

  pending = e.key === 'g' ? 'g' : '';
  if (pending) setTimeout(() => { pending = ''; }, 800);
});
</script>

Commands by Purpose

This section stays deliberately operational. Use it to inventory hardware, stand up a local server, and get enough telemetry to know whether your machine is actually behaving.

Inventory the box

uname -a
lscpu
free -h
lsblk -o NAME,SIZE,MODEL,TYPE,MOUNTPOINT
lspci | egrep 'VGA|3D|Display'
nvidia-smi
rocminfo
nvme list

Install and verify Ollama on Linux

Ollama Linux docs currently document the commands below, including the separate ROCm package for AMD GPUs.

curl -fsSL https://ollama.com/install.sh | sh
ollama serve
ollama -v

curl -fsSL https://ollama.com/download/ollama-linux-amd64-rocm.tar.zst \
  | sudo tar x -C /usr

Serve and test a model

ollama run gemma3

curl http://localhost:11434/api/generate -d '{
  "model": "gemma3",
  "prompt": "Summarize why VRAM matters for local inference."
}'

Use llama.cpp when you want direct control

The official README currently documents both local-file and Hugging Face startup paths.

llama-cli -m my_model.gguf
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Measure what matters

Track first-token delay separately from sustained decode speed.
From the Ollama API, watch load_duration, promptevalcount, promptevalduration, eval_count, and eval_duration.
Test your real context length, not just a tiny smoke prompt that flatters the hardware.

Watch out: A short benchmark can make a 16GB build look fine. The pain usually arrives when you raise context, add concurrency, or swap to a larger quant.

Configuration

Most home-server mistakes are configuration mistakes, not hardware mistakes. Set context, queueing, bind address, and model location deliberately.

Useful Ollama server variables

OLLAMACONTEXTLENGTH: default is 4096 tokens according to the official FAQ.
OLLAMA_HOST: change the bind address when you need LAN access.
OLLAMA_MODELS: move models to a larger SSD or dedicated volume.
OLLAMAKEEPALIVE: control how aggressively models stay resident.
OLLAMAMAXLOADED_MODELS, OLLAMANUMPARALLEL, and OLLAMAMAXQUEUE: tune concurrency carefully because memory demand rises fast.

systemd override

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/srv/ollama/models"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=1"

sudo systemctl edit ollama
sudo systemctl daemon-reload
sudo systemctl restart ollama
journalctl -e -u ollama

Local-only mode

{
  "disable_ollama_cloud": true
}

The official FAQ also documents OLLAMANOCLOUD=1 as an environment-variable alternative.

Pro tip: Put your models on the fastest NVMe drive you have, and keep your runtime logs separate from the model volume so log rotation never competes with large model pulls.

Advanced Usage

Multi-GPU behavior

The official Ollama FAQ says a model is loaded onto a single GPU if it fits entirely there; otherwise it is spread across all available GPUs. That is usually the right default because crossing the PCIe bus is still a tax.

Prefer one larger GPU over two smaller GPUs when your budget allows it.
Use multiple GPUs when your working set simply cannot live on one card.
Do not assume model sharding automatically means the best latency.

Apple unified-memory path

If you are building around Apple silicon, the operational story is different. MLX is Apple-backed and explicitly built around shared memory on Apple hardware, which is why compact systems can feel disproportionately useful for local LLM work. See MLX and Ollama’s 2026 MLX preview note for the current direction of travel.

Pick Mac Studio M4 Max when you want a compact always-on box with more than entry-level memory.
Pick Mac Studio M3 Ultra when your workflows benefit from a much larger unified-memory pool.
Do not buy Apple expecting cheap GPU upgrades later; buy it because the whole box already matches your target working set.

Sharing logs safely

When you post systemd logs, hostnames, or path layouts to a forum, strip sensitive details first. TechBytes’ Data Masking Tool is a clean fit for redacting machine names, tokens, IPs, and internal paths before you paste troubleshooting output into tickets or chats.

LLM Home Server Hardware [2026] Selection Cheat Sheet

Bottom Line

Selection Matrix

Bottom Line

Pick by memory, not marketing

Fast sizing rules

Choose this class when

Live Search Filter

Keyboard Shortcuts

Commands by Purpose

Inventory the box

Install and verify Ollama on Linux

Serve and test a model

Use llama.cpp when you want direct control

Measure what matters

Configuration

Useful Ollama server variables

systemd override

Local-only mode

Advanced Usage

Multi-GPU behavior

Apple unified-memory path

Sharing logs safely

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox