Local LLMs in VS Code: Optimization Guide for M4 Chips
The M4 chip changed the game for local inference. Here is how to configure VS Code to run 70B parameter models locally with zero latency.
Cloud AI is great, but offline, zero-latency autocomplete is better. With Apple's M4 Max, running a quantized 70B model locally is now viable for daily driving.
The Stack
- Engine: Ollama (v0.5.1+)
- Model: `codellama:70b-instruct-q4_K_M` (or `deepseek-coder:33b` for speed)
- Extension: Continue.dev (best M4 optimization)
VS Code Config for Speed
The trick is to increase the context window keep-alive so the model doesn't unload.
// settings.json
"continue.local.keepAlive": 3600, // Keep model loaded for 1 hour
"continue.local.contextWindow": 16384, // M4 can handle this easily
Warning: This will eat about 40GB of unified memory. Ensure you have the 64GB+ SKU of the M4 Max.
Master AI Engineering Today 🏗️
Join 50,000+ developers getting high-signal technical briefings. Zero AI slop, just engineering patterns.