Local LLMs in VS Code: Optimization Guide for M4 Chips

Cloud AI is great, but offline, zero-latency autocomplete is better. With Apple's M4 Max, running a quantized 70B model locally is now viable for daily driving.

The Stack

Engine: Ollama (v0.5.1+)
Model: `codellama:70b-instruct-q4_K_M` (or `deepseek-coder:33b` for speed)
Extension: Continue.dev (best M4 optimization)

VS Code Config for Speed

The trick is to increase the context window keep-alive so the model doesn't unload.

// settings.json
"continue.local.keepAlive": 3600, // Keep model loaded for 1 hour
"continue.local.contextWindow": 16384, // M4 can handle this easily

Warning: This will eat about 40GB of unified memory. Ensure you have the 64GB+ SKU of the M4 Max.

Master AI Engineering Today 🏗️

Join 50,000+ developers getting high-signal technical briefings. Zero AI slop, just engineering patterns.

Subscribe Free Read Tech Pulse

Local LLMs in VS Code: Optimization Guide for M4 Chips

The Stack

VS Code Config for Speed

Master AI Engineering Today 🏗️

Stay Curated. Stay Ahead.