Tech Bytes Logo Tech Bytes
Local AI Feb 15, 2026

Local LLMs in VS Code: Optimization Guide for M4 Chips

The M4 chip changed the game for local inference. Here is how to configure VS Code to run 70B parameter models locally with zero latency.

Cloud AI is great, but offline, zero-latency autocomplete is better. With Apple's M4 Max, running a quantized 70B model locally is now viable for daily driving.

The Stack

  • Engine: Ollama (v0.5.1+)
  • Model: `codellama:70b-instruct-q4_K_M` (or `deepseek-coder:33b` for speed)
  • Extension: Continue.dev (best M4 optimization)

VS Code Config for Speed

The trick is to increase the context window keep-alive so the model doesn't unload.

// settings.json
"continue.local.keepAlive": 3600, // Keep model loaded for 1 hour
"continue.local.contextWindow": 16384, // M4 can handle this easily

Warning: This will eat about 40GB of unified memory. Ensure you have the 64GB+ SKU of the M4 Max.

Master AI Engineering Today 🏗️

Join 50,000+ developers getting high-signal technical briefings. Zero AI slop, just engineering patterns.

Stay Curated. Stay Ahead.

Join 50,000+ developers receiving one high-signal tech briefing every morning. Zero slop, all signal.

No spam. Unsubscribe anytime.