laptop_mac macOS Sonoma
Intermediate
schedule 8 min read
by Alex Rivera • May 14, 2024
Step 1 The Cost of Cloud Autocomplete vs Local AI
Every keystroke you type in VS Code with GitHub Copilot active is transmitted to Microsoft's Azure-hosted OpenAI infrastructure. That's not a conspiracy theory — it's the documented product architecture. Your proprietary algorithms, internal business logic, unreleased features, and architectural decisions are all leaving your machine. For individual developers working on side projects, this tradeoff is often acceptable. For teams building commercially sensitive software, it's a material security and compliance risk.
Then there's the financial dimension.
GitHub Copilot Pricing Reality
| Plan |
Monthly Cost |
Annual Cost |
Team of 10 |
| Copilot Individual |
$10/month |
$100/year |
$1,000/year |
| Copilot Business |
$19/month/user |
$228/year/user |
$2,280/year |
| Copilot Enterprise |
$39/month/user |
$468/year/user |
$4,680/year |
For a mid-sized engineering organization of 50 developers on the Business tier, you're looking at $11,400 per year — paid indefinitely, to an external vendor, with no ownership of the underlying model or guarantee of pricing stability. Microsoft has already revised Copilot pricing upward since launch, and there is no structural reason to expect that trend to reverse.
What "Local AI" Actually Means
Running a local model via Ollama changes the equation entirely. The compute cost is your existing hardware. The privacy model is absolute — no data leaves your machine. The latency is often lower than cloud-based completion because you eliminate the round-trip network overhead to remote inference servers.
Terminal
Cloud Copilot flow:
Keypress → VS Code → HTTPS → Azure OpenAI → Inference → Response → VS Code
Local Ollama flow:
Keypress → VS Code → localhost:11434 → Inference → Response → VS Code
The localhost inference path removes an entire category of failure modes: API outages, rate limiting, corporate proxy interference, and geographic latency variance. Developers behind strict enterprise firewalls who previously couldn't use Copilot at all can now run a fully capable code model without touching the network perimeter.
The Hidden Costs of Cloud Dependency
Beyond licensing fees, cloud autocomplete introduces operational coupling that is easy to underestimate:
- Vendor lock-in: Your team's workflow becomes dependent on a third-party service with unilateral control over pricing, availability, and terms of service.
- Compliance overhead: Many regulated industries (healthcare, finance, defense) require formal data processing agreements and security reviews for any tool that transmits code externally. Copilot's approval process is non-trivial in these environments.
- Context window leakage: Cloud models receive not just your current file but surrounding context buffers — adjacent files, recently opened tabs, and repository metadata depending on configuration.
- Audit trail gaps: When code suggestions come from an external black box, attributing IP provenance becomes complicated in litigation or licensing disputes.
Where Local Models Are Competitive Today
Modern quantized models like Code Llama 13B and StarCoder2 7B running on consumer hardware achieve completion quality that is genuinely competitive with Copilot for the most common autocomplete scenarios: boilerplate generation, function signature completion, test scaffolding, and common algorithm implementations. The gap narrows further when the model is operating on code in languages and patterns it was specifically trained on.
The calculus is straightforward: zero marginal cost, complete data sovereignty, and no external dependencies versus a growing SaaS bill and an implicit agreement to share your codebase with a cloud inference provider. For teams serious about long-term toolchain ownership, the local-first approach is not a compromise — it is the more defensible architectural choice.
Step 2 Prerequisites: VS Code and Ollama
Before diving into the configuration, you need two foundational components locked in and verified. Skipping proper prerequisite validation is the single most common reason developers waste hours debugging what should be a straightforward local AI setup. Get these right, and everything downstream becomes mechanical.
Visual Studio Code
You need VS Code 1.80 or later. The Continue.dev extension depends on modern VS Code APIs for inline ghost-text completions, and older versions will silently fail or produce degraded behavior.
Verify your version from the command line:
Expected output (yours may be newer):
Terminal
1.89.1
e170252f762678dec6ca2cc69aba1864a9a1f8ad
x64
If you're running a fork like VSCodium, Continue.dev is fully compatible — just ensure you're pulling the extension from the Open VSX Registry rather than the Microsoft Marketplace, since VSCodium ships without Microsoft's proprietary extension host by default.
Ollama
Ollama is the local model runtime that handles model downloads, quantization management, and exposes an OpenAI-compatible REST API on localhost:11434. This compatibility layer is precisely why Continue.dev integrates with it so cleanly — no custom plugin architecture required.
Installation by platform:
| Platform |
Command / Method |
| macOS |
brew install ollama or download from ollama.com |
| Linux |
curl -fsSL https://ollama.com/install.sh \| sh |
| Windows |
Native installer from ollama.com/download |
After installation, start the Ollama daemon:
On macOS, Ollama runs as a menu bar application automatically after installation. On Linux, you may want it as a systemd service:
Terminal
sudo systemctl enable ollama
sudo systemctl start ollama
Verify the API is live:
Terminal
curl http://localhost:11434/api/tags
A successful response returns a JSON object listing your locally installed models. An empty list is fine at this stage — model downloads come in a later step.
Hardware Considerations
This is where most tutorials leave you hanging. Your hardware directly determines which models are viable and at what quantization level.
| VRAM / Unified Memory |
Recommended Tier |
Expected Latency |
| 8 GB |
7B models at Q4_K_M |
Acceptable (40–80 tok/s) |
| 16 GB |
13B models at Q4_K_M |
Good (30–60 tok/s) |
| 24 GB+ |
34B models at Q4_K_M |
Excellent (20–40 tok/s) |
| CPU only |
3B–7B at Q4 max |
Slow but functional |
Ollama automatically offloads layers to your GPU via Metal (Apple Silicon), CUDA (NVIDIA), or ROCm (AMD). No manual configuration is required — Ollama detects your hardware and optimizes layer offloading transparently.
Critical note for Apple Silicon users: Unified memory is shared between CPU and GPU. A MacBook Pro M3 with 16 GB can run a 13B model comfortably because there is no discrete VRAM ceiling. This is one of the strongest arguments for Apple Silicon in local AI workflows.
Confirming Your Environment
Run this quick sanity check before proceeding:
Terminal
# Confirm VS Code version
code --version
# Confirm Ollama is running and responsive
curl -s http://localhost:11434/api/tags | python3 -m json.tool
# Confirm GPU acceleration (Ollama logs)
ollama run llama3.2:1b "say ok"
Watch the Ollama terminal output during model inference. You should see layer offload counts referencing your GPU. If everything runs on CPU, revisit your driver installation before pulling larger code models.
With both prerequisites confirmed, you're ready to install Continue.dev and wire everything together.
Step 3 Step 1: Installing the Continue.dev Extension
Continue.dev is the cornerstone of this entire setup — an open-source AI code assistant that acts as a drop-in replacement for GitHub Copilot, but with one critical architectural difference: it routes your requests wherever you tell it to, including a locally running Ollama instance. No telemetry. No cloud egress. No subscription.
Installing from the VS Code Marketplace
The simplest installation path goes through the VS Code Extension Marketplace directly inside the editor:
- Open VS Code
- Press
Ctrl+Shift+X (Windows/Linux) or Cmd+Shift+X (macOS) to open the Extensions panel
- Search for "Continue"
- Locate the extension published by Continue (the icon is a distinct purple gradient logo)
- Click Install
Alternatively, install it from the command line using the VS Code CLI:
Terminal
code --install-extension Continue.continue
Or if you're using VS Codium (the telemetry-free fork), the extension is available on the Open VSX Registry:
Terminal
codium --install-extension Continue.continue
Note: Continue.dev also supports JetBrains IDEs (IntelliJ, PyCharm, GoLand, etc.) via a separate plugin, but this guide focuses exclusively on the VS Code integration.
Verifying the Installation
Once installed, you should see two immediate changes in your VS Code environment:
| UI Element |
Location |
Purpose |
| Continue sidebar icon |
Activity Bar (left panel) |
Opens the chat interface |
| Inline ghost text |
Code editor |
Autocomplete suggestions |
Keyboard shortcut Ctrl+L |
Global |
Focus the Continue chat panel |
Keyboard shortcut Tab |
Editor |
Accept autocomplete suggestion |
Click the Continue icon in the Activity Bar to open the side panel. On first launch, Continue will present an onboarding flow that prompts you to connect a provider. You can skip or dismiss this — we'll wire up Ollama manually in the next step by editing config.json directly, which gives us far more control than the GUI wizard.
Understanding the Extension's Architecture
Before moving forward, it's worth understanding what Continue actually installs and how it communicates:
Terminal
VS Code Extension (Continue.dev)
│
▼
config.json ──────────────────────────────────────────────────┐
│ │
▼ ▼
Chat Provider (LLM) Autocomplete Provider (LLM)
e.g., Ollama → llama3 e.g., Ollama → starcoder2
│ │
└────────────────────────┬─────────────────────────────────┘
│
▼
http://localhost:11434 (Ollama REST API)
Continue separates chat and autocomplete into two independently configurable providers. This is a powerful distinction — you can run a large, instruction-tuned model like llama3 or deepseek-coder for conversational queries while using a lightweight, fill-in-the-middle model like starcoder2:3b for low-latency inline completions. No other mainstream Copilot alternative exposes this level of provider granularity.
Locating the config.json File
The entire behavior of Continue is controlled by a single JSON file. Know where it lives:
| Operating System |
Path |
| macOS |
~/.continue/config.json |
| Linux |
~/.continue/config.json |
| Windows |
%USERPROFILE%\.continue\config.json |
You can open it directly from within VS Code using the Continue panel's settings gear icon, or navigate to it manually in your terminal:
Terminal
# macOS / Linux
cat ~/.continue/config.json
# Windows (PowerShell)
Get-Content "$env:USERPROFILE\.continue\config.json"
With the extension installed and the config file located, you're ready to point Continue at your local Ollama instance.
Step 4 Step 2: Configuring config.json for Ollama
Once Continue.dev is installed, the entire behavior of the extension is governed by a single file: config.json. This is where you wire Continue to your local Ollama instance, define which models handle chat versus autocomplete, and tune performance parameters. Getting this file right is the difference between a sluggish, unreliable setup and one that rivals GitHub Copilot in responsiveness.
Locating config.json
Continue stores its configuration in your home directory:
| Operating System |
Path |
| macOS / Linux |
~/.continue/config.json |
| Windows |
%USERPROFILE%\.continue\config.json |
You can also open it directly from VS Code using the Continue sidebar — click the gear icon at the bottom of the Continue panel, and it will open config.json in the editor.
The Minimal Ollama Configuration
Below is a production-ready baseline config.json that connects Continue to a locally running Ollama instance for both chat and tab autocomplete:
Terminal
{
"models": [
{
"title": "CodeLlama 13B (Chat)",
"provider": "ollama",
"model": "codellama:13b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "StarCoder2 3B (Autocomplete)",
"provider": "ollama",
"model": "starcoder2:3b",
"apiBase": "http://localhost:11434"
},
"tabAutocompleteOptions": {
"useCopyBuffer": false,
"maxPromptTokens": 1024,
"prefixPercentage": 0.85
},
"allowAnonymousTelemetry": false
}
Why two separate models? Chat models are optimized for instruction-following and multi-turn dialogue. Autocomplete models — particularly those trained with Fill-in-the-Middle (FIM) — are specifically tuned to predict code given a prefix and suffix context window. Conflating the two degrades both experiences.
Breaking Down Each Field
models array — This defines the models available in the Continue chat panel. You can add multiple entries and switch between them at runtime. Each object requires:
- provider: set to "ollama" for local inference
- model: the exact tag as it appears in ollama list
- apiBase: Ollama's default REST endpoint; only change this if you've remapped the port
tabAutocompleteModel — A dedicated entry for inline autocomplete. This model fires on every keystroke pause, so smaller and faster is better here. A 3B–7B parameter model with FIM support is the sweet spot.
tabAutocompleteOptions — Fine-grained control over the autocomplete engine:
| Option |
Value |
Purpose |
useCopyBuffer |
false |
Prevents clipboard content from leaking into suggestions |
maxPromptTokens |
1024 |
Caps the context window sent per request — critical for latency |
prefixPercentage |
0.85 |
Allocates 85% of the token budget to code before the cursor |
allowAnonymousTelemetry — Set this to false. You're running local AI specifically to keep your code off third-party servers; there is no reason to send telemetry to Continue's analytics pipeline.
Verifying the Connection
After saving config.json, confirm Ollama is reachable from Continue by opening the VS Code Command Palette (Cmd+Shift+P / Ctrl+Shift+P) and running:
Terminal
Continue: Open Debug Panel
You should see a green status indicator and a successful model ping. If you encounter a connection refused error, verify Ollama is running with:
Terminal
ollama serve
# or check its status
curl http://localhost:11434/api/tags
A valid JSON response listing your pulled models confirms the API is live and Continue can proceed to serve completions.
Step 5 Step 3: Downloading Code Llama or StarCoder2
With Continue.dev installed and your config.json wired up, the next critical decision is which model you actually pull down to your machine. For local code completion and chat, two models dominate the conversation: Code Llama (Meta) and StarCoder2 (BigCode/Hugging Face). Both run excellently through Ollama, but they have meaningfully different strengths. Let's break them down, then walk through the exact pull commands.
Model Comparison at a Glance
| Feature |
Code Llama |
StarCoder2 |
| Developer |
Meta AI |
BigCode (HuggingFace) |
| Parameter sizes |
7B, 13B, 34B, 70B |
3B, 7B, 15B |
| License |
Llama 2 Community |
BigCode OpenRAIL-M |
| Strengths |
General coding + chat, Python |
Multi-language fill-in-middle |
| Context window |
16K tokens |
16K tokens |
| Best use case |
Chat + completion hybrid |
Pure autocomplete / FIM |
| VRAM (7B variant) |
~4–5 GB |
~4–5 GB |
Recommendation: If you have 8 GB of VRAM or RAM to spare, start with codellama:7b-code for a balanced experience. If you're purely focused on in-line autocomplete quality and multi-language support, starcoder2:7b frequently edges it out on fill-in-the-middle (FIM) benchmarks.
Pulling Code Llama via Ollama
Ollama makes model management trivially simple. Open your terminal and run:
Terminal
# Lightweight 7B instruct variant — best starting point
ollama pull codellama:7b-instruct
# 7B code-specialized variant — better raw completion, less chat
ollama pull codellama:7b-code
# If you have the hardware headroom (13B is noticeably sharper)
ollama pull codellama:13b-instruct
The instruct variant understands natural language instructions and is ideal for the chat panel inside Continue.dev. The code variant is stripped down and tuned purely for completion tasks — think of it as the "autocomplete engine" variant.
After pulling, verify the model is available:
You should see output similar to:
Terminal
NAME ID SIZE MODIFIED
codellama:7b-instruct 8fdf8f752f6e 3.8 GB 2 minutes ago
Pulling StarCoder2 via Ollama
StarCoder2 was purpose-built for fill-in-the-middle (FIM) tasks — exactly what IDE autocomplete relies on. It was trained on over 600 programming languages from The Stack v2 dataset, making it exceptionally broad.
Terminal
# 3B — runs on nearly any modern machine, including CPU-only
ollama pull starcoder2:3b
# 7B — sweet spot for quality vs. resource usage
ollama pull starcoder2:7b
# 15B — requires ~10+ GB VRAM, but delivers near-Copilot quality
ollama pull starcoder2:15b
Test the model immediately from your terminal to confirm it's responding correctly:
Terminal
ollama run starcoder2:7b "Write a Python function to flatten a nested list."
If you get coherent, syntactically correct output, your model is ready for Continue.dev to consume.
A Note on Hardware Constraints
Don't let the larger parameter counts intimidate you. Quantized models (which Ollama pulls by default — typically Q4_K_M quantization) are dramatically smaller than their theoretical sizes. A 7B model in Q4 quantization runs comfortably on:
- Apple Silicon Macs (M1/M2/M3) with 16 GB unified memory — near-native GPU acceleration via Metal
- NVIDIA GPUs with 6–8 GB VRAM (RTX 3060, 4060, etc.)
- CPU-only machines with 16+ GB RAM — slower, but entirely functional for chat; autocomplete latency will be noticeable
If you're CPU-bound, strongly prefer starcoder2:3b or codellama:7b-code — their smaller memory footprint translates directly into faster token generation and a more responsive autocomplete experience.
Once your chosen model is pulled and verified, Continue.dev will automatically discover it through the Ollama API endpoint (http://localhost:11434) you configured in Step 2 — no additional linking required.
Step 6 Best Practices for Local Autocomplete and Chat
Getting the stack running is only half the battle. Squeezing maximum productivity out of a local AI setup requires deliberate configuration choices, disciplined prompt habits, and a clear understanding of where local models excel versus where they'll frustrate you.
Model Selection Strategy
Not all models are created equal, and the right tool depends on the task at hand. Run separate models for autocomplete and chat rather than forcing a single model to do both.
| Task |
Recommended Model |
Why |
| Tab autocomplete |
starcoder2:3b |
Ultra-low latency, fills-in-the-middle optimized |
| Code chat / Q&A |
codellama:13b |
Deeper reasoning, handles context windows well |
| Refactoring |
deepseek-coder:6.7b |
Strong instruction following for rewrites |
| General chat |
llama3:8b |
Broad knowledge, good for non-code questions |
The key insight: autocomplete is latency-sensitive, chat is quality-sensitive. A 3B model responding in 150ms feels magical for completions. That same model giving a shallow answer to a complex architectural question feels broken.
Tuning Autocomplete Behavior in config.json
The default autocomplete settings in Continue.dev are conservative. Push them for a better experience:
Terminal
{
"tabAutocompleteModel": {
"title": "StarCoder2 3B",
"provider": "ollama",
"model": "starcoder2:3b",
"apiBase": "http://localhost:11434"
},
"tabAutocompleteOptions": {
"useCopyBuffer": false,
"maxPromptTokens": 1024,
"prefixPercentage": 0.85,
"multilineCompletions": "always",
"debounceDelay": 300
}
}
Key options explained:
debounceDelay: 300 — Waits 300ms after your last keystroke before firing a completion request. Lower values feel more responsive but hammer your GPU. Find your threshold.
prefixPercentage: 0.85 — Allocates 85% of the context window to the code before the cursor. This is ideal since the model needs maximum prefix context to produce relevant completions.
multilineCompletions: "always" — Forces the model to attempt completing entire blocks, not just single lines. Essential for boilerplate generation.
maxPromptTokens: 1024 — Keeps requests lean. Larger values improve quality marginally but increase latency significantly on consumer hardware.
Context Management for Chat
The chat interface is where most developers underutilize Continue.dev. Use the built-in context providers aggressively:
@file — Reference a specific file directly in your prompt
@codebase — Triggers an embeddings search across your entire repo (requires configuring an embeddings model)
@terminal — Pastes your most recent terminal output into context
@problems — Injects the VS Code Problems panel content, perfect for debugging compiler errors
Example workflow for debugging:
Terminal
@problems @file src/auth/middleware.ts
The JWT validation is throwing a 401 on every request even with a valid token.
Walk me through what could cause this and show me the fix.
This single prompt gives the model the error message, the relevant source file, and a clear problem statement — drastically improving output quality.
Hardware Optimization Tips
GPU users (NVIDIA/AMD):
Ensure Ollama is actually using your GPU and not falling back to CPU:
Terminal
# Verify GPU utilization while a model is loaded
ollama ps
# Check VRAM usage
nvidia-smi # NVIDIA
rocm-smi # AMD
If ollama ps shows 100% CPU, you likely have a VRAM overflow issue. Drop to a smaller quantization:
Terminal
# Pull a smaller quantized variant
ollama pull codellama:13b-code-q4_K_M
CPU-only users:
Keep autocomplete models at 3B parameters or below. Use q4_0 quantization for the smallest memory footprint and set numThreads in Ollama's environment to match your physical core count (not hyperthreaded):
Terminal
OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 ollama serve
Running two models simultaneously on CPU will cause thrashing. Keep MAX_LOADED_MODELS=1 and accept the cold-load delay when switching between autocomplete and chat models.
Prompt Discipline for Consistent Results
Local models are smaller and more sensitive to prompt quality than GPT-4. These habits will save you hours of frustration:
- Be explicit about language and framework. Don't assume the model knows you're in a Next.js 14 App Router project. State it.
- Paste the error, not just the description. Full stack traces with line numbers outperform vague descriptions by an order of magnitude.
- Request a specific output format. End prompts with "Show only the modified function, no explanation" or "Explain first, then show the code" depending on what you need.
- Use
/clear between unrelated tasks. Context contamination from a previous conversation about Python can degrade responses in a new TypeScript session.
When to Fall Back to the Cloud
Local models are not a universal replacement for every use case. Be pragmatic:
| Scenario |
Local is fine |
Reach for cloud |
| Boilerplate generation |
✅ |
|
| Single-file refactoring |
✅ |
|
| Library documentation lookup |
✅ |
|
| Cross-repo architectural review |
|
✅ |
| Complex regex/algorithm generation |
Borderline |
✅ |
| Sensitive production code |
✅ (privacy) |
|
The privacy guarantee is non-negotiable for many teams — your code never leaves the machine. That alone justifies the local setup even if cloud models occasionally produce sharper answers.