Replace GitHub Copilot: Ollama + Continue.dev

laptop_mac macOS Sonoma Intermediate schedule 8 min read
Author by Alex Rivera • May 14, 2024

Step 1 The Cost of Cloud Autocomplete vs Local AI

Every keystroke you type in VS Code with GitHub Copilot active is transmitted to Microsoft's Azure-hosted OpenAI infrastructure. That's not a conspiracy theory — it's the documented product architecture. Your proprietary algorithms, internal business logic, unreleased features, and architectural decisions are all leaving your machine. For individual developers working on side projects, this tradeoff is often acceptable. For teams building commercially sensitive software, it's a material security and compliance risk.

Then there's the financial dimension.

GitHub Copilot Pricing Reality

Plan Monthly Cost Annual Cost Team of 10
Copilot Individual $10/month $100/year $1,000/year
Copilot Business $19/month/user $228/year/user $2,280/year
Copilot Enterprise $39/month/user $468/year/user $4,680/year

For a mid-sized engineering organization of 50 developers on the Business tier, you're looking at $11,400 per year — paid indefinitely, to an external vendor, with no ownership of the underlying model or guarantee of pricing stability. Microsoft has already revised Copilot pricing upward since launch, and there is no structural reason to expect that trend to reverse.

What "Local AI" Actually Means

Running a local model via Ollama changes the equation entirely. The compute cost is your existing hardware. The privacy model is absolute — no data leaves your machine. The latency is often lower than cloud-based completion because you eliminate the round-trip network overhead to remote inference servers.

Terminal
Cloud Copilot flow:
Keypress → VS Code → HTTPS → Azure OpenAI → Inference → Response → VS Code

Local Ollama flow:
Keypress → VS Code → localhost:11434 → Inference → Response → VS Code

The localhost inference path removes an entire category of failure modes: API outages, rate limiting, corporate proxy interference, and geographic latency variance. Developers behind strict enterprise firewalls who previously couldn't use Copilot at all can now run a fully capable code model without touching the network perimeter.

The Hidden Costs of Cloud Dependency

Beyond licensing fees, cloud autocomplete introduces operational coupling that is easy to underestimate:

  • Vendor lock-in: Your team's workflow becomes dependent on a third-party service with unilateral control over pricing, availability, and terms of service.
  • Compliance overhead: Many regulated industries (healthcare, finance, defense) require formal data processing agreements and security reviews for any tool that transmits code externally. Copilot's approval process is non-trivial in these environments.
  • Context window leakage: Cloud models receive not just your current file but surrounding context buffers — adjacent files, recently opened tabs, and repository metadata depending on configuration.
  • Audit trail gaps: When code suggestions come from an external black box, attributing IP provenance becomes complicated in litigation or licensing disputes.

Where Local Models Are Competitive Today

Modern quantized models like Code Llama 13B and StarCoder2 7B running on consumer hardware achieve completion quality that is genuinely competitive with Copilot for the most common autocomplete scenarios: boilerplate generation, function signature completion, test scaffolding, and common algorithm implementations. The gap narrows further when the model is operating on code in languages and patterns it was specifically trained on.

The calculus is straightforward: zero marginal cost, complete data sovereignty, and no external dependencies versus a growing SaaS bill and an implicit agreement to share your codebase with a cloud inference provider. For teams serious about long-term toolchain ownership, the local-first approach is not a compromise — it is the more defensible architectural choice.

Step 2 Prerequisites: VS Code and Ollama

Before diving into the configuration, you need two foundational components locked in and verified. Skipping proper prerequisite validation is the single most common reason developers waste hours debugging what should be a straightforward local AI setup. Get these right, and everything downstream becomes mechanical.


Visual Studio Code

You need VS Code 1.80 or later. The Continue.dev extension depends on modern VS Code APIs for inline ghost-text completions, and older versions will silently fail or produce degraded behavior.

Verify your version from the command line:

Terminal
code --version

Expected output (yours may be newer):

Terminal
1.89.1
e170252f762678dec6ca2cc69aba1864a9a1f8ad
x64

If you're running a fork like VSCodium, Continue.dev is fully compatible — just ensure you're pulling the extension from the Open VSX Registry rather than the Microsoft Marketplace, since VSCodium ships without Microsoft's proprietary extension host by default.


Ollama

Ollama is the local model runtime that handles model downloads, quantization management, and exposes an OpenAI-compatible REST API on localhost:11434. This compatibility layer is precisely why Continue.dev integrates with it so cleanly — no custom plugin architecture required.

Installation by platform:

Platform Command / Method
macOS brew install ollama or download from ollama.com
Linux curl -fsSL https://ollama.com/install.sh \| sh
Windows Native installer from ollama.com/download

After installation, start the Ollama daemon:

Terminal
ollama serve

On macOS, Ollama runs as a menu bar application automatically after installation. On Linux, you may want it as a systemd service:

Terminal
sudo systemctl enable ollama
sudo systemctl start ollama

Verify the API is live:

Terminal
curl http://localhost:11434/api/tags

A successful response returns a JSON object listing your locally installed models. An empty list is fine at this stage — model downloads come in a later step.

Terminal
{
  "models": []
}

Hardware Considerations

This is where most tutorials leave you hanging. Your hardware directly determines which models are viable and at what quantization level.

VRAM / Unified Memory Recommended Tier Expected Latency
8 GB 7B models at Q4_K_M Acceptable (40–80 tok/s)
16 GB 13B models at Q4_K_M Good (30–60 tok/s)
24 GB+ 34B models at Q4_K_M Excellent (20–40 tok/s)
CPU only 3B–7B at Q4 max Slow but functional

Ollama automatically offloads layers to your GPU via Metal (Apple Silicon), CUDA (NVIDIA), or ROCm (AMD). No manual configuration is required — Ollama detects your hardware and optimizes layer offloading transparently.

Critical note for Apple Silicon users: Unified memory is shared between CPU and GPU. A MacBook Pro M3 with 16 GB can run a 13B model comfortably because there is no discrete VRAM ceiling. This is one of the strongest arguments for Apple Silicon in local AI workflows.


Confirming Your Environment

Run this quick sanity check before proceeding:

Terminal
# Confirm VS Code version
code --version

# Confirm Ollama is running and responsive
curl -s http://localhost:11434/api/tags | python3 -m json.tool

# Confirm GPU acceleration (Ollama logs)
ollama run llama3.2:1b "say ok"

Watch the Ollama terminal output during model inference. You should see layer offload counts referencing your GPU. If everything runs on CPU, revisit your driver installation before pulling larger code models.

With both prerequisites confirmed, you're ready to install Continue.dev and wire everything together.

Step 3 Step 1: Installing the Continue.dev Extension

Continue.dev is the cornerstone of this entire setup — an open-source AI code assistant that acts as a drop-in replacement for GitHub Copilot, but with one critical architectural difference: it routes your requests wherever you tell it to, including a locally running Ollama instance. No telemetry. No cloud egress. No subscription.

Installing from the VS Code Marketplace

The simplest installation path goes through the VS Code Extension Marketplace directly inside the editor:

  1. Open VS Code
  2. Press Ctrl+Shift+X (Windows/Linux) or Cmd+Shift+X (macOS) to open the Extensions panel
  3. Search for "Continue"
  4. Locate the extension published by Continue (the icon is a distinct purple gradient logo)
  5. Click Install

Alternatively, install it from the command line using the VS Code CLI:

Terminal
code --install-extension Continue.continue

Or if you're using VS Codium (the telemetry-free fork), the extension is available on the Open VSX Registry:

Terminal
codium --install-extension Continue.continue

Note: Continue.dev also supports JetBrains IDEs (IntelliJ, PyCharm, GoLand, etc.) via a separate plugin, but this guide focuses exclusively on the VS Code integration.


Verifying the Installation

Once installed, you should see two immediate changes in your VS Code environment:

UI Element Location Purpose
Continue sidebar icon Activity Bar (left panel) Opens the chat interface
Inline ghost text Code editor Autocomplete suggestions
Keyboard shortcut Ctrl+L Global Focus the Continue chat panel
Keyboard shortcut Tab Editor Accept autocomplete suggestion

Click the Continue icon in the Activity Bar to open the side panel. On first launch, Continue will present an onboarding flow that prompts you to connect a provider. You can skip or dismiss this — we'll wire up Ollama manually in the next step by editing config.json directly, which gives us far more control than the GUI wizard.


Understanding the Extension's Architecture

Before moving forward, it's worth understanding what Continue actually installs and how it communicates:

Terminal
VS Code Extension (Continue.dev)
        │
        ▼
  config.json  ──────────────────────────────────────────────────┐
        │                                                          │
        ▼                                                          ▼
  Chat Provider (LLM)                               Autocomplete Provider (LLM)
  e.g., Ollama → llama3                             e.g., Ollama → starcoder2
        │                                                          │
        └────────────────────────┬─────────────────────────────────┘
                                 │
                                 ▼
                    http://localhost:11434  (Ollama REST API)

Continue separates chat and autocomplete into two independently configurable providers. This is a powerful distinction — you can run a large, instruction-tuned model like llama3 or deepseek-coder for conversational queries while using a lightweight, fill-in-the-middle model like starcoder2:3b for low-latency inline completions. No other mainstream Copilot alternative exposes this level of provider granularity.


Locating the config.json File

The entire behavior of Continue is controlled by a single JSON file. Know where it lives:

Operating System Path
macOS ~/.continue/config.json
Linux ~/.continue/config.json
Windows %USERPROFILE%\.continue\config.json

You can open it directly from within VS Code using the Continue panel's settings gear icon, or navigate to it manually in your terminal:

Terminal
# macOS / Linux
cat ~/.continue/config.json

# Windows (PowerShell)
Get-Content "$env:USERPROFILE\.continue\config.json"

With the extension installed and the config file located, you're ready to point Continue at your local Ollama instance.

Step 4 Step 2: Configuring config.json for Ollama

Once Continue.dev is installed, the entire behavior of the extension is governed by a single file: config.json. This is where you wire Continue to your local Ollama instance, define which models handle chat versus autocomplete, and tune performance parameters. Getting this file right is the difference between a sluggish, unreliable setup and one that rivals GitHub Copilot in responsiveness.

Locating config.json

Continue stores its configuration in your home directory:

Operating System Path
macOS / Linux ~/.continue/config.json
Windows %USERPROFILE%\.continue\config.json

You can also open it directly from VS Code using the Continue sidebar — click the gear icon at the bottom of the Continue panel, and it will open config.json in the editor.


The Minimal Ollama Configuration

Below is a production-ready baseline config.json that connects Continue to a locally running Ollama instance for both chat and tab autocomplete:

Terminal
{
  "models": [
    {
      "title": "CodeLlama 13B (Chat)",
      "provider": "ollama",
      "model": "codellama:13b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B (Autocomplete)",
    "provider": "ollama",
    "model": "starcoder2:3b",
    "apiBase": "http://localhost:11434"
  },
  "tabAutocompleteOptions": {
    "useCopyBuffer": false,
    "maxPromptTokens": 1024,
    "prefixPercentage": 0.85
  },
  "allowAnonymousTelemetry": false
}

Why two separate models? Chat models are optimized for instruction-following and multi-turn dialogue. Autocomplete models — particularly those trained with Fill-in-the-Middle (FIM) — are specifically tuned to predict code given a prefix and suffix context window. Conflating the two degrades both experiences.


Breaking Down Each Field

models array — This defines the models available in the Continue chat panel. You can add multiple entries and switch between them at runtime. Each object requires: - provider: set to "ollama" for local inference - model: the exact tag as it appears in ollama list - apiBase: Ollama's default REST endpoint; only change this if you've remapped the port

tabAutocompleteModel — A dedicated entry for inline autocomplete. This model fires on every keystroke pause, so smaller and faster is better here. A 3B–7B parameter model with FIM support is the sweet spot.

tabAutocompleteOptions — Fine-grained control over the autocomplete engine:

Option Value Purpose
useCopyBuffer false Prevents clipboard content from leaking into suggestions
maxPromptTokens 1024 Caps the context window sent per request — critical for latency
prefixPercentage 0.85 Allocates 85% of the token budget to code before the cursor

allowAnonymousTelemetry — Set this to false. You're running local AI specifically to keep your code off third-party servers; there is no reason to send telemetry to Continue's analytics pipeline.


Verifying the Connection

After saving config.json, confirm Ollama is reachable from Continue by opening the VS Code Command Palette (Cmd+Shift+P / Ctrl+Shift+P) and running:

Terminal
Continue: Open Debug Panel

You should see a green status indicator and a successful model ping. If you encounter a connection refused error, verify Ollama is running with:

Terminal
ollama serve
# or check its status
curl http://localhost:11434/api/tags

A valid JSON response listing your pulled models confirms the API is live and Continue can proceed to serve completions.

Step 5 Step 3: Downloading Code Llama or StarCoder2

With Continue.dev installed and your config.json wired up, the next critical decision is which model you actually pull down to your machine. For local code completion and chat, two models dominate the conversation: Code Llama (Meta) and StarCoder2 (BigCode/Hugging Face). Both run excellently through Ollama, but they have meaningfully different strengths. Let's break them down, then walk through the exact pull commands.


Model Comparison at a Glance

Feature Code Llama StarCoder2
Developer Meta AI BigCode (HuggingFace)
Parameter sizes 7B, 13B, 34B, 70B 3B, 7B, 15B
License Llama 2 Community BigCode OpenRAIL-M
Strengths General coding + chat, Python Multi-language fill-in-middle
Context window 16K tokens 16K tokens
Best use case Chat + completion hybrid Pure autocomplete / FIM
VRAM (7B variant) ~4–5 GB ~4–5 GB

Recommendation: If you have 8 GB of VRAM or RAM to spare, start with codellama:7b-code for a balanced experience. If you're purely focused on in-line autocomplete quality and multi-language support, starcoder2:7b frequently edges it out on fill-in-the-middle (FIM) benchmarks.


Pulling Code Llama via Ollama

Ollama makes model management trivially simple. Open your terminal and run:

Terminal
# Lightweight 7B instruct variant — best starting point
ollama pull codellama:7b-instruct

# 7B code-specialized variant — better raw completion, less chat
ollama pull codellama:7b-code

# If you have the hardware headroom (13B is noticeably sharper)
ollama pull codellama:13b-instruct

The instruct variant understands natural language instructions and is ideal for the chat panel inside Continue.dev. The code variant is stripped down and tuned purely for completion tasks — think of it as the "autocomplete engine" variant.

After pulling, verify the model is available:

Terminal
ollama list

You should see output similar to:

Terminal
NAME                    ID              SIZE    MODIFIED
codellama:7b-instruct   8fdf8f752f6e    3.8 GB  2 minutes ago

Pulling StarCoder2 via Ollama

StarCoder2 was purpose-built for fill-in-the-middle (FIM) tasks — exactly what IDE autocomplete relies on. It was trained on over 600 programming languages from The Stack v2 dataset, making it exceptionally broad.

Terminal
# 3B — runs on nearly any modern machine, including CPU-only
ollama pull starcoder2:3b

# 7B — sweet spot for quality vs. resource usage
ollama pull starcoder2:7b

# 15B — requires ~10+ GB VRAM, but delivers near-Copilot quality
ollama pull starcoder2:15b

Test the model immediately from your terminal to confirm it's responding correctly:

Terminal
ollama run starcoder2:7b "Write a Python function to flatten a nested list."

If you get coherent, syntactically correct output, your model is ready for Continue.dev to consume.


A Note on Hardware Constraints

Don't let the larger parameter counts intimidate you. Quantized models (which Ollama pulls by default — typically Q4_K_M quantization) are dramatically smaller than their theoretical sizes. A 7B model in Q4 quantization runs comfortably on:

  • Apple Silicon Macs (M1/M2/M3) with 16 GB unified memory — near-native GPU acceleration via Metal
  • NVIDIA GPUs with 6–8 GB VRAM (RTX 3060, 4060, etc.)
  • CPU-only machines with 16+ GB RAM — slower, but entirely functional for chat; autocomplete latency will be noticeable

If you're CPU-bound, strongly prefer starcoder2:3b or codellama:7b-code — their smaller memory footprint translates directly into faster token generation and a more responsive autocomplete experience.

Once your chosen model is pulled and verified, Continue.dev will automatically discover it through the Ollama API endpoint (http://localhost:11434) you configured in Step 2 — no additional linking required.

Step 6 Best Practices for Local Autocomplete and Chat

Getting the stack running is only half the battle. Squeezing maximum productivity out of a local AI setup requires deliberate configuration choices, disciplined prompt habits, and a clear understanding of where local models excel versus where they'll frustrate you.


Model Selection Strategy

Not all models are created equal, and the right tool depends on the task at hand. Run separate models for autocomplete and chat rather than forcing a single model to do both.

Task Recommended Model Why
Tab autocomplete starcoder2:3b Ultra-low latency, fills-in-the-middle optimized
Code chat / Q&A codellama:13b Deeper reasoning, handles context windows well
Refactoring deepseek-coder:6.7b Strong instruction following for rewrites
General chat llama3:8b Broad knowledge, good for non-code questions

The key insight: autocomplete is latency-sensitive, chat is quality-sensitive. A 3B model responding in 150ms feels magical for completions. That same model giving a shallow answer to a complex architectural question feels broken.


Tuning Autocomplete Behavior in config.json

The default autocomplete settings in Continue.dev are conservative. Push them for a better experience:

Terminal
{
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B",
    "provider": "ollama",
    "model": "starcoder2:3b",
    "apiBase": "http://localhost:11434"
  },
  "tabAutocompleteOptions": {
    "useCopyBuffer": false,
    "maxPromptTokens": 1024,
    "prefixPercentage": 0.85,
    "multilineCompletions": "always",
    "debounceDelay": 300
  }
}

Key options explained:

  • debounceDelay: 300 — Waits 300ms after your last keystroke before firing a completion request. Lower values feel more responsive but hammer your GPU. Find your threshold.
  • prefixPercentage: 0.85 — Allocates 85% of the context window to the code before the cursor. This is ideal since the model needs maximum prefix context to produce relevant completions.
  • multilineCompletions: "always" — Forces the model to attempt completing entire blocks, not just single lines. Essential for boilerplate generation.
  • maxPromptTokens: 1024 — Keeps requests lean. Larger values improve quality marginally but increase latency significantly on consumer hardware.

Context Management for Chat

The chat interface is where most developers underutilize Continue.dev. Use the built-in context providers aggressively:

  • @file — Reference a specific file directly in your prompt
  • @codebase — Triggers an embeddings search across your entire repo (requires configuring an embeddings model)
  • @terminal — Pastes your most recent terminal output into context
  • @problems — Injects the VS Code Problems panel content, perfect for debugging compiler errors

Example workflow for debugging:

Terminal
@problems @file src/auth/middleware.ts

The JWT validation is throwing a 401 on every request even with a valid token.
Walk me through what could cause this and show me the fix.

This single prompt gives the model the error message, the relevant source file, and a clear problem statement — drastically improving output quality.


Hardware Optimization Tips

GPU users (NVIDIA/AMD): Ensure Ollama is actually using your GPU and not falling back to CPU:

Terminal
# Verify GPU utilization while a model is loaded
ollama ps

# Check VRAM usage
nvidia-smi  # NVIDIA
rocm-smi    # AMD

If ollama ps shows 100% CPU, you likely have a VRAM overflow issue. Drop to a smaller quantization:

Terminal
# Pull a smaller quantized variant
ollama pull codellama:13b-code-q4_K_M

CPU-only users: Keep autocomplete models at 3B parameters or below. Use q4_0 quantization for the smallest memory footprint and set numThreads in Ollama's environment to match your physical core count (not hyperthreaded):

Terminal
OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 ollama serve

Running two models simultaneously on CPU will cause thrashing. Keep MAX_LOADED_MODELS=1 and accept the cold-load delay when switching between autocomplete and chat models.


Prompt Discipline for Consistent Results

Local models are smaller and more sensitive to prompt quality than GPT-4. These habits will save you hours of frustration:

  1. Be explicit about language and framework. Don't assume the model knows you're in a Next.js 14 App Router project. State it.
  2. Paste the error, not just the description. Full stack traces with line numbers outperform vague descriptions by an order of magnitude.
  3. Request a specific output format. End prompts with "Show only the modified function, no explanation" or "Explain first, then show the code" depending on what you need.
  4. Use /clear between unrelated tasks. Context contamination from a previous conversation about Python can degrade responses in a new TypeScript session.

When to Fall Back to the Cloud

Local models are not a universal replacement for every use case. Be pragmatic:

Scenario Local is fine Reach for cloud
Boilerplate generation
Single-file refactoring
Library documentation lookup
Cross-repo architectural review
Complex regex/algorithm generation Borderline
Sensitive production code ✅ (privacy)

The privacy guarantee is non-negotiable for many teams — your code never leaves the machine. That alone justifies the local setup even if cloud models occasionally produce sharper answers.