Run Ollama on Windows Natively

laptop_mac macOS Sonoma Intermediate schedule 8 min read
Author by Alex Rivera • May 14, 2024

No more WSL headaches. Ollama now runs natively on Windows as a standalone application. It automatically detects your NVIDIA or AMD graphics card and accelerates your local AI inference right out of the box.

Introduction

In the past, running local LLMs on Windows required installing the Windows Subsystem for Linux (WSL) and wrestling with driver passthroughs. Today, Ollama provides a native Windows .exe that hooks directly into DirectX and CUDA.

Step 1 Installation

  1. Go to ollama.com/download.
  2. Click Windows and download the .exe installer.
  3. Double-click the installer to run it.

Ollama will install itself and place an icon in your system tray (bottom right corner of your taskbar).

Step 2 Pulling Your First Model

Open a new PowerShell or Command Prompt window. Let's pull Meta's incredibly capable 8-billion parameter model.

Terminal
ollama run llama3

What happens next? - Ollama connects to the registry. - It downloads the ~4.7GB model weights to your local C:\Users\<YourUser>\.ollama folder. - It drops you into an interactive chat prompt.

You can now type: Write a python script to scrape a website and watch your PC generate code instantly.

Hardware Limits

Windows PCs typically rely on discrete GPUs (VRAM) rather than Unified Memory like Macs. To run an AI model fast, it must fit entirely inside your VRAM.

Your VRAM Max Model Size Recommended Models
6GB to 8GB ~7B to 8B parameters Llama 3 (8B), Mistral (7B), Gemma (2B)
12GB to 16GB ~13B to 14B parameters Qwen 2.5 (14B), Command R
24GB (RTX 3090/4090) ~30B parameters Mixtral (8x7B)

If a model exceeds your VRAM, Ollama will automatically offload the remaining layers to your much slower system RAM (CPU).

Step 3 GPU Acceleration

Ollama automatically detects your hardware. - If you have an NVIDIA card, it uses CUDA. - If you have an AMD card, it uses ROCm.

To verify GPU usage, open the Task Manager (Ctrl + Shift + Esc), go to the Performance tab, and select your GPU. Send a large prompt to Ollama and watch your "Dedicated GPU Memory" and "3D" compute graphs spike to 100%.

Step 4 The Local API

Ollama runs a local API server in the background automatically. You can plug this endpoint into VS Code extensions or Python scripts.

Terminal
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Your Windows PC is now a fully functional, private AI server!