Run Ollama on Windows Natively

laptop_mac macOS Sonoma Intermediate schedule 8 min read

by Alex Rivera • May 14, 2024

No more WSL headaches. Ollama now runs natively on Windows as a standalone application. It automatically detects your NVIDIA or AMD graphics card and accelerates your local AI inference right out of the box.

Introduction

In the past, running local LLMs on Windows required installing the Windows Subsystem for Linux (WSL) and wrestling with driver passthroughs. Today, Ollama provides a native Windows .exe that hooks directly into DirectX and CUDA.

Step 1 Installation

Go to ollama.com/download.
Click Windows and download the .exe installer.
Double-click the installer to run it.

Ollama will install itself and place an icon in your system tray (bottom right corner of your taskbar).

Step 2 Pulling Your First Model

Open a new PowerShell or Command Prompt window. Let's pull Meta's incredibly capable 8-billion parameter model.

Terminal

ollama run llama3

What happens next? - Ollama connects to the registry. - It downloads the ~4.7GB model weights to your local C:\Users\<YourUser>\.ollama folder. - It drops you into an interactive chat prompt.

You can now type: Write a python script to scrape a website and watch your PC generate code instantly.

Hardware Limits

Windows PCs typically rely on discrete GPUs (VRAM) rather than Unified Memory like Macs. To run an AI model fast, it must fit entirely inside your VRAM.

Your VRAM	Max Model Size	Recommended Models
6GB to 8GB	~7B to 8B parameters	Llama 3 (8B), Mistral (7B), Gemma (2B)
12GB to 16GB	~13B to 14B parameters	Qwen 2.5 (14B), Command R
24GB (RTX 3090/4090)	~30B parameters	Mixtral (8x7B)

If a model exceeds your VRAM, Ollama will automatically offload the remaining layers to your much slower system RAM (CPU).

Step 3 GPU Acceleration

Ollama automatically detects your hardware. - If you have an NVIDIA card, it uses CUDA. - If you have an AMD card, it uses ROCm.

To verify GPU usage, open the Task Manager (Ctrl + Shift + Esc), go to the Performance tab, and select your GPU. Send a large prompt to Ollama and watch your "Dedicated GPU Memory" and "3D" compute graphs spike to 100%.

Step 4 The Local API

Ollama runs a local API server in the background automatically. You can plug this endpoint into VS Code extensions or Python scripts.

Terminal

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Your Windows PC is now a fully functional, private AI server!

Continue Reading

Performance

Run Ollama on Windows Natively

Introduction

Step 1 Installation

Step 2 Pulling Your First Model

Hardware Limits

Step 3 GPU Acceleration

Step 4 The Local API

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

Introduction

Step 1 Installation

Step 2 Pulling Your First Model

Hardware Limits

Step 3 GPU Acceleration

Step 4 The Local API

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

ChatEzzy Workspace