The Ultimate Guide: Run Ollama on Mac M3

laptop_mac macOS Sonoma Intermediate schedule 8 min read
Author by Alex Rivera • May 14, 2024

You already own one of the most powerful local AI machines on the planet. Whether you're running a base MacBook Pro M3, a MacBook Pro M3 Max, or a Mac Studio — this guide will teach you how to unlock its full potential with Ollama. No cloud. No API bills. Just raw, private intelligence.


Introduction

Ollama requires macOS 11 Big Sur or later. However, for optimal Apple Silicon GPU acceleration and the best Metal Performance Shaders (MPS) support, you should be running macOS 14 Sonoma or later.

We will use the macOS Terminal. Press Cmd + Space and type "Terminal", or use a modern alternative like iTerm2 or Warp.


Step 1 Installing Ollama

You have two choices for installing Ollama on your Mac: the official macOS GUI installer or Homebrew (the package manager for macOS). We highly recommend Homebrew because it makes updating incredibly simple.

If you already have Homebrew installed, open your terminal and run:

Terminal
brew install ollama

Once installed, start the Ollama background service so it can listen for commands:

Terminal
ollama serve

(Note: Keep this terminal window open, or run brew services start ollama to have it run silently in the background on boot).


Step 2 Pulling Your First Model

Ollama makes downloading a Large Language Model (LLM) as easy as pulling a Docker container.

We will start with Meta's Llama 3 (8B parameters). It is fast, highly capable, and fits perfectly in the memory of any M3 Mac. Open a new terminal window and run:

Terminal
ollama run llama3

What happens next? - Ollama connects to the registry. - It downloads the 4.7GB model weights to your local drive. - It drops you into an interactive chat prompt.

You can now type: Write a python script to scrape a website and watch your local Mac generate code instantly, completely offline.


Hardware and RAM Limits

Why are Apple Silicon Macs so good at AI? Unified Memory.

On a PC, you have System RAM and Graphics RAM (VRAM on the GPU). To run an AI model fast, it must fit entirely inside the VRAM. But on an M3 Mac, the CPU and the GPU share the same pool of memory. If you have a Mac with 36GB of Unified Memory, your GPU can access all of it!

Here is exactly what you can run based on your Mac's RAM:

Your Mac's RAM Max Model Size Recommended Models Notes
8GB (Base M3) ~7B to 8B parameters Llama 3 (8B), Mistral (7B), Gemma (2B) Close other apps to avoid swapping memory.
16GB / 18GB ~13B to 14B parameters Qwen 2.5 (14B), Command R The sweet spot. Run Llama 3 (8B) blazing fast.
36GB / 64GB ~30B to 70B parameters Mixtral (8x7B), Llama 3 (70B at Q2) Desktop-class AI natively.
128GB+ ~120B+ parameters Llama 3 (70B Q8), Command R+ You own a personal supercomputer.

Step 3 Optimizing for Performance

How do you know Ollama is actually using your M3's GPU and not falling back to the slow CPU? Let's verify it mathematically.

  1. Open Activity Monitor on your Mac (Cmd + Space -> "Activity Monitor").
  2. Press Cmd + 4 to open the GPU History window.
  3. Keep that window visible, and go back to your terminal running ollama run llama3.
  4. Give it a massive prompt: Write a 1000 word essay about the history of artificial intelligence.

Watch the GPU History graph. You should see a massive, sustained spike pegging your GPU to 90-100% utilization. If you see this, Apple's Metal acceleration is working perfectly!


Step 4 Exposing the Local API

The terminal is great, but what if you want to use a beautiful web interface or integrate your local model into an app you're coding?

Ollama runs a local API server by default. Open a browser and go to: http://localhost:11434

You can now hit this API via curl or Python exactly like the OpenAI API:

Terminal
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

You've successfully turned your Mac M3 into a private, offline AI server. Your data never leaves your machine, and you pay zero API fees.