laptop_mac macOS Sonoma
Intermediate
schedule 8 min read
by Alex Rivera • May 14, 2024
You already own one of the most powerful local AI machines on the planet. Whether you're running a base MacBook Pro M3, a MacBook Pro M3 Max, or a Mac Studio — this guide will teach you how to unlock its full potential with Ollama. No cloud. No API bills. Just raw, private intelligence.
Introduction
Ollama requires macOS 11 Big Sur or later. However, for optimal Apple Silicon GPU acceleration and the best Metal Performance Shaders (MPS) support, you should be running macOS 14 Sonoma or later.
We will use the macOS Terminal. Press Cmd + Space and type "Terminal", or use a modern alternative like iTerm2 or Warp.
Step 1 Installing Ollama
You have two choices for installing Ollama on your Mac: the official macOS GUI installer or Homebrew (the package manager for macOS). We highly recommend Homebrew because it makes updating incredibly simple.
If you already have Homebrew installed, open your terminal and run:
Once installed, start the Ollama background service so it can listen for commands:
(Note: Keep this terminal window open, or run brew services start ollama to have it run silently in the background on boot).
Step 2 Pulling Your First Model
Ollama makes downloading a Large Language Model (LLM) as easy as pulling a Docker container.
We will start with Meta's Llama 3 (8B parameters). It is fast, highly capable, and fits perfectly in the memory of any M3 Mac. Open a new terminal window and run:
What happens next?
- Ollama connects to the registry.
- It downloads the 4.7GB model weights to your local drive.
- It drops you into an interactive chat prompt.
You can now type: Write a python script to scrape a website and watch your local Mac generate code instantly, completely offline.
Hardware and RAM Limits
Why are Apple Silicon Macs so good at AI? Unified Memory.
On a PC, you have System RAM and Graphics RAM (VRAM on the GPU). To run an AI model fast, it must fit entirely inside the VRAM. But on an M3 Mac, the CPU and the GPU share the same pool of memory. If you have a Mac with 36GB of Unified Memory, your GPU can access all of it!
Here is exactly what you can run based on your Mac's RAM:
| Your Mac's RAM |
Max Model Size |
Recommended Models |
Notes |
| 8GB (Base M3) |
~7B to 8B parameters |
Llama 3 (8B), Mistral (7B), Gemma (2B) |
Close other apps to avoid swapping memory. |
| 16GB / 18GB |
~13B to 14B parameters |
Qwen 2.5 (14B), Command R |
The sweet spot. Run Llama 3 (8B) blazing fast. |
| 36GB / 64GB |
~30B to 70B parameters |
Mixtral (8x7B), Llama 3 (70B at Q2) |
Desktop-class AI natively. |
| 128GB+ |
~120B+ parameters |
Llama 3 (70B Q8), Command R+ |
You own a personal supercomputer. |
How do you know Ollama is actually using your M3's GPU and not falling back to the slow CPU? Let's verify it mathematically.
- Open Activity Monitor on your Mac (
Cmd + Space -> "Activity Monitor").
- Press
Cmd + 4 to open the GPU History window.
- Keep that window visible, and go back to your terminal running
ollama run llama3.
- Give it a massive prompt:
Write a 1000 word essay about the history of artificial intelligence.
Watch the GPU History graph. You should see a massive, sustained spike pegging your GPU to 90-100% utilization. If you see this, Apple's Metal acceleration is working perfectly!
Step 4 Exposing the Local API
The terminal is great, but what if you want to use a beautiful web interface or integrate your local model into an app you're coding?
Ollama runs a local API server by default. Open a browser and go to:
http://localhost:11434
You can now hit this API via curl or Python exactly like the OpenAI API:
Terminal
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'
You've successfully turned your Mac M3 into a private, offline AI server. Your data never leaves your machine, and you pay zero API fees.