Llama.cpp on Mac: The Power User's Guide

laptop_mac macOS Sonoma Intermediate schedule 8 min read
Author by Alex Rivera • May 14, 2024

If you want maximum performance, absolute control, and zero bloatware, compiling llama.cpp directly from the source code is the only way to fly. Here is exactly how to do it on Apple Silicon.

Introduction

llama.cpp is the underlying C++ engine that powers almost all local AI tools (including Ollama and LM Studio). By running it natively from the terminal, you strip away the UI overhead and gain total control over performance flags.

Prerequisites

You need Apple's Xcode Command Line Tools and Homebrew installed to compile the C++ code.

Terminal
xcode-select --install
brew install cmake python3

Step 1 Compilation

We will clone the repository and compile it using the LLAMA_METAL=1 flag to ensure it leverages Apple's GPU.

Terminal
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_METAL=1

Step 2 Downloading Weights

We need to download a model in the .gguf format. We will use the huggingface-cli to download Llama 3 8B.

Terminal
pip3 install huggingface-hub
huggingface-cli download bartowski/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models

Step 3 Running Inference

Now, let's chat with the model. The -ngl 99 flag tells the engine to offload all layers to your Mac's GPU.

Terminal
./llama-cli -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  -ngl 99 \
  --color \
  -i -r "User:" \
  -p "You are a helpful AI assistant.\n\nUser: Hello!\nAI:"

Step 4 Local Server

If you want to host an OpenAI-compatible API endpoint directly from the terminal, use the llama-server binary:

Terminal
./llama-server -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 99 --port 8080

Your high-performance backend is now listening at http://127.0.0.1:8080.