Home chevron_right Mac Guides chevron_right Llama.cpp on Mac: The Power User's Guide Llama.cpp on Mac: The Power User's Guide laptop_mac macOS Sonoma Intermediate schedule 8 min read by Alex Rivera • May 14, 2024 If you want maximum performance, absolute control, and zero bloatware, compiling llama.cpp directly from the source code is the only way to fly. Here is exactly how to do it on Apple Silicon. Introduction llama.cpp is the underlying C++ engine that powers almost all local AI tools (including Ollama and LM Studio). By running it natively from the terminal, you strip away the UI overhead and gain total control over performance flags. Prerequisites You need Apple's Xcode Command Line Tools and Homebrew installed to compile the C++ code. Terminalcontent_copyCopyxcode-select --install brew install cmake python3 Step 1 Compilation We will clone the repository and compile it using the LLAMA_METAL=1 flag to ensure it leverages Apple's GPU. Terminalcontent_copyCopygit clone https://github.com/ggerganov/llama.cpp cd llama.cpp make -j LLAMA_METAL=1 Step 2 Downloading Weights We need to download a model in the .gguf format. We will use the huggingface-cli to download Llama 3 8B. Terminalcontent_copyCopypip3 install huggingface-hub huggingface-cli download bartowski/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models Step 3 Running Inference Now, let's chat with the model. The -ngl 99 flag tells the engine to offload all layers to your Mac's GPU. Terminalcontent_copyCopy./llama-cli -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ -n 512 \ -ngl 99 \ --color \ -i -r "User:" \ -p "You are a helpful AI assistant.\n\nUser: Hello!\nAI:" Step 4 Local Server If you want to host an OpenAI-compatible API endpoint directly from the terminal, use the llama-server binary: Terminalcontent_copyCopy./llama-server -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 99 --port 8080 Your high-performance backend is now listening at http://127.0.0.1:8080. Continue Reading Performance Mistral 7B vs Llama 3 on Apple Silicon Tools Best GUI clients for Local LLMs Advanced Quantization 101: Speed up your Inference