Llama.cpp on Windows: The CUDA Guide

laptop_mac macOS Sonoma Intermediate schedule 8 min read

by Alex Rivera • May 14, 2024

If you want maximum performance, absolute control, and zero bloatware, compiling llama.cpp directly from the source code using the NVIDIA CUDA Toolkit is the only way to fly. Here is exactly how to do it on Windows.

Introduction

llama.cpp is the underlying C++ engine that powers almost all local AI tools (including Ollama and LM Studio). By compiling and running it natively from the Windows terminal, you strip away the UI overhead and gain total control over VRAM allocation flags.

Prerequisites

You need to install the build tools required to compile C++ code on Windows with CUDA support.

Install Git for Windows.
Install CMake (ensure it's added to your PATH).
Install Visual Studio Build Tools 2022 (select "Desktop development with C++").
Install the NVIDIA CUDA Toolkit (required for GPU acceleration).

Step 1 Compilation

Open a Developer Command Prompt for VS 2022 (search for this in your Windows Start menu).

Clone the repository and compile it using the LLAMA_CUDA=ON flag to ensure it leverages your NVIDIA GPU.

Terminal

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release

Once compiled, the executable files will be located in build\bin\Release\.

Step 2 Downloading Weights

We need to download a model in the .gguf format. We will use the huggingface-cli in a standard PowerShell window.

Terminal

pip install -U huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models

Step 3 Running Inference

Now, let's chat with the model. The -ngl 99 flag tells the engine to offload all layers to your NVIDIA GPU's VRAM.

Terminal

.\build\bin\Release\llama-cli.exe -m .\models\Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 512 -ngl 99 --color -i -r "User:" -p "You are a helpful AI assistant.

User: Hello!
AI:"

Step 4 Local Server

If you want to host an OpenAI-compatible API endpoint directly from the terminal, use the llama-server executable:

Terminal

.\build\bin\Release\llama-server.exe -m .\models\Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 99 --port 8080

Your high-performance CUDA backend is now listening at http://127.0.0.1:8080.

Continue Reading

Performance

Llama.cpp on Windows: The CUDA Guide

Introduction

Prerequisites

Step 1 Compilation

Step 2 Downloading Weights

Step 3 Running Inference

Step 4 Local Server

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

Introduction

Prerequisites

Step 1 Compilation

Step 2 Downloading Weights

Step 3 Running Inference

Step 4 Local Server

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

ChatEzzy Workspace