laptop_mac macOS Sonoma
Intermediate
schedule 8 min read
by Alex Rivera • May 14, 2024
If you want maximum performance, absolute control, and zero bloatware, compiling llama.cpp directly from the source code using the NVIDIA CUDA Toolkit is the only way to fly. Here is exactly how to do it on Windows.
Introduction
llama.cpp is the underlying C++ engine that powers almost all local AI tools (including Ollama and LM Studio). By compiling and running it natively from the Windows terminal, you strip away the UI overhead and gain total control over VRAM allocation flags.
Prerequisites
You need to install the build tools required to compile C++ code on Windows with CUDA support.
- Install Git for Windows.
- Install CMake (ensure it's added to your PATH).
- Install Visual Studio Build Tools 2022 (select "Desktop development with C++").
- Install the NVIDIA CUDA Toolkit (required for GPU acceleration).
Step 1 Compilation
Open a Developer Command Prompt for VS 2022 (search for this in your Windows Start menu).
Clone the repository and compile it using the LLAMA_CUDA=ON flag to ensure it leverages your NVIDIA GPU.
Terminal
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release
Once compiled, the executable files will be located in build\bin\Release\.
Step 2 Downloading Weights
We need to download a model in the .gguf format. We will use the huggingface-cli in a standard PowerShell window.
Terminal
pip install -U huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models
Step 3 Running Inference
Now, let's chat with the model. The -ngl 99 flag tells the engine to offload all layers to your NVIDIA GPU's VRAM.
Terminal
.\build\bin\Release\llama-cli.exe -m .\models\Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 512 -ngl 99 --color -i -r "User:" -p "You are a helpful AI assistant.
User: Hello!
AI:"
Step 4 Local Server
If you want to host an OpenAI-compatible API endpoint directly from the terminal, use the llama-server executable:
Terminal
.\build\bin\Release\llama-server.exe -m .\models\Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 99 --port 8080
Your high-performance CUDA backend is now listening at http://127.0.0.1:8080.