laptop_mac macOS Sonoma
Intermediate
schedule 8 min read
by Alex Rivera • May 14, 2024
Running Llama 3 on Linux gives you maximum control and performance. We'll use llama.cpp — a highly optimized C++ inference engine that natively supports NVIDIA CUDA, AMD ROCm, and CPU-only inference.
Introduction
While wrappers like Ollama and LM Studio are great for getting started quickly, they abstract away the underlying engine. If you want to squeeze every last drop of performance out of your Linux hardware, compiling llama.cpp directly from source is the gold standard.
This guide will walk you through compiling the engine with NVIDIA CUDA support, downloading the Llama 3 8B model, and running it directly from your terminal.
Prerequisites
First, we need to install the core build tools and the NVIDIA CUDA toolkit. Open your terminal and run:
Terminal
sudo apt update
sudo apt install build-essential git python3-pip
sudo apt install nvidia-cuda-toolkit
Verify that your NVIDIA drivers and CUDA are working properly:
Step 1 Compilation
We will clone the llama.cpp repository from GitHub and compile it. We pass the LLAMA_CUDA=1 flag to the make command to ensure the compiler builds the binary with NVIDIA GPU acceleration.
Terminal
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc) LLAMA_CUDA=1
Once the compilation finishes, you will have several executable files in the root directory, including llama-cli and llama-server.
Step 2 Downloading Llama 3 Weights
llama.cpp uses the .gguf file format. We can use the HuggingFace CLI to securely download Meta's Llama 3 8B model directly to our server.
First, install the CLI:
Terminal
pip3 install -U "huggingface_hub[cli]"
Now, download the Q4_K_M quantized version of the model. This 4-bit quantization perfectly balances speed and intelligence while comfortably fitting inside 8GB of VRAM.
Terminal
huggingface-cli download bartowski/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models
Step 3 Running Inference
It is time to chat with Llama 3. We will use the llama-cli tool.
The -ngl 99 flag is the most important part of this command: it tells llama.cpp to offload 99 layers (essentially all of them for an 8B model) to your NVIDIA GPU's VRAM.
Terminal
./llama-cli -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
-n 512 \
-ngl 99 \
--color \
-i -r "User:" \
-p "You are a helpful AI assistant.\n\nUser: Hello!\nAI:"
You should see the engine initialize, load the model into VRAM, and present you with an interactive prompt.
Step 4 The Local API Server
If you are hosting this on a headless Ubuntu server and want to access the model remotely (or via a UI like Open WebUI), you can run the built-in HTTP server.
Terminal
./llama-server -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 99 --host 0.0.0.0 --port 8080
Your high-performance, CUDA-accelerated backend is now listening at http://YOUR_SERVER_IP:8080 and acts as a drop-in replacement for the OpenAI API!