Local Llama 3 on Linux

laptop_mac macOS Sonoma Intermediate schedule 8 min read

by Alex Rivera • May 14, 2024

Running Llama 3 on Linux gives you maximum control and performance. We'll use llama.cpp — a highly optimized C++ inference engine that natively supports NVIDIA CUDA, AMD ROCm, and CPU-only inference.

Introduction

While wrappers like Ollama and LM Studio are great for getting started quickly, they abstract away the underlying engine. If you want to squeeze every last drop of performance out of your Linux hardware, compiling llama.cpp directly from source is the gold standard.

This guide will walk you through compiling the engine with NVIDIA CUDA support, downloading the Llama 3 8B model, and running it directly from your terminal.

Prerequisites

First, we need to install the core build tools and the NVIDIA CUDA toolkit. Open your terminal and run:

Terminal

sudo apt update
sudo apt install build-essential git python3-pip
sudo apt install nvidia-cuda-toolkit

Verify that your NVIDIA drivers and CUDA are working properly:

Terminal

nvidia-smi

Step 1 Compilation

We will clone the llama.cpp repository from GitHub and compile it. We pass the LLAMA_CUDA=1 flag to the make command to ensure the compiler builds the binary with NVIDIA GPU acceleration.

Terminal

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc) LLAMA_CUDA=1

Once the compilation finishes, you will have several executable files in the root directory, including llama-cli and llama-server.

Step 2 Downloading Llama 3 Weights

llama.cpp uses the .gguf file format. We can use the HuggingFace CLI to securely download Meta's Llama 3 8B model directly to our server.

First, install the CLI:

Terminal

pip3 install -U "huggingface_hub[cli]"

Now, download the Q4_K_M quantized version of the model. This 4-bit quantization perfectly balances speed and intelligence while comfortably fitting inside 8GB of VRAM.

Terminal

huggingface-cli download bartowski/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models

Step 3 Running Inference

It is time to chat with Llama 3. We will use the llama-cli tool.

The -ngl 99 flag is the most important part of this command: it tells llama.cpp to offload 99 layers (essentially all of them for an 8B model) to your NVIDIA GPU's VRAM.

Terminal

./llama-cli -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  -ngl 99 \
  --color \
  -i -r "User:" \
  -p "You are a helpful AI assistant.\n\nUser: Hello!\nAI:"

You should see the engine initialize, load the model into VRAM, and present you with an interactive prompt.

Step 4 The Local API Server

If you are hosting this on a headless Ubuntu server and want to access the model remotely (or via a UI like Open WebUI), you can run the built-in HTTP server.

Terminal

./llama-server -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -ngl 99 --host 0.0.0.0 --port 8080

Your high-performance, CUDA-accelerated backend is now listening at http://YOUR_SERVER_IP:8080 and acts as a drop-in replacement for the OpenAI API!

Continue Reading

Performance

Local Llama 3 on Linux

Introduction

Prerequisites

Step 1 Compilation

Step 2 Downloading Llama 3 Weights

Step 3 Running Inference

Step 4 The Local API Server

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

Introduction

Prerequisites

Step 1 Compilation

Step 2 Downloading Llama 3 Weights

Step 3 Running Inference

Step 4 The Local API Server

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

ChatEzzy Workspace