Run Ollama on Linux: The Definitive Guide

laptop_mac macOS Sonoma Intermediate schedule 8 min read

by Alex Rivera • May 14, 2024

Linux is the native home of machine learning. Running Ollama on Ubuntu or Debian gives you the absolute lowest latency and the best possible driver integration for NVIDIA and AMD GPUs.

Introduction

Ollama provides a 1-click install script for Linux that not only downloads the binary but automatically configures a systemd background service. This means your local AI API will start automatically when you boot your server or desktop.

Prerequisites

Before installing Ollama, ensure your GPU drivers are correctly installed.

For NVIDIA GPUs:

Terminal

# Install proprietary NVIDIA drivers and CUDA toolkit
sudo apt update
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit
nvidia-smi  # Verify drivers are working

For AMD GPUs: Ollama supports AMD graphics cards via the ROCm platform. Ensure you have the latest amdgpu drivers installed for your specific distribution.

Step 1 Installation

The official installation script handles everything for you. Run this in your terminal:

Terminal

curl -fsSL https://ollama.com/install.sh | sh

During installation, the script will automatically detect your NVIDIA or AMD GPU and download the appropriate acceleration libraries.

Step 2 Managing the Service

Ollama runs as a daemon. You can manage it using standard systemd commands:

Terminal

# Check if Ollama is running
sudo systemctl status ollama

# Restart the service (useful after pulling large models or updating drivers)
sudo systemctl restart ollama

# View live server logs
journalctl -u ollama -f

Step 3 Pulling and Running Models

Once the service is active, you can pull your first model and drop into the chat interface. Let's use Meta's Llama 3:

Terminal

ollama run llama3

To exit the interactive prompt, type /bye or press Ctrl + d.

Hardware Limits

Because Linux has very low OS overhead, you can squeeze larger models into your VRAM compared to Windows.

Your VRAM	Max Model Size	Recommended Models
8GB	~8B parameters	Llama 3 (8B), Mistral (7B)
16GB	~14B parameters	Qwen 2.5 (14B), Command R
24GB	~30B parameters	Mixtral (8x7B)

If you exceed your VRAM, Ollama will gracefully offload the remaining layers to your system RAM, though generation speed will drop significantly.

Step 4 Network Access

By default, Ollama only listens on 127.0.0.1 (localhost). If you are running Linux on a headless server and want to access the API from your MacBook or Windows PC, you need to bind it to your local network IP.

Edit the systemd service:

Terminal

sudo systemctl edit ollama

Add the following lines:

Terminal

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Restart the service:

Terminal

sudo systemctl restart ollama

Your Linux AI server is now accessible from anywhere on your local network!

Continue Reading

Performance

Run Ollama on Linux: The Definitive Guide

Introduction

Prerequisites

Step 1 Installation

Step 2 Managing the Service

Step 3 Pulling and Running Models

Hardware Limits

Step 4 Network Access

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

Introduction

Prerequisites

Step 1 Installation

Step 2 Managing the Service

Step 3 Pulling and Running Models

Hardware Limits

Step 4 Network Access

Continue Reading

Mistral 7B vs Llama 3 on Apple Silicon

Best GUI clients for Local LLMs

Quantization 101: Speed up your Inference

ChatEzzy Workspace