High-Throughput Serving with vLLM on Ubuntu

laptop_mac macOS Sonoma Intermediate schedule 8 min read
Author by Alex Rivera • May 14, 2024

If you are building an application with multiple concurrent users, Ollama and LM Studio won't cut it. You need an enterprise-grade inference engine. You need vLLM.

Introduction

vLLM is a high-throughput and memory-efficient LLM serving engine. It uses a technique called PagedAttention to effectively manage attention keys and values, yielding 2-4x higher throughput than standard HuggingFace Transformers.

This guide assumes you are running a Linux server with at least one high-end NVIDIA GPU (e.g., RTX 3090, 4090, or A100).

Prerequisites

vLLM requires Python and CUDA. We highly recommend using a virtual environment like conda or standard venv.

Terminal
sudo apt update
sudo apt install python3-pip python3-venv

# Create and activate environment
python3 -m venv vllm-env
source vllm-env/bin/activate

Step 1 Installing vLLM

Install vLLM directly from PyPI. This will automatically pull down PyTorch and the necessary CUDA dependencies.

Terminal
pip install vllm

Step 2 Running an OpenAI-Compatible Server

vLLM natively supports the OpenAI API protocol. You can serve a model directly from the command line. We'll use Llama 3 8B as an example.

Terminal
python3 -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3-8B-Instruct   --dtype auto   --api-key your-secret-key   --port 8000

Key Parameters Explained: - --model: The HuggingFace repository ID. vLLM will download it automatically. - --dtype auto: Automatically uses bfloat16 or float16 for maximum performance. - --api-key: Secures your endpoint so random internet users can't spam your GPU.

Step 3 Connecting to the API

You can now hit this server using the standard OpenAI Python SDK, just by swapping the base_url.

Terminal
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain PagedAttention."}
    ],
)

print(response.choices[0].message.content)

Step 4 Optimizing for Production

If you have multiple GPUs (e.g., two RTX 3090s), you can split a large 70B model across them using tensor parallelism. Just add the --tensor-parallel-size flag:

Terminal
python3 -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3-70B-Instruct   --tensor-parallel-size 2

With vLLM running, your Linux server is now capable of handling dozens of concurrent requests simultaneously with zero latency degradation.