laptop_mac macOS Sonoma
Intermediate
schedule 8 min read
by Alex Rivera • May 14, 2024
If you are building an application with multiple concurrent users, Ollama and LM Studio won't cut it. You need an enterprise-grade inference engine. You need vLLM.
Introduction
vLLM is a high-throughput and memory-efficient LLM serving engine. It uses a technique called PagedAttention to effectively manage attention keys and values, yielding 2-4x higher throughput than standard HuggingFace Transformers.
This guide assumes you are running a Linux server with at least one high-end NVIDIA GPU (e.g., RTX 3090, 4090, or A100).
Prerequisites
vLLM requires Python and CUDA. We highly recommend using a virtual environment like conda or standard venv.
Terminal
sudo apt update
sudo apt install python3-pip python3-venv
# Create and activate environment
python3 -m venv vllm-env
source vllm-env/bin/activate
Step 1 Installing vLLM
Install vLLM directly from PyPI. This will automatically pull down PyTorch and the necessary CUDA dependencies.
Step 2 Running an OpenAI-Compatible Server
vLLM natively supports the OpenAI API protocol. You can serve a model directly from the command line. We'll use Llama 3 8B as an example.
Terminal
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --api-key your-secret-key --port 8000
Key Parameters Explained:
- --model: The HuggingFace repository ID. vLLM will download it automatically.
- --dtype auto: Automatically uses bfloat16 or float16 for maximum performance.
- --api-key: Secures your endpoint so random internet users can't spam your GPU.
Step 3 Connecting to the API
You can now hit this server using the standard OpenAI Python SDK, just by swapping the base_url.
Terminal
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Explain PagedAttention."}
],
)
print(response.choices[0].message.content)
Step 4 Optimizing for Production
If you have multiple GPUs (e.g., two RTX 3090s), you can split a large 70B model across them using tensor parallelism. Just add the --tensor-parallel-size flag:
Terminal
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2
With vLLM running, your Linux server is now capable of handling dozens of concurrent requests simultaneously with zero latency degradation.