How to Deploy a vLLM Endpoint in Just Minutes

Jul 24, 2024
7 min read

In the rapidly evolving world of machine learning, vLLM has emerged as a game-changer for scaling large language models. This powerful library enables the creation of API endpoints that leverage advanced techniques like quantization and tokenization to optimize performance. With its ability to harness GPU acceleration and CUDA drivers, vLLM offers a robust solution for deploying high-performance language models.

This guide walks readers through the process of deploying a vLLM endpoint in just minutes. It covers setting up the environment, creating a custom Dockerfile, and deploying vLLM on Koyeb. The article also delves into essential aspects such as TLS encryption, API authentication, and using the Hugging Face CLI for login. By the end, readers will have the knowledge to test and use their own vLLM endpoint, opening up new possibilities for integrating advanced language models into their projects.

Setting Up Your Environment

Before deploying vLLM, it is crucial to set up the environment with the necessary prerequisites. This involves installing the required software dependencies, configuring GPU support, and obtaining a Hugging Face API token for authentication.

Installing Prerequisites

To get started, create a virtual environment using Python 3.x 1. Install the dependencies listed in the requirements.txt file by running pip install -r requirements.txt 1. This ensures that all the required packages are available for vLLM to function properly.

Configuring GPU Support

vLLM leverages GPU acceleration to optimize performance. It is recommended to use an NVIDIA GPU with CUDA support 1. The compute capability of the GPU should be 7.0 or higher, which includes models like V100, T4, RTX20xx, A100, L4, and H100 2.

To configure GPU support, follow these steps:

Install the CUDA Toolkit, which can be downloaded from the official NVIDIA website 2.
Set the environment variable CUDA_HOME to the installation path of the CUDA Toolkit 2.
Ensure that the nvcc compiler is in your system's PATH 2.
Verify the installation by running nvcc --version in the terminal 2.

Obtaining a Hugging Face API Token

To access the Hugging Face Hub and utilize its resources, an API token is required. Here's how to obtain one:

Sign up for a Hugging Face account at https://huggingface.co/join 3.
Navigate to the settings page and generate a new API token with the desired permissions (read or write) 3.
Copy the generated token and keep it secure 3.

With the API token, you can authenticate and access the necessary resources from the Hugging Face Hub. To set the token in your environment, you can use the huggingface-cli login command and provide the token when prompted 3. Alternatively, you can set the HUGGING_FACE_HUB_TOKEN environment variable with the token value 4.

By completing these setup steps, you will have a fully configured environment ready for deploying vLLM. The installed prerequisites, GPU support, and Hugging Face API token will enable seamless integration and optimal performance of the vLLM endpoint.

Creating a Custom Dockerfile

To create a custom Dockerfile for deploying vLLM, start by selecting an appropriate base image. The official PyTorch Docker image is recommended as it comes with torch and CUDA drivers pre-installed 5. Here's a sample Dockerfile that uses the PyTorch base image:

FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel

WORKDIR /srv

RUN pip install vllm==0.3.3 --no-cache-dir

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \\ 
"--host", "0.0.0.0", "--port", "80", \\
"--model", "google/gemma-2b-it", \\
"--dtype=half"]

When installing vLLM in the official PyTorch Docker image, ensure you use the image with the correct PyTorch version by checking the corresponding pyproject.toml file 5.

Base Image Selection

The vLLM project directory includes the following Dockerfile resources 5:

Dockerfile: Contains the main vLLM library build context with support for NVIDIA GPU systems.
Dockerfile.cpu: Contains the vLLM build context for CPU systems.
Dockerfile.rocm: Contains the build context for AMD GPU systems.

Choose the appropriate Dockerfile based on your target system (CPU or GPU).

Configuring Environment Variables

vLLM uses various environment variables to configure the system 6. All environment variables used by vLLM are prefixed with VLLM_. Some important ones include:

VLLM_TARGET_DEVICE: Target device of vLLM (default: "cuda").
VLLM_USE_PRECOMPILED: If set, vLLM will use precompiled binaries.
VLLM_INSTALL_PUNICA_KERNELS: If set, vLLM will install Punica kernels (for LoRA).
VLLM_USE_MODELSCOPE: If true, will load models from ModelScope instead of Hugging Face Hub.
VLLM_API_KEY: API key for VLLM API server.

Set the desired environment variables in your Dockerfile using the ENV instruction.

Adding the Entrypoint Command

The entrypoint command specifies the command to run when the container starts. For the vLLM API server, use the following entrypoint 7:

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

You can pass additional arguments to customize the server, such as the host, port, model, and data type 5. For example:

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \\
"--host", "0.0.0.0", "--port", "80", \\
"--model", "google/gemma-2b-it", \\
"--dtype=half"]

By creating a custom Dockerfile with the appropriate base image, environment variables, and entrypoint command, you can easily build and deploy a vLLM container image tailored to your needs.

Deploying vLLM

Deploying vLLM is a straightforward process that can be accomplished in just a few steps.

To get started, ensure that you have a Atlas Cloud account and access to GPU instances on the platform. Additionally, a Hugging Face account with a read-only API token is required to fetch the models that vLLM will run 8.

Pushing to GitHub

If you prefer to customize and enhance the application, you can fork the vllm repository and link your service to the forked repository. This allows you to push changes and have more control over the deployment 8.

Configuring the Deployment

When deploying vLLM, several environment variables are used for configuration 9. The following variables should be set with appropriate values:

HF_TOKEN: An API token to authenticate to Hugging Face. This app only requires a read-only API token and is used to verify that you have accepted the model's usage license 9.
VLLM_API_KEY (Optional): An API key you can set to limit access to the server. When an API key is set, every request must provide it as an authorization bearer token 9.
VLLM_DO_NOT_TRACK (Optional): Set to "1" to disable sending usage statistics to the vLLM project 9.

In the "GPU Instance" section of the configuration page, choose "A100" or any GPU types that fits your model(the vram should be large than twice of your parameter sizes). 8. This ensures that vLLM runs on a GPU instance type, which is required for optimal performance.

Setting Environment Variables

When deploying vLLM on Koyeb, it is crucial to set the required environment variables with appropriate values. These variables include 8:

HF_TOKEN: Set this to your Hugging Face read-only API token.
MODEL_NAME: Set this to the name of the model you wish to use, as given on the Hugging Face site. You can check what models vLLM supports to find out more. If not provided, the google/gemma-2b-it model will be deployed.
REVISION: Set this to the model revision you wish to use. You can find available revisions in a drop-down menu on the "Files and versions" tab of the Hugging Face model page. If not provided, the default revision for the given model will be deployed.
VLLM_API_KEY: This defines an authorization token that must be provided when querying the API. If not provided, unauthenticated queries will be accepted by the API.

In the "Health checks" section, set the "Grace period" to 300 seconds to allow time for vLLM to fetch the model 8. This is especially important for large models that may take some time to download and initialize.

By following these steps and configuring the necessary environment variables, you can successfully deploy vLLM on Atlas Cloud and leverage the power of cloud infrastructure to scale your language model endpoints effortlessly.

Testing and Using Your vLLM Endpoint

Once the vLLM API server is up and running, it can be queried using the same format as the OpenAI API. This allows vLLM to be used as a drop-in replacement for applications that utilize the OpenAI API 10.

To get started, ensure that the server is running at the specified host and port (default is http://localhost:8000). The server hosts one model at a time and implements the list models, create chat completion, and create completion endpoints 10.

Querying Available Models

To retrieve a list of available models, make a request to the /v1/models endpoint 11:

curl http://localhost:8000/v1/models

This will return a JSON response containing the supported models by the vLLM server 12.

Making Completion Requests

To generate completions using the hosted model, send a POST request to the /v1/completions endpoint with the appropriate parameters 13:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
  }'

The request should include the model name, input prompt, and any additional generation parameters such as max_tokens and temperature 13.

Since the vLLM server is compatible with the OpenAI API, it can be used as a drop-in replacement in applications that utilize the OpenAI API. For example, the openai Python package can be used to interact with the vLLM server 13:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

completion = client.completions.create(model="facebook/opt-125m", prompt="San Francisco is a")
print("Completion result:", completion)

Using the Chat Interface

The vLLM server also supports the OpenAI Chat API, allowing for dynamic conversations with the model 14. To engage in a chat-like interaction, use the /v1/chat/completions endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me a joke."}
    ]
  }'

The request should include the model name and a list of messages, where each message has a role (e.g., "system" or "user") and content 14.

Using the openai Python package, the chat interface can be utilized as follows 14:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="facebook/opt-125m",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)

print("Chat response:", chat_response)

By leveraging the OpenAI-compatible API endpoints provided by the vLLM server, developers can seamlessly integrate and test their vLLM endpoints in existing applications, enabling efficient scaling and deployment of large language models.

FAQs

How can I implement vLLM for use in a production environment?

To effectively use a vLLM in production, adjust memory management settings and optimize your batching strategies. This includes setting the appropriate batch size and sequence length based on your available hardware resources. Ensure that your pre-trained large language model (LLM) is loaded into the vLLM environment, properly initialized, and ready for performing inference tasks.

What steps are involved in hosting a vLLM?

To host a vLLM, begin by starting the vLLM server. Note that for certain models, such as google/gemma-2b, you must first accept the model's license. This requires creating a HuggingFace account, accepting the license for the model, and then generating an access token. Once you have the token, you can proceed to start the server.

References

[1] - https://github.com/shobhitag11/vLLM_Deployment

[2] - https://docs.vllm.ai/en/latest/getting_started/installation.html

[3] - https://discuss.huggingface.co/t/how-to-login-to-huggingface-hub-with-access-token/22498

[4] - https://github.com/deepjavalibrary/djl-serving/issues/1471

[5] - https://docs.vultr.com/how-to-build-a-vllm-container-image

[6] - https://docs.vllm.ai/en/latest/serving/env_vars.html

[7] - https://github.com/vllm-project/vllm/blob/main/Dockerfile

[8] - https://github.com/koyeb/example-vllm

[9] - https://www.koyeb.com/deploy/vllm

[10] - https://ploomber.io/blog/vllm-deploy/

[11] - https://github.com/vllm-project/vllm/issues/2282

[12] - https://docs.vllm.ai/en/latest/models/supported_models.html

[13] - https://docs.vllm.ai/en/v0.2.7/getting_started/quickstart.html

[14] - https://python.langchain.com/v0.2/docs/integrations/chat/vllm/