Overview
After the Poolside platform is initialized, deploy the inference models by installing theinference-stack chart provided in the bundle. The chart deploys the inference models, the Envoy proxy, and the external processor. This requires GPU nodes in the cluster and model checkpoints in S3.
Prerequisites
-
Images: Ensure that the inference image such as
atlas, the Envoy proxy imageenvoyproxy/envoy, and the Envoy gateway imageenvoyproxy/gatewayare available in your registry. -
S3 credentials: Ensure that a secret with AWS credentials such as
aws-credentialsexists in thepoolside-modelsnamespace. - Model checkpoints: Upload checkpoints to the S3 bucket before you install the inference chart. See Install on Kubernetes → Step 4: Upload model checkpoints.
-
API key authentication (optional): To require an API key for the vLLM inference servers, create a secret containing the key in
poolside-models:
Configure the inference values file
Editinference_values.yaml (created during the platform install) and set the fields that apply to your environment:
tags.proxy: false.
For the INT4 Malibu model on 2x RTX 6000 Ada Pro GPUs with 48 GB each, limit the context length and batch size:
emptyDir volume is wiped on each restart:
Install the inference stack
Register the models
After the inference server is running, register each model in the Poolside Console. For step-by-step instructions, see Connect a model. Use the following deployment-specific values when you fill out the form:-
Model Name—the served inference model name. Retrieve it with:
-
Base URL—the in-cluster inference service endpoint. With the Envoy proxy stack, use the Envoy service:
http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-laguna/v1http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-point/v1
http://inference-laguna.poolside-models.svc.cluster.local/v1http://inference-point.poolside-models.svc.cluster.local/v1
/v1suffix is required because the platform appends/chat/completionsor/completionsto the base URL, and vLLM serves these under the/v1/prefix.
Authorization: Bearer <vllm-api-key> custom header to the model so the platform sends the Authorization header with inference requests.