Documentation Index
Fetch the complete documentation index at: https://docs.poolside.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
After the Poolside platform is initialized, deploy the AI models by installing the inference-stack chart provided in the bundle. The chart deploys both inference models, the Envoy proxy, and the external processor.
Prerequisites
-
GPU software stack: Install NVIDIA GPU Operator 26.3.0 in the cluster, with the following component versions:
- NVIDIA driver 580.126.20
- NVIDIA Container Toolkit 1.19.0
-
Images: Ensure that the inference image such as
atlas, the Envoy proxy image envoyproxy/envoy, and the Envoy gateway image envoyproxy/gateway are available in your registry.
-
S3 credentials: Ensure that a secret with AWS credentials such as
aws-credentials exists in the poolside-models namespace.
-
Model checkpoints: Upload checkpoints to the S3 bucket before you install the inference chart. See Install on OpenShift → Step 4: Upload model checkpoints.
-
API key authentication (optional): To require an API key for the vLLM inference servers, create a secret containing the key in
poolside-models:
oc create secret generic vllm-auth \
--from-literal=VLLM_API_KEY=<vllm-api-key> \
-n poolside-models
Edit inference_values.yaml (created during the platform install) and set the fields that apply to your environment:
global:
inference:
image:
# -- Container image registry (shared by all inference subcharts)
registry: "<registry-host>"
# -- Container image name (shared by all inference subcharts)
name: "atlas"
# -- Container image tag (shared by all inference subcharts)
tag: "202604-rc1" # extracted from the file name of the atlas container at ./containers/atlas tar file
# -- Name of the image pull secret for private registries
imagePullSecret: "poolside-registry-secret"
podSecurityContext:
# -- Require non-root user
runAsNonRoot: true
seccompProfile:
# -- Seccomp profile type
type: RuntimeDefault
s3:
# -- Name of secret containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
secretName: "aws-credentials"
# -- Custom CA certificate bundle for S3
caBundle: ""
# -- Additional AWS CLI configuration applied via `aws configure set` before model download.
# Omit to use defaults (max_concurrent_requests=200, max_queue_size=10000, multipart_chunksize=64MB).
# Set to `{}` or `null` to disable all settings.
# awsCliConfig: {}
authentication:
# -- Name of secret containing VLLM_API_KEY for vLLM server authentication
secretName: ""
tags:
# -- Deploy the proxy stack (inference-envoy + inference-extproc)
proxy: true
# --- Inference models ---
# Add or remove entries under `models` to deploy arbitrary models. Each key becomes a separate
# Deployment/Service named `inference-<key>`. Shared knobs (image, s3, auth, service, ingress,
# route, tolerations, extraEnv) are set at the `inference` level and apply to every model;
# only fields that legitimately vary per model live under each entry.
inference:
enabled: true
fullnameOverride: inference
models:
laguna:
model: s3://<bucket-name>/checkpoints/laguna
modelName: Laguna
modelType: agent
gpus: 2
point:
model: s3://<bucket-name>/checkpoints/point
modelName: Point
modelType: completion
gpus: 1
The checkpoint paths and the inference image name and tag must exactly match the locations you uploaded from the installation bundle.
To deploy only the inference models without the Envoy proxy stack, set tags.proxy: false.
For the INT4 Malibu model on 2x RTX 6000 Ada Pro GPUs with 48 GB each, limit the context length and batch size:
inference:
models:
malibu:
modelExtraArgs:
max-model-len: 65536
max-num-batched-tokens: 8192
When you use SeaweedFS as the S3 backend, set the AWS CLI to the classic transfer client. The default high concurrency and multipart chunk size settings are incompatible with SeaweedFS and can cause download failures:
global:
inference:
awsCliConfig:
default.s3.preferred_transfer_client: "classic"
When you use NooBaa or another S3 backend with limited concurrency, throttle downloads. Without throttling, the init container can fail after downloading 1-2 GiB and restart in an infinite loop because the emptyDir volume is wiped on each restart:
global:
inference:
awsCliConfig:
default.s3.max_concurrent_requests: "2"
default.s3.max_queue_size: "1000"
default.s3.multipart_chunksize: "64MB"
Install the inference stack
helm install inference-stack ./charts/inference-stack \
--namespace poolside-models \
-f ./inference_values.yaml
Register the model
After the inference server is running, register each model in the Poolside Console. For step-by-step instructions, see Connect a model.
Use the following deployment-specific values when you fill out the form:
-
Model Name: The served inference model name. Retrieve it with:
oc describe pod -n poolside-models | grep -A1 served-model
-
Base URL: The in-cluster inference service endpoint. With the Envoy proxy stack, use the Envoy service:
http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-laguna/v1
http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-point/v1
Without the proxy stack, use the individual inference service URLs:
http://inference-laguna.poolside-models.svc.cluster.local/v1
http://inference-point.poolside-models.svc.cluster.local/v1
The /v1 suffix is required because the platform appends /chat/completions or /completions to the base URL, and vLLM serves these under the /v1/ prefix.
If you deploy without the Envoy proxy and your inference servers require API key authentication, add an Authorization: Bearer <vllm-api-key> custom header to the model so the platform sends the Authorization header with inference requests.