Overview
After you initialize the Poolside platform, you can serve Poolside models from GPU workloads inside your Amazon EKS cluster, or connect the platform to external OpenAI-compatible API endpoints. Most production deployments serve models locally for performance and data-locality reasons. Local inference uses the bundledinference-stack Helm chart, which deploys an Envoy proxy and one model deployment per enabled subchart. The Envoy proxy URL becomes the in-cluster base URL you register in the Poolside Console.
Prerequisites
-
GPU node group: Use a GPU node group in your cluster, with
p5e.48xlargeas the supported minimum instance type. The reference architecture provisions this node group and the supporting IAM/IRSA wiring. -
GPU software stack: Install NVIDIA GPU Operator 26.3.0 in the cluster, with the following component versions:
- NVIDIA driver 580.126.20
- NVIDIA Container Toolkit 1.19.0
-
Model checkpoints bucket: Provide an S3 bucket for model checkpoints. The reference architecture creates a
<deployment>-modelsbucket and grants the inference IRSA role read access to it.- For self-assembled deployments, choose any bucket name and follow the layout described in the model checkpoints guide.
-
Deployment bundle: Extract the Poolside deployment bundle on the workstation that runs
helm. -
API key authentication (optional): To require an API key for the vLLM inference servers, create a secret containing the key in
poolside-models:
Stage model checkpoints in S3
Each enabled inference subchart loads its weights from S3 at startup. The reference architecture lays checkpoints out under:<deployment>-models by default; for self-assembled deployments, use whatever bucket you provisioned. For the upload mechanics, including the streaming uploader and the bring-your-own-bucket alternative, see the model checkpoints guide in the reference architecture repository.
Poolside provides the model checkpoints with the deployment bundle or through a presigned URL. Confirm the delivery method and the destination prefix with your Poolside contact, then upload them to your bucket so the per-model model URIs in your values file match where the checkpoints live. For example:
The inference IRSA role must have
s3:GetObject and s3:ListBucket on the models bucket. The reference architecture scopes this policy automatically.Install the inference-stack chart
The bundle ships aninference-stack chart under charts/inference-stack/. The chart is an umbrella around three subcharts:
inference: Runs the model pods. A single deployment of this subchart serves many models from one set of operator-facing values.inference-envoy: The in-cluster Envoy proxy that fronts the model pods and exposes the OpenAI-compatible endpoint.inference-extproc: The request and response processor that augments the Envoy proxy.
inference subchart calls AWS APIs to pull model checkpoints from S3. It is the only inference subchart that needs IRSA annotations. The inference-envoy and inference-extproc subcharts run cluster-internal only and use the namespace’s default service account or a chart-managed service account without an AWS IAM role.
You enable a model by adding an entry under inference.models in your values file. Each map key becomes a separate Kubernetes Deployment and Service named inference-<key>. The minimum per-model fields are the S3 checkpoint URI (model) and the OpenAI-compatible model name (modelName). The chart’s modelType preset fills in sensible defaults for the rest.
Configure inference values
Set the inference IRSA role and the shared image registry once underglobal.inference, then list the models you want to deploy under inference.models:
modelDefaults.
| Field | Required | Description |
|---|---|---|
model | Yes | S3 URI of the checkpoint directory. The init container downloads the contents into the pod at startup. |
modelName | Yes | Model name surfaced through the OpenAI-compatible API and used as the registration value in the Poolside Console. |
modelType | No | One of agent, agent_small, or completion. Selects a preset of inference-server CLI flags, for example max-model-len and distributed-executor-backend, tuned to the workload type. |
modelExtraArgs | No | Map of additional inference-server CLI flags. Merged on top of the modelType preset; per-key values win. |
gpus | No | GPUs per replica. Sets the nvidia.com/gpu resource request and limit. Defaults to 1. For multi-GPU models, also set tensor-parallel-size: "<n>" under modelExtraArgs. |
replicas | No | Number of pods for this model. Defaults to 1. |
ingressHost | No | External hostname when the chart’s ingress.enabled is true. |
routeHost | No | External hostname when you run on OpenShift with route.enabled set to true. |
affinity, topologySpreadConstraints, priorityClassName), rolling-update behavior, and pod disruption budgets, see the chart’s charts/inference-stack/charts/inference/values.yaml. Pod tolerations are configured once at the chart level and apply to every model; the default tolerates the nvidia.com/gpu node taint.
If you use the reference architecture, the poolside-values Terraform module composes these values automatically from var.inference_models. See the reference architecture repository for the Terraform variable schema and the well-known-model defaults it applies.
Configure container images for a private registry
The external processor image (forge_api) does not inherit global.inference.image.registry. Set its registry explicitly:
envoyproxy/envoy and envoyproxy/gateway. If you mirrored them into Amazon ECR instead of allowing pulls from the public internet, point the Envoy subchart at your registry:
Install the chart
Register the model
Whether you serve models locally or connect to an external API, you register them in the Poolside Console. For the full procedure, see Connect a model. Use the following deployment-specific values:-
Model Name: The served inference model name. Retrieve it with:
-
Base URL: The model API endpoint.
-
For locally hosted Poolside models with the Envoy proxy stack, use the in-cluster Envoy proxy service URL, for example:
-
Without the proxy stack, use the individual inference service URL, for example
http://inference-laguna-xs.poolside-models.svc.cluster.local/v1. - For external API endpoints, use the provider’s public URL.
-
For locally hosted Poolside models with the Envoy proxy stack, use the in-cluster Envoy proxy service URL, for example:
Authorization: Bearer <vllm-api-key> custom header to the model so the platform sends the Authorization header with inference requests.
Connect external OpenAI-compatible models
If you do not run local inference, you can point the platform at any OpenAI-compatible model API. Skip theinference-stack install and register the external endpoint as the model’s base URL when you connect a model.