Skip to main content

Overview

After you initialize the Poolside platform, you can serve Poolside models from GPU workloads inside your Amazon EKS cluster, or connect the platform to external OpenAI-compatible API endpoints. Most production deployments serve models locally for performance and data-locality reasons. Local inference uses the bundled inference-stack Helm chart, which deploys an Envoy proxy and one model deployment per enabled subchart. The Envoy proxy URL becomes the in-cluster base URL you register in the Poolside Console.

Prerequisites

  • GPU node group: Use a GPU node group in your cluster, with p5e.48xlarge as the supported minimum instance type. The reference architecture provisions this node group and the supporting IAM/IRSA wiring.
  • GPU software stack: Install NVIDIA GPU Operator 26.3.0 in the cluster, with the following component versions:
    • NVIDIA driver 580.126.20
    • NVIDIA Container Toolkit 1.19.0
  • Model checkpoints bucket: Provide an S3 bucket for model checkpoints. The reference architecture creates a <deployment>-models bucket and grants the inference IRSA role read access to it.
    • For self-assembled deployments, choose any bucket name and follow the layout described in the model checkpoints guide.
  • Deployment bundle: Extract the Poolside deployment bundle on the workstation that runs helm.
  • API key authentication (optional): To require an API key for the vLLM inference servers, create a secret containing the key in poolside-models:
    kubectl create secret generic vllm-auth \
      --from-literal=VLLM_API_KEY=<vllm-api-key> \
      -n poolside-models
    

Stage model checkpoints in S3

Each enabled inference subchart loads its weights from S3 at startup. The reference architecture lays checkpoints out under:
s3://<your-models-bucket>/models/checkpoints/<model>-<version>/
The reference architecture’s bucket is named <deployment>-models by default; for self-assembled deployments, use whatever bucket you provisioned. For the upload mechanics, including the streaming uploader and the bring-your-own-bucket alternative, see the model checkpoints guide in the reference architecture repository. Poolside provides the model checkpoints with the deployment bundle or through a presigned URL. Confirm the delivery method and the destination prefix with your Poolside contact, then upload them to your bucket so the per-model model URIs in your values file match where the checkpoints live. For example:
aws s3 cp ./checkpoints s3://<your-models-bucket>/models/checkpoints --recursive --region <region>
The inference IRSA role must have s3:GetObject and s3:ListBucket on the models bucket. The reference architecture scopes this policy automatically.

Install the inference-stack chart

The bundle ships an inference-stack chart under charts/inference-stack/. The chart is an umbrella around three subcharts:
  • inference: Runs the model pods. A single deployment of this subchart serves many models from one set of operator-facing values.
  • inference-envoy: The in-cluster Envoy proxy that fronts the model pods and exposes the OpenAI-compatible endpoint.
  • inference-extproc: The request and response processor that augments the Envoy proxy.
Only the inference subchart calls AWS APIs to pull model checkpoints from S3. It is the only inference subchart that needs IRSA annotations. The inference-envoy and inference-extproc subcharts run cluster-internal only and use the namespace’s default service account or a chart-managed service account without an AWS IAM role. You enable a model by adding an entry under inference.models in your values file. Each map key becomes a separate Kubernetes Deployment and Service named inference-<key>. The minimum per-model fields are the S3 checkpoint URI (model) and the OpenAI-compatible model name (modelName). The chart’s modelType preset fills in sensible defaults for the rest.

Configure inference values

Set the inference IRSA role and the shared image registry once under global.inference, then list the models you want to deploy under inference.models:
global:
  inference:
    image:
      registry: "<account-id>.dkr.ecr.<region>.amazonaws.com/<ecr-prefix>"
    # Set runAsUser so the inference pods pass the runAsNonRoot check.
    # Keep the full block together: the chart replaces podSecurityContext, it does not merge.
    podSecurityContext:
      runAsNonRoot: true
      runAsUser: 10003
      seccompProfile:
        type: RuntimeDefault

inference:
  enabled: true
  fullnameOverride: inference
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: "<inference-role-arn>"
  models:
    laguna-xs:
      model: "s3://<your-models-bucket>/models/checkpoints/laguna-xs-2026-04/"
      modelName: "Laguna XS"
      modelType: agent_small
      gpus: 1
    point:
      model: "s3://<your-models-bucket>/models/checkpoints/point-2026-04/"
      modelName: "Point"
      modelType: completion
      gpus: 1
Each per-model entry supports the following fields. The table shows the required fields, and the rest fall back to the chart’s modelDefaults.
FieldRequiredDescription
modelYesS3 URI of the checkpoint directory. The init container downloads the contents into the pod at startup.
modelNameYesModel name surfaced through the OpenAI-compatible API and used as the registration value in the Poolside Console.
modelTypeNoOne of agent, agent_small, or completion. Selects a preset of inference-server CLI flags, for example max-model-len and distributed-executor-backend, tuned to the workload type.
modelExtraArgsNoMap of additional inference-server CLI flags. Merged on top of the modelType preset; per-key values win.
gpusNoGPUs per replica. Sets the nvidia.com/gpu resource request and limit. Defaults to 1. For multi-GPU models, also set tensor-parallel-size: "<n>" under modelExtraArgs.
replicasNoNumber of pods for this model. Defaults to 1.
ingressHostNoExternal hostname when the chart’s ingress.enabled is true.
routeHostNoExternal hostname when you run on OpenShift with route.enabled set to true.
For the full list of per-model fields, including pod placement (affinity, topologySpreadConstraints, priorityClassName), rolling-update behavior, and pod disruption budgets, see the chart’s charts/inference-stack/charts/inference/values.yaml. Pod tolerations are configured once at the chart level and apply to every model; the default tolerates the nvidia.com/gpu node taint. If you use the reference architecture, the poolside-values Terraform module composes these values automatically from var.inference_models. See the reference architecture repository for the Terraform variable schema and the well-known-model defaults it applies.

Configure container images for a private registry

The external processor image (forge_api) does not inherit global.inference.image.registry. Set its registry explicitly:
inference-extproc:
  image:
    registry: "<account-id>.dkr.ecr.<region>.amazonaws.com/<ecr-prefix>"
The Envoy proxy and gateway images default to the public envoyproxy/envoy and envoyproxy/gateway. If you mirrored them into Amazon ECR instead of allowing pulls from the public internet, point the Envoy subchart at your registry:
inference-envoy:
  image:
    repository: "<account-id>.dkr.ecr.<region>.amazonaws.com/<ecr-prefix>/envoy"
  shutdownManagerImage:
    repository: "<account-id>.dkr.ecr.<region>.amazonaws.com/<ecr-prefix>/gateway"

Install the chart

helm install inference-stack ./charts/inference-stack \
  --namespace poolside-models \
  -f ./charts/inference-stack/values.yaml

Register the model

Whether you serve models locally or connect to an external API, you register them in the Poolside Console. For the full procedure, see Connect a model. Use the following deployment-specific values:
  • Model Name: The served inference model name. Retrieve it with:
    kubectl describe pod -n poolside-models | grep -A1 served-model
    
  • Base URL: The model API endpoint.
    • For locally hosted Poolside models with the Envoy proxy stack, use the in-cluster Envoy proxy service URL, for example:
      http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-laguna-xs/v1
      
    • Without the proxy stack, use the individual inference service URL, for example http://inference-laguna-xs.poolside-models.svc.cluster.local/v1.
    • For external API endpoints, use the provider’s public URL.
If you deploy without the Envoy proxy and your inference servers require API key authentication, add an Authorization: Bearer <vllm-api-key> custom header to the model so the platform sends the Authorization header with inference requests.

Connect external OpenAI-compatible models

If you do not run local inference, you can point the platform at any OpenAI-compatible model API. Skip the inference-stack install and register the external endpoint as the model’s base URL when you connect a model.