Model inference on upstream Kubernetes

Overview

After the Poolside platform is initialized, deploy the inference models by installing the inference-stack chart provided in the bundle. The chart deploys the inference models, the Envoy proxy, and the external processor. This requires GPU nodes in the cluster and model checkpoints in S3.

Prerequisites

Images: Ensure that the inference image such as atlas, the Envoy proxy image envoyproxy/envoy, and the Envoy gateway image envoyproxy/gateway are available in your registry.
S3 credentials: Ensure that a secret with AWS credentials such as aws-credentials exists in the poolside-models namespace.
Model checkpoints: Upload checkpoints to the S3 bucket before you install the inference chart. See Install on Kubernetes → Step 4: Upload model checkpoints.
API key authentication (optional): To require an API key for the vLLM inference servers, create a secret containing the key in poolside-models:
```
kubectl create secret generic vllm-auth \
  --from-literal=VLLM_API_KEY=<vllm-api-key> \
  -n poolside-models
```

Configure the inference values file

Edit inference_values.yaml (created during the platform install) and set the fields that apply to your environment:

global:
  inference:
    image:
      # -- Container image registry (shared by all inference subcharts)
      registry: "<registry-host>"
      # -- Container image name (shared by all inference subcharts)
      name: "atlas"
      # -- Container image tag (shared by all inference subcharts)
      tag: "202604-rc1" # extracted from the file name of the atlas container at ./containers/atlas tar file
    # -- Name of the image pull secret for private registries
    imagePullSecret: "poolside-registry-secret"
    podSecurityContext:
      # -- Require non-root user
      runAsNonRoot: true
      # -- Run inference pods as a specific numeric user ID (required on upstream Kubernetes)
      runAsUser: 10003
      seccompProfile:
        # -- Seccomp profile type
        type: RuntimeDefault
    s3:
      # -- Name of secret containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
      secretName: "aws-credentials"
      # -- Custom CA certificate bundle for S3
      caBundle: ""
    # -- Additional AWS CLI configuration applied via `aws configure set` before model download.
    # Omit to use defaults (max_concurrent_requests=200, max_queue_size=10000, multipart_chunksize=64MB).
    # Set to `{}` or `null` to disable all settings.
    # awsCliConfig: {}
    authentication:
      # -- Name of secret containing VLLM_API_KEY for vLLM server authentication
      secretName: ""

tags:
  # -- Deploy the proxy stack (inference-envoy + inference-extproc)
  proxy: true

# --- Inference models ---
# Add or remove entries under `models` to deploy arbitrary models. Each key becomes a separate
# Deployment/Service named `inference-<key>`. Shared knobs (image, s3, auth, service, ingress,
# route, tolerations, extraEnv) are set at the `inference` level and apply to every model;
# only fields that legitimately vary per model live under each entry.
inference:
  enabled: true
  fullnameOverride: inference
  models:
    laguna:
      model: s3://<bucket-name>/checkpoints/laguna
      modelName: Laguna
      modelType: agent
      gpus: 2
    point:
      model: s3://<bucket-name>/checkpoints/point
      modelName: Point
      modelType: completion
      gpus: 1

The checkpoint paths and the inference image name and tag must exactly match the locations you uploaded from the installation bundle. To deploy only the inference models without the Envoy proxy stack, set tags.proxy: false. For the INT4 Malibu model on 2x RTX 6000 Ada Pro GPUs with 48 GB each, limit the context length and batch size:

inference:
  models:
    malibu:
      modelExtraArgs:
        max-model-len: 65536
        max-num-batched-tokens: 8192

When you use SeaweedFS as the S3 backend, set the AWS CLI to the classic transfer client. The default high concurrency and multipart chunk size settings are incompatible with SeaweedFS and can cause download failures:

global:
  inference:
    awsCliConfig:
      default.s3.preferred_transfer_client: "classic"

When you use NooBaa or another S3 backend with limited concurrency, throttle downloads. Without throttling, the init container can fail after downloading 1-2 GiB and restart in an infinite loop because the emptyDir volume is wiped on each restart:

global:
  inference:
    awsCliConfig:
      default.s3.max_concurrent_requests: "2"
      default.s3.max_queue_size: "1000"
      default.s3.multipart_chunksize: "64MB"

Install the inference stack

helm install inference-stack ./charts/inference-stack \
  --namespace poolside-models \
  -f ./inference_values.yaml

Register the models

After the inference server is running, register each model in the Poolside Console. For step-by-step instructions, see Connect a model. Use the following deployment-specific values when you fill out the form:

Model Name—the served inference model name. Retrieve it with:

kubectl describe pod -n poolside-models | grep -A1 served-model

Base URL—the in-cluster inference service endpoint. With the Envoy proxy stack, use the Envoy service:
- http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-laguna/v1
- http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-point/v1
Without the proxy stack, use the individual inference service URLs:
- http://inference-laguna.poolside-models.svc.cluster.local/v1
- http://inference-point.poolside-models.svc.cluster.local/v1
The /v1 suffix is required because the platform appends /chat/completions or /completions to the base URL, and vLLM serves these under the /v1/ prefix.

If you deploy without the Envoy proxy and your inference servers require API key authentication, add an Authorization: Bearer <vllm-api-key> custom header to the model so the platform sends the Authorization header with inference requests.

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

Model inference on upstream Kubernetes

Overview

Prerequisites

Configure the inference values file

Install the inference stack

Register the models

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

​Overview

​Prerequisites

​Configure the inference values file

​Install the inference stack

​Register the models

Overview

Prerequisites

Configure the inference values file

Install the inference stack

Register the models