Model inference on Amazon EKS

Overview

After you initialize the Poolside platform, you can serve Poolside models from GPU workloads inside your Amazon EKS cluster, or connect the platform to external OpenAI-compatible API endpoints. Most production deployments serve models locally for performance and data-locality reasons. Local inference uses the bundled inference-stack Helm chart, which deploys an Envoy proxy and one model deployment per enabled subchart. The Envoy proxy URL becomes the in-cluster base URL you register in the Poolside Console.

Deploy local Poolside models

Prerequisites

A GPU node group in your cluster. The supported minimum instance type is p5e.48xlarge. The reference architecture provisions this node group, the NVIDIA GPU Operator, and the supporting IAM/IRSA wiring.
An S3 bucket for model checkpoints. The reference architecture creates a <deployment>-models bucket and grants the inference IRSA role read access to it.
For self-assembled deployments, choose any bucket name and follow the layout described in the model checkpoints guide.
The Poolside deployment bundle, extracted on the workstation that runs helm.

Stage model checkpoints in S3

Each enabled inference subchart loads its weights from S3 at startup. The reference architecture lays checkpoints out under:

s3://<your-models-bucket>/models/checkpoints/<model>-<version>/

The reference architecture’s bucket is named <deployment>-models by default; for self-assembled deployments, use whatever bucket you provisioned. For the upload mechanics, including the streaming uploader and the bring-your-own-bucket alternative, see the model checkpoints guide in the reference architecture repository.

The inference IRSA role must have s3:GetObject and s3:ListBucket on the models bucket. The reference architecture scopes this policy automatically.

Install the inference-stack chart

The bundle ships an inference-stack chart under charts/inference-stack/. Each model is a subchart (for example, inference-malibu, inference-point) and is enabled per-deployment in your values file. Refer to the chart’s values.yaml and the reference architecture customization guide for the full set of supported overrides, including per-model GPU counts and model identifiers. The reference architecture’s poolside-values module composes these values automatically from the Terraform outputs.

Register the model in the Poolside Console

Whether you serve models locally or connect to an external API, you register them in the Poolside Console. For the full procedure, see Connect a model. Use the following deployment-specific values:

Base URL: The model API endpoint.
- For locally hosted Poolside models, use the in-cluster Envoy proxy service URL, for example:
  http://inference-envoy-internal.poolside-models.svc.cluster.local/v0/models/inference-laguna/v1
- For external API endpoints, use the provider’s public URL.

Connect external OpenAI-compatible models

If you do not run local inference, you can point the platform at any OpenAI-compatible model API. Skip the inference-stack install and register the external endpoint as the model’s base URL when you connect a model.

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

Model inference on Amazon EKS

Overview

Deploy local Poolside models

Prerequisites

Stage model checkpoints in S3

Install the inference-stack chart

Register the model in the Poolside Console

Connect external OpenAI-compatible models

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

​Overview

​Deploy local Poolside models

​Prerequisites

​Stage model checkpoints in S3

​Install the inference-stack chart

​Register the model in the Poolside Console

​Connect external OpenAI-compatible models

Overview

Deploy local Poolside models

Prerequisites

Stage model checkpoints in S3

Install the inference-stack chart

Register the model in the Poolside Console

Connect external OpenAI-compatible models