Capacity planning

Overview

Use this page to size a Poolside deployment for on-premises and cloud environments. It explains what affects capacity, lists Poolside’s measured concurrent-agent capacity for each supported hardware tier, and shows how to translate those numbers into a developer seat count. For hardware specifications and supported configurations, see Supported configurations.

What affects capacity

Real-world capacity depends on factors that vary by team and workload:

Average step latency for the model and hardware combination
Number of steps each agent task takes to complete
Average task complexity
Average context window utilization per request
Mix of agent, chat, and completion workloads
Time-of-day concurrency patterns and burst behavior

Use the figures on this page as a conservative starting point. As your team builds usage history, replace the default planning assumptions with your own observed values.

How Poolside measures capacity

Poolside publishes capacity numbers measured under deliberately conservative conditions:

Step time threshold: Average step time stays under five seconds across the measured agent population.
Quantization: Laguna numbers reflect FP8 model weights with an FP8 KV cache. Malibu 2.2 INT4 numbers reflect INT4 weights.
Concurrency unit: Each concurrent agent is an active agent task occupying a model-serving slot, not a logged-in developer.

These thresholds keep latency predictable. Numbers are intentionally conservative because under-provisioning has a more disruptive impact on end users than over-provisioning.

Concurrent-agent capacity by hardware

The following table lists the maximum number of concurrent agents each hardware tier supports while staying under the step time threshold. Bold values are measured. Italic values are extrapolated from measured numbers on similar configurations.

Hardware tier	Total GPU memory	Laguna M.1	Laguna XS.2	Malibu 2.2 INT4
8× H200 (HGX rack or BYO)	1128 GB	41	80	38
4× H200 (BYO minimum)	564 GB	~20	~40	Untested
8× RTX 6000 Blackwell (rack)	768 GB	~32	~112	~12
4× RTX 6000 Blackwell (tower)	384 GB	16	56	6

Tail latency varies across configurations. On 4× RTX 6000 Blackwell, Laguna M.1 has a p99 step time of around 21 seconds, higher than other supported combinations. For latency-sensitive interactive workloads, prefer Laguna XS.2 on this tier or move to an H200 configuration.

DGX Spark is under active development and is not yet a supported deployment tier. It targets individual-developer evaluation rather than team deployment, and capacity numbers may change as development continues.

Translate concurrent agents into developer seats

Concurrent-agent capacity is not the same as the number of developers a deployment supports. A developer running an agent task occupies a slot for the duration of that task. Outside of an active task, the developer does not consume capacity. To estimate supported seats, divide concurrent-agent capacity by the fraction of seats actively running an agent at peak:

seats = concurrent-agent capacity / active-concurrency ratio

Recommended planning ratio

Use a planning range of 25 to 40 percent, with 40 percent as the conservative default for initial sizing. Laguna agent tasks typically take two to three minutes to complete. Each active agent occupies a slot for that full duration, so the instantaneous concurrency ratio for agent workloads runs higher than for chat-style models. Without real-world telemetry from your deployment, plan against the higher end of the range. Pick the lower end (25 percent) when:

Your deployment has telemetry showing light concurrency
The workload mixes agents with chat and completion in roughly equal share
You are sizing a pilot or limited rollout

Pick the higher end (40 percent) when:

This is a first-time sizing without observed concurrency
The team works in an agent-first culture
Latency degradation is particularly disruptive in your environment

Worked examples

The following table shows supported seat counts at both the 40 percent conservative default and the 25 percent lower bound. Actual capacity depends on your observed concurrency.

Hardware tier	Model	Concurrent agents	Seats at 40%	Seats at 25%
8× H200	Laguna M.1	41	~100	~165
8× H200	Laguna XS.2	80	~200	~320
4× H200	Laguna XS.2	~40	~100	~160
8× RTX 6000 Blackwell	Laguna XS.2	~112	~280	~450
4× RTX 6000 Blackwell	Laguna M.1	16	~40	~65
4× RTX 6000 Blackwell	Laguna XS.2	56	~140	~225

Choose a model for your deployment

For full model details, see Supported models.

Model	When to choose it
Laguna XS.2	Moderate complexity use cases where concurrent-agent throughput is the priority, and/or high performance GPU availability is constrained. Strong fit for agent workloads with acceptable tail latency.
Laguna M.1	Choose when agent quality matters more than raw throughput. Best fit on 8× H200 hardware. Viable on 4× RTX 6000 Blackwell for small teams, with reduced concurrency and higher tail latency.
Malibu 2.2	Choose for existing deployments, dense-model preferences, or when an INT4 quantization path is required. New deployments on RTX 6000 Blackwell hardware should consider Laguna XS.2 first.
Point	Choose for editor completion alongside an agent model.

Scale beyond a single node

Both on-premises and cloud deployments support multi-node GPU inference within a single Kubernetes cluster. Multi-node configurations distribute independent inference replicas across nodes to add throughput. Cross-node tensor parallelism is not supported, and multi-node configurations do not provide high availability against node failures. For platform-specific configuration, see Supported configurations for on-premises, or the model-inference page for your cloud platform: Amazon EKS, OpenShift, or upstream Kubernetes.

Continuous improvement

The capacity numbers on this page reflect the current Poolside inference stack. Poolside continues to improve performance through better quantization formats, KV cache handling, request scheduling, and inference-engine optimizations. Poolside updates this page as new measurements land.

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

Capacity planning

Overview

What affects capacity

How Poolside measures capacity

Concurrent-agent capacity by hardware

Translate concurrent agents into developer seats

Recommended planning ratio

Worked examples

Choose a model for your deployment

Scale beyond a single node

Continuous improvement

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

Documentation Index

​Overview

​What affects capacity

​How Poolside measures capacity

​Concurrent-agent capacity by hardware

​Translate concurrent agents into developer seats

​Recommended planning ratio

​Worked examples

​Choose a model for your deployment

​Scale beyond a single node

​Continuous improvement

​Related resources

Overview

What affects capacity

How Poolside measures capacity

Concurrent-agent capacity by hardware

Translate concurrent agents into developer seats

Recommended planning ratio

Worked examples

Choose a model for your deployment

Scale beyond a single node

Continuous improvement

Related resources