Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.poolside.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Use this page to size a Poolside deployment for on-premises and cloud environments. It explains what affects capacity, lists Poolside’s measured concurrent-agent capacity for each supported hardware tier, and shows how to translate those numbers into a developer seat count. For hardware specifications and supported configurations, see Supported configurations.

What affects capacity

Real-world capacity depends on factors that vary by team and workload:
  • Average step latency for the model and hardware combination
  • Number of steps each agent task takes to complete
  • Average task complexity
  • Average context window utilization per request
  • Mix of agent, chat, and completion workloads
  • Time-of-day concurrency patterns and burst behavior
Use the figures on this page as a conservative starting point. As your team builds usage history, replace the default planning assumptions with your own observed values.

How Poolside measures capacity

Poolside publishes capacity numbers measured under deliberately conservative conditions:
  • Step time threshold: Average step time stays under five seconds across the measured agent population.
  • Quantization: Laguna numbers reflect FP8 model weights with an FP8 KV cache. Malibu 2.2 INT4 numbers reflect INT4 weights.
  • Concurrency unit: Each concurrent agent is an active agent task occupying a model-serving slot, not a logged-in developer.
These thresholds keep latency predictable. Numbers are intentionally conservative because under-provisioning has a more disruptive impact on end users than over-provisioning.

Concurrent-agent capacity by hardware

The following table lists the maximum number of concurrent agents each hardware tier supports while staying under the step time threshold. Bold values are measured. Italic values are extrapolated from measured numbers on similar configurations.
Hardware tierTotal GPU memoryLaguna M.1Laguna XS.2Malibu 2.2 INT4
8× H200 (HGX rack or BYO)1128 GB418038
4× H200 (BYO minimum)564 GB~20~40Untested
8× RTX 6000 Blackwell (rack)768 GB~32~112~12
4× RTX 6000 Blackwell (tower)384 GB16566
Tail latency varies across configurations. On 4× RTX 6000 Blackwell, Laguna M.1 has a p99 step time of around 21 seconds, higher than other supported combinations. For latency-sensitive interactive workloads, prefer Laguna XS.2 on this tier or move to an H200 configuration.
DGX Spark is under active development and is not yet a supported deployment tier. It targets individual-developer evaluation rather than team deployment, and capacity numbers may change as development continues.

Translate concurrent agents into developer seats

Concurrent-agent capacity is not the same as the number of developers a deployment supports. A developer running an agent task occupies a slot for the duration of that task. Outside of an active task, the developer does not consume capacity. To estimate supported seats, divide concurrent-agent capacity by the fraction of seats actively running an agent at peak:
seats = concurrent-agent capacity / active-concurrency ratio
Use a planning range of 25 to 40 percent, with 40 percent as the conservative default for initial sizing. Laguna agent tasks typically take two to three minutes to complete. Each active agent occupies a slot for that full duration, so the instantaneous concurrency ratio for agent workloads runs higher than for chat-style models. Without real-world telemetry from your deployment, plan against the higher end of the range. Pick the lower end (25 percent) when:
  • Your deployment has telemetry showing light concurrency
  • The workload mixes agents with chat and completion in roughly equal share
  • You are sizing a pilot or limited rollout
Pick the higher end (40 percent) when:
  • This is a first-time sizing without observed concurrency
  • The team works in an agent-first culture
  • Latency degradation is particularly disruptive in your environment

Worked examples

The following table shows supported seat counts at both the 40 percent conservative default and the 25 percent lower bound. Actual capacity depends on your observed concurrency.
Hardware tierModelConcurrent agentsSeats at 40%Seats at 25%
8× H200Laguna M.141~100~165
8× H200Laguna XS.280~200~320
4× H200Laguna XS.2~40~100~160
8× RTX 6000 BlackwellLaguna XS.2~112~280~450
4× RTX 6000 BlackwellLaguna M.116~40~65
4× RTX 6000 BlackwellLaguna XS.256~140~225

Choose a model for your deployment

For full model details, see Supported models.
ModelWhen to choose it
Laguna XS.2Moderate complexity use cases where concurrent-agent throughput is the priority, and/or high performance GPU availability is constrained. Strong fit for agent workloads with acceptable tail latency.
Laguna M.1Choose when agent quality matters more than raw throughput. Best fit on 8× H200 hardware. Viable on 4× RTX 6000 Blackwell for small teams, with reduced concurrency and higher tail latency.
Malibu 2.2Choose for existing deployments, dense-model preferences, or when an INT4 quantization path is required. New deployments on RTX 6000 Blackwell hardware should consider Laguna XS.2 first.
PointChoose for editor completion alongside an agent model.

Scale beyond a single node

Both on-premises and cloud deployments support multi-node GPU inference within a single Kubernetes cluster. Multi-node configurations distribute independent inference replicas across nodes to add throughput. Cross-node tensor parallelism is not supported, and multi-node configurations do not provide high availability against node failures. For platform-specific configuration, see Supported configurations for on-premises, or the model-inference page for your cloud platform: Amazon EKS, OpenShift, or upstream Kubernetes.

Continuous improvement

The capacity numbers on this page reflect the current Poolside inference stack. Poolside continues to improve performance through better quantization formats, KV cache handling, request scheduling, and inference-engine optimizations. Poolside updates this page as new measurements land.