Skip to main content

Overview

Use this page to estimate how many concurrent agents a Poolside deployment can support and how that capacity translates into developer seats. The planner models Laguna deployments across hardware, model, context, and latency assumptions. For supported deployment paths and minimum hardware requirements, see Supported configurations. The planner can model configurations that are useful for comparison, but it does not make an unsupported configuration supported.
Capacity estimates are planning inputs, not guarantees. Validate final sizing with your Poolside account team before you commit to production hardware or a large rollout.

Estimate capacity

Use the planner to estimate the maximum number of active agent tasks your deployment can sustain under the selected assumptions. Set the inputs to match the deployment you are planning:
  • Hardware: Select the GPU type you want to model.
  • Model: Select the Laguna model you plan to deploy.
  • Number of GPUs: Select the number of GPUs assigned to the model-serving node.
  • Average context per task: Select how large the agent’s context window grows by the end of a typical task. Use a higher value for longer tasks, larger codebases, or workflows that read many files. Use a lower value for short, focused tasks.
  • Step-latency SLO: Select the p50 latency target per agent turn. A stricter SLO lowers the number of concurrent agents the deployment can serve.
The planner reports:
  • Concurrent agents: The estimated number of active agent tasks that can occupy model-serving slots at the same time.
  • Seats at 40%: A conservative developer-seat estimate for first-time sizing or agent-heavy usage.
  • Seats at 25%: A lighter-concurrency estimate for pilots, mixed workloads, or deployments with telemetry that shows lower peak activity.
If the planner reports that a configuration cannot serve a single agent, increase the GPU count, choose a smaller model, reduce the average context size, or relax the step-latency SLO.

Interpret the estimate

Concurrent-agent capacity is not the same as the number of developers a deployment supports. A developer consumes a model-serving slot only while an agent task is actively running. Outside of an active task, the developer does not consume agent capacity. To estimate supported seats, divide concurrent-agent capacity by the fraction of seats actively running an agent at peak:
Seat estimate
seats = concurrent-agent capacity / active-concurrency ratio
Use a planning range of 25 to 40 percent, with 40 percent as the conservative default for initial sizing. Laguna agent tasks typically take two to three minutes to complete. Each active agent occupies a slot for that full duration, so the instantaneous concurrency ratio for agent workloads runs higher than for chat-style models. Without real-world telemetry from your deployment, plan against the higher end of the range. Use 25 percent when:
  • Your deployment has telemetry showing light concurrency
  • Your workload mixes agent, chat, and completion usage
  • You are sizing a pilot or limited rollout
Use 40 percent when:
  • This is a first-time sizing without observed concurrency
  • The team works in an agent-first culture
  • Latency degradation is particularly disruptive in your environment

Understand calibration confidence

The planner uses an analytical inference simulator calibrated against measured Poolside benchmarks. The confidence badge in the planner indicates how closely the selected configuration matches measured data:
  • Calibrated: Direct measurement exists for the selected model, GPU, GPU count, and precision.
  • Same arch: Measurement exists for the same model on the same GPU architecture.
  • Factorized, partial signal, or arch median: The estimate depends more heavily on extrapolation.
Use extrapolated estimates for comparison and early planning. For production sizing, validate the selected configuration with Poolside against your expected workload.

What affects capacity

Real-world capacity depends on your deployment shape and workload:
  • Model choice
  • GPU type and GPU count
  • Weight and key-value cache precision
  • Average context size per trajectory
  • Step-latency target
  • Number of steps each agent task takes
  • Mix of agent, chat, and completion workloads
  • Peak-time concurrency and burst behavior
Use the planner output as a conservative starting point. As your team builds usage history, replace the default planning assumptions with your own observed values.

Choose a model for your deployment

For full model details, see Supported models.
ModelWhen to choose it
Laguna XS.2Use when concurrent-agent throughput is the priority, you have limited GPU availability, or you need a strong default for most agent workloads.
Laguna M.1Use when agent quality matters more than raw throughput. It is the best fit on 8× H200 hardware and can serve smaller teams on RTX 6000 Blackwell when you need lower concurrency.
PointUse for editor completion alongside an agent model.
Malibu 2.2 remains available for existing deployments and dense-model preferences, but the planner focuses on Laguna capacity. For Malibu sizing, contact your Poolside account team.

Scale beyond a single node

Both on-premises and cloud deployments support multi-node GPU inference within a single Kubernetes cluster. Multi-node configurations distribute independent inference replicas across nodes to add throughput. Cross-node tensor parallelism is not supported, and multi-node configurations do not provide high availability against node failures. For platform-specific configuration, see Supported configurations for on-premises, or the model-inference page for your cloud platform: Amazon EKS, OpenShift, or upstream Kubernetes.

Keep estimates current

Capacity changes as Poolside improves quantization formats, key-value cache handling, request scheduling, and inference-engine performance. Revisit this page when you change model versions, hardware, workload mix, or latency targets.