Skip to main content

Introduction

The Poolside inference stack’s vLLM model server exposes Prometheus metrics on its /metrics endpoint. These are the primary signals for inference health and performance. Use this page to find the signals you care about and to point your own monitoring stack at the right endpoint. This page is for operators running a self-hosted deployment. The metrics live in the poolside-models namespace.
The deployment does not bundle a metrics stack. No Prometheus, Grafana, or OpenTelemetry collector runs by default, and no ServiceMonitor or PodMonitor CRDs are installed. The endpoints below are exposed, but nothing scrapes them until you wire up your own monitoring. See Collect the metrics.

Model server metrics

The vLLM model server exposes native vLLM metrics on :8080/metrics, under the vllm:*, http_*, and python_* prefixes. There is one Kubernetes Deployment per model, with pods named inference-\<uuid\>. The examples below resolve the target pod by Helm label at runtime, so they keep working as pods are renamed on redeploy. Core vLLM metrics are present on an idle pod, with counters and histograms starting at 0.

What to watch

To find outWatchHealthy
Is the server overloaded?vllm:num_requests_waiting, vllm:kv_cache_usage_percQueue near 0, usage below ~0.9
Are responses slow to start?vllm:time_to_first_token_secondsStays low and steady; a rising trend means slower first tokens
Is streaming smooth after streaming starts?vllm:inter_token_latency_secondsStays low and steady; a rising trend means choppier streaming
Are requests rejected or preempted?vllm:num_dropped, vllm:num_preemptions0
Is the prefix cache helping?vllm:prefix_cache_hits_total over vllm:prefix_cache_queries_totalHigher is better; no fixed target, but watch for drops
Is throughput holding up?vllm:generation_tokens_total (rate), vllm:num_requests_runningToken rate steady under load; a drop while requests are running signals a stall

Latency and throughput

The user-facing performance signals. They are histograms, so alert on high percentiles rather than averages.
MetricTypeDescription
vllm:time_to_first_token_secondshistogramTime to first token (TTFT), the clearest measure of perceived responsiveness
vllm:inter_token_latency_secondshistogramGap between successive output tokens (ITL); sets streaming speed after a response starts
vllm:e2e_request_latency_secondshistogramEnd-to-end request latency, from queue to final token
vllm:request_time_per_output_token_secondshistogramDecode cost normalized per output token
vllm:request_queue_time_secondshistogramTime spent waiting in the queue
vllm:request_prefill_time_secondshistogramTime spent in the prefill phase
vllm:request_decode_time_secondshistogramTime spent in the decode phase
vllm:request_inference_time_secondshistogramTime spent in the running (inference) phase

Load and queue

How saturated the engine is. The first place to look when latency climbs or when deciding whether to scale.
MetricTypeDescription
vllm:num_requests_waitinggaugeRequests queued for admission; the strongest signal you need more capacity
vllm:num_requests_runninggaugeRequests in the active execution batch; near batch capacity means fully loaded
vllm:num_preemptionsgaugeRequests preempted by the engine, usually under KV-cache pressure
vllm:num_droppedgaugeRequests dropped because the queue exceeded its maximum size
vllm:iteration_tokens_totalhistogramTokens processed per engine step, a measure of batching efficiency

KV cache and prefix caching

KV-cache pressure and prefix-cache hit rate largely determine throughput and latency under load.
MetricTypeDescription
vllm:kv_cache_usage_percgaugeFraction of the KV cache in use, where 1.0 is 100%; a key saturation signal
vllm:prefix_cache_queries_totalcounterPrefix-cache lookups, in tokens
vllm:prefix_cache_hits_totalcounterPrefix-cache hits, in tokens; hits over queries is the local hit rate
vllm:external_prefix_cache_queries_totalcounterCross-instance prefix-cache lookups, in tokens
vllm:external_prefix_cache_hits_totalcounterCross-instance prefix-cache hits, in tokens
vllm:prompt_tokens_cached_totalcounterPrompt tokens served from cache, local plus external
vllm:prompt_tokens_recomputed_totalcounterCached tokens that had to be recomputed; rising values indicate cache thrashing
vllm:request_prefill_kv_computed_tokenshistogramNew KV tokens computed during prefill, excluding cached tokens

Token counts

Use these for cost tracking, capacity planning, and understanding workload shape.
MetricTypeDescription
vllm:prompt_tokens_totalcounterCumulative prompt (prefill) tokens processed
vllm:generation_tokens_totalcounterCumulative generated tokens; basis for tokens-per-second throughput
vllm:prompt_tokens_by_source_totalcounterPrompt tokens broken down by source
vllm:request_prompt_tokenshistogramPer-request prompt token count
vllm:request_generation_tokenshistogramPer-request generated token count
vllm:request_max_num_generation_tokenshistogramPer-request maximum requested generation tokens
vllm:request_params_max_tokenshistogramDistribution of the max_tokens request parameter
vllm:request_params_nhistogramDistribution of the n request parameter

Speculative decoding

Present only when speculative decoding is enabled. Accepted over draft tokens is the acceptance rate; a low rate means speculation is wasting compute.
MetricTypeDescription
vllm:spec_decode_num_drafts_totalcounterSpeculative-decoding drafts proposed
vllm:spec_decode_num_draft_tokens_totalcounterDraft tokens proposed
vllm:spec_decode_num_accepted_tokens_totalcounterAccepted draft tokens
vllm:spec_decode_num_accepted_tokens_per_pos_totalcounterAccepted draft tokens by draft position

Model FLOPs utilization

Use these to estimate model FLOPs utilization (MFU) and tell whether a workload is compute-bound or memory-bound.
MetricTypeDescription
vllm:estimated_flops_per_gpu_totalcounterEstimated floating-point operations per GPU
vllm:estimated_read_bytes_per_gpu_totalcounterEstimated bytes read from memory per GPU
vllm:estimated_write_bytes_per_gpu_totalcounterEstimated bytes written to memory per GPU

Runtime and miscellaneous

MetricTypeDescription
vllm:request_success_totalcounterSuccessfully completed requests
vllm:cache_config_infogaugeStatic cache configuration
vllm:engine_sleep_stategaugeWhether the engine is awake or sleeping
vllm:mm_cache_queries_total, vllm:mm_cache_hits_totalcounterMulti-modal input cache, for multi-modal models only
http_requests_total, http_request_duration_secondscounter, histogramFastAPI HTTP layer in front of the engine
python_*, process_*variousPython runtime and process metrics
This list reflects the metrics exposed by the deployed model server. The metric surface can change between releases, so confirm the exact set against a live scrape of your deployment using the commands in Scrape the endpoint.

Collect the metrics

The deployment does not ship a monitoring stack, so the metrics listed earlier are exposed but unscraped until you set up collection:
  • No prometheus.io/scrape annotations are present, and no ServiceMonitor or PodMonitor CRDs are installed.
  • The deployment does not bundle Prometheus, Grafana, VictoriaMetrics, Thanos, or an OpenTelemetry collector.
To collect these metrics, point your own Prometheus-compatible scraper at the endpoint listed earlier, or add ServiceMonitor or PodMonitor resources if you run the Prometheus Operator. For one-off checks, use the port-forward commands in the next section.

Scrape the endpoint

Run these commands from a host with kubectl access to the deployment’s cluster. Resolve the target pod by its Helm label, port-forward it, curl the local port, then stop the forward. Resolving by label keeps the command working as pods are renamed on redeploy. In one terminal, resolve the vLLM model server pod by its Helm label and port-forward its metrics port. Leave this running:
model_pod=$(kubectl -n poolside-models get pods -l app.kubernetes.io/component=inference -o jsonpath='{.items[0].metadata.name}')
kubectl -n poolside-models port-forward "pod/$model_pod" 18080:8080
In a second terminal, scrape the local port. For example, read the current load and cache values. On a server handling traffic, this returns something like:
curl -s http://localhost:18080/metrics | grep -E '^vllm:(num_requests_(running|waiting)|kv_cache_usage_perc)'

# vllm:num_requests_running{engine="0",model_name="agent-small"} 8
# vllm:num_requests_waiting{engine="0",model_name="agent-small"} 2
# vllm:kv_cache_usage_perc{engine="0",model_name="agent-small"} 0.43
To list every metric name instead, scrape the same port with grep '^# HELP' | sort. When you are done, stop the port-forward with Ctrl+C in the first terminal. The endpoint details are:
EndpointNamespaceLabel selectorPortPath
Model server (vLLM)poolside-modelsapp.kubernetes.io/component=inference8080/metrics
When more than one model is deployed, the component=inference selector matches every model’s pods and jsonpath picks the first. To target a specific model, append its model label, for example app.kubernetes.io/component=inference,app.kubernetes.io/model=<model>.
Counters that have not been incremented yet report 0 rather than being absent, so an idle pod still exposes the full set of metrics described above. If a scrape returns nothing, check the port-forward terminal for connection errors.