Server and service maintenance for an on‑premises Poolside node

Choose a start or stop method

Use this guide to stop, start, reboot, or fully shut down a Poolside node in an on-premises RKE2 deployment.

Goal	Use this method
Stop and start Poolside workloads without stopping RKE2	Stop and start Poolside workloads without stopping RKE2
Reboot or shut down the node for planned maintenance or hardware servicing	Reboot or shut down the node
Stop all Poolside and RKE2 processes without rebooting	Stop Poolside and RKE2 without rebooting
Preview script actions or use a custom timeout	Run the scripts directly

Timing expectations

Stopping and starting Poolside services can take several minutes.

Action	Expected time	What happens
Stop Poolside services	1 to 2 minutes	Poolside workloads stop. `rke2-server` remains running.
Start Poolside services	3 to 10 minutes	Poolside workloads start namespace by namespace in a specific order. Inference workloads start after the GPU operator reports `nvidia.com/gpu` as allocatable.

Stop and start Poolside workloads without stopping RKE2

Use this method when you want to drain or restart Poolside workloads without stopping RKE2. For example, use this method for routine debugging or workload maintenance.

Stop Poolside workloads

sudo systemctl stop poolside-services

This command does not stop rke2-server. It cordons the node, scales deployments and StatefulSets to 0, and drains pods.

Start Poolside workloads

sudo systemctl start poolside-services

This command starts rke2-server if it is not already active, then scales Poolside workloads back up. The start and stop commands are idempotent:

Running start on an already running cluster exits with Cluster is already running and node is schedulable. Nothing to do.
Running stop on an already stopped cluster returns immediately.

To stop the full stack, see Stop Poolside and RKE2 without rebooting.

Check the current status

sudo systemctl status poolside-services

View live logs

sudo journalctl -t poolside-shutdown -f
sudo journalctl -t poolside-startup -f

Reboot or shut down the node

Use this method for planned maintenance or hardware servicing, such as OS patching, kernel updates, or GPU replacement. The cleanest way to fully stop a Poolside node is to reboot or shut it down. During system shutdown, Poolside drains workloads through the unit’s Before=shutdown.target ordering. Then rke2-server stops, and RKE2 cleans up its container shims.

Reboot the node

sudo reboot

Shut down the node

sudo shutdown -h now

Stop Poolside and RKE2 without rebooting

Use this method when you need to fully stop Poolside and RKE2, but rebooting is not an option.

Stopping rke2-server alone is not sufficient. RKE2 leaves DaemonSet pods and the static control-plane pods (kube-apiserver, etcd, kube-scheduler, kube-controller-manager) running under orphan containerd-shim processes parented to PID 1. The Kubernetes API stays partially reachable until rke2-killall.sh reaps them.

Stop RKE2

sudo systemctl stop rke2-server

This command stops rke2-server. To fully stop remaining RKE2-managed processes, run rke2-killall.sh next.

Clean up remaining RKE2 processes

sudo /usr/local/bin/rke2-killall.sh

This command stops DaemonSet pods and the static control-plane pods that rke2-server leaves behind as orphan containerd-shim processes.

Verify that everything stopped

sudo systemctl is-active rke2-server poolside-services

Expected result: both services are inactive. Check for remaining containerd-shim processes:

ps -ef | grep containerd-shim | grep -v grep | wc -l

Expected result: 0.

Run the scripts directly

Use this method when you want to preview actions with --dry-run or set a custom timeout.

Preview shutdown or startup actions

sudo /usr/local/bin/poolside-shutdown.sh --dry-run
sudo /usr/local/bin/poolside-startup.sh --dry-run

Run with a custom timeout

Specify the timeout in seconds:

sudo /usr/local/bin/poolside-shutdown.sh --timeout 120
sudo /usr/local/bin/poolside-startup.sh --timeout 120

Show script help

/usr/local/bin/poolside-shutdown.sh --help
/usr/local/bin/poolside-startup.sh --help

When you run poolside-shutdown.sh directly, the script stops rke2-server as its last step. However, it does not call rke2-killall.sh. The same orphan containerd-shim process caveat from Stop Poolside and RKE2 without rebooting applies.

Troubleshooting

Symptom	Likely cause	Action
`systemctl stop poolside-services` returns immediately with no log output	The unit is not currently active.	Check the unit status with `systemctl is-active poolside-services`. If the result is `inactive`, activate the unit. For more information, see Step 3 of the install guide.
`systemctl stop poolside-services` hangs	Pods did not stop before `TimeoutStopSec=600`.	Run `sudo /usr/local/bin/poolside-shutdown.sh --timeout 120` manually, then investigate the stuck pods with `kubectl describe pod`.
Startup finishes but pods remain `Pending`	GPUs are not yet allocatable.	Check `kubectl get nodes -o yaml \| grep nvidia.com/gpu` and the `nvidia-device-plugin` DaemonSet in the `gpu-operator` namespace.
Annotation is still present after startup	The script stopped before completing annotation cleanup.	Remove the annotation manually: `kubectl annotate deploy,sts --all -n <namespace> shutdown-poolside/replicas-`.
API is reachable, but workloads are at `0` replicas and the node is cordoned	`rke2-server` started directly via `systemctl start rke2-server` instead of through `poolside-services`. The startup script that uncordons the node and restores replicas from saved annotations did not run.	Run `sudo systemctl start poolside-services`. The script is idempotent. It skips the `rke2-server` start, then uncordons the node and restores replicas.
`containerd-shim` processes remain after `rke2-killall.sh`	Some containers were not reaped on the first pass.	Re-run `sudo /usr/local/bin/rke2-killall.sh`. If shims persist, list them with `ps -ef \| grep containerd-shim \| grep -v grep` to see which containers are still running, then run `sudo systemctl restart containerd`.

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

Server and service maintenance for an on‑premises Poolside node

Choose a start or stop method

Timing expectations

Stop and start Poolside workloads without stopping RKE2

Stop Poolside workloads

Start Poolside workloads

Check the current status

View live logs

Reboot or shut down the node

Reboot the node

Shut down the node

Stop Poolside and RKE2 without rebooting

Stop RKE2

Clean up remaining RKE2 processes

Verify that everything stopped

Run the scripts directly

Preview shutdown or startup actions

Run with a custom timeout

Show script help

Troubleshooting

Overview

Cloud deployment

On-premises deployment

Configuration

Metrics and telemetry

Legacy

Documentation Index

​Choose a start or stop method

​Timing expectations

​Stop and start Poolside workloads without stopping RKE2

​Stop Poolside workloads

​Start Poolside workloads

​Check the current status

​View live logs

​Reboot or shut down the node

​Reboot the node

​Shut down the node

​Stop Poolside and RKE2 without rebooting

​Stop RKE2

​Clean up remaining RKE2 processes

​Verify that everything stopped

​Run the scripts directly

​Preview shutdown or startup actions

​Run with a custom timeout

​Show script help

​Troubleshooting

Choose a start or stop method

Timing expectations

Stop and start Poolside workloads without stopping RKE2

Stop Poolside workloads

Start Poolside workloads

Check the current status

View live logs

Reboot or shut down the node

Reboot the node

Shut down the node

Stop Poolside and RKE2 without rebooting

Stop RKE2

Clean up remaining RKE2 processes

Verify that everything stopped

Run the scripts directly

Preview shutdown or startup actions

Run with a custom timeout

Show script help

Troubleshooting