Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.poolside.ai/llms.txt

Use this file to discover all available pages before exploring further.

Choose a start or stop method

Use this guide to stop, start, reboot, or fully shut down a Poolside node in an on-premises RKE2 deployment.
GoalUse this method
Stop and start Poolside workloads without stopping RKE2Stop and start Poolside workloads without stopping RKE2
Reboot or shut down the node for planned maintenance or hardware servicingReboot or shut down the node
Stop all Poolside and RKE2 processes without rebootingStop Poolside and RKE2 without rebooting
Preview script actions or use a custom timeoutRun the scripts directly

Timing expectations

Stopping and starting Poolside services can take several minutes.
ActionExpected timeWhat happens
Stop Poolside services1 to 2 minutesPoolside workloads stop. rke2-server remains running.
Start Poolside services3 to 10 minutesPoolside workloads start namespace by namespace in a specific order. Inference workloads start after the GPU operator reports nvidia.com/gpu as allocatable.

Stop and start Poolside workloads without stopping RKE2

Use this method when you want to drain or restart Poolside workloads without stopping RKE2. For example, use this method for routine debugging or workload maintenance.

Stop Poolside workloads

sudo systemctl stop poolside-services
This command does not stop rke2-server. It cordons the node, scales deployments and StatefulSets to 0, and drains pods.

Start Poolside workloads

sudo systemctl start poolside-services
This command starts rke2-server if it is not already active, then scales Poolside workloads back up. The start and stop commands are idempotent:
  • Running start on an already running cluster exits with Cluster is already running and node is schedulable. Nothing to do.
  • Running stop on an already stopped cluster returns immediately.
To stop the full stack, see Stop Poolside and RKE2 without rebooting.

Check the current status

sudo systemctl status poolside-services

View live logs

sudo journalctl -t poolside-shutdown -f
sudo journalctl -t poolside-startup -f

Reboot or shut down the node

Use this method for planned maintenance or hardware servicing, such as OS patching, kernel updates, or GPU replacement. The cleanest way to fully stop a Poolside node is to reboot or shut it down. During system shutdown, Poolside drains workloads through the unit’s Before=shutdown.target ordering. Then rke2-server stops, and RKE2 cleans up its container shims.

Reboot the node

sudo reboot

Shut down the node

sudo shutdown -h now

Stop Poolside and RKE2 without rebooting

Use this method when you need to fully stop Poolside and RKE2, but rebooting is not an option.
Stopping rke2-server alone is not sufficient. RKE2 leaves DaemonSet pods and the static control-plane pods (kube-apiserver, etcd, kube-scheduler, kube-controller-manager) running under orphan containerd-shim processes parented to PID 1. The Kubernetes API stays partially reachable until rke2-killall.sh reaps them.

Stop RKE2

sudo systemctl stop rke2-server
This command stops rke2-server. To fully stop remaining RKE2-managed processes, run rke2-killall.sh next.

Clean up remaining RKE2 processes

sudo /usr/local/bin/rke2-killall.sh
This command stops DaemonSet pods and the static control-plane pods that rke2-server leaves behind as orphan containerd-shim processes.

Verify that everything stopped

sudo systemctl is-active rke2-server poolside-services
Expected result: both services are inactive. Check for remaining containerd-shim processes:
ps -ef | grep containerd-shim | grep -v grep | wc -l
Expected result: 0.

Run the scripts directly

Use this method when you want to preview actions with --dry-run or set a custom timeout.

Preview shutdown or startup actions

sudo /usr/local/bin/poolside-shutdown.sh --dry-run
sudo /usr/local/bin/poolside-startup.sh --dry-run

Run with a custom timeout

Specify the timeout in seconds:
sudo /usr/local/bin/poolside-shutdown.sh --timeout 120
sudo /usr/local/bin/poolside-startup.sh --timeout 120

Show script help

/usr/local/bin/poolside-shutdown.sh --help
/usr/local/bin/poolside-startup.sh --help
When you run poolside-shutdown.sh directly, the script stops rke2-server as its last step. However, it does not call rke2-killall.sh. The same orphan containerd-shim process caveat from Stop Poolside and RKE2 without rebooting applies.

Troubleshooting

SymptomLikely causeAction
systemctl stop poolside-services returns immediately with no log outputThe unit is not currently active.Check the unit status with systemctl is-active poolside-services. If the result is inactive, activate the unit. For more information, see Step 3 of the install guide.
systemctl stop poolside-services hangsPods did not stop before TimeoutStopSec=600.Run sudo /usr/local/bin/poolside-shutdown.sh --timeout 120 manually, then investigate the stuck pods with kubectl describe pod.
Startup finishes but pods remain PendingGPUs are not yet allocatable.Check kubectl get nodes -o yaml | grep nvidia.com/gpu and the nvidia-device-plugin DaemonSet in the gpu-operator namespace.
Annotation is still present after startupThe script stopped before completing annotation cleanup.Remove the annotation manually: kubectl annotate deploy,sts --all -n <namespace> shutdown-poolside/replicas-.
API is reachable, but workloads are at 0 replicas and the node is cordonedrke2-server started directly via systemctl start rke2-server instead of through poolside-services. The startup script that uncordons the node and restores replicas from saved annotations did not run.Run sudo systemctl start poolside-services. The script is idempotent. It skips the rke2-server start, then uncordons the node and restores replicas.
containerd-shim processes remain after rke2-killall.shSome containers were not reaped on the first pass.Re-run sudo /usr/local/bin/rke2-killall.sh. If shims persist, list them with ps -ef | grep containerd-shim | grep -v grep to see which containers are still running, then run sudo systemctl restart containerd.