Documentation Index
Fetch the complete documentation index at: https://docs.poolside.ai/llms.txt
Use this file to discover all available pages before exploring further.
Choose a start or stop method
Use this guide to stop, start, reboot, or fully shut down a Poolside node in an on-premises RKE2 deployment.
| Goal | Use this method |
|---|
| Stop and start Poolside workloads without stopping RKE2 | Stop and start Poolside workloads without stopping RKE2 |
| Reboot or shut down the node for planned maintenance or hardware servicing | Reboot or shut down the node |
| Stop all Poolside and RKE2 processes without rebooting | Stop Poolside and RKE2 without rebooting |
| Preview script actions or use a custom timeout | Run the scripts directly |
Timing expectations
Stopping and starting Poolside services can take several minutes.
| Action | Expected time | What happens |
|---|
| Stop Poolside services | 1 to 2 minutes | Poolside workloads stop. rke2-server remains running. |
| Start Poolside services | 3 to 10 minutes | Poolside workloads start namespace by namespace in a specific order. Inference workloads start after the GPU operator reports nvidia.com/gpu as allocatable. |
Stop and start Poolside workloads without stopping RKE2
Use this method when you want to drain or restart Poolside workloads without stopping RKE2. For example, use this method for routine debugging or workload maintenance.
Stop Poolside workloads
sudo systemctl stop poolside-services
This command does not stop rke2-server. It cordons the node, scales deployments and StatefulSets to 0, and drains pods.
Start Poolside workloads
sudo systemctl start poolside-services
This command starts rke2-server if it is not already active, then scales Poolside workloads back up.
The start and stop commands are idempotent:
- Running
start on an already running cluster exits with Cluster is already running and node is schedulable. Nothing to do.
- Running
stop on an already stopped cluster returns immediately.
To stop the full stack, see Stop Poolside and RKE2 without rebooting.
Check the current status
sudo systemctl status poolside-services
View live logs
sudo journalctl -t poolside-shutdown -f
sudo journalctl -t poolside-startup -f
Reboot or shut down the node
Use this method for planned maintenance or hardware servicing, such as OS patching, kernel updates, or GPU replacement.
The cleanest way to fully stop a Poolside node is to reboot or shut it down. During system shutdown, Poolside drains workloads through the unit’s Before=shutdown.target ordering. Then rke2-server stops, and RKE2 cleans up its container shims.
Reboot the node
Shut down the node
Stop Poolside and RKE2 without rebooting
Use this method when you need to fully stop Poolside and RKE2, but rebooting is not an option.
Stopping rke2-server alone is not sufficient. RKE2 leaves DaemonSet pods and the static control-plane pods (kube-apiserver, etcd, kube-scheduler, kube-controller-manager) running under orphan containerd-shim processes parented to PID 1. The Kubernetes API stays partially reachable until rke2-killall.sh reaps them.
Stop RKE2
sudo systemctl stop rke2-server
This command stops rke2-server. To fully stop remaining RKE2-managed processes, run rke2-killall.sh next.
Clean up remaining RKE2 processes
sudo /usr/local/bin/rke2-killall.sh
This command stops DaemonSet pods and the static control-plane pods that rke2-server leaves behind as orphan containerd-shim processes.
Verify that everything stopped
sudo systemctl is-active rke2-server poolside-services
Expected result: both services are inactive.
Check for remaining containerd-shim processes:
ps -ef | grep containerd-shim | grep -v grep | wc -l
Expected result: 0.
Run the scripts directly
Use this method when you want to preview actions with --dry-run or set a custom timeout.
Preview shutdown or startup actions
sudo /usr/local/bin/poolside-shutdown.sh --dry-run
sudo /usr/local/bin/poolside-startup.sh --dry-run
Run with a custom timeout
Specify the timeout in seconds:
sudo /usr/local/bin/poolside-shutdown.sh --timeout 120
sudo /usr/local/bin/poolside-startup.sh --timeout 120
Show script help
/usr/local/bin/poolside-shutdown.sh --help
/usr/local/bin/poolside-startup.sh --help
When you run poolside-shutdown.sh directly, the script stops rke2-server as its last step. However, it does not call rke2-killall.sh. The same orphan containerd-shim process caveat from Stop Poolside and RKE2 without rebooting applies.
Troubleshooting
| Symptom | Likely cause | Action |
|---|
systemctl stop poolside-services returns immediately with no log output | The unit is not currently active. | Check the unit status with systemctl is-active poolside-services. If the result is inactive, activate the unit. For more information, see Step 3 of the install guide. |
systemctl stop poolside-services hangs | Pods did not stop before TimeoutStopSec=600. | Run sudo /usr/local/bin/poolside-shutdown.sh --timeout 120 manually, then investigate the stuck pods with kubectl describe pod. |
Startup finishes but pods remain Pending | GPUs are not yet allocatable. | Check kubectl get nodes -o yaml | grep nvidia.com/gpu and the nvidia-device-plugin DaemonSet in the gpu-operator namespace. |
| Annotation is still present after startup | The script stopped before completing annotation cleanup. | Remove the annotation manually: kubectl annotate deploy,sts --all -n <namespace> shutdown-poolside/replicas-. |
API is reachable, but workloads are at 0 replicas and the node is cordoned | rke2-server started directly via systemctl start rke2-server instead of through poolside-services. The startup script that uncordons the node and restores replicas from saved annotations did not run. | Run sudo systemctl start poolside-services. The script is idempotent. It skips the rke2-server start, then uncordons the node and restores replicas. |
containerd-shim processes remain after rke2-killall.sh | Some containers were not reaped on the first pass. | Re-run sudo /usr/local/bin/rke2-killall.sh. If shims persist, list them with ps -ef | grep containerd-shim | grep -v grep to see which containers are still running, then run sudo systemctl restart containerd. |