> ## Documentation Index
> Fetch the complete documentation index at: https://docs.poolside.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Server and service maintenance for an on‑premises Poolside node

> Methods to stop and start Poolside workloads, reboot or shut down the node, or fully stop Poolside and RKE2 in an on-premises RKE2 deployment.

## Choose a start or stop method

Use this guide to stop, start, reboot, or fully shut down a Poolside node in an on-premises RKE2 deployment.

| Goal                                                                       | Use this method                                                                                                     |
| -------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| Stop and start Poolside workloads without stopping RKE2                    | [Stop and start Poolside workloads without stopping RKE2](#stop-and-start-poolside-workloads-without-stopping-rke2) |
| Reboot or shut down the node for planned maintenance or hardware servicing | [Reboot or shut down the node](#reboot-or-shut-down-the-node)                                                       |
| Stop all Poolside and RKE2 processes without rebooting                     | [Stop Poolside and RKE2 without rebooting](#stop-poolside-and-rke2-without-rebooting)                               |
| Preview script actions or use a custom timeout                             | [Run the scripts directly](#run-the-scripts-directly)                                                               |

## Timing expectations

Stopping and starting Poolside services can take several minutes.

| Action                  | Expected time   | What happens                                                                                                                                                   |
| ----------------------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Stop Poolside services  | 1 to 2 minutes  | Poolside workloads stop. `rke2-server` remains running.                                                                                                        |
| Start Poolside services | 3 to 10 minutes | Poolside workloads start namespace by namespace in a specific order. Inference workloads start after the GPU operator reports `nvidia.com/gpu` as allocatable. |

## Stop and start Poolside workloads without stopping RKE2

Use this method when you want to drain or restart Poolside workloads without stopping RKE2. For example, use this method for routine debugging or workload maintenance.

### Stop Poolside workloads

```bash theme={null}
sudo systemctl stop poolside-services
```

This command does not stop `rke2-server`. It cordons the node, scales deployments and StatefulSets to `0`, and drains pods.

### Start Poolside workloads

```bash theme={null}
sudo systemctl start poolside-services
```

This command starts `rke2-server` if it is not already active, then scales Poolside workloads back up.

The start and stop commands are idempotent:

* Running `start` on an already running cluster exits with `Cluster is already running and node is schedulable. Nothing to do.`
* Running `stop` on an already stopped cluster returns immediately.

To stop the full stack, see [Stop Poolside and RKE2 without rebooting](#stop-poolside-and-rke2-without-rebooting).

### Check the current status

```bash theme={null}
sudo systemctl status poolside-services
```

### View live logs

```bash theme={null}
sudo journalctl -t poolside-shutdown -f
sudo journalctl -t poolside-startup -f
```

## Reboot or shut down the node

Use this method for planned maintenance or hardware servicing, such as OS patching, kernel updates, or GPU replacement.

The cleanest way to fully stop a Poolside node is to reboot or shut it down. During system shutdown, Poolside drains workloads through the unit's `Before=shutdown.target` ordering. Then `rke2-server` stops, and RKE2 cleans up its container shims.

### Reboot the node

```bash theme={null}
sudo reboot
```

### Shut down the node

```bash theme={null}
sudo shutdown -h now
```

## Stop Poolside and RKE2 without rebooting

Use this method when you need to fully stop Poolside and RKE2, but rebooting is not an option.

<Warning>
  Stopping `rke2-server` alone is not sufficient. RKE2 leaves DaemonSet pods and the static control-plane pods (`kube-apiserver`, `etcd`, `kube-scheduler`, `kube-controller-manager`) running under orphan `containerd-shim` processes parented to PID 1. The Kubernetes API stays partially reachable until `rke2-killall.sh` reaps them.
</Warning>

### Stop RKE2

```bash theme={null}
sudo systemctl stop rke2-server
```

This command stops `rke2-server`. To fully stop remaining RKE2-managed processes, run `rke2-killall.sh` next.

### Clean up remaining RKE2 processes

```bash theme={null}
sudo /usr/local/bin/rke2-killall.sh
```

This command stops DaemonSet pods and the static control-plane pods that `rke2-server` leaves behind as orphan `containerd-shim` processes.

### Verify that everything stopped

```bash theme={null}
sudo systemctl is-active rke2-server poolside-services
```

Expected result: both services are `inactive`.

Check for remaining `containerd-shim` processes:

```bash theme={null}
ps -ef | grep containerd-shim | grep -v grep | wc -l
```

Expected result: `0`.

## Run the scripts directly

Use this method when you want to preview actions with `--dry-run` or set a custom timeout.

### Preview shutdown or startup actions

```bash theme={null}
sudo /usr/local/bin/poolside-shutdown.sh --dry-run
sudo /usr/local/bin/poolside-startup.sh --dry-run
```

### Run with a custom timeout

Specify the timeout in seconds:

```bash theme={null}
sudo /usr/local/bin/poolside-shutdown.sh --timeout 120
sudo /usr/local/bin/poolside-startup.sh --timeout 120
```

### Show script help

```bash theme={null}
/usr/local/bin/poolside-shutdown.sh --help
/usr/local/bin/poolside-startup.sh --help
```

When you run `poolside-shutdown.sh` directly, the script stops `rke2-server` as its last step. However, it does not call `rke2-killall.sh`. The same orphan `containerd-shim` process caveat from [Stop Poolside and RKE2 without rebooting](#stop-poolside-and-rke2-without-rebooting) applies.

## Troubleshooting

| Symptom                                                                      | Likely cause                                                                                                                                                                                                  | Action                                                                                                                                                                                                                                          |
| ---------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `systemctl stop poolside-services` returns immediately with no log output    | The unit is not currently active.                                                                                                                                                                             | Check the unit status with `systemctl is-active poolside-services`. If the result is `inactive`, activate the unit. For more information, see [Step 3 of the install guide](/deployment/on-prem/install#step-3-enable-graceful-start-and-stop). |
| `systemctl stop poolside-services` hangs                                     | Pods did not stop before `TimeoutStopSec=600`.                                                                                                                                                                | Run `sudo /usr/local/bin/poolside-shutdown.sh --timeout 120` manually, then investigate the stuck pods with `kubectl describe pod`.                                                                                                             |
| Startup finishes but pods remain `Pending`                                   | GPUs are not yet allocatable.                                                                                                                                                                                 | Check `kubectl get nodes -o yaml \| grep nvidia.com/gpu` and the `nvidia-device-plugin` DaemonSet in the `gpu-operator` namespace.                                                                                                              |
| Annotation is still present after startup                                    | The script stopped before completing annotation cleanup.                                                                                                                                                      | Remove the annotation manually: `kubectl annotate deploy,sts --all -n <namespace> shutdown-poolside/replicas-`.                                                                                                                                 |
| API is reachable, but workloads are at `0` replicas and the node is cordoned | `rke2-server` started directly via `systemctl start rke2-server` instead of through `poolside-services`. The startup script that uncordons the node and restores replicas from saved annotations did not run. | Run `sudo systemctl start poolside-services`. The script is idempotent. It skips the `rke2-server` start, then uncordons the node and restores replicas.                                                                                        |
| `containerd-shim` processes remain after `rke2-killall.sh`                   | Some containers were not reaped on the first pass.                                                                                                                                                            | Re-run `sudo /usr/local/bin/rke2-killall.sh`. If shims persist, list them with `ps -ef \| grep containerd-shim \| grep -v grep` to see which containers are still running, then run `sudo systemctl restart containerd`.                        |
