Skip to main content

Introduction

This guide is intended for a physical network re-location of an on-premise Poolside deployment. This is not a version upgrade or a version migration (for example, V1 → V2 or r20250403 → r20250527). RKE2 Cluster nodes are intended to maintain static IP addresses. When migrating the system from one network to another, proper steps must be taken to stop all services, reset the cluster, and bring all services back up properly. Ref: https://github.com/rancher/rke2/discussions/4107

Shutdown

Scale down Poolside models

# Scale all poolside-models deployments to 0. This will take a minute to complete. 
kubectl scale deployment inference-<MALIBU> -n poolside-models --replicas=0
kubectl scale deployment inference-<POINT> -n poolside-models --replicas=0

Scale down Poolside deployments

If your deployment includes additional services, list them first and scale as needed:
kubectl get deployments,statefulsets -n poolside-services
# Scale poolside deployments to 0
kubectl scale deployment core-api  -n poolside --replicas=0
kubectl scale deployment web-assistant  -n poolside --replicas=0


# Scale poolside-services to 0 
kubectl scale deployment keycloak -n poolside-services --replicas=0
kubectl scale statefulset seaweedfs-admin -n poolside-services --replicas=0
kubectl scale statefulset seaweedfs-filer -n poolside-services --replicas=0
kubectl scale statefulset seaweedfs-master -n poolside-services --replicas=0
kubectl scale statefulset seaweedfs-volume -n poolside-services --replicas=0
# If the SeaweedFS S3 endpoint is enabled, scale its deployment
kubectl scale deployment seaweedfs-s3 -n poolside-services --replicas=0
kubectl scale statefulset postgres -n poolside-services --replicas=0

Shut down and disable RKE2

sudo systemctl stop rke2-server
sudo systemctl disable rke2-server

Shut down the system

sudo shutdown now

Startup

Update /etc/hosts

This guide assumes your primary IP address has changed and the node IP must be updated.
Update /etc/hosts, replacing the <OLD IP> <node-hostname> with the new IP address.
This ensures the RKE2 master node can appropriately resolve DNS for the hostname to the new IP.
The default hostname in current on-premise installations is poolside-server. The hostname is also used as the Kubernetes node name, and local PVs are tied to that node name. Avoid changing the hostname during relocation unless you also update local PV node affinity.
# Example of old entry - ${IP} ${node-hostname} ${ingress}
# 192.168.1.30 poolside-server poolside.poolside.local

# After updating with new IP
192.168.1.40 poolside-server poolside.poolside.local

Reset master node IP

Refer to https://github.com/rancher/rke2/discussions/4107 for additional details.
# The installer disables UFW for RKE2, but if it is still enabled, disable it (Ubuntu only)
sudo systemctl disable --now ufw

# Disable app armor (Ubuntu only)
sudo systemctl disable --now apparmor.service

# On RHEL, check firewalld instead
sudo systemctl disable --now firewalld

# Stop all RKE2 services - exists in path at /usr/local/bin/rke2-killall.sh
sudo rke2-killall.sh

# Reset the cluster configuration to update Master IP
sudo rke2 server --cluster-reset

# Start the service
sudo systemctl start rke2-server 

# Confirm the service is in a running state
sudo systemctl status rke2-server

# Confirm all Nodes Ready
kubectl get nodes

Update CoreDNS for Keycloak

After the external interface / ingress IP has changed, rerun the Terraform step that manages supporting services (step 3, 03-infra-services).
Terraform updates the CoreDNS entries for Keycloak as part of that phase. The CoreDNS configmap change is applied automatically and does not require a manual restart.

Scale up Poolside deployments

# Scale poolside-services to 1
kubectl scale deployment keycloak -n poolside-services --replicas=1
kubectl scale statefulset seaweedfs-admin -n poolside-services --replicas=1
kubectl scale statefulset seaweedfs-filer -n poolside-services --replicas=1
kubectl scale statefulset seaweedfs-master -n poolside-services --replicas=1
kubectl scale statefulset seaweedfs-volume -n poolside-services --replicas=1
# If the SeaweedFS S3 endpoint is enabled, scale its deployment
kubectl scale deployment seaweedfs-s3 -n poolside-services --replicas=1
kubectl scale statefulset postgres -n poolside-services --replicas=1

# Scale poolside deployments to 3
kubectl scale deployment core-api  -n poolside --replicas=3
kubectl scale deployment web-assistant  -n poolside --replicas=3

# Scale all poolside-models deployments to 1.
# Scale each deployment separately, and wait for it to become healthy, before scaling in the next.
kubectl scale deployment inference-<MALIBU> -n poolside-models --replicas=1
kubectl scale deployment inference-<POINT> -n poolside-models --replicas=1

Validation

Validate models are in a running state (1/1 Ready). Once models are running, login to https://poolside.poolside.local and test Chat completion, and update your Poolside extension API to https://poolside.poolside.local to confirm code completion.
kubectl get deployments -n poolside-models
kubectl get pods -n poolside-models

Common errors

You’ve created a model, but it is stuck in an erroring, or CrashLoopBackOff state. Logs show:
    raise RuntimeError("Failed to infer device type")
RuntimeError: Failed to infer device type
This error is common when the inference pod cannot detect GPUs. The error typically presents when the model is created on custom hardware without a preset, and is not ascribed the “none” preset. Additionally, it can manifest when the underlying hardware has GPU issues. Confirm the model was created with preset: none.
This can be checked via splash: splash models edit <model name>
Additionally, check the status of the gpu-operator namespace, and confirm all pods are healthy, running, or completed.
# A valid, running configuration
$ kubectl get pods -n gpu-operator
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-mkhbf                                   1/1     Running     0          2d4h
gpu-operator-666bbffcd-hhqfc                                  1/1     Running     0          2d4h
gpu-operator-node-feature-discovery-gc-7c7f68d5f4-sz64r       1/1     Running     0          2d4h
gpu-operator-node-feature-discovery-master-58588c6967-gm9gt   1/1     Running     0          2d4h
gpu-operator-node-feature-discovery-worker-5hqq6              1/1     Running     0          2d4h
nvidia-container-toolkit-daemonset-44rsg                      1/1     Running     0          2d4h
nvidia-cuda-validator-pvv2n                                   0/1     Completed   0          2d4h
nvidia-dcgm-exporter-mm2vc                                    1/1     Running     0          2d4h
nvidia-device-plugin-daemonset-tj4x2                          1/1     Running     0          2d4h
nvidia-mig-manager-cbsv2                                      1/1     Running     0          2d4h
nvidia-operator-validator-qg27q                               1/1     Running     0          2d4h


# An invalid, problematic deployment
$ kubectl get pods -n gpu-operator
NAME                                               READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-7frhq                        0/1     Init:0/1    5             16h
gpu-operator-node-feature-discovery-worker-rdvc5   1/1     Running     6 (15h ago)   16h
nvidia-container-toolkit-daemonset-zsdt6           0/1     Init:0/1    5             16h
nvidia-cuda-validator-s2k95                        0/1     Completed   0             4d3h
nvidia-dcgm-exporter-bhmrn                         0/1     Init:0/1    5             16h
nvidia-device-plugin-daemonset-mzsvb               0/1     Init:0/1    5             16h
nvidia-operator-validator-5shw6                    0/1     Init:0/4    5             16h
Poolside logins are returning a 500 error. Ensure CoreDNS configmap has been updated with the new IP address, and the CoreDNS deployment has been restarted.