November 2, 2022

OpenShift 4.6 Automation and Integration: Recovering Failed Worker Nodes

Node Status

$ oc get nodes <NODE>

$ oc adm top node <NODE>

$ oc describe node <NODE> | grep -i taint

OpenShift Taint Effects

3.6.1. Understanding taints and tolerations
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.6/html-single/nodes/index#nodes-scheduler-taints-tolerations-about_nodes-scheduler-taints-tolerations

  • PreferNoSchedule
  • NoSchedule
  • NoExecute
apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

Worker Node Not Ready

$ oc describe node/worker01
...output omitted...
Taints:             node.kubernetes.io/not-ready:NoExecute
                    node.kubernetes.io/not-ready:NoSchedule
...
Ready       False   ...     KubeletNotReady        [container runtime is down...
$ ssh core@worker01 "sudo systemctl is-active crio"

$ ssh core@worker01 "sudo systemctl start crio"

$ oc describe node/worker01 | grep -i taints

Worker Node Storage Exhaustion

3.6.1. Understanding taints and tolerations
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.6/html-single/nodes/index#nodes-scheduler-taints-tolerations-about_nodes-scheduler-taints-tolerations

node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.

$ oc describe node/worker01 
...
Taints:             disk-pressure:NoSchedule 
                    disk-pressure:NoExecute 
...

Worker Node Capacity

$ oc get pod -o wide
NAME             READY   STATUS    ...  NODE      ...
diskuser-4cfdd   0/1     Pending   ...  <none>    ...
diskuser-ck4df   0/1     Evicted   ...  worker02  ...

$ oc describe node/worker01
...output omitted...
Taints:             node.kubernetes.io/not-ready:NoSchedule
...
Conditions:
  Type             Status  ...   Reason                       ...
  ----             ------  ...   ------                       ...
  DiskPressure     True    ...   KubeletHasDiskPressure       ...

Worker Node Unreachable

3.6.1. Understanding taints and tolerations
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.6/html-single/nodes/index#nodes-scheduler-taints-tolerations-about_nodes-scheduler-taints-tolerations

node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.

$ ssh core@worker02 "sudo systemctl is-active kubelet" 

$ ssh core@worker02 "sudo systemctl start kubelet" 

No comments: