K8S-Troubleshooting: NodeNotReady
4 Status of node
In Kubernetes, nodes are the worker machines that run your applications. Each node has a specific status, which helps administrators monitor the cluster’s health and performance. The four primary statuses of a Kubernetes node are:
Ready – The node is fully operational and can run pods. This means the node is healthy and all necessary services (like kubelet, networking, and storage) are functioning correctly.
NotReady – The node is recognized by the cluster but is not in a state to run pods. This could be due to missing components, network failures, or system crashes.
SchedulingDisabled – Also known as "Cordoned," this status indicates that the node is temporarily marked unschedulable. Administrators use this status when performing maintenance or updates to prevent new pods from being scheduled on the node.
Unknown – The node has lost communication with the control plane, usually because of network issues or a node failure. Kubernetes cannot determine its exact condition.
Reasons for NotReady Error
1. Node-Level Issues
Kubelet Failure – The kubelet process on the node has crashed or is not running.
Out of Resources – The node has run out of CPU, memory, or disk space.
Disk Pressure – If the node has insufficient disk space, Kubernetes may taint it.
Network Disconnection – The node has lost network connectivity and cannot communicate with the control plane.
Time Synchronization Issues – Clock drift between nodes and the control plane may cause problems.
2. Control Plane Communication Issues
API Server Unreachable – If the node cannot communicate with the Kubernetes API server, it may be marked as NotReady.
Firewall or Security Group Restrictions – Network policies or firewall rules might be blocking communication.
DNS Resolution Issues – Nodes rely on DNS for service discovery; a DNS failure can cause nodes to become unreachable.
3. Component Failures
Kube-Proxy Failure – If kube-proxy crashes or is misconfigured, network issues can occur.
Container Runtime Issues – Problems with Docker, containerd, or another container runtime may prevent pod execution.
CNI Plugin Issues – Network plugin failures (Calico, Flannel, Cilium, etc.) can disrupt pod networking.
4. Node Under Maintenance
Manual Cordon or Drain – The node has been deliberately marked as unschedulable (e.g.,
kubectl cordonorkubectl drain).Pending Reboot – Kernel updates, OS patches, or kubelet updates require a reboot.
Common Troubleshoot Methods
In this section, I will list some basic commands upon a few circumstances, it is very hard to cover all the possibilities mentioned above.
Basic Commands
Normally, by using the following commands and focus on certain parts of the output, you will quickly get the reason why the error occurs.
Check node status:
kubectl get nodes -o wideDescribe the problematic node:
kubectl describe node <node-name>
kubelet Problem
Check kubelet status and logs on node:
systemctl status kubelet
journalctl -u kubelet -f
# If necessary, restart the kubelet service
systemctl restart kubelet
kube-proxy Problem
# Describe kube-proxy pod
kubectl describe pod kube-proxy -n kube-system
# See logs of kube-proxy pod
kubectl logs kube-proxy -n kube-system
# Describe DaemonSet
kubectl describe daemonset kube-proxy -n kube-system

