After we started in-place updating Kubernetes 1.10 to 1.12, everything seemed fine. Eight hours of sleep, and a few pods OOMKills later – we noticed that some of the AWS ELBs that front our Kubernetes services were reporting a few unhealthy target nodes.
While one of them was marking every node as unhealthy – technically taking the service down, most of the affected ELBs were only missing a few nodes, the same set across all services. Fortunately, troubleshooting Kubernetes services and ingresses is a well-known process as tenants commonly misconfigure them. The first step involves verifying that endpoints for the service considered are listed by Kubernetes, which reveals the most common configuration issues such as a missing port declaration on the containers specification, a missing network policy or a pod label/port name typo. Regarding the service that was completely unhealthy – it turned out that a network policy was missing. Unauthorized traffic is blocked by default on our clusters, the fact that this service has been working until now is nothing but scary; From our later investigation, it seems that the network policy controller’s version we were running has left some stale rules which authorized the ELB traffic to the pods after one of our developers experimented using ALBs. Updating & bouncing the network policy controller reconciled the state and solved the failure.
When it comes to the other degraded ELBs, the Kubernetes services were properly configured, routing was working fine from several nodes and seemed uncorrelated to the nodes the pods were scheduled on. Time to pull the mental diagram of the data flow in Kubernetes Services. According to the facts gathered so far, the issue was most likely related to misconfiguration of the instances’ dynamic security groups or service routing mechanism (i.e. kube-proxy, IPVS, iptables). Network policies were pretty much out of the picture as the ipsets/iptables rules are filtering traffic right in between the bridge and the containers, which means it’d have taken the container out for every node at once.
Simplified diagram of the data flow for a Kubernetes Service of Type=LoadBalancer
On those clusters, kube-proxy runs in IPVS mode, which mainly presents four major benefits compared to the (still) default iptables mode:
- O(1) rule insertion: the sync that happens when adding new pods or new services, which ends up taking minutes with iptables after a few thousand services [1],
- O(1) rule evaluation: determining which rule should route the packet, which ends up adding significant latency to packets with iptables after a few thousand services [1],
- Ability to change scheduler, do server health checking, connection retries, etc,
- Straightforward rule debugging with ipvsadmin -ln.
With the above command, it quickly turned out that IPVS rules were outdated on the unhealthy nodes: they were pointing to old pod IPs. Since kube-proxy is responsible for updating those rules, the next step was to look at kube-proxy logs. The logs were filled with messages regarding attempts to remove those old IPs from the rule sets – which must actually never be removed as it was continuously trying for an extended period of time. Restarting kube-proxy on a node would fix that node in the ELB’s perspective.
A GitHub search later and we found a report around kube-proxy hanging. The issue reveals that the bug was introduced by the implementation of graceful connection termination, which has been cherry-picked from the Kubernetes 1.13 branch to both the 1.11.5 and 1.12.2 releases. We, therefore, decided to test and rollback kube-proxy to 1.12.1 immediately in order to cope with the problem. Soon after, Laurent Bernaille decided to take the lead & handle the problem by himself: quickly putting a PR together and cherry-picking it around for the community. As of December 17th 2018, Kubernetes 1.11.6, 1.12.4 and 1.13.1 contain the fix.
Kubernetes is not quite OpenStack / Neutron, yet with all add-ons considered: there is still a few lines of code involved. In between Kubernetes v1.10.0 and v1.11.0, and in the main repository alone, there were 3,733 commits from 54 contributors, generating 596 non-blank lines in the CHANGELOG. Running conformance tests before/after updates is a good practice, however, in order to catch that particular bug, the tests should be executed repeatedly over the course of a few days. Running staging clusters & updating them first is another obvious practice as workloads executed there are less critical. Additionally, not upgrading to X.Y.0 versions is likely a great idea (although it would not have helped in this case).
As Kris Nova puts it in her Kubecon NA 2018 talk:
Kubernetes is complex,
Complexity is chaos,
Chaos can cause problems,
Conquering chaos is freedom,
Conquering Kubernetes is worth it.
You can’t have a clusterbleep without a cluster.