Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying either internal or external services, regardless of the ports or protocols. Reproducing the issue was as simple as using cURL in any of our containers, to any destination, where the majority of the queries would stall for durations close to multiples of five seconds. Five seconds, you say? That is generally the red flag for DNS issues. Let’s figure out…
An initial look at the problem
With our Kubernetes stacks using recent and fairly trusted components (AWS, Kubernetes 1.10.2, Weave, CoreDNS) and given the experience I acquired while architecting and developing Tectonic at CoreOS, I was pretty confident our clusters were all pretty well configured overall. But still, Kubernetes has various moving pieces, and subtle issues may be introduced at any of those layers – where might that be this time? Unlike the OpenStack world, switching a component by another in Kubernetes is fairly easy thanks to the amazingly simple interfaces the system is built onto (e.g. CNI), and the incredible implementation diversity that the community offers. I, therefore, decided to invest an hour using iptables rather than IPVS, then replacing Weave by Calico, and CoreDNS by the older KubeDNS, in order to rule them out – no luck, the issue is still there. I noticed however that using Weave without fastdp made the issue disappear, but there is no such thing as using Weave without fastdp.
I then remembered various networking issues we had at CoreOS while working on Tectonic for Azure, due to a TX checksum offloading malfunction in their hypervisors / LB implementation, unlikely relevant here. The clusters and their nodes were mostly idle, the ARP tables looked totally fine: without stale entries (as it occurred with kube-proxy before), and not full. Back to square one.
Going deeper, and discovering a time-saving feature of CoreDNS
Looking at the facts again, the issue occurs with some applications, but not all of them, most of the time, but not always. The base image of the containers does not seem to have an effect on the numbers. There were a few issues opened here and there during the past few months about DNS latency, some of them totally unrelated (e.g. scalability, misconfiguration, ARP tables being full).
What is the difference between an affected application, like cURL, and an application that worked totally fine? I opened a few tcpdump sessions on the different nodes and containers in the path of cURL in an attempt to answer this question, and understand the problem better.
Reading the container’s tcpdump capture, two lookups were made by libc spaced by only a few CPU cycles, for A and AAAA records. While the responses for the A queries were coming quickly, the AAAA queries did not seem to be answered in a timely manner and were repeated after five seconds. IPv6 is disabled everywhere across our clusters, at the kernel level and our network interfaces do not even have link-local addresses. The reason why would the applications or libc make AAAA lookups got me a somewhat confused, but I could imagine potential use-cases, moving on. Reading the DNS server’s capture, IPv6 turned out to be irrelevant, as the server would not even receive the most of the packets containing the AAAA queries, which are transported over UDP with IPv4. When it did receive them, it would query the upstream server in a similar fashion. So, the only thing IPv6 about those AAAA lookups is the fact that it is looking for IPv6 responses, and nothing else.
Because the resolv.conf file of Kubernetes’ containers has numerous search domains and ndots:5, libc generally have to look up several composed names before getting a positive result, unless the requested domain is fully-qualified and has a trailing dot, which most applications do not use. For example, to resolve google.com, google.com.kube-system.svc.cluster.local., google.com.svc.cluster.local., google.com.cluster.local., google.com.ec2.internal. and finally google.com. must be looked up, for both A and AAAA records. That’s a lot of hops, especially when most of the AAAA requests time out after five seconds and must be retried. I discovered that CoreDNS can actually limit the number of roundtrips required, thanks to its autopath feature, which automatically detect queries being made with a known Kubernetes suffix, iterate server-side through the usual search domains, and leverage its own knowledge/cache about the available Kubernetes services to find a valid one (or fallback to querying the upstream server), to finally return both a CNAME containing the actual domain name found to have a valid, and an A/AAAA response with the actual IP address for that domain name (or NXDOMAIN if the record does not exist, obviously). I was baffled to see how smart and convenient that was, such an easy win.
An initial workaround
This did not solve the root cause though, as we are still seeing AAAA lookups taking up to five seconds.
After a bit of digging, I read in the man(5) page for resolv.conf that two options relevant to the parallel lookup mechanism used by glibc are available: single-request and single-request-reopen, which both enable sequential lookups. After specifying any of those options, using the relatively new dnsConfig configuration block (Alpha in Kubernetes 1.9), I could finally only see sub-second queries and got immediately excited about the fact that I would simply be able to add this to our templates and call it a day. I applied the changes and happily went home, too late anyway.
Setback & netfilter race conditions
This was until I discovered that the workaround had no effect on Alpine containers. It was at this moment that I knew musl was going to give me a hard time, again, I should have known better. Their resolver only supports ndots, attempts, and timeout, awesome. I went to talk to Rich Felker on #musl only to learn that no change would be made, as sequential lookups are against their architecture, and because, according to other users on the IRC channel, Kubernetes’ use of ndots is a heresy anyways. Wherever the actual issue is (may it be the general concept of Kubernetes’ networking), it should be fixed there.
Sequential queries work, not parallel ones, sometimes, but not always. That’s got to be a race condition, with the number of networking trickeries that Kubernetes do to get packets from one end to the other, it would not be too surprising after all. After some additional research, I found some existing literature about netfilter race conditions, such as this one or that one. Looking at conntrack -S, we had thousands of insert_failed, this is it. It turns out that a few engineers have noticed the issue and have gone through the troubleshooting process as well, identifying a SNAT race condition, ironically briefly documented in netfilter’s code. The solution would be to add –random-fully on all masquerading rules, which are set by several components in Kubernetes: kubelet, kube-proxy, weave and docker itself. There is only one little problem here… This is an early feature and not available on Container Linux, nor in Alpine’s iptables package nor in the Go wrapper of iptables. Regardless, it seems generally accepted that this would be the solution to the issue, and some developers are now implementing the missing flag support, but behold, this does not stop here.
Based on various traces, Martynas Pumputis discovered that there also was a race with DNAT, as the DNS server is reached via a virtual IP. Due to UDP being a connectionless protocol, connect(2) does not send any packet and therefore no entry is created in the conntrack hash table. During the translation, the following netfilter hooks are called in order: nf_conntrack_in (creates conntrack hash object, adds it to the unconfirmed entries list), nf_nat_ipv4_fn (does the translation, updates the conntrack tuple), and nf_conntrack_confirm (confirms the entry, adds it to the hash table). The two parallel UDP requests race for the entry confirmation and end up using different DNS endpoints, as there are multiple DNS server replicas available. Therefore, insert_failed is incremented, and the request is dropped. This means that adding –random-fully does not mitigate the packet loss, as the flag would only help mitigate the SNAT race! The only reliable fix would be to patch netfilter directly, which Martynas Pumputis is currently attempting to do.
A short and efficient workaround
Getting a patch into the kernel, and having it released, is not something that happen overnight. I, therefore, started writing my own workaround, based on all the knowledge gathered while troubleshooting the issue. Fortunately, I learned how to use tc(8) back then when I was administrating a large infrastructure of containers for my startup Harmony Hosting, in order to provide bandwidth guarantees to our customers and help to mitigate DDoS attacks. Coping with such race condition requires nothing but introducing a small amount of artificial latency to every AAAA packets. Using iptables, we can mark UDP traffic destined to the port exposing our DNS server, that have the DNS query bits set (inexpensive check) and that contain at least one question with QTYPE=AAAA. We need to be cautious due to the existing marks, and use a proper mask. With tc, we can route the marked traffic using a two bands priomap to a netem that will introduce a few milliseconds worth of latency, and the rest to a standard fq_codel. Additionally, we need to do our DPI and traffic shaping on the right interface, as Weave will encapsulate and encrypt traffic using IPSec (ESP), obfuscating everything. The good news though is that the Weave interface is a virtual interface and is therefore set to noqueue by default, we won’t need to worry about mq or about grafting qdiscs to specific TX/RX queues or CPU cores, which makes the script extremely simple.
Finally, we can build a very simple container image with the iproute2 package only, and run it alongside the Weave’s containers in its DaemonSet.
Conclusion
All in all, given the current adoption of Kubernetes, it is quite surprising that only a few Kubernetes engineers noticed this omnipresent and highly disruptive issue, which may be because networking conditions may not be as favorable everywhere for that race, or a symptom of a lack of monitoring overall.
However, I am thrilled to see that we ended up with a workaround that consists of 10 lines of bash and 10 lines of YAML, that do not require maintaining patches anywhere, or pushing any changes down to our users, and that reduce the likelihood of the races happening down to far less than a percent. And along the way, we also picked up a change that truncates the number of DNS roundtrips dramatically!
Edit: As mentioned by Duffie Cooley, it would also be possible to run the DNS server on every nodes using a DaemonSet, and specify the node’s IP as the clusterDNS in kubelet’s configuration. This solution is unfortunately unusable for us, as containers with cluster-wide permissions (even read-only) are unable to run on our worker nodes, and as containers do not have direct network access to any of our nodes for security reasons.