The value proposition offered by Container Linux by CoreOS (Red Hat / IBM) is its ability to perform automated operating system updates thanks to its read-only active/passive /usr mount point, the update_engine, and Omaha. This philosophy (“secure the Internet“) allows system administrators to stop worrying about low-level security patches and helps define a clear separation of concerns between operations and applications.
In practice, the update_engine periodically pings the configured CoreUpdate server, and upon finding a new available version of Container Linux, downloads and installs it to the passive partition, which is then marked as the active partition for the next boot. We note that CoreUpdate does not send updates to all the online nodes at once, but spreads their release over several hours/days. After that, a few strategies exist to apply the update, the most common being locksmith‘s etcd-lock which restarts up to n servers simultaneously. The second most frequently used mechanism, which ships by default in Tectonic, is the container-linux-update-operator. It is designed to orchestrate the necessary reboots at the Kubernetes layer, draining the workloads present on the nodes, one by one.
Because quite a few of our applications do not play well with restarts (e.g. slow warm-ups, costliness of re-initializing data streams and manual algorithm restarts required), we had disabled locksmith and never installed the container-linux-update-operator. However, we wouldn’t completely refuse punctual updates, for instance when spinning up new instances or executing rare manual reboots, and therefore kept the update_engine enabled (wcgw?).
On October 24th, CoreOS released a new version (1855.5.0) on their Stable branch, which our nodes downloaded and installed to the passive partition, waiting for manual reboots to switch over. As we were working on other projects at the time and were not making any active changes to our Kubernetes infrastructure, we hadn’t started any new clusters/instances for a little while. A few days later, one of our engineers started the process of upgrading to Kubernetes 1.12, and noticed that some of the test nodes were rebooting unexpectedly from time to time – weird, but okay; it might be related to some other changes. The following night, another of our engineers got paged because one of our production nodes started to flip between Ready / NotReady states. Just a single node out of the entire pool – we can simply take it out for now and investigate during working hours.
We’ve had some instance restart on us in the past, either because the underlying cloud-provider’s host died, or because of some irritating panics related to the VM’s network stack implementation. But this time, the node was continuously restarting. Obviously, the logs didn’t show anything, and the pstore was empty. We provisioned a new server: same symptoms. We noticed that it was the very first Container Linux with this version we were running, and therefore started a server with 1855.4.0 instead – problem “solved”. The CHANGELOG looks pretty slim, just a git and a kernel update. We don’t use git, and the restarts hella look like panics.
# Kernel Panic
[ 22.106179] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[ 22.110683] IP: netif_skb_features+0x39/0x280
[ 22.113125] PGD 0 P4D 0
[ 22.114579] Oops: 0000 [#1] SMP PTI
[ 22.116668] Modules linked in: xt_u32 ts_bm xt_string cls_fw sch_netem sch_prio xt_esp drbg seqiv esp4 xfrm4_mode_transport xt_policy iptable_mangle xt_NFLOG vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 ip_set_hash_ipportnet ip_set_bitmap_port ip_set_hash_ipport ip_set_hash_ipportip dummy ip_set_hash_ip ip_vs_sh ip_vs_wrr ip_vs_rr xt_set nfnetlink_log ip_set ip_vs veth xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c overlay nls_ascii nls_cp437 vfat fat mousedev i2c_piix4 psmouse sb_edac edac_core evdev i2c_core button xenfs
[ 22.156650] xen_privcmd sch_fq_codel ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio crc32c_intel xen_blkfront ata_piix aesni_intel libata aes_x86_64 crypto_simd cryptd glue_helper scsi_mod ixgbevf dm_mirror dm_region_hash dm_log dm_mod dax
[ 22.169694] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.14.74-coreos #1
[ 22.173623] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[ 22.177359] task: ffff9ef7abb01e00 task.stack: ffffbe81c31a4000
[ 22.180928] RIP: 0010:netif_skb_features+0x39/0x280
[ 22.183859] RSP: 0018:ffff9ef7af303e40 EFLAGS: 00010246
[ 22.186956] RAX: 00000000000000c0 RBX: ffff9ef7a6765c00 RCX: ffff9ef7a5c2dc00
[ 22.191274] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ef7a6765c00
[ 22.195618] RBP: ffff9ef7a6765c00 R08: ffff9ef7a5d8109c R09: 0000000000000001
[ 22.200029] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9ef7a6765c00
[ 22.204238] R13: ffff9ef7a4864000 R14: ffff9ef7a5d81000 R15: 00000000ffffff82
[ 22.208295] FS: 0000000000000000(0000) GS:ffff9ef7af300000(0000) knlGS:0000000000000000
[ 22.212863] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.216144] CR2: 00000000000000d0 CR3: 00000000c020a004 CR4: 00000000001606e0
[ 22.220189] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 22.224320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 22.228294] Call Trace:
[ 22.229868] <IRQ>
[ 22.231236] ? clockevents_program_event+0xcf/0x100
[ 22.234222] validate_xmit_skb+0x1c/0x270
[ 22.236576] validate_xmit_skb_list+0x3d/0x60
[ 22.239226] sch_direct_xmit+0x12b/0x1b0
[ 22.241453] __qdisc_run+0x10c/0x300
[ 22.243454] net_tx_action+0xda/0x1d0
[ 22.245658] __do_softirq+0xe8/0x2cf
[ 22.247807] irq_exit+0xc2/0xd0
[ 22.249777] xen_evtchn_do_upcall+0x2c/0x40
[ 22.252278] xen_hvm_callback_vector+0x7d/0x90
[ 22.254913] </IRQ>
[ 22.256212] RIP: 0010:native_safe_halt+0x2/0x10
[ 22.258783] RSP: 0018:ffffbe81c31a7ed0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[ 22.263015] RAX: ffffffff886215f0 RBX: ffffffff89319420 RCX: 0000000000000000
[ 22.267188] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 22.271498] RBP: 0000000000000004 R08: 0000000000000002 R09: 0000000000000001
[ 22.275733] R10: 0000000000000234 R11: 0000000000000000 R12: 0000000000000000
[ 22.279925] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 22.284066] ? __cpuidle_text_start+0x8/0x8
[ 22.286545] default_idle+0x1a/0x100
[ 22.288562] do_idle+0x15f/0x1d0
[ 22.290525] cpu_startup_entry+0x6f/0x80
[ 22.292844] start_secondary+0x1ae/0x200
[ 22.295176] secondary_startup_64+0xa5/0xb0
[ 22.297662] Code: 48 8b 8f d0 00 00 00 0f b6 97 91 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 8b 87 cc 00 00 00 48 8b 77 10 83 e2 20 <48> 8b 9e d0 00 00 00 48 01 c8 66 83 78 04 00 74 4d 0f b7 78 06
[ 22.308295] RIP: netif_skb_features+0x39/0x280 RSP: ffff9ef7af303e40
[ 22.312018] CR2: 00000000000000d0
[ 22.313930] ---[ end trace 6e3f796914ee2bd9 ]---
[ 22.316655] Kernel panic - not syncing: Fatal exception in interrupt
[ 22.321014] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
.436485] Code: 48 8b 8f d0 00 00 00 0f b6 97 91 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 8b 87 cc 00 00 00 48 8b 77 10 83 e2 20 <48> 8b 9e d0 00 00 00 48 01 c8 66 83 78 04 00 74 4d 0f b7 78 06
[ 22.445223] RIP: netif_skb_features+0x39/0x280 RSP: ffff90292f2c3e40
[ 22.448069] CR2: 00000000000000d0
[ 22.449633] ---[ end trace 9311e23e3029cde7 ]---
[ 22.451740] Kernel panic - not syncing: Fatal exception in interrupt
[ 22.455385] Kernel Offset: 0xe000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
After a little bit of digging, we got our hands on a panic dump. From there, a basic Google search answered our questions: Crash with latest v4.14.73 in netif_skb_features. Of course, our Kubernetes nodes use sch_netem in some instances, for very specific packets. We could seldom notice the issue with our in-development 1.12 cluster, as those specific packets rarely flow. However, on production, those packets are rapidly transmitted as soon as Kubernetes starts our workloads, causing the nodes to panic immediately. This evidence would seem to indicate that sch_netem may not be tested as part of the Linux Kernel release. Since our misadventure, a kernel patch has been submitted upstream, yet to be released in Container Linux.
Pegging new nodes to Container Linux 1855.4.0, then disabling the update_engine everywhere and forcing all existing nodes to continue booting on their currently active partitions, prevented the issue from spreading any further. A good way to catch potential issues early on at this layer would be to run different node pools subscribed to different channels (Stable, Beta, Alpha).
# Short-circuiting any further Container Linux update
kubectl get nodes | awk '{ if (NR!=1) { print $1 }}' | xargs -I'{}' ssh core@{} 'sudo systemctl mask --now update-engine; sudo cgpt prioritize $(rootdev -s /usr)'
Now, here is the thing: If we were using locksmith as the update coordinator (like we would in the etcd-cloud-operator if it was enabled), because the failure only appeared ~1min after the node started, the update would certainly have been marked as successful, and locksmith would have proceeded to update other nodes, catastrophically leading to cluster-wide panic hell in no time. By the time we diagnosed the root cause, every single node had the new kernel version installed and ready-to-go pending reboot. The issue may not have spread too far with the container-linux-update-operator though, as the local update agent container has to start for at least a few seconds, in order to report a successful update.
The is not the first time unexpected issues appear out of Container Linux upgrades. As minimal as Container Linux is; as well tested as the Linux kernel is; and as exciting as automated updates are; any untested changes to a line of code can cause serious havoc. Tectonic, CoreOS’ distribution of Kubernetes, shares the same philosophy, but with a much more complex system to update – and I once was at CoreOS, responsible for shipping downstream those updates developed by dozens of very talented engineers. Red Hat / IBM‘s successors of those products will likely follow the same principle, which is awesome, but requires great care from everyone – nothing is magic.