• Nvidia & Docker – Failed to initialize NVML

    I haven’t written in a while and thought this might be helpful to many out there. After containerizing my GPUs workloads in my home lab, I noticed that seemingly randomly my GPU-enabled containers would start throwing the following error (e.g. when running nvidia-smi): *Failed to initialize NVML: Unknown Error*

    Read More
  • Our breakup with Weave Net

    In 2017, when BitMEX started using Kubernetes, we picked Weave Net as our overlay network for its obvious simplicity (150 lines of YAML, one DaemonSet, no CRD) and transparent encryption via IPSEC ESP. As our clusters grew bigger, with more and more tenants running real-time financial applications in production, the delusion has faded.

    Read More
  • Kube-proxy IPVS – simple, efficient, unstable

    After we started in-place updating Kubernetes 1.10 to 1.12, everything seemed fine. Eight hours of sleep, and a few pods OOMKills later – we noticed that some of the AWS ELBs that front our Kubernetes services were reporting a few unhealthy target nodes.

    Read More
  • The moment Container Linux almost broke our fleet

    The value proposition offered by Container Linux by CoreOS (Red Hat / IBM) is its ability to perform automated operating system updates thanks to its read-only active/passive /usr mount point, the update_engine, and Omaha. This philosophy (“secure the Internet“) allows system administrators to stop worrying about low-level security patches and helps define a clear separation of concerns between operations and applications. That's the theory anyways...

    Read More
  • 5 – 15s DNS lookups on Kubernetes?

    Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying either internal or external services, regardless of the ports or protocols. Reproducing the issue was as simple as using cURL in any of our containers, to any destination, where the majority of the queries would stall for durations close to multiples of five seconds. Five seconds, you say? That is generally the red flag for DNS issues. Let’s figure out...

    Read More