Can Kubernetes Navigate Disaster?

Image credit: iStockphoto/ConstantinosZ

Once you’ve got Kubernetes in production, those predictable business continuity and disaster recovery (DR) exercises get a lot more interesting — and not necessarily in a good way. That’s why I’m focusing on the challenges of Kubernetes disaster recovery and business continuity in my recently published research.

As Kubernetes makes its way into stateful applications — that is, apps that save data to, or read from, persistent disk storage — infrastructure and operations (I&O) leaders will have to figure out whether apps running on that cloud-native infrastructure can meet their DR goals. And if they’re relying only on out-of-the-box Kubernetes, the answer is likely to be no. As Annette Clewett of Red Hat asked back in 2019 at KubeCon + CloudNativeCon, “How can you have a serious platform if you have no backup and recovery?”

In the two years since, the Cloud Native Computing Foundation (CNCF) community has worked to build out the Kubernetes ecosystem to provide enterprise-grade storage for Kubernetes — for example, with the frequent upgrades of Rook, which orchestrates storage operators used in Kubernetes, including Ceph, Cassandra, and NFS. But it is still largely up to the user to figure out how to make it all work — from basic storage to full DR — either on their own or with the support of various vendors. Notably, the CNCF-certified Kubernetes distributions don’t claim to have built-in disaster recovery capabilities, except for Kublr.

Some users might look to the hyperscale cloud service providers to offer a Kubernetes DR solution to go along with their managed Kubernetes services. If so, they will likely be disappointed. Microsoft, for example, provides a set of best practices for business continuity and DR on Azure Kubernetes Services but directs users to “common storage solutions [that] provide their own guidance about disaster recovery and replication.” AWS documentation for Elastic Kubernetes Services (EKS) describes the resiliency of the Kubernetes control plane but is silent on what this means for providing DR for the apps running on EKS.

Google takes a different tact, incorporating a more detailed discussion of disaster recovery for Google Kubernetes Engine, including storage as part of the “disaster recovery building blocks” for Google Cloud Platform. Those documents may be a fine starting point for engineers previously steeped in Kubernetes, but Google’s guides require expertise that may be beyond systems operations teams who are still sorting out their transition to site reliability engineers. If Kubernetes is going to move into mainstream enterprise IT, basic DR will have to become more straightforward. Failover for high-availability applications on Kubernetes is an even bigger challenge, of course.

So if you need DR for apps running on Kubernetes, you’d better shop around. Several storage vendors and Kubernetes-based platforms do provide DR or provide support for other vendors that do. However, operations teams should expect some awkward moments in the regular DR exercises, even with those tools.

Some DR procedures won’t change. These include project leaders convening a meeting to set DR goals and application teams reporting additions and changes to the development. In addition, I&O teams will update runbooks with recovery point objectives — the amount of data loss or data reentry that can be tolerated — and recovery time objectives (RTO), the acceptable time that systems will be unavailable. However, application development teams and business users accustomed to aggressive RTOs may not be aware of the complexities involved in hitting those same numbers with applications running on Kubernetes.

The Kubernetes DR picture should become more apparent in the coming months. As my colleagues, Brent Ellis and Andras Cser and I observed at this year’s KubeCon + CloudNativeCon Europe, the CNCF community and various vendors are finally assembling the technologies and tools to ease Kubernetes adoption in the enterprise. But today, DR with Kubernetes remains a hurdle.

The original article by Lee Sustar, principal analyst at Forrester, is here.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/ConstantinosZ