Cluster API is a Kubernetes (K8s) sub-project providing declarative APIs and tooling to simplify provisioning, upgrading, and operating multiple Kubernetes clusters, managed by a management Kubernetes cluster. It supports most of the public and private clouds, as well as bare-metal environments.
Public clouds have all the standard services needed to fully automate the provision of the infrastructure for a K8s cluster, such as a metadata server for VM bootstrapping with cloud-init, load balancer for apiserver high availability, IP management with DHCP, and direct internet access. However, on-prem environments for both private clouds and bare metal are so different from one another that there is no single standard way to provide the functionality required.
This blog will focus on vSphere support in Cluster API with CAPV (cluster-api-provider-vsphere), how it achieves full automation with a similar user experience to that of public clouds, and how it supports some of the production enterprise K8s cluster requirements such as high availability, IPAM (IP Address Management), HTTP proxy support, etc.
Full Automation with Cloud-Init
Cluster API follows an immutable infrastructure philosophy. The base OS image used to launch the virtual machine has a fixed version of kubelet/kubeadm baked into it. To upgrade the cluster to a new version, a new base OS image with a new kubelet is used to launch new VMs and then terminate the old virtual machines.
Cloud-init is used to bootstrap the VM when booting for the first time. Cluster API controllers generate user data that includes K8s certificates, static pod manifests, and kubeadm init/join commands, then pass the user data into the VM, where cloud-init will pick it up and do all the heavy lifting to bootstrap the VM.
Unlike in public clouds, vSphere does not have an instance metadata server that is used to provide user data for a VM. One option is to use the OVF datasource, where there are certain predefined key-value pairs where you can configure VM properties, which will be available within the VM as a read-only ISO disk, attached and mounted.
CAPV is using another guestinfo based datasource which is not merged into the official cloud-init bundle yet. The datasource takes two parts of data: guestinfo.metadata is used to configure the hostname, VM network with static IP configuration, etc, which is vSphere specific; and guestinfo.userdata takes the data generated by the bootstrap provider, which contains kubeadm related files and commands.
With this datasource, CAPV can directly consume the userdata generated by the KubeadmBootstrap controller, which is common for all the clouds.
High Availability
A typical solution for K8s HA is to have a load balancer sitting on top of the multiple apiservers, with either a stacked etcd cluster or an external one.
For on-prem vSphere setups, there is no native load balancer service like ELB. CAPV v1alpha3 API uses CRD HAProxyLoadBalancer to create a separate VM running HAProxy with the dataplane API. When controlplane nodes are added or removed, the HAProxy configuration is updated through the dataplane API.
The critical issue for this is the HAProxy VM is a single point of failure. One solution is to launch multiple HAProxy VMs configured with Virtual IP (VIP) failover, with the cost of two additional VMs and one additional auto-failover component. This is still ok for a virtualized environment, but It would be very costly to have two additional bare-metal machines just for LB. Can we make this simpler for both virtualized environments and bare metal?
This is what leads us to kube-vip. The kube-vip project is designed to provide both a highly available networking endpoint and load-balancing functionality for underlying networking services. Instead of using separate load balancer VMs, kube-vip runs as either a static pod or a daemonset within the control plane nodes. Instead of using an additional component like keepalived to do VIP failover, kube-vip uses native K8s leader election. Kube-vip also supports maintaining dynamic DNS for the control plane endpoint when there is no static IP available for the VIP.
In the latest CAPV release, kube-vip is the default HA solution. HAProxyLoadBalancer was already deprecated.
Failure Domains
In K8s, failure domains are mainly consumed by a K8s scheduler to decide where to place the pods. The K8s nodes/volumes have labels with the region/zone information that the node/volumes reside in. The K8s scheduler will make sure that pods are spread across multiple availability zones, and volumes are attached to pods that match the affinity policies.
In ClusterAPI, the same term “failure domain” is used to control where to place the nodes. For a HA cluster with three control plane nodes, we can distribute the nodes into three different availability zones, to survive one availability zone failure.
There are some specific challenges to overcome in vSphere that are a little different from public clouds:
- K8s understands failure domains in terms of region/zone, but vSphere only has topology constructs like datacenter, compute cluster, not region/zone.
- Public clouds can place VMs with the exact same configuration into different failure domains (Availability Zone). But in vSphere, different failure domains will have different placement configurations like compute clusters, resource pools, datastores, networks, folders.
For the first challenge, vSphere introduced region/zone tags for the cloud provider (CPI) and container storage plugin (CSI). vSphere admin can add region/zone tags to the infrastructure. When the K8s nodes are launched within a compute cluster or host with a zone tag, CPI will label the node with the zone information. The same holds for the persistent volumes. With tags as a loose coupling mechanism, a vSphere admin can create different topologies based on specific requirements. The following example from the official doc shows examples of different topology configurations.
For placement challenges, CAPV recently added VSphereFailureDomain to have a better separation between failure domains and placement constraints. VSphereFailureDomain defines the region and zones for the physical infrastructure fault domains configured by the vSphere admin. VSphereDeploymentZone refers to the VSphereFailureDomain, then adds the placement constraints of this failure domain. These CRDs are defined outside of a specific K8s cluster; they can be pre-populated by the admin or managed by a separate controller which has higher privileged permissions, to do auto-discovery, validation, and configuration.
IPAM
Many enterprise environments use a dedicated IPAM (IP Address Management) service instead of DHCP to allocate IP addresses for vSphere VMs. To support all the different commercial IPAM solutions, CAPV does not provide a native IPAM service but instead provides the hooks to integrate with out-of-band IPAM plugins.
The above diagram shows our solution. The IPAM service is developed by the cluster-api-provider-meta3 project, which provides a generic way to allocate and deallocate IP addresses. It is initialized with the IPPool provided by the admin, takes IPClaim requests, and returns allocated IP addresses. The CAPV-IPAM integrator will watch for any VMs within CAPV waiting for static IP, talk to IPAM to allocate an IP, send it back to CAPV, and then create the VM inside vSphere. The integrator will also reclaim the IP once VM is deleted.
For now, the IPAM service supports using a static IP Pool. There are plans to expand to directly integrate with enterprise IPAM solutions like Infoblox.
HTTP Proxy
Most enterprise vSphere environments are behind a firewall and need an HTTP proxy to go out to the internet. There are three different levels of proxy support in the K8s cluster: host OS, container runtime, and applications running within the cluster. For host OS and container runtime, we could inject the httpproxy, httpsproxy and no_proxy configurations using cloud-init. For applications, we need a way to automatically inject the proxy configurations into the pod, due to the dynamic nature of IPs / DNS names within K8s.
Kubernetes version earlier than 1.20 has an alpha API PodPreset, which can be used to define environment variables, secrets or volume mounts to be auto injected into the pod which has the matching label. However, the alpha API was not able to graduate into beta so was deleted in the 1.20 release.
We build an open source project, reach, to port this feature into a CRD, with improvements specifically targeting use cases of proxy support. For a given K8s cluster, it is very likely that the httpproxy and httpsproxy should be the same for pods in all the namespaces. But for noproxy, in addition to the standard domains and IPs like .svc, .cluster.local, there might be application specific ones; for example when services talk to each other using service names like ‘mysql’, the service name ‘mysql’ needs to be added into noproxy.
The reach project takes a hybrid approach, with a non-namespaced CRD ClusterPodPreset and a namespaced one PodPreset. ClusterPodPreset could contain the httpproxy, httpsproxy, and cluster level noproxy values. The namespaced PodPreset can contain a single noproxy value specifically for the namespace itself. There is also metadata attached to each environment variable to indicate whether we should merge the values with the cluster level ones, like noproxy, or just do a replace, like if you need to use a different httpproxy in a specific namespace.
Summary
Using Kubernetes to manage Kubernetes itself, Cluster API has quickly become the de facto way of provisioning and managing Kubernetes clusters. With cluster-api-vsphere-provider and various open source projects around it, we can achieve a public-cloud-like experience, but at the same time provide on-premise specific functionality. The project is moving fast towards the next major v1beta1 release with lots of new features and enhancements - stay tuned!