How do you handle dependencies in complex deployments?
In today’s world of interconnected cloud services, deploying application infrastructure can get pretty complicated. For example, your Kubernetes app in EKS may need several pods to share storage, requiring you to set up Amazon EFS for your cluster. Your IT department may require you to use RFC1918 IP address conservation for any EKS clusters you deploy in their main VPC. Another app might be deployed through Flux and require retrieving a SOPS key from your company’s secret store first and adding it to the cluster as a secret. Automating all the steps of actually getting a Kubernetes application into production is not easy.
Terraform has helped many companies address all or part of this problem. But, with Terraform being a desired state language, it can be difficult to deal with complex interdependencies between resources. The Terraform language is aimed at deploying everything in a single run, using resource dependencies to create resources in the right order. However, the more complex the infrastructure gets, the more difficult it becomes to avoid circular dependencies in your TF code that make deployment difficult, or sometimes even outright impossible.
A real example — and a workaround you can use today
I personally ran into this problem as I was working with a customer that uses the RFC1918 IP address conservation approach for EKS. This approach requires customizing the AWS VPC CNI so that it attaches pods to dedicated pod networking subnets that use a special RFC1918 CIDR (typically 100.64.0.0/16).
Due to limitations at the time, deploying the CNI with a custom configuration wasn’t working through CAPA(the Cluster API provider for AWS) and I needed to deploy the cluster with a vanilla CNI configuration first, then redeploy the AWS VPC CNI Helm chart over it with a custom configuration. Making this work through a single Terraform run was simply impossible, so I had to figure out a solution.
I figured that if I could just run Terraform multiple times, with slightly different configurations for each run, I could deliver a working solution for the customer.
Terraform would still be able to manage the final state of the infrastructure that it deployed, as well as correctly deprovision it on a TF destroy run. So, the question became: how can I make Terraform apply different configurations, depending on which step of the deployment process it is? And how can I maintain the state of which deployment step a cluster is in?
Let’s take this step by step…
To start with the easy part of maintaining state: this could be accomplished by using a tag on the cluster that stored the current step. I chose to use a simple step tag that contains a number to indicate which step of the multi-step process the cluster is on. A fresh cluster would start on step 0 and complete at an arbitrary number, incrementing by 1 for each run.
The next challenge was determining the correct value of the step tag dynamically during a TF run. I achieved this through an external resource in Terraform:
This will execute the waitforcluster_state.sh bash script that queries the Spectro Cloud Palette API to determine if the cluster exists and which step it is on. It also verifies that all pending changes on the cluster have been fully implemented and will loop until that is the case. This is helpful in scenarios where, for example, the new VPC CNI config has been applied, but it takes the cluster a while to fully implement the changes.
This is the script I use for this external resource:
The script returns -1 if the cluster doesn’t exist yet, otherwise it returns the value of the step tag on the cluster. We can then use the value of the current step in our code as data.external.waitforclusterstate.result.step_. I use the following code to generate two local variables that make it easier to use the cluster state in other places of the code:
The cluster_exists variable is very useful for other blocks in Terraform, where you only want to define a resource if the base cluster has been deployed. For example, I needed to retrieve the value of an AWS security group that gets auto-created when the EKS cluster is deployed. So, I used this variable to define the TF resource like this:
The step variable combines several pieces of logic:
- If the cluster does not exist, return 0
- If the cluster exists and its current step value is not the last step in the process, increment the step value by 1 and return that value
- If the cluster exists and its current step value is the last step in the process, return the step value as-is.
The last step: the state map
To determine which step constitutes the last step in the process, we come to the final piece of the puzzle: the state map. This is a local variable that provides the ability to define a unique configuration for every individual step. The basic structure looks like this:
I found the most powerful way to leverage the state map is through using dynamic blocks. For example, I defined dynamic blocks for the spectrocloudclustereks resource, so that I can dynamically set the cluster configuration based on the current step:
Which then allows me to set the desired state per step in the state map like so:
Finally, we need to tie it all together and make Terraform output some useful information so that we can use a simple script to loop TF runs until the last step is reached. First, we define some useful outputs:
We then use the last_step output to determine if we need to perform another TF run, in the following script:
and with this, our Terraform state machine is complete. I hope this state machine walkthrough is useful for tackling your more complex Terraform infrastructure deployment challenges. It certainly helped me get more out of Terraform than I was able to before.
New CAPA developments ahead!
While my solution works, it was always meant as a temporary workaround until the technology existed to no longer need multiple runs to deploy the infrastructure.
So, while this workaround is still a useful tool to keep in your back pocket, I’m pleased to say that we have also contributed significant improvements to CAPA to get the custom VPC CNI functionality working out of the box. This capability is new in Palette 3.0 — and it means I’ll be able to move my customer’s automation back to a single Terraform run. To find out more about Palette 3.0, check out the release notes.