November 20, 2021

Terraform: Lessons Learnt

It's been a while since my last post, that's mainly due to me being super busy with work, and additional college assignments.

Basically, this is how I've been these past few months -

Introduction

I've been using Terraform in production recently, and it gave me a whole new perspective on how infrastructure should be managed. A lot of the learning was done with the help of "Terraform in Action", by Scott Winkler. Here are a few things I've picked up on about Terraform, along with a couple opinionated tips 👀️

Keep an eye on the number of resources

With Terraform workspaces, it can be easy to just set it up and ignore it throughout the lifetime of the project. However, as your workspace starts to expand, the number of resources can definitely affect your plan/apply time. A workspace should definitely only contain resources that heavily depend on each other.

For example, if you have a Kubernetes Cluster with 5 namespaces, each having their own Deployments, and depend largely on other workloads in the namespace, it would be ideal to split it into another workspace. If you need to share some data like IP addresses, ports, etc, you can do so by using the origin workspace as a data source. A point should be noted however, a change in the origin workspace output does not run a refresh in the target workspace. The target workspace needs to have an additional plan and apply executed to act on the new output. To deal with this, you can build weak dependencies into your CI system to automatically run a refresh on dependent workspaces.

Make use of modules

Initially, while setting up a few Kubernetes Deployments with Terraform, I didn't mind manually writing the secrets, config, etc each Deployment required, however, as the project grew, it became unsustainable. The labels were janky, and declaring Kubernetes resources on Terraform is a pure nightmare (If you thought YAML was bad).

I decided that most of the boilerplate could be abstracted away into a Kubernetes Deployment module, which could just take in the necessary arguments, like resource limits, configs, secrets, and dependent databases etc. This cut down the LoC by A LOT, and made maintaining our entire set of deployments super easy.

Our logging service, has a known bug - that it cannot figure out the container name from the logging process. A workaround for this, was to inject an ENV variable into the container, called CONTAINER_NAME. Now, if I had to update dozens of YAML manifests with this, I'd have to create a script that did it. With Terraform, all I had to do was edit the deployment module, and all the deployments were updated! Note that it is not a Terraform-specific feature, a similar feat can be achieved with Kustomize or Helm.

However, Kustomize lacks in a very important area. It assumes that the deployment is the same from the last time it was created. This is where Terraform shines, since before any change to infrastructure, it detects drifts, and suggests a course of action, i.e to update the resource according to its spec defined by code.

Kustomize 0 | Terraform 1

Your CD system will change

While trying to adopt Terraform into our system, significant changes had to be made to our CI/CD workflows. Instead of using a tool like gke-deploy to apply manifests, we had to use the Terraform API. Luckily, Terraform has amazing documentation, which really moved along the update to our pipeline.

Using remote state management is VERY important. Even if it is a small project, if You're using Terraform in your pipeline, you need remote state. We use Terraform Cloud, which handles all our needs adequately.

There are however, a few drawbacks that are noticeable. On the one hand, you can trigger a new apply every time a new image is built, and on the other, you can trigger a new apply after all images related to a commit hash are built. With the former, you can rack up a huge amount of pending plans in your workspace, and if you don't set appropriate timeouts, the pipeline will fail. With the latter, atomicity is lost between image builds. For example, if 5 images are built, and 1 has a bad deploy, the entire apply will fail, and reconciling remote state is a cumbersome process (It takes time).

The first option is ideal, if you split up the workspaces into tight units. This frees up the locks on unrelated/weakly related resources! Life. Hacked.

Version your modules!

It is important to split your actual infrastructure and Terraform module repositories. Having different git histories for both is cleaner. Whenever a module is declared in the infrastructure repo, make sure that it refers to a fixed hash on the module repo. This way, even if the module is updated, it doesn't break the resources that haven't been updated with it! Think of it just like package management.

Versioning modules is definitely something that should be done from Day 0. After all, you don't want to update a module a year later and then notice your services don't talk to each other anymore! (For those of you menaces that don't read the Terraform plan output😝️)

Sit back and relax on the weekends

For those SRE's that have to work weekends because of a misconfig, rest easy. With Terraform's robust framework, there is very low chance of you working the weekend! You are aware of all the changes a single line of code will do to your infrastructure.

Pro-Tips:

If you need to populate a map with default values, but don't want to use the experimental provider (at the time of writing) -

locals {
	map_raw = {
		process_name = "bar"
		ip_cidr = "/27"
	}

	map_filled = {
		for idx, value in local.map_raw: idx => merge(
			{
				resource_limit = "LOW"
				ip_cidr = "/24"
				location = "us-central1-c"
			},
			value
		)
	}
}

If you have a chain of functions applied to a string, like, upper, replace, split etc, abstract it away into a module From:

main.tf

locals {
	super_cool_string = upper(replace(split(".", var.value), "-", "_"))
}

To:

Module transform_super_cool_string

variable "value" {
	type = string
}

locals {
	return_val = upper(replace(split(".", var.value), "-", "_"))
}

output "value" {
	value = locals.return_val
}

Main.tf

module "transformed_string" {
	value = var.value
}

	# Now use module.transformed_string.value

Clean, right?

Conclusion

It was a blast to work with Terraform, especially since their team provides near-instant support over email. It's hard to believe that only v1 of this system is so refined! There are definitely a few places it needs improvement, but for the ordinary user, it should hum along just fine.