Effective Terraform Practices

2020-07-20 tech programming terraform

Getting started with Terraform is easy. As your deployments grows to multiple environments and multiple state files, maintained by multiple team members, things can become harder to wrangle. This article touches on some practices that I’ve found effective for using Terraform in those settings.

Be thoughtful about repository structure
Maintain a custom build in a Docker image
Run in CI whenever possible
Be thoughtful about state file boundaries
Be practical about module boundaries
How to name things

This is part two in a series about Terraform. Part One: Terraform Pain Points.

Be thoughtful about repository structure

Every Terraform state file comes from a combination of what I’ll call “definitions” (the HCL code) and “configuration” (the tfvars used as inputs). This terminology differs from the Terraform docs (sorry), where they refer to the HCL code as “configuration”. I prefer my terms because they seem to align better with industry standards (e.g. Twelve-Factor App defines config as “everything that is likely to vary between deploys”).

Terraform, the tool, expects the definitions and config to be all in one place. That isn’t a natural organization of things, because we often have a set of Terraform definitions that get deployed multiple times with different config, e.g. once in staging, once in production. Instead, it makes sense to have the config organized by environment, and have the definitions stand on their own. Then a couple of questions:

When it’s time to run Terraform, do we bring the config to the definitions, or the definitions to the config? Either way works. terraform init -from-module, and Terragrunt, both run alongside the config and pull the definitions to it. That’s the conventional choice.
Do we maintain a tree of config organized by environment/project (staging/project1/config.tfvars) or does every project bring its own list of environments (project1/staging/config.tfvars)?

That second question is more open-ended. The Gruntwork recommendation, and what we recommend here, is the first option: tree of config organized by environment/project. There are definitely situations where the second (transposed) arrangement makes sense, too, especially if you have few environments. I was surprised to find that Pulumi, as I understand it, only supports the second arrangement (Pulumi.staging.yaml in the project repo).

Here’s a recommended repository structure:

Config repository

Let’s call this infrastructure (Gruntwork calls it infrastructure-live, but -live doesn’t add much IMO). This repository is an index of all of our Terraform states, and holds the config (tfvars) for each.

This means it’s a tree of our environments, and all of the projects deployed in each. In addition, we can have “special” one-off Terraform states that aren’t part of our application, like a Jenkins deployment, or an “ops” AWS account containing all of our IAM users (following Segment’s mult-account guide).

Whenever we run Terraform, it runs in this repository. We can use init -from-module or Terragrunt to pull in the Terraform definitions and colocate them next to the config. More on Terragrunt later. For the one-off states, which don’t have any meat to them beyond their Terraform definitions, we can put the Terraform definitions directly in the infrastructure repository.

Modules repository

In the terraform-modules (or infrastructure-modules) repo, we’ll put small Terraform modules that we want to reuse in multiple projects. Some examples: an ECS service with a load balancer, AWS Lambda function with CloudWatch alarms, Cloudfront distribution to front an S3 bucket.

There’s a tradeoff here around versioning. One easy approach is to version at the repo level, tagging it with a new version after every change. Then the version becomes less of a useful signal about whether things have changed. module1@1.2.3 and module1@1.2.9 might be the same thing, because those intermediate changes were to other modules. Still, this is usually good enough. If the alternatives are to come up with a more intricate scheme for versioning individual modules within the repo, or separating every module into its own repo, it’s not obvious that the complexity of either of those solutions is worth it.

Project repositories

If a project needs its own set of Terraform definitions, then a reasonable place to put those is in the project repo itself. That module could live in the terraform-modules repository, but it’s not meant for reuse, and there will be an awkward, unnecessary disconnect between the version of the module and the version of the project. If the project’s Terraform definitions are brought into the project repo, then the infrastructure code is versioned the same as the project code, and it simplifies the cognitive overhead and general experience for the developers.

Another alternative is to corral multiple projects’ Terraform definitions into a big module, but that leads to several problems. More details on that under “State file boundaries” below.

It can also be useful to have some bigger modules (let’s call them “stacks”) living in their own repositories. The main value in these is that building the environment out of a handful of logical building blocks, rather than many tiny modules, can make it easier to reason about (e.g. harder for things to fall through the cracks). Having them in a separate repository (rather than terraform-modules) is not essential but it’s one way to give them meaningful versions.

The existence of these stacks seems to diverge from the Gruntwork guidance, which says

Large modules are slow, insecure, hard to update, hard to code review, hard to test, and brittle

There’s truth to this, especially when “large” is really “large”, but it isn’t black-and-white. Having medium-sized stacks, which plan in under a minute, can be a pragmatic compromise.

Sidenote: Terragrunt

Terragrunt is a wrapper around Terraform that does many things, but the most useful (IMO) is how it encourages/complements this repository structure (not surprising, since Terragrunt is a Gruntwork product). It expects to be run alongside your config, and then it manages a staging area: it sets up a clean directory, copies the config (*.tfvars) into it, copies the definitions (*.tf) into it, and then runs there.

Here are some features that I don’t use, and the reasons why. Other people, in other situations, may get tremendous value out of them. This definitely should not be read as a strong recommendation to avoid them!

Auto-init: If we run Terraform in a Docker container (which we should), we can be intentional/explicit about the runtime context. In particular, the init step of the lifecycle may require GitHub credentials (to fetch modules from private git repos) but plan/apply/etc. don’t. So while we pass the container extra credentials to run terragrunt init, the principle of least privilege suggests we should omit them when we run our other commands. Auto-init will prepend any command with an init, when it hasn’t been done yet, so we’d always have to run with the special credentials.
apply-all: I prefer to run several small, targeted, applies: one for each piece of infrastructure that needs to be deployed. There are pros and cons of doing it either way.
Auto-creation of state bucket/DynamoDB lock table: It’s nice to Terraform our AWS organization, and if we do that, then we may as well also include the S3/DDB resources used for each account’s Terraform state. The Terragrunt feature is useful for bootstrapping that first account, though (the AWS organization’s master account).
path_relative_to_include(), etc.: Be careful storing state in S3 at locations that mirror the structure of your infrastructure repo. It works nicely at the start, requiring minimal thought or setup. Then, if we want to rename a directory in the config repo, we need to do a dance with copying state files around in S3 (making sure we don’t operate on the old one in the meantime). It’s a lot of work. Better to consider the paths in S3 as disconnected from the paths in the config repo, even if they coincide at first.
Source-controlled versions in terragrunt.hcl: There are a few issues here.
- HCL isn’t conveniently machine-readable or machine-writable, so this is not automation friendly.
- The commit history is cluttered if every deployment results in a version bump commit.
- The versions listed are aspirational, and do not necessarily reflect reality. Maybe this got applied after the commit, maybe it didn’t. Maybe it failed to apply. Maybe it was applied from a different branch, with a different version. This is not a real source of truth, but it does look like one, and that can be misleading.
This is not to say that we should leave the version unpinned. One pattern is to have one simple and generic terragrunt.hcl, which fills in its source from an environment variable which we supply at init time, reflecting something like “we’re trying to deploy projectA@v1.2.3”. The deployed version isn’t reflected in the infrastructure repo, but that can be okay. It can be tracked elsewhere.

Maintain a custom build in a Docker image

I’ve found it valuable to run Terraform exclusively in a Docker image. This solves a few important problems:

It keeps the entire team (and CI) on the same version of Terraform. For a given state file, Terraform doesn’t let you operate on it with more than one version. If the version tagged in the state file is greater than the version you’re using, Terraform will error out when you try to run terraform init.
It lets us run custom builds of Terraform and providers, with whatever patches we find useful. This is particularly great for the AWS provider, where there are many bugfixes and feature additions which can take months or years to be merged. We’re also free to pull in any patches that might be company-specific, or may not be welcomed upstream for whatever reason. Here are some examples of useful patches I’ve incorporated in the past:
- Terraform: Add deepmerge() #25032
- AWS provider: add aws_kinesis_stream_consumer resource #10487
- AWS provider: fix spurious diffs in IAM policies #13813
- AWS provider: custom patch adding an option missing_okay to ECS data sources, so we could solve this issue, gracefully sharing control of an ECS task definition between Terraform and a deployment pipeline.
The ability to fix things in Terraform by ourselves, rather than being at the mercy of the maintainers, is so liberating. It’s impossible to overstate how useful this pattern is.
It lets us easily distribute custom builds of Terraform and its providers. If team members are running whatever versions they’ve installed locally, there will be problems when Terraform definitions rely on any patched behavior.

Distributing a custom provider is tricky in 0.12 (and older versions), requiring team members to download it and put it in the right place on disk. It gets a bit better in 0.13 (we can point TF at a remote registry that we maintain). But bundling it all up in a Docker image solves this problem neatly and more generally, so it’s likely that this will still be a better approach than maintaining a private registry.

Run in CI whenever possible

It’s a good idea to preferentially run Terraform on a CI server. I worked in a system built on Jenkins, where we had a “Terraform plan/apply” pipeline which could be invoked as part of automation or run manually in an ad-hoc way. The inputs to this pipeline are the environment name, project name, and the version that you want to deploy. It would grab the config (from the infrastructure repo), the TF definitions (from the project repo), compute a plan, wait for approval, and then apply it.

I recommend a workflow which routes through CI whenever possible. The benefits in making our Terraform workflow “CI-first” are:

Auditability/transparency: the CI server will have a build log, though those logs get rotated. It’s also a good idea to dump artifacts from every build somewhere (maybe S3), as a sort of permanent log. In my experience, it’s worthwhile to include all the build parameters/metadata (e.g. Terraform version, git commit hashes of the config and the definitions, environment variables used to run TF, etc.), copies of the tfvars config, the plan, and the outputs (as JSON).

Stashing these artifacts (particularly TF outputs) in a predictable location in S3 can be a gamechanger for working with infrastructure programmatically. For example, a deployment script can can refer to those outputs to get the name of a bucket, the id of an EC2 instance, etc.
Easy communal debugging: We get a shareable link when a build fails to plan or apply.
Uniformity of runtime: No “works on my machine”. Jenkins works the same for everyone (and a Dockerized workflow can solve this for running Terraform locally).
Automation: Having a “run Terraform” pipeline with the right API is a boon for automation and composition. We found ourselves constantly invoking this pipeline as a step in other pipelines.

Of course, running Terraform locally is still important. It’s useful for initial development of a feature, debugging/quickly iterating, or for executing commands like state mv which are largely at odds with automated processes. Ideally, this is a Dockerized workflow, which works equivalently in CI and on your workstation. However, the lack of auditability, and lack of automated bookkeeping (e.g. stashing the outputs in S3) makes running Terraform locally a questionable proposition in a team setting.

Be thoughtful about state file boundaries

It’s common practice to have a separate state per environment, in order to limit blast radius of changes. By the same token, it’s also worth considering having separate state per project per environment. For clarity: “project” here refers both to things like our Terraform stacks (which cover base infrastructure like VPCs, databases in RDS, etc.) as well as to the services which run our application code.

At a previous job, we started out with a handful of TF states in each environment, one for each “stack”. For example, one of those stacks covered some base infrastructure (e.g. VPC, shared S3 buckets). Another was a big stack where we defined most of the infrastructure directly powering our services (e.g. Lambda functions and ECS task definitions). Over time, it got to the point where that stack had over 4000 resources and took 5 minutes to plan. This was extremely disruptive and was only going to get worse over time.

We migrated to a state-per-project model. As always, this is a tradeoff. There are several benefits but it comes at the cost of operational complexity:

Benefits of having more granular state files:

Limited blast radius
Finer control over deployments
Faster plans because each state contains fewer resources
Infrastructure definitions can live closer to the project (i.e. in the project repo), which can reduce the complexity of deployment (number of steps, number of moving parts).
- Versioning the infrastructure and project code together tends to make sense

Downsides of having more granular state files:

Requires additional orchestration. What used to be one plan/apply to update the infrastructure across all projects is now 15 plans/applies.
There are new dependencies which need to be managed outside of Terraform’s resource graph (or at least considered, if not managed). For example:

If we’re deploying A and B, A needs to be deployed before B

and

If we’re deploying A, then X, Y, and Z should be deployed afterwards because they depend on A’s outputs
It’s harder to get a bird’s-eye view of all resources

After building the necessary tooling, it was clear that moving to state-per-project was unequivocally the right decision. The speed of terraform plan and the improvement to developer experience (having infrastructure code in the project repo) were tremendous improvements.

The process we used to extract a subset of resources from one state into another state was totally bespoke. It was definitely off-label usage of Terraform (which is unfortunate, since this seems like a reasonable, maybe even inevitable, way for a big Terraform deployment to evolve). Since state files are just JSON, we wrote some Python scripts that would interact with the state buckets in S3 and make the necessary changes.

Be practical about module boundaries

When some Terraform definitions can be factored into a useful, reusable unit, it’s usually the right call to put them in a module. When we have a module that’s useful for multiple projects, we can throw it in our terraform-modules repository and reference it via git (pinning the module version). When the module is only useful within a single project, we can simply make it a module within the repo (i.e. a subdirectory).

The difficulty is that renaming, moving, or otherwise refactoring Terraform resources is always a headache. If you started without a module, but now need to make a second copy of those resources, introducing a module will require either (1) moving resources in state, or (2) destroying and recreating the resources. Currently, Terraform has no facilities for doing state movement except as an out-of-band process that has to be managed separately as special steps in the deployment process (also: mentally scale up the amount of manual work by the number of environments that need to get the deployment).

As a result, it’s sometimes not worth the effort to add or remove module structure. No matter how thoughtful we are, module (and state) boundaries will change over time. Keeping a set of Terraform definitions “optimally” factored has a high maintenance cost and is not always the right choice.

Note that moving a module into or out of your repo can be done seamlessly. Terraform’s resource addresses (e.g. module.abc.module.def.aws_s3_bucket.bucket) don’t care about the source of the module. As long as the structure remains the same, you can move that module from ./some-internally-useful-module to git@github.com.../terraform-modules.git//some-generally-useful-module without any extra work.

How to name things

Resource names are largely irrelevant, but bear in mind that it’s a hassle to rename them (see comments about state movement, above), so try to name things such that you’ll never need to rename them (good luck!). If the name is too generic, like aws_s3_bucket.bucket, then the next time you’re adding a bucket, it will bug you. If it’s too specific, like aws_s3_bucket.python_docs_bucket, then when you want to store something else in that bucket, it will bug you. Happily, even if the name bugs you, it still only impacts development experience within that set of Terraform definitions.

Sometimes, especially in a small module, there’s no reasonable name for a resource. If we’re working on a module that wraps a bucket with a certain kind of policy, what do we call the main bucket resource? aws_s3_bucket.main? default? bucket? None of these are obviously better than the others. There are recommendations on the internet to always include the resource type in the resource name, which is comically redundant. And yet, sometimes we do it anyway. This is completely inconsequential and a great topic for bikeshedding. Just do whatever feels right at the time.

Variables and outputs should have specific names, making it clear what they refer to. Bad: var.memory, better: var.important_lambda_memory_mb. This is often not easy, and it can get worse over time just due to entropy (maybe we introduce a second “important” Lambda). Renaming a variable isn’t a very big deal, though, thankfully. Renaming outputs is a little harder, since it’s the contract published by your module. It’s still doable, though, especially when the outputs have verbose, specific names, and when you can search across all your codebases (GitHub search scoped to organization is often good enough).

Whenever appropriate, put units in names, like alarm_threshold_milliseconds, so the “user” doesn’t have to dig all the way down to AWS API documentation to figure out the unit. Do that legwork once.

As mentioned earlier, it’s useful to publish each Terraform state’s outputs as JSON somewhere easy to access, like S3. We can then refer to those from deployment pipelines and all sorts of scripts. It would be unreasonable for each consumer of TF outputs to know which state was responsible for its outputs, but one thing we can do instead is to download all the outputs, from all the states, and merge them. Because of that, it’s a good idea for output names to be “globally” unique (among our projects). Usually, that means prepending each project’s outputs with the name of the project.

Conclusion

Managing a sprawling Terraform deployment can be difficult and daunting, and it’s a different problem from defining infrastructure and writing HCL. This article touched on some practices that can help keep a big Terraform deployment running smoothly in an auditable, maintainable, scalable way. Key among those are running Terraform in a Docker image, running preferentially in CI, and being deliberate about module, repository, and state file organization. I hope this article gives you some useful ideas for your work with Terraform.