Terraform Pain Points

2020-07-08 tech programming terraform

I love using Terraform. At my previous job, we managed our infrastructure entirely with Terraform: tens of thousands of resources spread across several cloud providers. The benefits of infrastructure-as-code and Terraform, in particular, are massive, but well known. While I still consider Terraform the best tool of its kind, this article describes some pain points that my team and I encountered as power users. I hope it can lead to some discussion about ways to improve.

All of these are relevant as recently as Terraform v0.13.0-beta3, July 2020.

This is part one in a series about Terraform. Part Two: Effective Terraform Practices.

Refactoring is difficult

Terraform code is unwieldy to refactor. Even giving a resource a new internal name is a hassle. Here’s our simple Terraform definition:

resource "aws_s3_bucket" "bucket" {
  bucket = "images.bigco.com"
}

Now it’s a month later, and we’re adding our second bucket, so let’s change our Terraform code to use a more specific name:

resource "aws_s3_bucket" "images_bucket" {
  bucket = "images.bigco.com"
}

If we naively try to make this innocuous-looking change, Terraform will want to delete and recreate the bucket. We probably all understand why that’s the case, and it helps us appreciate how wonderful the concept of terraform plan is, but it’s ludicrous to have no serious facility for doing this smoothly. terraform state mv exists, but you need to run that separately, outside the plan/apply lifecycle. If you need to do this for ten environments, it’s a lot of work.

And that’s the easy case. Moving across module boundaries is harder, especially if you want to move the resources from the module into the root of the state (spoiler: state mv can’t do it). Moving across state boundaries is harder still. While the documentation mentions moving to a different state file, there’s no support for hooking it up to an already-existing state in S3 (for example). The tool is not at all user friendly or convenient.

The silver lining is that Terraform state is a simple JSON file, so it’s easy to write your own tooling around it. My team had occasion to do several refactors where we pulled individual projects’ resources out of a monolithic state and into their own states, once for each of our environments. Trying to orchestrate that with state mv would have been an awkward mess, but writing a simple Python script to pull state from S3, modify it, and push it back, was not too bad (if you do this, you’ll also want to remove the state checksum from DynamoDB).

You’re never going to nail the module and state boundaries correctly on your first pass (or ever?). Refactoring is inescapable. It needs to be more convenient. It would be great if there was some way to signal to Terraform “hey, this resource used to have a different address”. Something like this seems reasonable:

resource "aws_s3_bucket" "images_bucket" {
  bucket = "images.bigco.com"

  lifecycle {
    old_addresses = [
      "aws_s3_bucket.bucket",
      "module.images_bucket.aws_s3_bucket.bucket",
    ]
  }
}

Code reuse is limited

Terraform’s main tool for code reuse (i.e. a chunk of resource definitions that can be reused with different inputs) is the module (symlinks may also be useful in some situations, but I haven’t used them for this). Modules are limited in some ways.

It’s awkward to pin module versions

You can’t do interpolation in a module’s source parameter. So if a dozen modules should all be pointing at the same revision of your modules git repository, there’s no clean way to update all those references in one place. My team had a Makefile target, which we ran manually, that used find and sed to update all the references.

Leaving the module source unpinned is not an option I’d be comfortable with because it’s a vector for un-source-controlled drift that can be easily avoided.

Can’t partially apply modules

The biggest problem we’ve faced with the module system is the inability to do partial application (in the computer science sense). In essence, it would be nice to simplify a module’s interface by binding a bunch of common parameters. Here’s an example:

module "service_a" {
  source = "..."

  name = "service_a"

  environment_variables = merge(
    local.environment_variable_defaults,
    {
      THING = "1",
    },
  )

  vpc_id     = local.vpc_id
  subnet_ids = local.subnet_ids
  ...
}

module "service_b" {
  source = "..."

  name = "service_b"

  environment_variables = merge(
    local.environment_variable_defaults,
    {
      THING = "2",
    },
  )

  vpc_id     = local.vpc_id
  subnet_ids = local.subnet_ids
  ...
}

These services have, say, fifteen parameters being passed in, and they only differ in one or two. There should be some more expressive way of writing them so that only those two unique values are prominent. Instead, you’re stuck copy/pasting a bunch of boilerplate, and editing in the unique values. That’s error-prone and a maintenance burden.

There’s an analogy between Terraform definitions and a conventional programming language: a set of Terraform definitions (a module) is like a function, with TF variables being inputs to the function, TF locals being local variables, and TF outputs being return values. The extension of that analogy to this use case is partial application, where you give a module some of its inputs, it binds those values and you get back a module with only the rest of the inputs. Terraform doesn’t support it.

Ideally, we’d be able to define some sort of a “submodule” like this:

submodule "service" {
  source = "..."

  environment_variables = merge(
    local.environment_variable_defaults,
    ?
  )

  vpc_id     = local.vpc_id
  subnet_ids = local.subnet_ids
}

module "service_a" {
  source = submodule.service

  name = "service_a"

  environment_variable_overrides = {
    THING = "1",
  }
}

We can’t use a proper module for this, because it doesn’t have access to its parent’s locals (which seems right). Notice the unsolved problem here of how to refer to those environment variable overrides in the submodule. This isn’t a fully-formed proposal.

Here are some ways we’ve dealt with this.

Code reuse workaround 1: big map of config

You can pass a big map of config as input to the module, rather than individual variables. That map can be defined as a local, and can include all the common values. Each service can merge on top of it to provide its overrides.

It’s a hack, but it works to reduce boilerplate. There are a couple of serious drawbacks, though:

  • Terraform’s merge() only performs a shallow merge. This is surprising behavior, and can lead to subtle bugs. You can work around it if you know about it, but the workarounds are often awkward. There’s an open PR adding a deepmerge() function.

  • When anything in the map is “not known until after apply” (e.g. an attribute of a resource that hasn’t been created yet), the entire map is considered “not known until after apply”. For example, if our config map looks like

    config = {
      vpc_id = aws_vpc.vpc.id,
      thing_enabled = true,
    }
    

    and the module does something like

    resource ... {
      count = var.config.thing_enabled ? 1 : 0
    
      ...
    }
    

    then this will fail to plan if the VPC doesn’t already exist because var.config contains something (the VPC id) which isn’t known yet, and so var.config.thing_enabled is not known until after apply. This is subtle and unexpected, and the error messages you’ll see from it are cryptic. But it’s very easy to do by accident once some resources are created already. Then the next time you try to bootstrap a new state (e.g. a new environment), you’ll find that it won’t plan successfully.

Code reuse workaround 2: generate Terraform definitions from templates

In some instances, it can make sense to generate Terraform code from templates. On my team, there were a few places that we did this, and we checked those generated files in to git as regular *.tf files (their names started with generated. to make it obvious). This can work well, but adds some process overhead (pre-processing, knowing which files to not edit, CI validation that the files haven’t been edited).

There are also some TF preprocessors, like terraformpy, but I haven’t tried any.

Type system is too rigid

In Terraform before 0.12, everything was a string, and that was ugly (count = "${var.enabled ? 1 : 0}"). Terraform 0.12 added proper booleans, numbers, even some data structures like sets and maps. That was an improvement. However:

  • When defining a module’s contract (i.e. specifying the types for its input variables), it’s not currently practical to use map or object.

    The map type requires all values in the map to have the same type, which can be useful in a some cases (environment variable values are always strings), but not very often, in my experience.

    The object type is a map without that restriction on value types, but if you’re going to say a variable’s type is object, you need to specify all the names of the keys and the types of their values. Okay… that doesn’t sound so bad. And all of the keys are mandatory. What!? This makes object useless as a variable type, where you’ll often want to just pass in one value as an override and leave some set of defaults.

    If you put these types on your variables, you’ll fail to plan in all sorts of surprising ways. You’ll try to pass alarm_config = { enabled = true, threshold_seconds = 30 } to your map variable and fail to plan because the value types aren’t uniform. So you’ll change it to an object, then realize that you can’t omit the period_seconds parameter which was supposed to optionally merge on top of a default. It’s an uphill battle. These types, in this context, are so rigid that they cause a lot of hassle but bring no tangible benefit.

    Luckily, you can sidestep the issues by putting type = any or type = map(any).

    As an aside, it’s also not clear why the map and object are even distinct concepts in Terraform. That feels like an abstraction leaking.

  • null was a welcome addition, but it has some surprising behavior. It’s supposed to represent the absence of a value, but if you pass null as a variable’s value, it will override the default. There is no way to both supply a value for the variable and to say “leave it as the default” (good example of why that’s useful).

The type system was clearly an improvement that paved the way for other major improvements to HCL, but the current user experience is not perfect. When the type system makes its presence known, it’s more often getting in the way than helping prevent issues. The only reasonable way to work with it, in some cases, is to bypass it (type = any).

Upstream development: frustrating priorities

In the past few years, it’s become far easier to write effective Terraform code: list and map comprehensions, resource for_each, module for_each, rich outputs, etc. But, if we can be brutally honest, there’s no innovation in any of that. It’s just plugging functionality gaps in HCL; paying debt service on the massive debt of choosing to develop the domain-specific language.

Had Terraform used an established programming language instead of HCL, maybe this time would have been spent on pushing the infrastructure-as-code ecosystem forward. As it is, Terraform’s core is developed slowly and there don’t seem to be any meaningful innovations on the horizon.

The AWS provider has a rapid pace of development, seeing a release approximately once a week. However, there are many long-standing PRs, fixing important bugs and adding important features, which languish for months with no attention from maintainers (example, example, example, example, example). It’s a good project, but apparently not particularly well managed.

On my previous team, we found it necessary to run custom builds of Terraform and the AWS provider so that we could incorporate patches that we wrote or pulled from community PRs. This took some work to set up, but it was valuable.

Conclusion

I really do like Terraform! Despite these complaints, I still consider Terraform the best-in-class tool for infrastructure management. None of these problems are so critical that they stop Terraform from being an extremely well-made, useful tool.

These issues almost all trace back to HCL. From an outsider’s perspective, continuing to invest in HCL seems like a mistake. We see from Pulumi that generating a resource graph using a real programming language is no less declarative or easy to reason about. The value in these tools comes from the use of state and the plan/apply lifecycle. Terraform would be a better tool if it did not use HCL.

Regarding Pulumi: my team took some time in June 2020 to evaluate Pulumi as a replacement for Terraform. It fixes or sidesteps basically all of the issues described above. However, it’s not as polished as Terraform, and using a real programming language fixes many things but it’s not a panacea (surprise!). Even when starting from scratch, rather than thinking about migrating existing resources, I’d still choose Terraform because it’s simply a better piece of software, HCL and all. I hope to discuss this further in a future article.

comments powered by Disqus