The Custom Terraform Build

2020-10-21 tech programming terraform

As discussed in Terraform Pain Points, one of the most frustrating things about working with Terraform is the pace and direction of upstream development. Useful features and bug fixes often languish for months or years because the maintainers lack bandwidth to review them or don’t see the utility. Check out all of the (justified) angst in the comments of terraform-provider-aws#8268, as hundreds of people waited eight months for it to be merged. This can feel like an impassable barrier.

It’s not an impassable barrier, though. In fact, we can sidestep this entirely if we run a custom build of Terraform and/or its providers. Here are some reasons you might want to do this:

  • Pull in new features or bug fixes
    • Add deepmerge (Terraform #25032), once you realize that Terraform’s merge only goes one layer deep.
    • Actually delete Route53 records when the plan says they’re going to be deleted (AWS provider #11335)
  • Pull in features that may never be accepted upstream
    • Add support for bulk imports (Terraform #22227)
    • I wrote a patch which added a missing_okay flag to ECS data sources, to work around the bootstrapping issues discussed here. I didn’t bother submitting it upstream because optional/gracefully-failing data sources are unwelcome.
  • Incorporate patches specific to your company’s resources or workflows
    • At my previous job, we managed some ALB routing rules dynamically, outside of Terraform. This made Terraform unable to delete the ALB target groups (which it did manage), because they still had downstream resources. I added a flag to the target group resource which, when enabled, would delete the downstream rules before trying to delete the target group itself.
  • Enforce a consistent Terraform environment (version pinning and, optionally, a set of blessed patches).

This article will explain details of a practical approach to doing that. It’s surprisingly liberating, and IMO one of the most impactful ways to improve the Terraform experience.

What to build

It’s important that your team runs Terraform in a consistent way. For one thing, Terraform is touchy about its version: it will only operate on state generated by the same or older version, so everybody has to upgrade in sync. But when we start relying on patched behavior in Terraform or its providers, it’s even more critical to be in sync so that a consistent set of patches is used. (Side note: “everybody” includes CI. A mature Terraform workflow should route primarily through an automation/CI platform).

Packaging Terraform in a Docker image is a great way to encapsulate everything. We can ship an image with the patched Terraform binary and whatever patched providers we need, at the appropriate locations on the file system. Everything we care about is pinned. We only need to build for one target architecture. Distribution is a cinch (docker pull ...). Running it is less of a cinch, but will be discussed below.

There’s also terraform-bundle. I haven’t used it, but it’s worth being aware of. It seems more complicated than a Docker image (e.g. to distribute), but I could see it being a good fit for some workflows (especially if you’re using Terraform Enterprise).

How to build it

If you want to skip the explanation, a demo implementation of these ideas may be found here.

There are many ways to approach maintaining and building this Docker image. Here I’ll describe one which aims for a declarative paradigm, making automation and ease of maintenance a focus. What we’ll do is specify a base reference (e.g. a tag or commit hash in the hashicorp/terraform repo) and a list of patches to apply. We’ll have a shell script which applies those patches before compiling the code.

This declarative approach doesn’t involve maintaining a fork, and has some advantages over a fork: it’s easy to tell what base versions and patches a given version of the image contains, and it’s usually a simple one-line code change to upgrade the base reference. However, it’s difficult to handle things which require manual intervention (e.g. merge conflict resolution), and a manually-maintained fork makes that easier because the workflow is less automated to begin with. If another approach works better in your situation, run with that. The pattern described here has been powerful and easy to use in my experience.

To keep things simple, we can list our patches as files on disk. Here’s the proposed file structure:

├── Dockerfile
├── apply-patches
└── patches
  ├── terraform
  │ ├── PR123-some-great-feature
  │ │ ├── remote
  │ │ ├── reference
  │ │ └── explanation
  └── terraform-provider-aws
    ├── PR987-adding-a-resource
    │ ├── remote
    │ ├── reference
    │ └── explanation
    └── some-custom-patch
      ├── apply
      └── explanation

If we’re pulling in the contents of a public changeset (e.g. a PR), our patch directory will contain files remote and reference. The contents of remote will be a git remote (e.g. https://github.com/somebody/terraform). The contents of reference will be a git reference that can be checked out in that repository, like a branch name (feature/some-great-feature) or a long or short commit hash (0abcdef). The file explanation, is just free-form text explaining what the patch is for, and can be named anything.

If we have some other sort of patch, e.g. a .patch file, then we can bring an executable shell script named apply which does whatever is necessary.

The apply-patches script (detailed below) will be called from the Dockerfile, which we break up into stages like this:

Dockerfile part 1 (building Terraform)
FROM golang:1.15-buster AS builder-terraform
ARG TERRAFORM_BASE_REFERENCE
WORKDIR /code/hashicorp/terraform
RUN git clone https://github.com/hashicorp/terraform . && git checkout $TERRAFORM_BASE_REFERENCE

COPY apply-patches .
COPY patches/terraform/ patches/
RUN ./apply-patches

# Stripping debug symbols for smaller binary:
# 	`-ldflags '-s -w'`
# Output lives at /go/bin/terraform
RUN go install -a -ldflags '-s -w' .
Dockerfile part 2 (building a provider)
FROM golang:1.15-buster AS builder-terraform-provider-aws
ARG AWS_PROVIDER_BASE_VERSION
WORKDIR /code/terraform-provider-aws
RUN git clone https://github.com/terraform-providers/terraform-provider-aws . && git checkout v$AWS_PROVIDER_BASE_VERSION

COPY apply-patches .
COPY patches/terraform-provider-aws/ patches/
RUN ./apply-patches

# Stripping debug symbols for smaller binary:
# 	`-ldflags '-s -w'`
# Output lives at /go/bin/terraform-provider-aws
RUN go install -ldflags '-s -w'

This is taking a base version of the provider as a build argument, and assuming a tag named v... exists. Other providers may not follow this pattern. It would be preferable to allow an arbitrary git reference to be passed in, but it’s done like this because Terraform 0.13+ is very specific about where it wants providers installed, and that location requires a semver-alike version string (a bit more detail in the comments of the next Dockerfile segment). It’s possible to hardcode that version string to something like 0.0.1 (for some reason, 0.0.0 doesn’t work) and then use any git reference here, but it could make the output of terraform init confusing.

Dockerfile part 3 (constructing the final image)
FROM debian:buster-20201012-slim
ARG AWS_PROVIDER_BASE_VERSION

# For pulling Terraform modules from git repositories
RUN apt-get update && apt-get install -y git && apt-get clean

COPY --from=builder-terraform /go/bin/terraform /bin/terraform
ENTRYPOINT ["terraform"]

# Note! This is only correct for TF >=0.13.
#
# Put the bundled provider where Terraform will look for it, following
# 	https://gist.github.com/mildwonkey/85df0f0605880a0f08b6f05c15092bd7
#
# Note that there are some restrictions on the provider version used in the path. A `v` prefix is not allowed
# (e.g. `/v2.70.0/`) and neither is a custom suffix (e.g. `/2.70.0-custom/`. In both cases, Terraform will
# ignore our provider and try to install from the public registry.
ENV AWS_PROVIDER_PATH=/usr/share/terraform/plugins/registry.terraform.io/hashicorp/aws/$AWS_PROVIDER_BASE_VERSION/linux_amd64/terraform-provider-aws
COPY --from=builder-terraform-provider-aws /go/bin/terraform-provider-aws $AWS_PROVIDER_PATH

If you’re using Terragrunt or other supporting tools, you would also download them in this final layer.

Aside: There was a big change in Terraform 0.13 regarding where Terraform searches for provider binaries. If you’re using Terraform 0.12 or earlier, the provider binary can simply be put next to Terraform, like this:

# Note! This is only correct for TF <0.13 (i.e. 0.12.* or earlier)
#
# Put the bundled AWS provider alongside the Terraform binary (/bin/terraform), which is the second
# place that Terraform checks for plugins.
# https://www.terraform.io/docs/extend/how-terraform-works.html#discovery
COPY --from=builder-terraform-provider-aws /go/bin/terraform-provider-aws /bin/terraform-provider-aws

Applying patches

Here’s the apply-patches shell script, which applies the patches stored on disk as described above:

#!/bin/sh

set -ex

# This script is run as part of `docker build`.
#
# We should be in the directory where the base git reference has been checked out.
# Now, we loop over the patches in the patches/ directory and apply them.
#
# Usage: ./apply-patches

# git doesn't let us commit unless we configure author details
git config user.name terraform-image-automation
git config user.email nobody@nowhere.com

apply_from_remote_and_reference() {
  patch_name=$1
  remote=$2
  reference=$3
  git remote add "$patch_name" "$remote"
  git fetch "$patch_name" "$reference"
  # Get the exact revision so we can log it for audit purposes (what, exactly, went into this
  # image?). If we're working with a branch or tag, resolve that. Otherwise, use the reference,
  # which is assumed to be a commit hash. Fallback logic cf. https://stackoverflow.com/a/62338364
  revision=$(git rev-parse -q --verify "$patch_name"/"$reference" || echo "$reference")
  echo Revision: "$revision"
  git merge --squash "$revision"
}

apply_from_script() {
  patch_name=$1
  ./patches/"$patch_name"/apply "$patch_name"
}

for patch_path in patches/*
do
  # If there were no glob matches, don't loop.
  if [ ! -e "$patch_path" ]; then
    break
  fi

  patch_name=$(basename "$patch_path")
  echo Applying patch "$patch_name"

  if [ -f patches/"$patch_name"/remote ] && [ -f patches/"$patch_name"/reference ]; then
    # A typical patch specifies a remote (e.g. https://github.com/somebody/terraform-provider-aws)
    # and a reference (a branch name or commit hash) and doesn't provide an `apply` script.
    apply_from_remote_and_reference "$patch_name" "$(cat patches/"$patch_name"/remote)" "$(cat patches/"$patch_name"/reference)"
  elif [ -f "$patch_path"/apply ]; then
    # Patch bringing its own apply script
    apply_from_script "$patch_name"
  else
    echo Patch "$patch_name" should have either \\\`remote\\\` and \\\`reference\\\` files or an \\\`apply\\\` script.
    exit 1
  fi

  # Can't `merge --squash` more than once in a row without committing in between.
  git commit -m "Applying patch ""$patch_name"
done

Now docker build takes care of the rest. You can set DOCKER_BUILDKIT=1 to use BuildKit, which will parallelize the independent stages.

A full demo of this implementation may be found here. That repository uses another tool I wrote, Dockerfiler, which is useful for automated “build me an image based on this Dockerfile” processes like this.

How to use it

I’ve previously written about how great it is to run things in Docker. If it’s unfamiliar to you, here’s a useful mental model in the form of a shell alias:

alias terraform='docker run -it --rm -v "$(pwd)":"$(pwd)" -w "$(pwd)" -u $(id -u):$(id -g) custom-terraform:ourtag'

Now we might run terraform init or terraform plan and it routes through the container. That docker run command (well, stub of a command) is functionally equivalent to invoking a locally-installed Terraform binary. It’s better in many ways, though: explicit choice of version via the image tag, easy cross-platform support, environmental context doesn’t leak in by accident (AWS credentials, file system access). The article goes more in depth on the benefits (and drawbacks).

In practice, an alias like this isn’t the best way to do it: not source controlled, not easy to share with a team, some workflow steps require different credentials passed to the container (e.g. init needs access to download modules while plan or apply do not).

One thing that I’ve found useful for this is to use a Makefile like:

image_reference = custom-terraform:ourtag

# This would volume in credentials necessary for init (i.e. for downloading modules, accessing remote state)
# and volume in the `terraform-known-hosts` and `terraform-etc-passwd` files.
tf_init_docker_options = ...

terraform = docker pull $(image_reference) && \
	docker run --rm -u $$(id -u):$$(id -g) -v "$$(pwd)":"$$(pwd)" -w "$$(pwd)" $(2) $(image_reference) $(1)

# Specific to if we're downloading modules from GitHub (or another git remote)
terraform-known-hosts:
	ssh-keyscan github.com > terraform-known-hosts

# OpenSSH is finicky about running as a user that doesn't exist in /etc/passwd, so create a basic
# /etc/passwd to get around that (and use /tmp as $HOME so it's writable).
terraform-etc-passwd:
	echo "terraform:x:$$(id -u):$$(id -g):terraform,,,:/tmp:/bin/sh" > terraform-etc-passwd

tf-init: terraform-known-hosts terraform-etc-passwd
	$(call terraform, init $(args), $(tf_init_docker_options) $(docker_args))

tf:
	$(call terraform, $(args), $(docker_args))

The workflow is then make tf-init followed by make tf args='plan', etc. There are a lot of ways to make this more ergonomic, but the basic functionality is all here.

Some gotchas when running in a container

There are some subtleties to making this work seamlessly. I’ll call them out here, but I’ll be light on details (which could fill another article).

  • Sourcing cloud provider credentials will probably look different when you run locally and when you run in automation. For instance, if we’re using AWS, the local credentials might be provided to the container by voluming in ~/.aws but our automation might use an assumed role/the EC2 instance metadata service. To handle discrepancies like that, it’s useful to distinguish between automated/non-automated sessions. Something like this can work:

    # Identify interactive session by presence of stdin, cf. https://stackoverflow.com/a/4251643
    is_interactive := $(shell [ -t 0 ] && echo 1)
    
    ifdef is_interactive
        special_docker_run_options = -it -v ~/.aws:/tmp/.aws -e HOME=/tmp -e AWS_PROFILE
    else
        special_docker_run_options = -e TF_IN_AUTOMATION=1 -e TF_INPUT=0
    endif
    

    Then we’d add $(special_docker_run_options) to the docker run template. More about /tmp as a home directory below.

  • The Dockerfile doesn’t create a non-root user. It’s a good practice to run as a non-root user, and actually to run as the same uid/gid as the person invoking Docker, but that can’t be baked into the image because the user id is only known at runtime. Containers are usually okay running as whatever user, even if it’s unknown, but anything that tries to write to the home directory will fail unless we set HOME to something writable. HOME=/tmp is a simple thing that works in those situations.

    If we’re downloading Terraform modules via git, then we run into an issue where OpenSSH errors out if the user isn’t known to the OS. To get around that, we have this trick in the Makefile to create a dummy /etc/passwd with the same uid/gid as our current user.

  • Specific to downloading Terraform modules over git/SSH:

    • We need to pre-fetch the remote’s certificates with ssh-keyscan. Otherwise, we get the dreaded Are you sure you want to continue connecting (yes/no)? interactive prompt from OpenSSH.

    • It’s usually necessary to volume in an SSH agent. This has been possible on a Mac since late 2019, but requires running as root in the container. The following should work:

      ifeq ($(shell uname), Darwin)
      # /run/host-services/ssh-auth.sock is a special socket exposed in the Docker-for-Mac Linux VM since
      # Docker for Mac Edge release 2.1.4.0 (2019-10-15) https://docs.docker.com/docker-for-mac/edge-release-notes/
      # The socket can be volumed into a container to use the host's SSH agent in the container. This only
      # works when running as root in the container, but that's okay in D4M because it only ever writes to
      # the host file system as the user running Docker (e.g. doesn't write files as root).
          tf_init_docker_options = -v /run/host-services/ssh-auth.sock:/ssh-agent -e SSH_AUTH_SOCK=/ssh-agent \
              -u 0:0 -v "$$(pwd)"/terraform-known-hosts:/tmp/.ssh/known_hosts
      else
          tf_init_docker_options = -v $$SSH_AUTH_SOCK:/ssh-agent -e SSH_AUTH_SOCK=/ssh-agent \
              -v "$$(pwd)"/terraform-etc-passwd:/etc/passwd \
              -v "$$(pwd)"/terraform-known-hosts:/tmp/.ssh/known_hosts
      endif
      
  • On a Mac, watch out for terrible I/O performance when voluming in the working directory. It may be possible to work around that, though I haven’t.

Conclusion

Setting up a system to produce custom builds takes some work, but I hope this article makes that path a little easier for you. Once you start running a custom build of Terraform, you’re going to wonder how you ever got along without it.

comments powered by Disqus