ECS + Terraform + Continuous Delivery
One of my medium-term goals at my job is to define my department’s AWS footprint in Terraform. We run many of our applications in ECS, and the way that ECS configuration is structured presents an interesting challenge for organizations that want to manage infrastructure and deployment separately. When infrastructure is managed by Terraform, and deployment is handled by a continuous delivery tool, how to you get them to play nice with each other?
There’s a long-running discussion on GitHub about this challenge, in which folks in the Terraform community discuss the many different solutions that have been developed to address it, or at least work around it. I ended up coming up with my own solution, and wanted to write it out in detail in case it’s helpful to others.
Background
Most of my team’s applications run within Docker containers. Our continuous integration and delivery system, Semaphore CI, builds and publishes a container image for each build we want to deploy. The build images are then deployed to ECS by a shell script that’s run by Semaphore CI, thus achieving continuous delivery, or “CD”.
My goals for this project were as follows:
- Use Terraform to manage all (or almost all) infrastructure & application config
- Use CD tooling run by Semaphore CI to deploy changes automatically (for dev) or semi-automatically (upon user request, for prod)
- Allow speedy rollback to the last “known good” application version if a deploy fails
Glossary
ECS has a number of different components, and the terminology can be confusing, so here’s a quick overview of the different moving parts:
- ECS Cluster
- Logical grouping of ECS Services
- ECS Service
- Set of one or more ECS Tasks that make up a running application
- ECS Task
- Instantiation of an ECS Task Definition within an ECS Service/Cluster
- ECS Task Definition
- Set of parameters that describe the infrastructure required to run a container in ECS (e.g. network, volumes, IAM roles, etc.). Note: Task Definitions are immutable, so each update to a Task Definition generates a new version.
- ECS Container Definition
- Subset of ECS Task Definition that describes the container’s runtime configuration (e.g. Docker image version, log config, environment variables, secrets, etc.)
In short: An ECS Service is configured using one or more ECS Task Definitions, each of which is an immutable, versioned artifact containing one or more ECS Container Definitions.
Problems
- The way AWS structures ECS configuration blurs the lines between infrastructure config, application/container config, and declaring which version of the application to run. This lack of separation is true even at the level of the ECS Container Definition, which contains:
- The container image (application version)
- The container port, CPU, and memory (container config) [Note: These are also declared at the ECS Task Definition level, which may or may not be redundant depending on how many containers are being run]
- Environment variables that could vary by app environment, e.g. dev/prod (application config)
- ARNs of Secrets Manager secrets that could (and should!) vary by app environment (application config)
- If we want to use an explicit Docker tag for each application version, then deploying that application version means having to update the ECS Container Definition, which in turn means having to create a new version of the ECS Task Definition
- Because the ECS Container Definition is a subset of the ECS Task Definition, it can’t be treated as a first-class resource in Terraform, so we can’t use a lifecycle hook to instruct Terraform to ignore changes on only one aspect of the ECS Container Definition without essentially excluding the entire ECS Task Definition from Terraform.
Sources of Truth
In order to separate concerns between the different systems that automate application infrastructure provisioning and deployment, it helps to designate a source of truth for each:
Configuration Type | Declared Within | Managed By |
---|---|---|
Infrastructure Configuration | ECS Task Definition | Terraform |
Application/Container Configuration | ECS Container Definition | Terraform |
Application Artifact Version | ECS Container Definition | Semaphore CI |
Represented this way, the problem becomes clear: The ECS Container Definition has multiple sources of truth, and we need to figure out how to resolve that.
Approaches From The Community
In the GitHub issue thread, many different approaches are discussed, which I’ve attempted to condense into four options. Each of them comes with some advantages and drawbacks. They’re presented in descending order of personal preference.
Approach 1: Manage ECS Task Definition & ECS Container Definition with Terraform, deploy via persistent Docker tag
Create a persistent Docker tag that corresponds to the environment name (e.g. dev
, prod
). During deployment, CD tooling updates the tag to point to the desired Docker image, and then force-updates the ECS service. The CD tooling then monitors the ECS deployment to confirm that new tasks are started successfully, and old tasks are torn down. No changes to the
ECS Task Definition or Container Definition are required.
Advantages
- All infrastructure configuration is defined in Terraform
- No Terraform changes are required in order to deploy a new version
Drawbacks
- Cannot declare an explicit application artifact version to deploy
- Rollbacks require re-updating the image tag before deploying
- ECS Autoscaling is inconsistent if the Docker tag changes separately from deployment
Possible Mitigations
- Somehow enforce only updating the tag immediately before deploying
- Monitor deployments for failure and roll back by triggering a new deployment
Approach 2: Ignore changes to the ECS Task Definition & Container Definition in Terraform
As noted previously, given that the ECS Container Definition is part of the ECS Task Definition and not a separate Terraform resource, it’s not possible to ignore changes on the image
field alone. That means you can’t simply update the application artifact version without updating (and thus creating a new version of) the ECS Task Definition. Any changes to the ECS Task Definition that are made outside of Terraform will be flagged by Terraform as drift.
To avoid creating drift, you could instruct Terraform to ignore changes on the entire ECS Task Definition, but then you’re effectively no longer managing the ECS Task Definition via Terraform at all. If you need to change the infrastructure configuration at a later date, doing so would potentially overwrite all changes related to deployments.
To mitigate this risk, you could have Terraform call out to a script—or to AWS directly—for the current configuration before proposing changes. However, this would create a circular dependency on the ECS Task Definition configuration that would need to be addressed when provisioning a new ECS Service.
Advantages
- Allows for declaring an explicit application artifact version for each deployment
Disadvantages
- Breaks Terraform management of the ECS Task Definition
- Infrastructure configuration changes could overwrite declaration of application artifact version
Possible Mitigations
- Instruct Terraform to read ECS Task Definition configuration before proposing changes (kludgy)
Approach 3: Store ECS Container Definition in application code
Another approach to resolve the drift problem described in Approach 2 is to remove the ECS Container Definition from Terraform entirely, and instead store it alongside the application. During deployment, CD tooling injects the application artifact version into the ECS Container Definition, then creates a new ECS Task Definition version, then updates the ECS Service.
Advantages
- Easy to update image tag & push new version
- Preserves rollback via ECS console
Disadvantages
- Infrastructure configuration would be stored outside Terraform (i.e. CPU/memory values, Secrets Manager ARNs)
- ECS Container Definition must be managed by hand or via some other tool
- Terraform must either use a “dummy” task definition when creating the resource, or have access to load it when creating the resource.
Approach 4: Use Terraform as part of CD process
This is essentially the opposite of the previous two approaches: Instead of removing some of the ECS configuration bits from Terraform, you could use Terraform to handle all changes, including deployment. In this approach, the CD system is empowered to make Terraform changes. Deployment is handled by adjusting the Terraform config, running terraform apply
, and committing/pushing the changes back to the repo.
This approach could be handled two different ways, depending on where you want to store the Terraform configuration:
Approach 4a: Keep Terraform config in central infrastructure repo Approach 4b: Manage ECS service via separate Terraform config in application repo
Advantages
- All configuration is managed by Terraform
- Deploys are still handled by CD tooling
Disadvantages
- Requires allowing CD tooling to apply Terraform changes
- Requires allowing CD tooling to push git commits
- Increases complexity of both provisioning and deployment
This seems to be a popular approach among the community, but our Terraform setup isn’t mature enough to support it yet. Additionally, I’m not enthusiastic about granting Semaphore CI permission to make and commit Terraform changes.
My Approach
The alternate solution I came up with was to have Terraform create and manage a “template” Task Definition. The template Task Definition declares all of the desired configuration for the container, but with latest
as a placeholder container image version (to ensure validity during provisioning). The ECS Service configuration uses the template upon creation, but has a lifecycle hook set to ignore_changes
on the task_definition
attribute.
On deployment, our CI tooling reads the configuration from this “template” Task Definition, and uses it to create a separate Task Definition for use by the ECS Service. The desired image tag is injected into the new Task Definition at deploy time, and the new task definition is not managed by Terraform.
With this approach, any changes made by Terraform will apply only to the “template” Task Definition, and won’t try to manipulate the running tasks at all. Such changes will be picked up by the service on the next deployment triggered by CI. By updating the template, Terraform can remain the source of truth for container configuration, and by creating a separate Task Definition from that template, Semaphore can still serve as the source of truth for the artifact version to deploy.
Advantages
- All configuration is managed by Terraform
- Deploys are still handled by CD tooling
Disadvantages
- Changes to the Task Definition and Container Definition must be applied to the template by Terraform prior to deploy (but in our case this is by design)
Execution
I had hoped to use ecs-deploy (Python) or ecs-deploy (Shell) to handle the deployment part, but neither seems to have working support for reading configuration from a separate task definition (see here and here, respectively). We ended up rolling our own method using the aws
CLI and jq
for now. At some point I’d like to contribute updates to one or both tools to facilitate this approach, but I haven’t had a chance yet. Feel free to beat me to it. :)
A Note About Rollbacks
All of the approaches described above still require manual intervention to resolve failed deployments. This is essentially an ECS limitation. But good news came recently when AWS recently announced a preview of a deployment circuit breaker feature that can “automatically roll back unhealthy service deployments without the need for manual intervention”. This is an exciting development, and while it’s not yet supported by Terraform as of this writing, it’s something I’m looking forward to trying out.
Thanks for reading! If you have any questions, comments, or corrections, feel free to hit me up on Twitter.