Data-driven Terraform, with Hiera
Yay for Terraform
Terraform is an great tool for managing infrastructure as code. Several factors make it easy to learn and use it:
- It’s a DSL, so the language is stripped down to the basics
- It’s declarative, so you describe the desired state rather than the steps to get there
- It’s modular, so you can deploy a whole stack of infrastructure by calling a module
The registry is amazing
The documentation on Terraform’s provider registry is so easy to use that I reference it instead of the cloud provider documentation most of the time. Resources and data sources are so easy to find and understand. The data is terse, yet contains all of the possible variables and comes with examples.
The basics are easy
When you’re learning how to use Terraform, you’ll find some good resources that help you deploy a few simple resources, or compose a modules out of some resources and call that module from your Terraform “root” module. There’s not much disagreement on the basics, so this advice is echoed all over the place.
Advanced concepts are hard
As your Terraform codebase grows in size and complexity, you’ll find yourself having to perform refactors. Some of these are easy enough, for example, breaking resources out into modules. However, things get more difficult when you have to deploy additional copies of your whole infrastructure into multiple environments and regions without ending up with a lot of code duplication. On the one hand, you want your code to be DRY (don’t repeat yourself) but on the other hand you need to configure it differently depending on where it is and what it’s used for. Raw Terraform doesn’t try to solve this problem for you. If you look for advice, the most common approach might lead you to a code structure that looks like this:
terraform/
├── environments
│ ├── dev_eu2
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ ├── prod_eu1
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ ├── prod_us1
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ └── prod_us2
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
├── datacenter
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── modules
Each of the subdirectories underneath ./environments
represents a distinct “Terraform root directory” with its own state backend and variables. These should be kept as simple as possible, and used to Terraform. has its own state backend, its own variables and its own outputs. Each one calls the ‘t seem too bad! We can try to minimize the amount of “code” in these separate environments. Right?
Well, I have seen this strategy turn into a huge mess. As the infrastructure gets more complex, those main.tf
files get broken up into service-specific files like database.tf
, sqs.tf
, storage.tf
and eks.tf
. If this turns into a sprawl, then these might be further refactored into modules within the environments. No really, I’ve seen this play out, and a relatively normal-sized infrastructure ended up spawning 35,000 lines of Terraform code.
Avoiding duplication
Once you get to this problem, there are no simple answers. None of the usual Terraform learning resources cover this, in much the same way that a Python course that tells you about the language doesn’t explain how your large software project’s class inheritance should work. Much of the advice that will be offered to you should be taken with caution: “Pay for $THIS_SERVICE!” “Wrap Terraform in $THIS_CODE_GENERATOR!” “Just switch to $OTHER_IAC_LANGUAGE which doesn’t suck!
“Beware of one-way doors!”
A one-way door is a decision that is difficult to revert. If you decide to pay for a Terraformy SaaS service, you may end up writing code that only works with that service. If you later run into an issue, you might find yourself painted into a corner. How much code refactoring is necessary? Is it so bad that it’s easier to just… get another job somwhere else?
Whiera? (‘Why hiera?’ …shut up, it’s funny)
While I was trying to solve this problem at a startup a few years back, I strongly considered Terragrunt. But I kept getting… bad smells. I couldn’t stop myself from wondering why there wasn’t a Hiera equivalent for Terraform. I had used Hiera heavily when I worked as an SRE at Qualtrics, where we used Puppet to deploy and manage many thousands of machines.
What is a hierarchical data lookup?
Hiera lets you define a hierarchy of data sources in order of importance. When you perform a key lookup, Hiera searches in all hierarchies and applies results in reverse order, starting with low priority values and overriding them with higher priority values.
An example hierarchy
Two very common hierarchies are eegion and environment because they typically require separate infrastructure with different configurations. So here’s an example hiera.yaml
that uses these two hierarchies along with a common
hierarchy that can be used as a fallback:
hierarchy:
- name: Region
path: region/%{region}.yaml
- name: Environment
path: environment/%{environment}.yaml
- name: Common
glob: common/*.yaml
Some hierarchical data files…
Now we’re create a data structure based on the example hierarchy:
├── common
│ ├── database.yaml
│ └── tags.yaml
├── environment
│ ├── development.yaml
│ ├── staging.yaml
│ └── production.yaml
├── region
│ ├── eu-west-1.yaml
│ └── us-west-2.yaml
├── hiera.yaml
└── README.md
Hiera asks, “Where am I?”
How does Hiera know which region or environment it’s in? If you’re using Hiera with Puppet, then this information is usually sourced by the puppet agent on the remote device, which can check its local environment. With Terraform, we need to pass this information to the hiera provider when we initialize it:
provider "hiera5" {
scope = {
environment = var.environment
region = var.region
}
}
You could pass these into Terraform using command-line arguments, .tfvars files etc. My own preference is to use Terraform workspace names that encode this information, so for example, you could name your workspace prod_us-west-1
and then parse the variables by splitting the terraform.workspace
value.
When we’re deploying our database cluster, we want to keep costs low in staging, but we want production to be bulletproof: ./hiera/environment/staging.yaml
database:
replicas: 2
instance_type: m6a.small
backup_retention: 2
./hiera/environment/production.yaml
database:
replicas: 3
instance_type: m6a.large
backup_retention: 7
./hiera/common/database.yaml
database:
replicas: 1
instance_type: t3a.medium
backup_retention: 0
So if we assume that we're in a production environment, and we are using "deep merge" behavior, then the key `database` will return the following:
we are deploying a production environment, the Hiera lookup for the key `database` will return this:
How about some examples?
1. Environment
- path: ./environment/$MY_ENVIRONMENT/*
3. Region
4. Defaults
And let's say I tell Hiera that my environment is "production" and my region is "eu-west-1".
- key, it starts at the the least specific source, and allows more specific sources to override previous results"
That's not really enough to nail the explanation. There are two key things that hiera needs to know before it can perform a lookup:
The hierarchy of data sources
```yaml
hierarchy:
- Common
- Environment
- Region
Known variables environment = production region = eu-west-1
The lookup
When I search for a key like “db_replica_count”, Hiera first checks in Common and finds the value 1
. The Terraform exam does make sure that you understand tfstate and remote states work, but once you get into the really tricky stuff, the official recommendation is to pay for Terraform Cloud or Terraform Enterprise
You might find some suggestions for handling separation of environments, but pretty quickly people start recommending other tools (like Terragrunt) to work around Terraform’s gaps. You won’t find much advice on how to abstract your modules, or how many layers of modules you need. All of that is left as an exercise for you to figure out. Ka-ching.
Scaling issues
How do you:
- Deploy your platform in different regions or environments without duplicating code?
- Reduce the blast radius of changes in a large codebase
As an alternative to this approach, some people might recommend that you pay for Terraform Cloud. Others might tell you that Terragrunt is amazing. For a number of my own reasons I disagree
- Terraform Cloud is a paid service
- Terragrunt is used to generate Terraform code
The problem
When you start building small, simple infrastructure, Terraform is a godsend. As your needs grow and your codebase increases in size and complexities, you’ll come to a few forks in the road, and at each one you have choices. Each of these choices is a ‘one way door’. It’s a choice that will be hard to undo without a major code refactor. So it’s important to make the right choices. You started off with a single AWS VPC that runs your development environment For example, if we write a module that creates a whole queue / worker system comprising SQS queues, S3
1. Keep your code DRY
No matter what language you’re coding in, you should keep your code dry, in other words, “Don’t Repeat Yourself”. Repeating chunks of code is a bad idea. It increases the number of lines of code. It means that changes and fixes need to be made in multiple places.
reduce the size of the codebase and limit the number of places that code must be updated when something changes. Depending on the language and the complexity of the codebase, this might lead you to prefer composition over inheritence. But at an very basic level it requires that you write reusable, modular code.
Terraform challenges
Terraform modules let you define reusable blocks of infrastructure. When we use a module, we can pass parameters to the module’s variables. This lets us configure each instance of the module differently. Here, for example, we deploy two VPCs with different names and CIDRs:
module "production" {
source = "./modules/vpc"
name = "production"
cidr = 10.0.1.3/24
}
module "staging" {
source = "./modules/vpc"
name = "staging"
cidr = 10.5.1.3/24
}
We already see some repetition above. This type of repetition is hard to avoid. But this is OK, nothing is hidden from us, and it’s easy to read. The problem gets worse when we need multiple copies of a whole collection of infrastructure. For example, imagine we need a second data center that is a copy of the original data center. The only differences are the AWS region, CIDR addresses and some globally-unique names (DNS, S3 etc). Imagine that our codebase is structured as follows:
├── main.tf
├── outputs.tf
├── providers.tf
└── modules
├── eks
│ ├── cluster.tf
│ ├── irsa.tf
│ ├── nodes.tf
│ ├── outputs.tf
│ └── variables.tf
├── iam
│ ├── outputs.tf
│ ├── policies.tf
│ ├── roles.tf
│ └── variables.tf
├── network
│ ├── outputs.tf
│ ├── peering.tf
│ ├── subnets.tf
│ └── vpc.tf
└── s3
├── bucket.tf
├── encryption.tf
├── outputs.tf
├── replication.tf
└── variables.tf
How do we handle the second or third data center without copying a lot of code?
Multiple data centers
In the main.tf
we deploy everything that our data center requires by using the modules. Our data center has a VPC with 3 subnets, 15 S3 buckets, 2 EKS clusters and 6 SQS queues. If we need to add one more S3 bucket, we can add a new definition to the main.tf
that uses the s3 module one more time:
module "backup_bucket" {
source = "./modules/s3"
name = "backups"
}
These resources and the relationships between them are defined in the main.tf
in the root of the project. If we need extra SQS queues, we can invoke the module again from our root module. Somewhat. You see, it’s very likely that you will need to deploy multiple copies of your entire infrastructure. Imagine your company needs to deploy a data center in the US and another in Europe. Both data centers should be largely similar - perhaps sized differently, or with different security configuration. How do you deploy two data centers without code duplication?
. basic functionality of the module provides us with enough flexibility to create .writing Terraform code to just deploying a virtual machine and an object store bucket, you could use a single file. If you’re deploying five or six different resources and your single file gets too large, you can split it into many files (their names don’t matter). Let’s imagine that you want to use the built-in Puppet type package
to install 30 packages on all of your servers. You could write something repetitive like this:
package { 'openssh-client':
ensure => installed,
}
package { 'nfs-client':
ensure => installed,
}
package { 'chrony':
ensure => installed,
}
Which is tedious. But then you realise that your front-end hosts need nginx
and your file servers need nfs-server
, samba
and clamav
. More tedious lists of packages strewn across your puppet code.
Why keep config/data and code separate?
There are many good reasons to keep configuration data out of your code.
Code with data in it is not DRY
Over time, a codebase that has configuration data mixed through the logic will balloon in size, as new copies the code are required to create a new instance of a resource, or create a new environment, or deploy the product into a new regions.
Data tends to change frequently
Once you have written a Terraform module, you can use it many times, passing each instance a different configuration. The module itself stays the same while the configurations will change.
Changes to data are easy review
It’s pretty easy to eyeball a change to YAML data and understand it. You can clearly see when data has been added, removed or modified. Code is more difficult to audit because it is more verbose, which leads to increased cognitive load during code reviews.
High-change repositories have more problems
When you work on a large repository that changes a lot, you are more likely to run into problems outside your “job”. These include extra delays, merge conflicts, test failures, code reviews.
Since it’s far easier to review a change to a list than it is to review a logic change in a 200 line terraform file, keeping the configuration data separate simplifies By separating the data from the code, you can take pressure off developers by 2. Data changes happen frequently than code changes 3. Data changes should be simple (add item to list) but mixing data in with your logic puts more pressure on developers (branch, merge request, merge conflicts, reviews, merges) 5. Mixing the two results in a large, sprawling codebase that’s difficult to work with
If you’re using Hiera, a file-based YAML data store that supports hierarchical lookup, then you can do things in a much better way.
$package_list = hiera('packages',{})
create_resources(packages, $package_list)
All that’s left to do is define those packages in Hiera!
packages:
openssh-client:
ensure: installed
package {
written a Puppet module that installs OS packages from an online repository. You can invoke the module multiple times, providing a package name each time, along with a value to control whether it is installed or not:
But if you’re using Hiera (a file-based YAML data store) to keep your data separate from your Puppet code, then your YAML might look like this:
packages:
nodejs: enabled
python3: enabled
vscode: enabled
chromium: disabled
firefox: enabled
The Puppet code you could use to invoke this
DSLs are great
They’re great!
DSLs sound great, what’s the catch?
When you abstract away the nitty-gritty details behind a simple set of commands, you simplify 80% of the operations that need to be done at the expense of making 20% of the tasks more difficult or downright impossible. (Yes, I made those percentages up using the foolproof 80/20 rule)
The type of problem you run into with your DSL depends on what your DSL is, and what you’re trying to do with it (that it doesn’t make easy). But I’ll give you some types of problems I have encountered with Puppet and Terraform.
Puppet problems
strict if/then/else ??
Terraform problems
for each
If you’re a software engineer, you may not have many painful experiences with DSLs. You do most of your work in an imperative language like C, Rust, Java, Python or Javascript. When you work with a DSL it may be embedded in, generated by or completely abstracted by an imperative language. Think about React or SQL (which can be abstracted with an ORM)
If you’re some other type of computer engineer (DBA, Systems) then you might have butted heads more directly DSLs. I spent a large part of the 2000s fighting with Puppet.
DSLs in general
Pros
The extremely focused nature of a DSL means that, as long as you’re familiar with the domain it operates within, it is:
- Extremely focused
- Easy to learn
- Low ‘boilerplate’
Cons
What happens when you want to cook something that doesn’t have a suitable mode, like a slow-cooked lamb casserole? Or what happens if you want to cook 5 pizzas so you need the timer to sound an alarm to let you know when each par of pizzas are cooked? .
- What happens if I want to slow-cook a
Terraform’s problems (specific)
Terraform’s biggest issue is that it gets in the way of DRY (don’t repeat yourself) code. Every large Terraform codebase I’ve seen has had a lot of repetition.
Deploying a data centers
Let’s imagine that our start-up wants to deploy a simple “developer” data center in the AWS cloud. It has a VPC and an EKS cluster.
Good practice dictates that you write a module for each type of infrastructure and call them as required from the root of your Terraform project, which is known as the Terraform “root module”. The root module is where you run terraform init
and terraform apply
. It might only have a main.tf
that looks like this:
provider "aws" {
region = eu-west-2
}
module "vpc" {
source = "./modules/vpc"
name = "dev"
cidr = "10.1.1.2/24"
azs = 2
}
module "eks_cluster" {
source = "./modules/eks_cluster"
cluster_name = "dev"
cluster_version = 1.24
}
We can see that there are values hard-coded into the main.tf
. Maybe we should break those out into a .tfvars file?
Deploying another data center
You followed sage advice and wrote modular Terraform code. The complex details of each piece of infrastructure is hidden away in a reusable module. But your main.tf
in the root modules has hard-coded values for the developer The “root” of your main Terraform project invokes those modules with environment-specific parameters:
provider "aws" {
region = eu-west-2
}
module "eks_cluster" {
source = "./modules/eks_cluster"
cluster_name = "staging"
cluster_version = 1.24
}
If your next data center is in the United States and is a production data center, then we need to either create a new is in the in the US you still have to write a “root” Terraform module where you can run terraform init
and terraform apply
. that can use those modulesadvice and -= lots of copied code is a DSL is that it’s the primary language that you use to deploy infrastructure. Some people choose to generate terraform code with a ‘wrapper’ language like Terragrunt or Dhall, it was designed to be used directly. These tools appeared aftwards to solve some of the inherent problems that you sometimes run into with DSLs.