Testing Infrastructure as Code – 3 Lessons Learnt

In the last few weeks, I made a deep dive into Infrastructures as Code. I chose AWS and Terraform to write my provisioning scripts.

It came naturally to me, as a software engineer, to write Terraform code. A lot of software design principles (like KISS, DRY or even SOLID to some extent) can be adapted to write quality IaC. The intention is to end up with small, decoupled modules, used as building blocks for provisioning environments. Still, I felt a bit uncomfortable without TDD or automated tests at all. Are automated IaC tests useful (besides improving my well-being)?

Developers, DevOps, Software Engineers, we always verify if our code works as expected, even if not in an automated manner. In the case of Terraform, running terraform plan (a command that creates an execution plan, but doesn’t deploy anything) checks the syntax and if all resources are correctly defined. 

Manual functional testing involves logging into the management console and verifying all properties of the deployed resources, verifying access-control lists, connectivity tests, etc. This is time-consuming and cumbersome, but necessary.

Operability practices aim to support frequent deployments. This also means constant changes to the underlying infrastructure. In this case, manual testing of the IaC is inefficient and may not add as much value as expected. For this reason, I decided to take some time and test the automated testing tools for IaC. Below, I will talk about three valuable lessons I learned.

1. Testing modules with Terratest isn’t even close to unit testing.

The tool of choice for automated Terraform tests is Gruntwork’s Terratest. It is a Golang framework, actively growing and gaining popularity.

In the beginning, it was tempting to think about module tests like there were unit tests. When you unit-test a function used in your application, you don’t need to run the application. The tests are short, simple, and examine a particular piece of code in isolation (also with input values that yield errors.) Correctness means that the output of the function under test is as expected. We care about broad test coverage.

Module testing in Terratest is different. You write an example infrastructure code using the module you want to verify. Terratest deploys the code and runs your tests against it. The tests should answer the question: “does the infrastructure actually work?” For example: if your module deploys a server with running application, you could send some traffic to it to verify if it responds as expected.

Examining resource’s properties (loaded from an API) is rarely practiced with Terratest. It can be useful when false-negative test results may introduce some kind of high risk. As a result module testing with Terratest is looks almost like end-to-end testing.

2. There are other tools to complement Terratest.

Sometimes, end-to-end tests are not enough. For example, your private network accidentally has a route to the gateway. To confirm, that the private network is really private, it feels convenient to check if no routes in its routing table let public traffic. 

You could also picture a situation where the operation team lets the development teams create own resources. You may need to ensure that the implemented code follows security standards and compliance obligations. E.g all resources should be correctly tagged, all storage should be encrypted, some ports should never be open, etc.

In addition to Terratest, several other testing tools are more convenient to test specific resource properties. One of them is terraform-compliance. You can encode all your policies in “spec” files similar to Cucumber specs and run them against the output of terraform plan


Feature: Define AWS Security Groups
  In order to improve security
  As engineers
  We will use AWS Security Groups to control inbound traffic

  Scenario: Policy Structure
    Given I have AWS Security Group defined
    Then it must contain ingress

  Scenario: Only selected ports should be publicly open
    Given I have AWS Security Group defined
    When it contains ingress
    Then it must only have tcp protocol and port 22,443 for 0.0.0.0/0

This spec would yield an error if any of your security groups allow inbound traffic on a port different than 22 or 443.

If you feel more comfortable testing deployed resources, and you work with AWS, you could try AWSSpec. AWSSpec is built on top of Ruby’s RSpec testing framework. The tests are specs-alike, BDD style. The difference is that you run them against the real infrastructure. Similarly to the Terratest, if you would test modules you need to deploy examples first. You could automate the deployment and verification using Test-Kitchen (along with Kitchen-Terraform plugin). For example,  testing a private subnet may look like this:

require 'spec_helper'
describe subnet('Default Private Subnet') do
  it { should exist }
  its(:cidr_block) { should eq '10.10.2.0/24' }
  its(:state) { should eq 'available' }
end
describe route_table('Private Subnet Route Table') do
  it { should exist }
  it { should have_subnet('Default Private Subnet') }
  it { should have_route('10.10.0.0/16').target(gateway: 'local') }
  it { should_not have_route('0.0.0.0/0')}
end

Executing the test may show following output:

3. Automated IaC tests are expensive

The cost of IaC testing doesn’t only include the charges for the resources deployed for testing. For writing automated IaC tests you need good programming skills, that may go beyond one programming language. (Terratest uses Golang, terraform-compliance uses Python, AWSSpec uses Ruby, etc.)

Writing terraforms tests is time-consuming. The cloud APIs aren’t convenient to use, and helpers libraries may miss important functions. In the case of Terratest, and AWSSpec there is a lot of additional infrastructure code needed for module testing.

Many tools, although quite useful, aren’t yet mature. There is always a danger that they will cease to work with newer versions of Terraform or just be discontinued.

Summary

Should I recommend investing time and money into automated IaC testing? That depends. First of all, your team should focus on using Terraform the right way. This means no direct, manual changes to the infrastructure.

Once delivering IaC works well, your team may consider adding automated tests.

If a change may introduce any kind of risk that can’t be accepted, then it’s a good candidate for an automated test. Another factor to consider is the team topology. If the IaC ownership is decentralised then automated tests may help to ensure code consistency, compliance, and quality.

Is it OK to give up automated IaC testing for now? If you can’t introduce automated IaC testing you can rely on other confirmation techniques like green/blue deployments with comprehensive monitoring. Although they do not substitute each other both can help verify the correctness of the infrastructure.

 

We’re often asked to help our clients build out their capability in the cloud or to help them reduce infrastructure costs. Being able to do both at the same time is the icing on the cake!

Our work with one of our clients enabled it to reduce its monthly AWS spend by 66%, within 8 months. More satisfying still, we were able to do so while improving its cloud capability, steadily increasing availability and ramping up deployments fivefold, from an average of 3 to 15 per month.

Equal Experts was initially engaged by the client to add features to its previous AWS-based customer rewards platform. However, given the ambitious scope of the work, it became clear that its previous infrastructure would not scale to meet its increasing performance requirements. Soon the engagement grew to include a complete rewrite, both of the client’s platform and of its companion Business Analytics reporting platform.

Put through its paces

The new platform has already been substantially road-tested by ticket sales for performances by major artists, and didn’t break a sweat as we registered and enrolled 230% more new customers per second. As well as being much cheaper to run, it has also proved itself to be much simpler to manage and better performing due to all parts of the new platform being fully automated and simple to rebuild completely from scratch.

New environments

The client’s cost reduction was achieved by performing a managed cutover to a set of completely new environments. The new environments leverage Docker in Swarm clusters and refactor the old code into microservices. These microservices can be scaled up and down independently of each-other; an nginx ingress tier and standard service-discovery-based load-balancing ensures that customer traffic is balanced across all available copies of a microservice. This also means that a container can be removed from the pool with no impact, making our deployments zero-downtime. The net result is much more efficient use of memory and compute in AWS. As both back-end and front-end microservices for the platform were moved to Docker across web, iOS and Android, this also made the platform more consistent and significantly reduced the complexity around change.

Decommissioning

Another serious source of cost-saving was the decommissioning of an extensive Hadoop cluster used for business analytics and daily reports. The reports being run were fairly small and did not really require the overhead of MapReduce processing. Our Tech Lead produced a radically simplified solution using a single microservice to do the data processing and the platform’s mongodb replica-set for storage.

In addition to huge cost savings (we decommissioned 10 extra-large Hadoop data nodes and 2 name nodes as well as about 25 other instances), the migration moved the client onto open-source, widely-used service-discovery and load-balancing solutions, away from proprietary end-of-life software packages. The Business Analytics solution, which used to have a feed failure on average twice a week, has had no transfer failures since it went live in November last year.

Trusted team

The client asked us to use this platform to roadshow a couple of other improvements it was working on internally; it’s using us to test their new architecture patterns initiative and has designated us their only ‘trusted team’ currently, meaning that we don’t need to give 24 hours’ notice in order to make production changes. This is based on our high availability and very low mean time to fix, as well as the success rate of our deployments. This won’t be the last client team to enjoy huge cost reduction with improved stability and agility.

Overall this is a great example of how we can add value to our clients in a short time, whilst in parallel laying the groundwork for better infrastructure in the future that’s consistent to manage, simple to maintain and can easily adapt to business opportunities.