Save money by shutting down idle AWS ec2 instances (part 1)

The backend technology behind the websites managed by Patton Web Concepts

Save money by shutting down idle AWS ec2 instances (part 2)

Sample Instance

For simplicity, I'm going to show how to automatically shut down a single AWS instance named "wollaston" (the name comes from a popular Boston city beach). But the method could easily be extended to any number of instances, such as those tagged as being dev or test.

Monitoring

To get started, I'm using Prometheus to collect load data for our test system every thirty seconds, which is the system wide default.

From there, I'm arbitrarily defining a load of less than 0.10 as idle (line 173). Your systems may be higher or lower depending on their baseline. You could just as easily choose other metrics like memory usage or how many packets are moving across the network interface to determine if a system is busy.

To get the load, I'm calculating an average over time rather than using a single data point. An average will remove any spikes or aberrations in the data that could be misleading. For demonstration, the window size is quite small at just two minutes (line 173) so that alerts will fire more quickly. A larger window size will more reliably average things over a longer time period.

With the sliding average, the next step is to define a duration, meaning how long has the average been below the threshold of 0.10. Again for demonstration, I'm using five minutes, but a production value would be much larger (line 174).

To summarize, Prometheus will alert when the instance's two minute average load has been below 0.10 for five minutes or greater. Note that the alert config also includes a custom label of "status" with a value of "idle". This will be necessary for the notification step.

Alerting

Once the threshold has been met, Prometheus will trigger an alert, so we have tell Prometheus's Alert Manager what to do with it.

In this bit of config, we're looking for alerts that have a "status" label with a value of "idle". If that matches, send the alert to the defined ec2_webhook.

This part of the Alert Manager config defines a Prometheus webhook that will send an HTTP Post request. The URL being sent is defined in the url: field; here the value is obscured as <secret> for security. The actual URL will be sent to our GitLab server in order to trigger a CI/CD pipeline.

GitLab Pipeline

The pipeline being run is referred to as a "trigger" pipeline, meaning it is run when triggered, rather than via a code commit or branch merge. In this case the trigger is receiving an HTTP request in a specific format for this repo.

The pipline will run the following steps. It's setting up a python virtual environment to run a script called stop-instance.py.

stop-instance.py

This section shows the portion of the script running on a GitLab server that performs the shutdown. It's using the boto3 module to interact with the AWS API directly.

The script connects to AWS as a regular user and then assumes a role that has permissions to call the stop() method on lne 72. It then returns a result to the caller.

AWS Permissions

Configuring AWS permissions is somewhat beyond the scope of this post, but I wanted to include a sample configuration just for completeness. Obviously if you're using a different cloud provider, this section won't apply directly.

I'm only showing the resulting JSON objects; I used the web GUI to create them for the purposes of this article, but you could also use the CLI, a script, Cloud Formation template, etc.

We're using the principle of least privilege here which means we're defining only the amount of permissions necessary. We're doing this by assigning a user to a role that has permissions to shut down the instance. Because these credentials are temporarily created at run time, this approach is considered more secure than more blanket permissions.

We begin by creating an IAM user "admin1". It has no inline policies.

Then we define a set of permissions in the form of an IAM policy. This policy allows read access on EC2 instances in order to describe them. It also very strictly allows the StopInstances action to a single resource. If you needed to terminate more instances, you would define them in the Resource section.

Then we create a role that will to use these permissions and add our admin1 user to the role.

The last step is to associate the policy to the role (not shown).

With the permissions configured, we can run the script via the pipeline

Pipeline Output

This snippet of pipeline output shows the script running and that it successfully stopped the instance. And because it's in a pipeline, we have a complete record of how, when, and why the instance was stopped.

Conclusion

Hopefully you now have some ideas about how to automatically shut down EC2 instances that aren't doing anything other than spending money. This fundamental idea of taking automated action based monitoring alerts can be extended in all kinds of different ways. Granted it takes some time to get all the pieces configured and working together, but once complete you'll have automation in place that will hopefully save money, time, and be able to respond to events in real time.

Patton Web Concepts

Find Me

Boston, MA

erik "at" pattonwebconcepts.com

@erikpatton

About Me

I'm Erik I build and maintain websites so other people don't have to.

My expertise lies in building computing infrastructures for websites that are reliable, fast, and secure. I work primarily with Linux systems in cloud and on-premise environments.

I also do web design and development with a preference for the Astro javascript framework. I've also managed several websites using WordPress.

If you need a new website, an integration to your existing site, or managed hosting, please get in touch.