Save thousands on AWS: A step-by-step guide to cleaning up idle VPC NAT gateways
The two most compelling problems facing the IP Internet are IP address depletion and scaling in routing. Long-term and short-term solutions to these problems are being developedâŚ.This memo proposes another short-term solution, address reuse, that complements CIDR or even makes it unnecessary. The address reuse solution is to place Network Address Translators (NAT) at the borders of stub domains.
â RFC 1631, Kjell Egevang and Paul Francis, May 1994
Letâs turn back the clock nearly 30 years. The internet was no longer in its infancy, but firmly in the toddler phase. And, like many toddlers, it was a bit wobbly on its feet.Â
One of the early internetâs growing pains was the realization that the IP address specification only allowed for 4.2 billion IP addresses. At the time, computer manufacturers like Apple, IBM, and Packard Bell were advertising the idea that everyone could (and should!) have a personal, connected computer. The only problem: The 1994 World Population Profile report estimated the population at 5.6 billion. Clearly, the math didnât add up.Â
To address this looming issue, members of the Internet Engineering Task Force came up with RFC 1631, the design document for a short-term solution to this problem: Network Address Translators (NAT). NAT allows for one computer, the NAT gateway, to make âproxyâ requests on behalf of a private network of computers. The NAT gateway is the go-between, the only computer connected to both networks. NAT was supposed to be a placeholder until a more comprehensive strategy came about⌠but as we engineers are fond of saying, “There is nothing more permanent than a temporary fix.”
Fast forward to today, and NAT is a key part of network design. AWS offers NAT gateways as part of its Virtual Private Cloud, called VPC NAT Gateways. Knowing how to use VPC NAT gateways is an important tool in your AWS toolbelt.Â
Just as important: knowing when and how to get rid of VPC NAT gateways that are no longer in use. As with idle VPC endpoints, elastic load balancers, Elastic IP addresses and more, eliminating these seemingly small components can add up to significant AWS savings. Letâs dive in.Â
Table of Contents
- The what, how, and why of VPC NAT gateways
- Why idle VPC NAT gateways are costing you thousands of dollars
- Three steps to manually deleting idle VPC NAT gateways
- Eliminate idle VPC NAT gateways automatically with CloudFix
The what, how, and why of VPC NAT gateways
Letâs start with a closer look at how VPC NAT gateways work. In the diagram below, you can see that the internal instances (which all happen to be Graviton3s, one of our favorite instance types) have internal IP addresses of the form 192.168.0.XXX. The VPC NAT gateway has two network interfaces, attached to two different networks.
When one of the EC2 instances makes a request to an external network, the request is routed over the NAT gateway. The NAT gateway uses its external interface, makes the request to the external network, and then sends the results back to the right instance. Note that this layout only makes sense for requests originating on the private side of the network. If a request comes from the external network to the NAT gateway, it will not be forwarded to any of the internal nodes.
VPC NAT gateways have a number of handy use cases, such as:
- Securely accessing the internet from private subnets. If you have EC2 instances that need outbound access but not inbound access, like for downloading security updates, NAT gateways can make it happen. Theyâre also useful for allowing access to services such as S3, DynamoDB, or CloudWatch.
- Reducing data transfer charges. By routing AWS service-bound egress traffic from your private subnets through a NAT gateway into an interface VPC endpoint, you can avoid unnecessary data transfer charges.
- Leveraging Elastic IP addresses. With NAT, you can have multiple compute resources associated with a single IP address. This is useful if the IP address is known/trusted with other entities. For example, if you are operating a fleet of web crawlers to create a search engine and you want them all to operate under one trusted IP address, using NAT would be the way to go. With NAT, websites that want their content indexed by your search engine can add your known IP address to an allowlist.
These are all excellent reasons to use NAT gateways and illustrate why theyâre a great resource across the AWS ecosystem. But what happens when you stop using NAT gateways? Drumroll please⌠nothing. They continue to exist, and you continue to pay for them, even though theyâre no longer needed. Now letâs do something about it.
Useful aside:
If NAT was the short-term solution, then IPv6 is the long-term solution. With IPv6, there are 10^38 possible IP addresses, so there are zero worries of running out! When IPv6 is widely deployed, we probably wonât need NAT anymore, but that dayâs a long way off. After all, 3.5â floppy disks are still in widespread use, especially in industrial machines.
Why idle VPC NAT gateways are costing you thousands of dollars
How do we end up with a stack of idle VPC NAT gateways?
Readers of our other fixer blogs will find the answers familiar. Just like with idle VPC endpoints, the reasons include:
- Theyâre left over from the development and testing process. During the development and testing phase of a project, itâs common to create ad-hoc VPC NAT gateways to facilitate secure communication between various virtual network components. Once the design has stabilized and the infrastructure is properly managed via IaC, the artifacts are no longer necessary. However, itâs easy to forget about them and simply move on to the next project. This can lead to orphaned infrastructure components that are neither managed nor tagged, making it difficult to remember their purpose over time.
- They were associated with retired services. Services become deprecated over time. We usually remember to delete most of the standing resources associated with these services, like EC2 instances and RDS databases, that take up the majority of the costs. We often forget, however, to delete the smaller components that go with the services. If the infrastructure contains VPC NAT gateways, they should be deleted too.
What are these idle VPC NAT gateways costing us? More than you would think.Â
NAT gateways are priced on an hourly basis. As of May 2023 in us-east-1, the hourly charge is $0.045/hr, which adds up to just under $400 per year. That doesnât sound too bad at first, but like many of these smaller charges, becomes significant at scale. A typical NAT gateway configuration includes at least two per region (one public, one private). If youâre operating in 10 regions, with 20 NAT gateways, it amounts to nearly $8000 dollars every year. And thatâs not even the full scope â donât forget about extraneous gateways that get created during the dev and test process. Suddenly, weâre talking thousands of dollars in potential AWS savings.
Three steps to manually deleting idle VPC NAT gateways
Weâve seen why VPC NAT gateways are useful, how we end up with idle NAT gateways, and how much those idle NAT gateways cost. Next step: letâs get rid of them. Manually identifying and deleting idle NAT gateways involves three steps:
- List all of the idle VPC NAT Gateways
- Determine which of these NAT Gateways are idle and eligible for deletion
- Delete the idle NAT gateways
1. List all of the idle VPC NAT gateways
We can find idle VPC NAT gateways by using our trusty friend the Cost and Usage Report (CUR). The query would look like:
SELECT line_item_resource_id,
line_item_usage_start_date,
line_item_usage_end_date,
line_item_usage_type,
line_item_cost
FROM "your_aws_schema"."your_aws_cur_table"
WHERE line_item_line_item_type = 'Usage'
AND line_item_usage_start_date >= date_trunc('day', current_date - interval '31' DAY)
AND line_item_usage_start_date < date_trunc('day', current_date - interval '1' DAY)
AND line_item_resource_id LIKE '%natgateway%'
AND (line_item_usage_type LIKE '%NatGateway-Hours%' OR line_item_usage_type LIKE '%NatGateway-Bytes%');
Notice a few key facts about this query:
- Weâre looking for rows where
line_item_line_item_type
isUsage
. The data in these rows represent usage-based consumption of AWS resources. Other possible values ofLineItemType
areFee
,RIFee
,Tax
, etc. See the Line Item columns document for more information. - Weâre filtering for one month’s worth of data (well, 31 days to be precise).
- The resource IDs are the NAT gateway identifiers.
- If we aggregate the data over the entire window, grouping by
resource_id
andline_item_usage_type
, we can identify all the NAT gateways and determine which are in use and which are not.
line_item_resource_id |
line_item_usage_start_date |
line_item_usage_end_date |
line_item_usage_type |
line_item_cost |
natgateway-xyz0987def0125rst999 |
2021-09-01 00:00:00 |
2021-09-01 01:00:00 |
NatGateway-Bytes |
3.142 |
natgateway-xyz0987def0125rst999 |
2021-09-01 00:00:00 |
2021-09-01 01:00:00 |
NatGateway-Hours |
0.045 |
natgateway-0x33338abp9898r023r3 |
2021-09-01 00:00:00 |
2021-09-01 01:00:00 |
NatGateway-Bytes |
0.000 |
natgateway-0x33338abp9898r023r3 |
2021-09-01 00:00:00 |
2021-09-01 01:00:00 |
NatGateway-Hours |
0.045 |
2. Determine which of these NAT Gateways are idle and eligible for deletion
NAT gateways have a fixed hourly charge and a metered data charge. We can see both rows in the output from the CUR. We want to find NAT gateways that have the NatGateway-Hours
charge, but no NateGateway-Bytes
charges for a defined amount of time, like 31 days. You can do this in SQL using sum
and join
, or use a script in Python. Use whatever tool youâre comfortable with, but pick one that can easily use the AWS APIs. I find Python, using the boto3 library, to be the easiest way.
3. Delete the idle NAT gateways
Now that we have a list of idle NAT gateways, itâs time to delete them. This is where the savings come in! Itâs a pretty straightforward process:
- Use DescribeNATGateway API to ensure that it still exists, see whether it is public or private, and determine what the NatGatewayAddress is.
- Use DeleteNATGateway API to delete the NAT gateway.
To complete the first step, we can use the AWS CLI:
aws ec2 describe-nat-gateways --nat-gateway-ids natgateway-0x33338abp9898r023r3
This will return a JSON or XML structure of a list of NatGatway objects. Some sample output would look like:
{
"NatGateways": [
{
"CreateTime": "2021-09-01T12:30:00.000Z",
"NatGatewayAddresses": [
{
"AllocationId": "eipalloc-0123456789abcdef0",
"NetworkInterfaceId": "eni-abcdefghijkl123456",
"PrivateIp": "10.0.0.1",
"PublicIp": "203.0.113.25"
}
],
"NatGatewayId": "natgateway-0x33338abp9898r023r3",
"State": "available",
"SubnetId": "subnet-06a692ed4ef8c4d38",
"VpcId": "vpc-0a4aa1e4bfd3c84e57"
}
]
}
Note that the AllocationId
inside of the NatGatewayAddresses
list refers to the ElasticIP address associated with the NAT gateway. Itâs a good practice to save this response somewhere, so you can decide what to do with the Elastic IP address.Â
If DescribeNATGateway
returns a response, this validates that the NAT gateway still exists. Now, we can use this command to delete it:
aws ec2 delete-nat-gateway --nat-gateway-id natgateway-0x33338abp9898r023r3
This will return the following output:
{
"NatGateway": {
"NatGatewayId": "natgateway-0x33338abp9898r023r3",
"State": "deleting"
}
}
To confirm that the NAT gateway is deleted, we can use describe-nat-gateway
command again, filtering on a particular NAT gateway.
aws ec2 describe-nat-gateways --filter Name=nat-gateway-id,Values=natgateway-0x33338abp9898r023r3
If itâs still in the process of being deleted, we will get the same response as before. Once the deletion has completed successfully, you will see an empty response:
{
"NatGateways": []
}
Thatâs one less NAT gateway and $400 per year in annualized savings. Do this a few times, and the savings really add up.
Eliminate idle VPC NAT gateways automatically with CloudFix
This fix, like many of the others that weâve covered, is certainly possible to implement manually. The tricky part isnât doing it; itâs deciding whether or not itâs a good use of your time. Is a relatively small optimization worth the engineering hours required to ensure the process runs without errors, especially when itâs not related to your core functionality? For most teams, the answer is no.
Enter CloudFix. CloudFixâs automation has been tried, tested, and proven across thousands of AWS accounts. With CloudFix, you donât have to choose between saving money and investing engineering time. It automates fixes like removing idle VPC NAT gateways so they can be executed quickly and consistently with just a few clicks. That means more time to spend on business-critical projects, and more budget to fund them. We think the Internet Engineering Task Force would approve.