For years, the CloudFix team has managed and maintained 120+ AWS hosted SaaS products across hundreds of AWS accounts. Although this model follows established AWS best practices, the team’s scope introduced operational challenges. Their team needed a way to identify cost-saving opportunities across their applications without making architectural compromises or introducing service disruption.
The team responded to the challenge by developing CloudFix. CloudFix proved that it could meet their architectural requirements while cutting costs by 10-20%. For example, CloudFix helped with:
- Identifying General Purpose SSD (gp2) Amazon EBS volumes and migrate to EBS gp3 volumes with additional IOPS provisioning if necessary.
- Identifying under utilized convertible Amazon EC2 Convertible Reserved Instances and exchange them for a different instance family.
- Identifying Amazon Simple Storage Service (S3) Buckets without lifecycle policies and enable S3 Intelligent-Tiering.
Automated Opportunity Finding
As the CloudFix team explored possible solutions, their first realization was that identifying cost saving opportunities needed to be automated. Although converting all gp2 volumes to gp3 could save up to 20% in costs they had thousands of these volumes spread across their accounts. Each volume needed analysis of past performance to determine if this was the right path or if provisioned IOPS would be necessary to optimize the workload. The data gathering and opportunity identification had to be automated to be feasible due to the scope.
AWS Config turned out to be a great tool to gather inventory of all of the AWS resources across hundreds of accounts. They enabled Config recording and created configuration snapshots in response to configuration updates generated when resources were added, removed, or reconfigured. The snapshots included resource IDs and configuration metadata.
In addition to resource configuration metadata and IDs, usage metrics were necessary to generate quality analysis and recommendations. The team leveraged cross-account functionality in Amazon CloudWatch metrics to collect this data.
Once resource configs and metrics were all gathered and stored in a centralized data store, CloudFix ran finders against the data to create a list of recommended optimization actions.
Automated Opportunity Fixing
The natural next step after identifying cost-saving opportunities was to execute on the recommendations. It was also important to the operations team to automate this process, because even a simple change needs operational guardrails to make sure that it is implemented in a safe, secure, and repeatable manner.
For example, there are multiple considerations required when converting a gp2 volume to gp3. With each conversion, the team would create an EBS snapshot, create a snapshot lifecycle policy to make sure that the backup does not accrue unnecessary costs, initiate the volume conversion, and monitor the state of the change. Doing this manually for thousands of recommendations would be both error-prone and tedious.
Even with automated finders and fixers, rolling out changes at scale turned out to be a significant operations workflow challenge. The following requirements had to be considered:
- Account and resource owners needed to be informed of all changes.
- Account and resource owners needed to be able to easily review the AWS Systems Manager document corresponding to the change.
- For fixers identified as low-risk, such as converting an EBS volume from gp2 to gp3, operations staff should be able to deploy the change without waiting for approval.
- For fixers identified that are higher-risk and require review, the fixer should only be executed if explicitly approved. Any rejected changes must also be recorded for analysis.
- For all identified fixers, the operations team needed to keep track of changes performed and monitor their impact.
- Account and resource owners needed to be able to pull up all governance related changes made to their resources.
All of these requirements needed to be met and had to work with tens of thousands of resource changes spread across hundreds of AWS accounts, each with different owners.
AWS Systems Manager Change Manager to the Rescue
- CloudFix creates change templates describing each new type of finder/fixer. Account owners have the opportunity to review and approve the templates before any new change request can be made in Change Manager.
- For each new set of resources to be fixed, a change request is created. CloudFix itself only has permission to create a change request based on an approved change template.
- A change request is automatically executed after the account owner or designated approver approves the change. Designated approvers can be AWS Identity and Access Managemet (IAM) users, roles, or AWS Single Sign On (SSO) users or groups. The delegates only need permission to approve change requestchange requests and not all of the operations executed during the request.
- Changes that have no performance impact or risk can be auto-approved. The approver still receives notifications when changes are executed.
- CloudFix lets changes be either auto-approved or rejected after a timeout.
- Complex workflows with multiple stages of approval, multiple approvers, or a group of approvers are supported.
- Approved changes can be set to execute at specific times.
- A change request can be tracked and aggregated centrally for analytics and reporting.
AWS Systems Manager Change Manager provided the features to transform the CloudFix product into a scalable, multi-account, cost-savings tool. It completed their infrastructure-as-code transition by delivering operational changes as pull requests. All of this was achievable by integrating native AWS services.
You can learn more about Change Manager and its available feature set by reviewing the AWS Systems Manager Change Manager User Guide.
CloudFix is available in the AWS Marketplace. With one selection, you can install the AWS CloudFormation templates needed to start getting change requests with cost-reducing recommendations.
This post was written in collaboration with Badri Varadarajan – Executive VP, Technical Product Management (DevFactory) and Ravi Duddukuru – Chief Product Officer (DevGraph)