Season 2: Episode #2

How to Avoid the Pitfalls of IaC

IaC has been around for over a decade but has only recently become common practice, thanks to the growth and adoption of the cloud itself. With this growth, developers looked for tools to tame their sprawling infrastructure and returned to what they were most comfortable with: code. But code comes with its own problems, including an ever-growing variety of languages that span different paradigms and different levels of abstraction. When it comes to IaC, people just aren’t talking about all the pitfalls and the best practices like they do with code.

Guest

Kyle Brown

The CTO for the CIO at IBM and published author

Read Bio

Kyle Brown

The CTO for the CIO at IBM and published author

Listen to the AWS Insiders podcast

Transcript

Kyle Brown: The question that you’re really getting to then is what’s the source of truth? Ah, okay. Have I got you on this one?

Hilary Doyle: If our listeners could hear your eyebrow movements right now, I think they would be delighted.

Kyle Brown: You’ve all heard the cattle versus pets analogy, right?

Rahul Subramaniam: Yeah.

Hilary Doyle: Yes. But I have a problem with that because especially for Rahul, who is currently living in India, cattle are holy.

This is AWS Insiders, an original podcast from CloudFix, bringing you what you need to know about AWS through the people and the companies that know it best. CloudFix is the nonstop automated way to find and fix AWS recommended savings opportunities. It never stops. I’m Hilary Doyle. I’m the co-founder of Wealthy Works Daily.

Rahul Subramaniam: And I’m Rahul Subramaniam. I’m the founder of CloudFix.

Hilary Doyle: Rahul, we are talking about infrastructure as code today. I’m extremely excited, so don’t get me wrong. I am looking forward to the show, but I’m also wondering if this might be our shortest episode ever.

Rahul Subramaniam: That would be an absolute shock to me, Hilary. But again, I like efficiency. So what have you got?

Hilary Doyle: Okay. Well, thank you for letting me continue. The thing about IaC seems to be, from my layperson’s perspective, that it’s a pretty complicated DevOps approach that sort of masquerades as an easy fix. I say this because I have read a lot of not so subtle cries for help from engineers who are saying, “Everybody else seems to be having a really easy time with this. So what am I doing wrong?”

Listen, from my perspective, it just seems like IaC is complicated because code gets complicated, particularly when configurations are dynamic over time. And I just-

Rahul Subramaniam: Hilary, Hilary, hold on, hold on. Hold on one second. One second.

Hilary Doyle: Uh-huh?

Rahul Subramaniam: It sounds like you have questions that’s going to take another 30 minutes to articulate.

Hilary Doyle: Oh, I’m sure.

Rahul Subramaniam: So what are you driving at? Let’s get right to it.

Hilary Doyle: All right. What I’m saying is, I mean, we could debate the pros and cons of IaC. We could try to solve for the structural realities, but also we’re having a pretty good moment with AI. And so, I’m curious about whether we may finally have reached a place where we could just leave it to AI to code and automate and sort out our DevOps while we go and enjoy the spring.

Rahul Subramaniam: Well, we are not there yet, Hilary, but hopefully soon. I’m really hopeful.

Hilary Doyle: Okay. Well, hope isn’t going to write this episode. So for now, I guess we will take on IaC. Let’s get to it.

Infrastructure as code has become the de facto industry standard, but that has not led to an agreement about how it should be implemented or how best to run Agile or really even what IaC is. AWS cloud formation made IaC an essential DevOps practice almost overnight. But is anyone actually having an easy time with this? And Rahul, why is it still such a hot button issue?

Rahul Subramaniam: Okay, Hilary, while IaC has been around for over a decade, it’s only recently become common practice. I mean, thanks to the growth and eruption of the cloud itself. With this growth, developers look for tools to retain their sprawling infrastructure and return to what they were most comfortable with, which is code.

But code comes with its own problems. I mean, including an ever-growing variety of languages that span different paradigms and – of course – different levels of abstraction. When it comes to IaC people just aren’t talking about all the pitfalls and the best practices like we’ve learned to do with code. We just don’t talk about things like technical debt when we talk about IaC.

Hilary Doyle: Got it. We are going to dig into all of this and more with our guest, Kyle Brown. He comes to us from IBM. But first as always, your AWS news headlines.

Okay. Some big AI news out this week for that tiny company we know as AWS. Amazon Code Whisperer is now generally available. Thanks AWS for making me an overnight coding sensation. My arrogance has really grown over the last 12 hours.

Code Whisperer is, of course, the AI coding companion that creates real time single line or even full function code suggestions in your integrated development environment. How big is this Rahul, and genuinely, how well is it actually working?

Rahul Subramaniam: Hilary, this is absolutely big. We have spoken about how the AWS landscape is just so massive spanning over what? 200,000 API? I mean, staying on top of this is probably one of the greatest challenges facing developers. Knowing how to use them right and efficiently is just a constant uphill battle.

The thing I like about Code Whisperer is that it is also trained on millions of lines of code that were written specifically for AWS services and for building AWS applications. So the recommendations that come from it are opinionated by all the experts that have written AWS specific code. While it is not perfect, it is awesome to get you started with all the scaffolding and the framework stuff that just takes ages and writes up well-directed functions really well. Don’t forget, like every other AWS product, it just keeps getting better every week.

Hilary Doyle: You said it’s not perfect. Give me a scale of zero to a hundred. Is it a B? Is it a C? Is it a solid A-?

Rahul Subramaniam: You’re mixing numbers and alphabets, Hilary, but-

Hilary Doyle: I can’t be boxed in.

Rahul Subramaniam: … but I’m going to say it’s at about a 70 at this point.

Hilary Doyle: That’s like a C, a low C. Okay. All right. You will be hearing about AI and AWS a lot on this show this season because we are the future.

Meanwhile, moving on to CloudFix’s favorite topic, cost savings, you can now receive cost data for Amazon ECS tasks and AWS batch jobs in your AWS cost and usage reports. That means you’ll be able to analyze, optimize, and charge back cost and usage for your containerized applications. Chargebacks? What? Seriously? This sounds amazing. Tell us about this news, Rahul.

Rahul Subramaniam: Okay, so this is another big story. As you know, one of the pillars of FinOps practice is showback and chargeback. That means that you really need to be able to allocate costs accurately so that you can charge the different businesses appropriately for whatever they spend. It just brings more accountability and responsibility to them.

Now, when it comes to ECS, we are talking about a centralized cluster of instances that are shared across multiple end users and jobs. Now, unless you know exactly how many seconds each specific job took on the cluster and how much memory it consumed, and of course other resources like networking and storage as well – it is almost impossible to be able to do a chargeback or a showback, right? So from a FinOps practitioner standpoint, this is huge and closes this huge gap in the costs that really couldn’t be accurately accounted for so far.

Hilary Doyle: Huh. Okay. Well finally, Rahul, you were the first person to tell me about the AWS Think Big Spaces in schools. I love this program.

Rahul Subramaniam: Yeah, I actually follow this program in India. It’s a pretty neat program that they run.

Hilary Doyle: Mm-hmm. As a quick recap or cap – I don’t think we’ve talked about this before. AWS sponsors high-tech classrooms all over the world where kids can explore STEAM subjects. The idea is just to encourage this kind of learning in schools that might not otherwise have access to this kind of programming.

AWS does this where they have local presence with data centers and other facilities. So tell us about what they’re doing in your part of the world, Rahul.

Rahul Subramaniam: Okay. So well in Navi Mumbai, AWS has a project-based learning lab where they drive around to different schools with it. This gets kids out of the classroom and into real world environments to develop their problem solving skills, and it’s just absolutely fantastic.

Hilary Doyle: It sounds fantastic. Also, I like the idea of building the next generation of little RaRa’s. And back to the show.

Rahul Subramaniam: Joining us now is Kyle Brown, and for more than three decades, his focus has been design and implementation of large scale enterprise systems. Kyle is also the CTO for the CIO at IBM.

Hilary Doyle: IBM doesn’t have a C-suite. They have more of a C-hotel. Anyway, Kyle is also the author of the Cloud Adoption Playbook. Welcome to the show, Kyle.

Rahul Subramaniam: Tell us a little bit about yourself and about your experience with the cloud and infrastructure as code in particular.

Kyle Brown: So I have been doing things on the cloud since before there was officially a cloud. With IBM, we started doing things that you would now consider cloudy many years before actually AWS was even formed. And then of course, once AWS came around, we began using AWS and then all the rest of the clouds as well.

So now, we’re truly a multi-cloud kind of organization, and it makes the whole question of infrastructure as code, and especially configuration as code, even more important because no one just does one thing. Everyone does lots and lots of things and they all have to work together to make our applications function.

Rahul Subramaniam: So I completely agree with it, and it looks like the multi-cloud use case seems to be what has driven a large part of the IaC revolution, and now it’s pretty much become standard for all deployments.

But here’s the question, it seemed to come out of the need to handle an immense amount of complexity in systems, and we’ve ended up building a wrapper on top of that complexity. But more so with code, we’ve taken the approach of, “Hey, we need more powerful expressions to even describe the infrastructure that we have. Let’s just go back to programming languages that you’re most familiar with or somewhere near those generations of languages.”

The one thing that nobody talks about is the downsides of all of that: the technical debt, the decay, all the trade-offs that we have with code that we’ve learned from over the last 50 years or 60 years of writing code. No one seems to be talking about that. Any thoughts? You guys have a lot of experience.

Kyle Brown: Exactly. That’s part of the reason why I think there’s been an evolution in the IaC world, similar to what we’ve seen over the last 15 to 20 years in the programming language world. Let me do some personal examples on this. Back at the beginning of my career, way longer ago than I want to think about, one of the things that I had to end up doing was I was the system admin for a small cluster of Unix workstations that I was part of a development team that was using. I was given the job of being the sysadmin for these six or seven Unix workstations that we had. This is, by the way, pre-Spark. Okay? These are really, really old Unix workstations.

Rahul Subramaniam: Are we talking AIX?

Kyle Brown: Yeah, the old RS6000s, if you remember those.

Rahul Subramaniam: I remember those days.

Kyle Brown: One of the very first things I ended up doing as part of being a sysadmin for this is – like every other programmer I’ve ever met, I am lazy. The very first thing I started doing as the sysadmin was starting of figuring out ways of automating my work. We had a source code control system worked out and we had what we would now start to call build automation, although it was using Make and things like that back in the day. We had all that lined out and we were doing that as part of our project.

But on the other hand, all of the Bash scripts that I was writing to do the automation for my sysadmin work – heck, I was just throwing that in a folder somewhere. It turned out to be the way that sysadmins, which evolved into the kinds of automation engineers that we had after that, which then evolved passed that into site reliability engineers, that was how they operated. They always operated as this special thing that’s not really programming. It’s special, it’s different, it’s unique.

Rahul Subramaniam: But is it really?

Kyle Brown: Well, it took us years to figure out that it’s not.

Rahul Subramaniam: Exactly.

Kyle Brown: That it is the same thing and you have all the same issues like technical debt.

That’s where I wanted to draw this parallel to something that I’ve seen happen over the last several years that we’ve really leaned into in the job that I’m in at IBM. That is, if you go back 15 years in programming, the big thing that changed Java programming for all of us was the introduction of Spring.

Well, what we’ve been doing in the first generation of infrastructure-as-code was – we were going back to those bad old days that we used to have in Java before Spring. We were just writing code to do everything to tie it together. You’re right, that ends up with technical debt.

But what happened is essentially that same change happened on the infrastructure-as-code side. We started realizing that you could do a lot of this, in fact, most of this, if not all of it, through configuration. The second generation of these IaC tools, and I’m particularly thinking of things like Ansible, but also Terraform — I love Terraform and I love Ansible, and we do most of our work now with those two tools — are much more at the level of just specifying the configuration of the way you want things to be, and then the system itself has to figure things out.

The difference in between those two ways of looking at IaC really came down to the Kubernetes revolution because I think Kubernetes is what changed people’s mind in between the two and got people thinking about the possibility of: Can I just write things as a configuration file and let the system figure it out? It turns out that, for the most part, you can. That results in systems that are much more easily maintainable and have much less technical debt because you’re not putting all of that dependencies directly into your code. You’re letting the system work it out on its own as the engine runs, updates, and gets configured that way.

Rahul Subramaniam:Having been in a YAML hell with Kubernetes, I can probably say I would disagree with things having gotten better, but we will leave that for a later question. Hillary, you’re up.

Hilary Doyle: Thank you. I mean, I’m going to come at you, Kyle, with more of a layperson’s perspective, although I really appreciate this heavy nerd love between the two of you. As of learning about IaC time, Kyle, you’ve referenced the Gartner hype cycle for tech innovations before. We’ve got the Innovation Trigger followed by the Peak of Inflated Expectations, the Trough of Disillusionment, the Slope of Enlightenment, and finally this famed Plateau of Productivity. IaC looks like it’s in the plateau because of this mass adoption.

Kyle Brown: Oh, it is.

Hilary Doyle: Where is the endpoint here and where are we now? Can you just set the stage for the evolution here so we understand it?

Kyle Brown: We are in the plateau. We are well into the plateau. What I think is next though, where we have the next layer of the hype curve coming up, is getting to Raul’s previous point, YAML hell. Yeah, I’ve been in YAML hell, too, okay, but it’s a different level of hell. I mean, if you’re going to Dante, it’s not as far down. It’s a little bit higher up and you’re not feeling as bad with it.

What it comes down to is if you can have tools that can help you with the syntactic issues and can help you with the semantic issues of, “Well, what does this actually mean?” If I’m putting in a configuration term in my JSON, YAML, whatever my tool is using, what’s the semantic meaning of that? If you can have tools that can help you to do that, that’s the way out of the YAML hell.

The good news is we’re almost there now. If I’m talking about generative AI tools like Copilot and other things like that, they’re really now to the point where you can start to look at helping you get out of that semantic mess and make things better for just being able to say, “I want to do something like this. Oh, here’s the semantic pieces that correspond to that.” That’s really cool.

Hilary Doyle: I think I’m using code to simplify my deployments. But now, I need to hire a team that will literally never leave me because they’re sitting on all of the code knowledge. I need them to scenario plan, then optimize, and manage a web of coded developments that are siloed, but need to work together and are also going to need upgrades to their code bases over time, not to mention occasional debugging. I just need you to tell me why IaC might not be the nightmare I currently think it is.

Kyle Brown: What you just described to me, Hilary, is software development. Right? I mean, all of the things you just talked about. Well, yeah, you’re siloed. The team has to be around forever. You need domain knowledge about how things are done. All right. This is the issue we’ve faced with software development forever that led to us adopting Agile methods to being able to say that we need teams that can stay together forever, where you bring the work to the team instead of the team to the work.

Hilary Doyle: A lot of relationship work in developing systems. Yeah.

Kyle Brown: It’s looking at it in that way of being able to say, “I really do have a software development effort that I’m on.” Instead of configuring it separately as this special, one-time configuration or infrastructure effort, it’s realizing you just have an ongoing team that you will care and feed for forever, the same way you do your application teams. That’s the thing is don’t look at it as being any different.

Rahul Subramaniam: But, Kyle, I’m going to argue that we’ve just made our lives a lot more complicated because it’s not just the code deploying something. We’ve now got the additional problem of state of the system, which now exists in multiple places. You’ve got the state in the actual deployment, you’ve got state as expressed in the code, and then you have state files that could be all over the place, like in the case of Terraform, which actually store some kind of snapshot of what the state was at some point of time. And now, you have this big jugglery going on. You’ve basically turned your software developer into jugglers trying to figure out and guess what the state is, and stay on top of all these different systems that are up in the air in different states.

Hilary Doyle: Kyle’s leaning in. I like it. Yeah.

Kyle Brown: The question that you’re really getting to then is, what’s the source of truth? How do you know what the real source of truth is and what the real state of the system needs to be? Ah, okay. Have I got you on this one?

Rahul Subramaniam: Not only that, when people do-

Hilary Doyle: If our listeners could hear your eyebrow movements right now, I think they would be delighted. Just a lot of up and down. I’m very excited about your answer.

Kyle Brown: Okay, so here’s the issue, and this goes back to the early days of what we wanted out of infrastructure as code. You’ve all heard the cattle versus pet’s analogy, right? You familiar with that one?

Rahul Subramaniam: Yeah.

Kyle Brown: Okay.

Hilary Doyle: Yes, but I have a problem with that because especially for Rahul who is currently living in India, cattle are holy. And so I think maybe this-

Kyle Brown: But if you think about what the analogy was trying to get to, that your run times should be disposable. They should be entirely replaceable, and more importantly, they should be generatable at any moment from the information that is held outside of the runtime system. That’s at the key of the cattle versus pets analogy. Okay?

Now, what we’ve had to do is, over the years, most teams have ended up compromising and they’re saying, “Well, I know that this is what we’re supposed to do. But darn it, it’s just so hard that we’re going to keep a few of these systems around forever, and we’re not going to change them out, and we’re going to end up treating them more pet-like.”

But that is the problem in its heart, is you can’t say that I’ve got one particular cow that’s different from the rest of the herd and treat it like a pet, because that is going to break the entire process. You have to make sure you only have one repository of truth. And for me, that repository of truth is GitHub. Period. End of story.

Hilary Doyle: Let’s just take a break for a moment, because if you are loving this deep dive into all things technical about AWS, you are going to really appreciate this other thing that Rahul has going on.

Rahul Subramaniam: That’s right, Hillary. Every week I host a livestream with fellow AWS enthusiasts, Stephen Barr. We break down the latest news from AWS and offer are learnings and insights about them, along with some amazing guests from AWS who are with us both in the virtual studio as well as in the audience.

Hilary Doyle: It’s called AWS Made Easy, and you can find out more about it at cloudfix.com/livestream. They stream live on LinkedIn, YouTube, Twitch, Facebook, even the soon to be deceased Twitter. You can ask your questions live on the show, and this team will magically answer them for you. It’s the kind of free advice that you wish your parents gave to you. It’s the kind of free advice you actually want in your life. Do not hesitate, be on this livestream.

Rahul Subramaniam: Okay, now back to Kyle.

Hilary Doyle: Back to Kyle.

Would you talk to us a little bit about your thoughts or your preference between mutability and immutability?

Kyle Brown: So the difference is that if you have a mutable system, now that could be a VM, or it could be a container or just about anything else you have, you always have the possibility that it will get out of sync with the way you want it to be. And that’s the basic problem that we face in infrastructure as a code, is that wherever you have this possibility, that possibility will occur. It’s like Murphy’s Law, if it can happen in a bad way, it will. And in fact, that happens every single time in every single application that I’ve ever been a part of managing at Runtime.

Rahul Subramaniam: Kyle, do you have a Terraform branch, which basically has deleted the Terraform update command from it?

Kyle Brown: We haven’t gone that far, but we’ve talked about it. We really have, and it comes down to that point I was talking about earlier is that sometimes the answer is to go back to some of the older ways of doing things and have a little more infrastructure sometimes than you need.

Rahul Subramaniam: Interesting.

Kyle Brown: While you’re making sure that you have smooth transitions from one to the other. But the way you have to do that is you have to absolutely commit to immutability. You have to absolutely commit to – the only way to build something is from the source is from the code in your GitHub or the configuration in your GitHub, and what you’re doing is creating something brand new. You can’t let a human into the system to make modifications of any point.

Patching is gone. Now, that’s an unnatural discipline for most development teams, because they want to be able to get into the systems, and look around, and be able to make changes on the fly, but that’s the path to Hell that’s paved with good intentions.

Rahul Subramaniam: Let me ask you this question, Kyle, in the context of the current economic scenario where everyone is super conscious about cost and efficiency, and all this multiple states, multiple systems, it feels like an additional overhead, really. How does this immutability concept, even though I like this purest form to be honest, how does that impact the cost and efficiency of systems that we deploy?

Kyle Brown: So what you have to do is you have to think about the lifetimes of the systems. Okay, so your AWS cloud cost, or your Azure cloud cost, or your IBM cloud cost or your GCP cloud cost, whatever, it’s all based on time, right?

Rahul Subramaniam: Yep.

Kyle Brown: I’m getting charged per service, or VM, or per container, per unit of time. So the less time you have, it almost doesn’t matter if you’ve got one or two, if you’re controlling how much time it takes to be able to manage that. And that’s where the ability to rebuild things through configuration as code and infrastructure as code becomes so important.

If you can keep that time window very short, so that rebuilding a system is something that you can do in, let’s say an hour or two, for the most part, then all you need is the time it takes to do that, the time it takes to very importantly, test that it came up correctly, and then the time it takes to do the switchover. So even though I’m saying, “You need more infrastructure than you might anticipate you need in this method,” you don’t need it for the entire period of time.

Hilary Doyle: Okay. I mean, in the past, Kyle, you’ve said that it’s not the result of your work that’s important, it’s the process that you went through to get the result, and I’m hearing shades of that with your explanation, but could you explain this a little further? I mean, I like end points, especially when it comes to tech and development, so help me understand exactly what you mean.

Kyle Brown: Absolutely. Okay, so let’s think about what it takes to build a system, and then to keep a system running, and then specifically what it takes to make a change to a system, okay? So simple change would be, I’m pushing a new version of code out to my system that I’ve got running on AWS, or Azure or wherever, and it’s actually making a change now to a Kubernetes container that I’ve got running on one of those Kubernetes extensions.

Well, the process I have to follow has to be that I don’t go in and modify an existing running container. It has to be that I am creating a brand new specification of the way that the container has to be built from scratch, and then I let that go through the process first, probably on my local machine, then on my build system, and then finally into the production system. But the thing is that what I have in the GitHub, the representation of what that final system looks like is the only source of truth. So I have to have a common pipeline.

That’s the other key that we haven’t talked about here; I have to have a common pipeline that functions the same, regardless of what I’m doing is generating a system that I’m testing locally, or generating something that’s going to be deployed into my test system, or something that’s in my production system.

Rahul Subramaniam: Isn’t there a problem with the fact that if you consider GitHub to be the source of truth, then the feedback cycle that you need from a running system, which might end up in a different state, for reasons that are beyond your control – whether it be a change that one of the cloud service providers made in the environment, or the fact that they deployed an IP address that is dynamic data, or they put a load balance, there are end number of factors that could basically change the state of the end system, and that is the feedback that you’re getting back, right? It’s like then you would basically be ending… You can’t take that feedback and put it back in the code because that just makes no sense. So how do you build a closed loop cycle?

Kyle Brown: But why can’t you go back and put it in the code? That’s what I’m getting to, is that, okay, anytime you’re getting a change in your infrastructure, and-

Rahul Subramaniam: No, because here’s the thing, if you’re looking at the code, your GitHub as your source of truth for your insights, and not at the running system, then it doesn’t make sense because you wouldn’t get any new feedback from there, right?

Kyle Brown: No. Okay. So again, I think we’re not communicating on what the meaning of code is in this case. Okay?

Rahul Subramaniam: Okay.

Kyle Brown: Okay. The code is in the end the way in which APIs are being called on the cloud provider to represent a particular configuration of services on the cloud provider, right?

Rahul Subramaniam: Right.

Kyle Brown: I mean, there’s only one way to do things in any cloud provider. It’s got to be I’m calling a particular API, I’m feeding at a particular sequence of parameters, and it’s going to result in a particular service configuration as a result.

Rahul Subramaniam: Correct.

Kyle Brown: Okay? Now that’s what goes in the GitHub. It all has to happen at the GitHub to make the process repeatable in the multiple environments that you will always end up having to have in a cloud environment like this.

Hilary Doyle: I appreciate that the two of you are figuring out how to solve for all of these use cases. But what I also want to point out is that hell has come up no less than three times over the course of this interview. Everything from-

Rahul Subramaniam: You’re keeping count, are you?

Hilary Doyle: … Dante. I mean, I keep track of the hell and I don’t want to have to live in it. So as the CTO to the CIO, Kyle, I want you to help us see the future. What, if anything, is coming to address the issues that plague IaC, or are we looking forward to a totally different paradigm taking its place with generative AI?

Kyle Brown: No, it’s not a totally different paradigm. What it’s going to do with generative AI is it’s going to make some of the things that have been painful, slightly less painful, but it’s going to put an emphasis on the parts of the process that we’ve not paid as much attention to. Okay. Let me draw an analogy.

Hilary Doyle: Yeah. Because slightly less painful is like level two in Dante’s Inferno, which also came up earlier. So yeah, help us climb out.

Kyle Brown: When we go back to the very early days of software engineering, I’m thinking of things like Brooks’ The Mythical Man-month. If you’ve ever gone back and you read that book, one of the things that Fred described in that book was he said, “Oh, here’s the perfect software development team.” And then you read through his description of what the roles in that software development team were. One of them, believe it or not, was typist.

Hilary Doyle: Finally, a role for me!

Kyle Brown: When was the last time any of us came across the software development team where we needed a typist? Okay. No. Technology took that away from us and it made that through technology, everybody’s job. Okay. Well, the way I look at what generative AI is doing is it’s doing some similar things. It puts the emphasis back on the process that we’ve been able to essentially avoid because it used to take us so long to actually write and test the code.

That means that these process issues of, where is the source of truth, how do I do my builds, how do I do my tests, those become much more in focus, and that’s how things are going to change. I think one of the things that we’re going to see coming out of generative AI is a lot more emphasis put on testing because you can’t hide fine coding anymore. And what we’re going to find out is a lot of developers are really bad at testing. And it’s especially true of infrastructure as code developers because they’ve never had to think about what a test of infrastructure’s code actually means, and it’s going to change the way we work.

Rahul Subramaniam: So Kyle, I’m going to jump in here and ask you this straight up. So we’ve discussed so many different things. Let’s try and wrap it up into three simple things for developers to watch out for or take care of as they’re getting into IaC. IaC as we’ve discussed the default. So let’s just assume that they’re going to be doing IaC. What are the three pieces of advice you’d give out of all of your experience?

Kyle Brown: Absolutely. All right. So the first piece of advice I’m going to say is don’t rely on just one tool. It’s the same as being a programmer. I need to be able to switch between Terraform and Ansible and maybe something else based on the particular infrastructure risk code problem I’m trying to solve. And that’s going to be key – is you need to have multiple tools in your toolbox and you need to keep those tools refreshed.

The second piece of advice I want to give is if this is software development, you have to treat it as software development. That means that you have to implement all of the best practices around software development that maybe we’re not used to doing in infrastructure as code. That includes things like testing that we just talked about.

You need to think about what a unit test would look like for your IaC. You need to think about what an integration test would look like for your IaC, and those are different beasts, and that means you’ve got to do things differently through that.

And then the final thing is you need to be able to make sure that you have an SDLC, a software development life cycle that encompasses all the different parts of what you’re doing. So for instance, let me ask this question of your SREs that you have that maybe we’re giving advice to. Are you guys using user stories to define the way that your infrastructure should be laid out? Huh? What? No. We just go into our drawing tool and we lay out our infrastructure. That’s the way we used to do architecture for software too.

But then we learn it’s much better if you actually start with your product owner and you ask them, “What are the things you’re trying to accomplish? What are those functional and importantly non-functional things you want to do,” and you write up user stories and you write up epics and you start thinking through things that way. And maybe you even apply design to the problem of infrastructure as code. How is this going to be consumed? How is this going to be used by people who are not necessarily the same people that are writing the infrastructure as code?

So you see what I’m really saying here is I’m saying that an IaC developer needs to be a full stack developer, and they need to be part of a development team that is implementing the kinds of development and agile best practices that we’ve all read about and all followed if we’re doing software development. Because Nicole Forsgren told us about all these things in Accelerate and said, “This is what makes your team an elite team.” You need to make sure that your SRE teams and your IaC teams are also doing those same practices. And that’s, I think, the key advice I would give.

Hilary Doyle: Kyle, thank you for bringing up design and user stories. You’ve made hell sound almost manageable. We really appreciate your being with us.

Kyle Brown: Happy to do it.

Rahul Subramaniam: Yeah. Thank you so much for being here, Kyle.

Kyle Brown: Thanks everyone. This has been fun.

Rahul Subramaniam: It’s been an absolute pleasure.

Hilary Doyle: Okay. Rahul hot takes. Did talking to Kyle loosen your mind about IaC or just further cement what you already knew?

Rahul Subramaniam: Okay. So Kyle brought up some very interesting things that I hadn’t thought about in a while. I mean, the first, the purest view of making sure that the system is immutable sounds really appealing, but I don’t think it’s realistic.

Hilary Doyle: Why not?

Rahul Subramaniam: So that’s literally not the way we do things today. Today, we deploy infrastructure once and then keep updating it over and over again. That’s the way developer workflows work today. Changing that is probably going to be the hardest thing on this planet.

Hilary Doyle: Is it harder than staying mutable?

Rahul Subramaniam: Very much so.

Hilary Doyle: Okay.

Rahul Subramaniam: The second is on his advice about having a plethora of tools. I’ll be honest and say that I was cringing at the thought of having to manage multiple code bases in different languages just to deploy the infrastructure. I mean, I can say from experience that it’s going to be a recipe for disaster.

Hilary Doyle: Yeah. This is supposed to make things easier, not more difficult.

Rahul Subramaniam: Exactly. And lastly, I agree with his emphasis on needing an SDLC for IaC. I mean, hopefully that’ll bring light to aspects of IaC that we just ignored today. But these things are really, really important for the long run. But let me also just give you my take on IaC.

Hilary Doyle: I’d love to hear it.

Rahul Subramaniam: Okay. I don’t like the fact that we are making it more flexible for developers with these programming languages. I personally think that the language to describe infrastructure should be stricter and more opinionated. Okay.

Hilary Doyle: Interesting.

Rahul Subramaniam: We are already struggling with just keeping up with the list of services and best practices that the CSPs are coming up with and evolving literally on a day by day basis. With expressive programming languages, there will be an infinite number of ways in which that infrastructure can be expressed as code, and there are a million other ways that things can actually go wrong. So like I said, stricter and more opinionated is really the way I would go about this.

Hilary Doyle: Listen, whatever it takes to keep us out of the inferno. We would love to hear what you think about IaC and everything we’ve talked about with Kyle. You can reach us at [email protected]

Rahul Subramaniam: And please leave us a review and don’t forget to follow the show to get the new episodes as soon as they are released.

Hilary Doyle: AWS Insiders is brought to you by CloudFix, an AWS cost optimization tool. You can learn more about them and you should at cloudfix.com.

Rahul Subramaniam: Thanks for listening. Bye-bye.

Meet your hosts

Rahul Subramaniam

Host

Rahul is the Founder and CEO of CloudFix. Over the course of his career, Rahul has acquired and transformed 140+ software products in the last 13 years. More recently, he has launched revolutionary products such as CloudFix and DevFlows, which transform how users build, manage, and optimize in the public cloud.

Hilary Doyle

Host

Hilary Doyle is the co-founder of Wealthie Works Daily, an investment platform and financial literacy-based media company for kids and families launching in 2022/23. She is a former print journalist, business broadcaster, and television writer and series developer working with CBC, BNN, CTV, CTV NewsChannel, CBC Radio, W Network, Sportsnet, TVA, and ESPN. Hilary is also a former Second City actor, and founder of CANADA’S CAMPFIRE, a national storytelling initiative.

Rahul Subramaniam

Host

Hilary Doyle

Host