Season 2: Episode #7

Behind the Scenes at Prime Day with AWS

Amazon invented a holiday that’s now one of the biggest international shopping days in the world: Prime Day. What powers it and how do they use AWS? We take you behind the scenes with AWS’s own Vice President and Chief Evangelist Jeff Barr to talk millions of requests per second, billions of products sold, and demands on AWS that keep getting bigger each year.

Jeff Barr

Guest

Jeff Barr

Vice President & Chief Evangelist for Amazon Web Services

Read Bio
Jeff Barr

Jeff Barr

Vice President & Chief Evangelist for Amazon Web Services

Transcript

Hilary Doyle: We are just a few days away from one of the biggest online retail events of the year, assuming you’re listening to this episode right when it comes out, which, huh, you should be.

Rahul Subramaniam: More than 300 million items will be sold worldwide. That’s more than 100,000 items a minute. We are talking, of course, about Prime Day.

Hilary Doyle: Yes, consumerism, it’s safe on our watch, and we are going to take you behind the scenes to the organized chaos that is Prime Day – with the one and only, Jeff Barr. He is the Chief Evangelist at AWS. And all of these sales, all of these events, they’re all made possible by AWS.

Rahul Subramaniam: The resources needed for an event like this are staggering. How do you prepare? How do you test? What if something goes wrong? And what happens when Prime Day goes live?

Hilary Doyle: This is AWS Insiders, an original podcast from CloudFix, bringing you what you need to know about AWS through the people and the companies that know it best. CloudFix is the non-stop, automated way to find and fix AWS recommended savings opportunities. It never stops. I’m Hilary Doyle; I’m the co-founder of Wealthy Works Daily.

Rahul Subramaniam: And I’m Rahul Subramaniam, I’m the founder and CEO at CloudFix.

Hilary Doyle: In 2015, Amazon launched Prime Day – their own retail holiday to encourage a boost in sales on the platform and in a phrase, it worked really well. In the last eight years, sales have grown from about 900 million in that first year, nothing to sneeze at, to over 12 billion in 2022.

Rahul Subramaniam: But to make that happen, the most crucial ingredient is AWS. We are talking about an event that needs systems to handle millions of requests per second for a sustained period of time. There are event buses handling hundreds of billions of messages, databases with petabytes and billions of transactions, and CDNs delivering billions of product images and video streams. I think I just said billion too many times. Anyway-

Hilary Doyle: You can never.

Rahul Subramaniam: It just gets bigger and I can’t wait to hear about what new numbers this Prime Day will astonish us with.

Hilary Doyle: Listen, we will ask Jeff Barr to answer those questions, see the future, take us behind the scenes at Prime Day. But before we get to that, as always, we’ve got the latest from AWS. Here are your news headlines.

Let’s open with some AWS open source news. AWS cloud formation recently announced the general availability of AWS Cloud Formation Guard 3.0. This is an open source, domain-specific language and command-lined interface that will help validate that your cloud infrastructure complies with your company policy guidelines. Sounds important, Rahul, what will this mean for developers?

Rahul Subramaniam: Okay, so controlling what resources get launched and how they get launched in AWS is like playing a game of whack-a-mole with about 200,000 holes, one for each API.

Hilary Doyle: Sounds amazing.

Rahul Subramaniam: It’s more frustrating than amazing. But a large part of that leakage actually comes from the fact that the vast majority of deployments are initiated by infrastructure as code or cloud formation scripts, and these were always outside the purview of any rule-based configuration tool in the AWS catalog. But now, with Guard 3.0, we can finally enforce some of those constraints on the cloud formation scripts and make sure that these resources comply with these organizational policies.

Hilary Doyle: Those organizational policies sound like keep our costs low, and this is the way to do it. Okay, so moving on to our next story. Unfortunately, it has nothing at all to do with fashion. Don’t be fooled. AWS just announced the general availability of AWS App Fabric. It is a no-code service that connects SaaS applications with security tools, so it’ll connect to Jira Suite, your Google Workspace, or Microsoft 365 with Netskope, Rapid7, or Splunk. Rahul, what impact is this service poised to have?

Rahul Subramaniam: Okay, so now this is one of the biggest news items this quarter, and it is not just about security. Hilary, how often do you find that your work involves navigating at least three or five different systems?

Hilary Doyle: Oh, all the time, and five is modest.

Rahul Subramaniam: Exactly. Now, you might start with an email that just came in for a request to schedule a meeting and follow ups with a customer. Now that causes you to look into your calendar, log into Jira, Zendesk, a bunch of other systems before you can actually respond back or close the task. Now, App Fabric allows you to bring all of that data across all these different systems together in one place. And the best part is that you can start using generative AI tools on this unified dataset to gather insights and significantly improve productivity, maybe even have the bot do the work for you.

Hilary Doyle: One dashboard to rule them all. I’ve been waiting for this for a long time. Finally, a little throwback and throw ahead. Rahul, registration is now open for AWS re:Iinvent. It’s November 27th to December 1st in Las Vegas. Last year, you will recall, we broadcast from the roof of the Nobu Hotel. We had-

Rahul Subramaniam: The terrace. The terrace of a suite, Hillary, not the roof.

Hilary Doyle: Well, it wasn’t a roof? It felt like a roof. I mean, I’m sorry, did that sound less impressive? I think, well, you can see we basically stared out across the entire Las Vegas strip. We recorded five shows while we were in Vegas. It’s worth scrolling back in our feed to listen to those shows. The Werner Vogels episode alone is worth listening to again-

Rahul Subramaniam: Absolutely.

Hilary Doyle: … I have to say. Yeah, I listen to it once a month. What are you most looking forward to for this year, Rahul?

Rahul Subramaniam: Okay. I’m just so excited about this year’s Reinvent and CloudFix is going to have an even bigger presence this year.

Hilary Doyle: Whoa.

Rahul Subramaniam: And my instinct tells me that re:Invent will be all about two topics. First one is going to be cost optimization, a continuation from last year’s asks, and of course, generative AI. Those are the dominant conversations in the cloud world right now.

Hilary Doyle: It’s going to be an amazing year at re:Invent.

Rahul Subramaniam: Epic.

Hilary Doyle: Epic. Epic. But really now all I can think about is where are you going to set up your HQ? And I hope it’s on the set of Cirque du Soleil’s O or something equally extravagant. It’ll be Rahul just dangling from a trapeze. That’s it for your AWS news headlines.

As the Chief Evangelist for Amazon Web Services, Jeff Barr shares the AWS story with audiences all over the world. He’s worked in the software industry since 1976. He started out working part-time at a computer shop while still in high school – start early kids – and since then he has worked at a variety of large and small organizations, served as VP of Engineering, and worked at tech giants such as Microsoft and Amazon. Welcome to the show, Jeff.

Jeff Barr: Really happy to be here today and thanks for the great intro.

Hilary Doyle: Well listen, Prime Day is around the corner. So we want to know what AWS is like right now on the ground. I mean are people just running around? Is hair on fire? Tell us what the week before Prime Day is like every year.

Jeff Barr: So let’s see. I mostly work from home, so if people’s hair is on fire, I haven’t actually seen it.

Hilary Doyle: You’re safe from the flames.

Jeff Barr: Yeah. I don’t think it’s actually a panic. My sense is that we’ve been through this a lot of times before and it’s a very, very long preparatory period where we do a lot of planning, we do forecasting, we do testing, we do even some simulations to make sure we understand how everything is going to behave well under extreme load with the expectation that the day is probably going to run pretty smoothly. And if it doesn’t run as smoothly- Sometimes it runs a lot more smoothly as seen from the outside than from the inside, right? We’re always working behind the scenes to keep everything running really well.

Hilary Doyle: I had never thought about Prime Day as the ultimate experience in QA testing, so I’m curious, on the AWS side, obviously you don’t want things to break, but do you get more excited about things failing during Prime Day as a way to learn forward?

Jeff Barr: We certainly always want to learn, but we wouldn’t always want to learn in a way that’s visible to our customers. So we do a lot of capacity planning to make sure that we understand how much of every different resource we’re going to need, make sure we have enough of those resources, make sure that various limits and quotas are set high enough that we’re not going to bounce into them in a surprising way. We do something that we call resilience engineering, where we’re going to validate all this planning and preparation.

The brand name for this is Game Day where we’re going to actually run at scale. We’re going to probably introduce some simulated failures. We’re going to make sure that the things we believe are going to fail over will actually fail over, that they’re going to recover properly. We’re going to push each system really hard and make sure that it scales in the way that we actually think it’s going to scale. And then, like any complex system, there’s always a bottleneck. But you never know what that bottleneck is until you reach it. So we want to make sure we know what those are ahead of time.

Rahul Subramaniam: While it might not be an ultimate exercise in QA because it seems very systematic, it seems very organized, it does seem like an ultimate exercise in getting to high availability. And there’s always this debate about high availability versus DR, or disaster recovery. I don’t think DR applies here because you just need the high availability. That’s all you’re going for. Again, what can other customers take away from that exercise in high availability?

Jeff Barr: Sure. So we are always looking to improve the availability and the durability of our systems and the scale we run at, where billions and trillions are routine numbers, even a fraction of a percent of a failure turns out into a big number when you multiply it out. So any time you look at things like DynamoDB and the fact that last year we did 105 million requests per second…

Hilary Doyle: Oh my God.

Jeff Barr: So if even some small fraction of those aren’t processed properly, that’s an issue. So anything we can do to make any individual component of the system more reliable is going to have a positive effect on everything else.

Hilary Doyle: I’m curious about what the scale down is like within AWS the day after Prime Day. Is there a process?

Jeff Barr: I think a lot of this is going to ultimately be driven by auto scaling. So auto scaling is going to be driven by various kinds of business metrics and service metrics. So we’re going to dial in additional storage and servers and bandwidth on an as needed basis. And while auto scaling is great both on the uphill side and the downhill side, I do think while we’re talking about it, yes, it’s kind of magic in that it’s going to bring more resources into play as needed. But you also need to make sure that you know all the bottlenecks, just because you have more processors and more servers, doesn’t always mean you can use them or you can use them efficiently. So you can’t simply think of auto scaling as this magic wand that you paint everything with auto scaling and then everything is going to happen perfectly. You have to make sure that yes, we can get the resources, but can we use them efficiently and properly?

Hilary Doyle: I’m just going to stop this interview here for a second because even though Jeff is seriously putting the insider in AWS Insiders, Rahul hosts another show that is even more insider-y than this one.

Rahul Subramaniam: That’s right. Every week I’m joined by fellow CloudFix AWS enthusiast, Stephen Barr for a live stream where we break down the latest AWS news, we share insights, not just our own, but also have some amazing guests from AWS.

Hilary Doyle: Yes. So line up your electric edge of the seat questions, get them answered Live on AWS Made Easy. You can find out more at cloudfix.com/livestream. And in the meantime, we’ll keep this insider chat going. Let’s get back to our talk with Jeff Barr.

Rahul Subramaniam: Jeff, to go back to the very first Prime Day, how did it come about? Was it something that Bezos came up with? And what was the process at AWS to take on something that big for the very first time?

Jeff Barr: Interesting. So I’ve got a book on my shelf, I can’t reach it from here, but apparently one day Jeff just decided, “Hey, it would be awesome to invent, I guess kind of a new special day or a holiday,” effectively and-

Rahul Subramaniam: Holiday?

Jeff Barr: I’m not sure of the exact vocabulary that he used. He called one of my colleagues, Sarah Spillman, and Sarah happened to be on vacation and she’s like, “Okay, I’ll get right on this.” And then with a lot of prep and a lot of hard work from a lot of people, they managed to make that first one happen. And like every one of these, you do your best to make the current one a success. And you also say, what do we learn from the first one to use on the second and subsequent ones? But one thing I love is that learning isn’t just limited to Prime Day or to use inside of Amazon. Everything that we are going to learn, we’re hopefully going to be able to document and share with customers in some way, shape, or form.

Rahul Subramaniam: Okay. So I had a follow-up question on that, which is, obviously I mean there’s one part of testing all of this stuff, but in the real world scenario when Prime Day is ongoing, I imagine that the complexity of operating Prime Day means you have hundreds of services that are running all to make sure that Prime Day goes through successfully. There are probably tons of monitors, there are probably tons of alarms that are going off, there are probably tons of metrics that you’re watching very, very closely. In that complexity, how do you simplify things? What’s the simplifying model that allows you to execute and not be overwhelmed by tens of millions of alarms that are going off because you have billions of events that are happening pretty much every minute?

Jeff Barr: Sure. So this is, I think one of the differences between the monoliths and the microservices is I think we’ve all seen those pictures of these rooms, the last generation colos and the telecom providers, they have this big room with gigantic screens and a bunch of people sitting at consoles looking up at the gigantic screens trying to somehow get a handle on the whole system. It’s very different with a distributed system that’s all these microservices where each individual team, their main job is, keep this one service running as well as possible.

Now, there is one ultimate metric that we watch, and this is actually pretty interesting. This is something I learned pretty early in my time at Amazon. We very carefully watch the metric of number of successful checkouts per second. So if people can search through the catalog, put things in the shopping cart, go through the entire checkout procedure, place their order, that’s an incredibly good end metric that says everything is working properly.

And so we have a lot of historical data that tells us, at a given time of day, a given day of the year, what is the range of what that number should be? And then we’re able to set alarms that say, it’s looking a little bit lower than expected. And generally what I’ve seen for the alarms on this is the alarm is not saying checkouts have ceased, because that basically says everything has broken. It’s more, checkouts are headed towards zero. So it’s actually catching things on that trend is a really important aspect of doing this properly.

My understanding is any big distributed system, you’ve got various kinds of synchronous and asynchronous processing. If it’s async, you’ve probably got some message cues and some buffers. When things start to go wrong, occasionally some entire service can fail over. That happens. Very infrequently that happens, but more likely something gets slow, buffers start to fill up, message cues start to get bigger and bigger. So you don’t get this sudden, it’s working, it’s not working. You start to go, well, things are getting behind. We’re not keeping up with reality. And so being able to monitor and alert on those, that’s one of those keys that says we can understand things before they become really critical and visible to customers, and we can remedy them. We can put more consumers on the message queue, for example, that’s going to happen routinely, maybe even automatically without any customer ever noticing that.

Hilary Doyle: We hear a lot about customer obsession, obviously with AWS, but I don’t have much of a sense of what things are like on the ground between the teams, what competition is like between these teams. Can you just give us some insight into how these teams do or do not work together? How they level up against one another? How something like Prime Day really galvanizes a spirit within the company?

Jeff Barr: Sure. So one of the things I love about the way our dev teams work is that there’s a massive amount of learning that takes place. So I’m on an internal list where we share what we call operational wins, where we’ve got some service, it’s running, we look at some metrics and we say, “We can make that faster. We can make it more cost effective. We can reduce latency. We can make it more scalable,” whatever. So we post stories to ops wins that basically say, “Here’s where we were. We noticed something really interesting. We did a deep dive into the technology, into the system. We figured out a way to make it better. We changed it.” And then you always show the before, the during, the after metrics. And then these are shared very, very widely within the organization with the idea being that once we learn something that makes one service better, we are going to share that as broadly as possible so that others can look and say, “I’m doing something very, very similar in my service. Maybe I can benefit from their learning.”

So there’s this culture of being very proud to share your accomplishments. And then if something does break, we have this internal process we call the correction of errors, or COE. And the thinking with a COE is that let’s say something does go wrong and a service breaks in some publicly visible or even internal way. The first job, of course, is get the service back online and keep the customers happy. But afterward, the owner of that service has to do a very, very deep dive into all the logs, all the notes, all of the things that happened, and write up an actual story that says, “Okay, here we were. We started and 50 milliseconds in, the following started to happen. 100 milliseconds in, this next thing started to happen. Two minutes in, the buffers were overflowing. By five minutes, all the red lights are flashing.”

So we tell that story, but we also say, “Okay, here’s what happened. Here’s what we did to fix it.” And then we say, “What do we put in place to make sure this particular failure can never ever happen again?” And so those learnings are collected and also shared very, very widely. The funny thing is they’re actually numbered, and if you talk to our senior engineers or principal engineers, these COEs are so rich with helpful information, they’ll just refer to them between each other by number and they’ll say, “Oh, when you’re building systems, don’t forget about number 523.” Like, “Oh yeah, 523. That’s an amazing one.” And I don’t remember-

Hilary Doyle: Vintage.

Jeff Barr: Exactly. We have this deep culture of learning and sharing,

Rahul Subramaniam: Talking about AWS services. They’re proliferated into this really vast number today. How does a newcomer at AWS consume this enormous amount of information? It feels almost like a barrier to entry to just figure out where to get started. So what advice do you give to a newcomer?

Jeff Barr: Well, interestingly enough, so I build Lego both creatively just with a bunch of parts, but also the sets with the instructions. It’s very, very reassuring when you open up that box and there’s thousands of pieces, but you know, if I start in step one and I do that properly and I do two, three, and four, if I do them all properly, I get the intended result. That’s such a different paradigm than what we generally have to do in our jobs, where there’s no roadmap, there’s no steps, and you have to trust your intuition sometimes to get you in the right direction. So that building Lego step by step is almost the antidote for having to be creative all day on the job.

As far as customers, I always say, “Yes, we have a lot of services, and I’m sure as developers, you are just so eager to dive in and use every last one of them in some way.” But let’s just pick a few. Build your first app with maybe some messaging, some logging, and some Lambda functions or some servers or some containers, whatever your compute paradigm is. Get really good at those few things and make sure that you feel very, very comfortable with them. And then, and only then, maybe for every weekly sprint of your service, maybe you’re adding one service every couple sprints, let’s say. But don’t look at this and say, “I must come up with the most amazing architecture and be sure to use every possible service in some way, shape, or form.” I’m sure we have some customer that uses every conceivable service. It had to have taken them years and years to get to that point.

Rahul Subramaniam: Jeff, what is the relationship in terms of this interaction between the Amazon retail business and AWS? Do you treat them just like any other customer? Or is there a special relationship where they get the special privileges of being part of the group?

Jeff Barr: So my understanding is that they are just any other customer and yes, we know each other, we’re in the same phone book. We can email within the same email domain and so forth, but when they need something from us, they’re going to request through their account manager, they need to pay their bills, they need to negotiate terms and conditions and prices and so forth. And they’re going to have solution architects, they’re going to have all the usual support you get when you are an AWS customer.

Rahul Subramaniam: Got it. I was very surprised to know that Amazon itself has an account manager at AWS.

Jeff Barr: I would guess probably more than one, given what I would imagine to be the scale and the scope of the relationship between Amazon retail and AWS.

Rahul Subramaniam: So it feels like a lot of the learning that AWS has had over the years, watching it since 2007, I feel like it’s found its way into this super comprehensive, well architected framework that answers most questions around scale or managing events or how to build highly available and reliable applications. Do you have some fundamental takeaways? Again, that’s incredibly comprehensive when you’re to go through all of it, but are there some basic takeaways that an average customer, who’s probably never going to have an event like the Amazon Prime Day, something that they take away from all of this learning?

Jeff Barr: So one of the processes that we do as part of preparation is something that we call auditing. And so there’s a set of internal checklists and every team has to go through that checklist on behalf of the services that they own and run. And one of our key leadership principles at Amazon is ownership. And one of the ways that that maps from people to technology is that individual teams will own individual services and microservices.

So, a team needs to respond on behalf of their services they own to a very detailed checklist. So it might be things like, how long does it take for you to recover if you have a database failure? What is the time to live or TTL you have configured for your various DNS entries like your CNAMEs? And what are the schedules for all the people? What are their on calls? What are the points of contacts? Who owns everything? So making sure that we’ve done that ahead of time so there’s no scrambling. If you need to do a failover, well you know that DNS is set up to fail over to a new IP address very, very quickly. If you need to call somebody, you know who to call, you know how to get in touch with them. So having that done ahead of time means that it should be a very smooth experience versus some kind of a panic.

Hilary Doyle: Well, the only panic here is time because we’re out of it. But thank you. I know you have grandkids who are expecting a lot out of Prime Day, so we’ll let you get to it. All the best this week, and thank you so much for being with us.

Jeff Barr: My pleasure.

Rahul Subramaniam: Thank you once again, Jeff.

Jeff Barr: All right. Good to see you both.

Rahul Subramaniam: Absolute pleasure having you here.

Jeff Barr: All right-y. Bye-bye.

Hilary Doyle: We’ve been looking forward to speaking with Jeff. He really is a cultural leader in the tech world, so I particularly appreciated his overview of the culture at AWS, how it scales from every two pizza team right up through the whole organization. I mean failure has been a sort of cult for Silicon Valley and the startup ecosystem over the years – this celebration of failure, fail fast, fail often. It’s a really expensive way to build. So I really appreciated the way that Jeff outlined an AWS culture, not a failing fast, but of ownership. Solve immediately and locally for whatever goes wrong, and then disseminate those learnings broadly.

Rahul Subramaniam: And even though Prime Day is such a unique and massive exercise in infrastructure utilization, at the core of it, there are foundational principles that apply to anyone building solutions in the cloud. The first is build systems that are designed to be elastic and can scale as needed. The second one is build with a distributed and event-driven architecture. The third is assume that everything will fail and build and plan for resilience.

I mean the cloud is a different paradigm. I mean in a data center, you pour out huge sums of money to buy hardware that promise a low rate of hardware failures, but at the end of the day, they all fail. With a cloud, you have an infinite supply of cheap and disposable compute that you just have to assume is unreliable. The design and architecture for such systems is fundamentally different. And the distributed nature of these systems also means that not everything comes to a grinding halt just because one component of your system fails. So while not everyone is going to have an event like Prime Day in their business, the lessons that Amazon applies to their architecture is exactly the same that any other AWS customer should take away.

And we want to know what you think about Prime Day and all things AWS. Send us your thoughts to podcast@cloudfix.com.

Hilary Doyle: And send us the photos of everything you buy, then leave us a review of five stars. We don’t make the rules, we just announce them. And don’t forget to follow the show to listen to new episodes as soon as they’re released. AWS Insiders is brought to you by Cloud Fix. They are an AWS cost optimization tool and you can learn more about them at cloudfix.com.

Rahul Subramaniam: Thanks everyone for listening. Bye-bye.

Meet your hosts

Rahul Subramaniam

Rahul Subramaniam

Host

Rahul is the Founder and CEO of CloudFix. Over the course of his career, Rahul has acquired and transformed 140+ software products in the last 13 years. More recently, he has launched revolutionary products such as CloudFix and DevFlows, which transform how users build, manage, and optimize in the public cloud.

Hilary Doyle

Hilary Doyle

Host

Hilary Doyle is the co-founder of Wealthie Works Daily, an investment platform and financial literacy-based media company for kids and families launching in 2022/23. She is a former print journalist, business broadcaster, and television writer and series developer working with CBC, BNN, CTV, CTV NewsChannel, CBC Radio, W Network, Sportsnet, TVA, and ESPN. Hilary is also a former Second City actor, and founder of CANADA’S CAMPFIRE, a national storytelling initiative.

Rahul Subramaniam

Rahul Subramaniam

Host

Rahul is the Founder and CEO of CloudFix. Over the course of his career, Rahul has acquired and transformed 140+ software products in the last 13 years. More recently, he has launched revolutionary products such as CloudFix and DevFlows, which transform how users build, manage, and optimize in the public cloud.

Hilary Doyle

Hilary Doyle

Host

Hilary Doyle is the co-founder of Wealthie Works Daily, an investment platform and financial literacy-based media company for kids and families launching in 2022/23. She is a former print journalist, business broadcaster, and television writer and series developer working with CBC, BNN, CTV, CTV NewsChannel, CBC Radio, W Network, Sportsnet, TVA, and ESPN. Hilary is also a former Second City actor, and founder of CANADA’S CAMPFIRE, a national storytelling initiative.