Cloud Coffee Talk

Cloud Coffee Talk is a podcast by cloud professionals for cloud professionals. These are relatable, deep dive, unscripted discussions, where technical talk is mixed with the real world challenges of people, process and technology. Each episode features a different domain discussion with 1 or more guests who are passionate about cloud and technology.

All Episodes

Cloud Coffee Talk

The Cost of the Cloud

June 21, 2021 • Darren Weiner • Season 1 • Episode 4

0:00 | 41:57

No matter what you do in the cloud, you are paying for it! Discussions of overspend, optimization ideas, successes and failures

[0:00:14] Darren: So welcome once again to Cloud Coffee Talk, AWS edition, sponsored by CloudButton. These are real world problems solutions and thoughtful discussions about working in the cloud. This podcast is meant for cloud professionals at all levels of the organization, from the executive team to those with their hands on the keyboard, putting out fires and making the world a better place. It really is meant to be unscripted, it's for those that are passionate about cloud technologies and it's not meant to be cloud 101. There's a lot of great content out there already for that. We're trying to do something that's hopefully a little bit different. I am Darren Weiner, owner of CloudButton, an independent consulting company focused on AWS cloud. And with me, once again I have Eric DeRoin. Erik, why don't you introduce yourself.

[0:00:58] Erik: Hi everybody. I'm Eric DeRoin, I'm a site reliability engineer for a company called TrainingPeaks that does coaching software for endurance athletes, and also a large AWS presence. I just want to follow up that, the history of our podcast really was that Darren and I and a group of our other friends would get together all the time and Darren and I would end up kind of talking shop with each other off on the side and we really enjoyed that experience and found that other people would engage with us on that and we wanted to share that and see if there's other interest in other cloud professionals and engaging with that kind of discussion, talking about our experience and how we have used and done things in our careers.

[0:01:41] Darren: And this week's topic, by the way, is cost. The cost of the cloud, which actually I'm looking forward to. I think it's going to be a really fun conversation, which is a funny thing to say about cost, but in AWS it gets pretty, pretty interesting pretty quickly. But before we get into it a little bit: How your week going? Erik? Any dominant theme of your week?

[0:02:05] Erik: Yeah, our biggest thing is we're spinning up a new service, a brand new feature that we're trying to spend up in ECS and was really struggling with some of the containers and trying to understand some weird permissions issues, and ended up getting to play around with ECS Exec, their new feature, and that worked out really well, pretty impressed with that. Otherwise, it's been a week of firefighting, it's been much more of alerts and responding to stuff, SRE week than a sit-down and heads-down focus on things.

[0:02:33] Darren: Well, I hope you get the opportunity to apply some operational excellence with all those lessons that you learn from these issues that are coming up.

[0:02:42] Erik: Yeah, I think a big part is spreading organizational awareness and getting buy-in from various stakeholders and doing some of that work has been a part of it recently.

[0:02:53] Darren: For me this week, it's funny. I actually spent it been spending a lot of time with containers as well. And for me it was capacity providers with ECS for EC2-based containers and just diving really deep because I'm about to launch some production workloads and really want to test out the scaling profiles when capacity providers are dialed in, so really starting to design all those dials and then start turning them and seeing how things work.

So I've been in the weeds on that. And the other piece, which goes back to the first podcast we did on infrastructure as code, talking about the challenge of modernizing mature AWS environments, which are usually EC2 heavy, but having to manage those and maintain those and spending a lot of time in that department to make sure that things stay compliant. That's been definitely a big theme for me this week as well.

But let's move on to cost. The way I look at it is: Amazon is one of the largest companies in the world. And so it's no surprise they've created one of the most complex pricing structures on the planet, Right? It's funny because early on when AWS, with all the cloud vendors, but certainly with AWS, hey would market the cloud as: go to the cloud to save money. And there's certainly a huge total cost of ownership conversation to be had around, especially if you're a large enterprise company that's getting rid of their datacenter, trying to spin down the data center and all the costs associated with that. And there's a huge TCO argument for that. But for me, I tend to focus more on small and medium sized businesses, and I think there's much more compelling arguments. Costs can be a factor. But really, all the discussions around capabilities and scalability and reliability and maybe most importantly for the startup’s innovation - to move to the cloud for these innovative platforms. For me, those are much more interesting arguments. But at the same time, you can do a lot of things with total cost of ownership, but it's not cut and dry. You're not going to just save money by moving to the cloud.

[0:05:08] Erik: Yeah, I agree with all of those points, I would say my first experience with AWS and cost and pricing was: we worked together at a former company and we were in AWS at that point and I wanted to learn more. So, I spun up my own account and was messing around with that and quickly saw and kind of learned that like actually AWS, for a guy just playing around on his own once you get beyond the free tier is not that cheap. And so if you're running like a single server workload and website and this, that or the other, you can find much cheaper hosting out there at other places. It's when you start to get into complicated applications that have various logic layers, various tiers. And you start to build out things beyond maybe just like an EC2 server that you find the real savings and you find the real strength and benefits of something like AWS or other cloud providers.

[0:05:58] Darren: So, as SRE, which you are, how does cost management or cost containment come up in your world?

[0:06:05] Erik: Yeah, actually one of the first big, big wins that we had as a team coming into it. We really had an SRE title in name only. We really kind of regrouped and revamped on that. One of the things that we did was to look at our costs, look at our spend and find where that was going and find really where the waste was. We had a fair amount of low hanging fruit. We also have a fair amount of stuff that was harder to move. But that actually became a big motivator early on to start to clean things up to drive better practices, start to take this stuff seriously. The SRE mantras and missions, falling in line with some of the google book recommendations and kind of show the benefits of the investment that they put into us. So we're able to drop actually our bill rather dramatically in our first six months to a year.

[0:06:56] Darren: So talk about some of that low hanging fruit, let's dive deeper into that.

[0:07:00] Erik: So, I know you and I have talked a number about, and I hope our listeners haven't had this experience, but EBS volume snapshots were a good one, there's a good way to waste a lot of money. There ended up being a lot of S3 storage costs where we had way more database backup stuff than we ever needed. So, we actually sat down and did a disaster recovery plan and did a little bit more deliberate practice around that kind of stuff and understanding what that would look like and where we want our SLOs, SLAs to be around that. And then tweaked things. We got rid of a bunch of database backups. We got rid of a bunch of EBS volumes, snapshots. We were able to move some pretty heavy and costly workloads out of something EC2 worker instances and eventually move them into Lambda, and so pay for we use and that was a decent amount of savings.

[0:07:49] Darren: When you say - it's really important that the whole idea of operational excellence, you can certainly go in there and just clean up a bunch of data. But did you also sit down and map out lifecycle policies?

[0:08:03] Erik: Yeah, absolutely. We originally had lifecycle policies; we had some pretty sharp people working for us. We weren't just totally novices and not knowing what we're doing, but it was really ad hoc, it was just one guy deciding on his own: I guess this is safe enough that we'll have it for six months and then we'll rotate daily database snapshots, we'll rotate them into Glacier, six months down the road and we're going to keep those for two years. So there was just huge, huge amounts of data sitting in that that like we weren't really using. So we sat down, we talked: was it one of the cases where we need to restore something that's this much old - what does that look like? If it's more than a couple of days, then we're looking at some sort of data loss scenarios, looking at some sort of disaster recovery, something bad. Something really bad happened. At that point if we have to spend 24 hours to get out of Glacier after it's been a week or two or whatever, it might be - Glacier is 12 hours or less - that's fine. Not to mention that we still have to all the restoration work after that. So, if we have to wait a little bit of time, that's something we're willing to do and that will save us X number of dollars.

[0:09:13] Darren: The part for me with lifecycle policies is when you accidentally rotate something out. So when you have some older buckets and might have years worth of data and you think: We won't need that and then, oops, turns out you do. Obviously, backups are an easier conversation. Customer data and what you do with that, obviously you have the intelligent tiering, which helps with that sort of thing a little bit as long as your file sizes aren't too small. But they've made it easy to address those issues as well.

[0:09:45] Erik: Yeah, we have some developers actually go through on some of that work. So one of things we do is - those devices you wear when you run and you do all those things, they basically generate files and then when you sync them to your phone, they send the files over, and Garmin or whoever else they'll send us those files that we then parse to get all your data as far as what your workout looked like and your heart rate and your power and your distance and GPS coordinates and all that kind of stuff. So they do send us files, so we have to store all those files somewhere. Figuring out the best way to store not only the original files but then the parsed data off of those files which are basically giant JSON blobs. How to store those efficiently and effectively in something like S3, which usually we end up going with, how to do that efficiently. That was something that we went through and were able to find some cost savings in there as well.

[0:10:33] Darren: On the EBS side, so I assume this is a lot, because again your company has a mature AWS presence probably didn't have or don't have fully immutable infrastructures and you're storing a lot of data on EBS. So, what were some of your solutions there?

[0:10:55] Erik: We actually just went to fully immutable infrastructure, which was my sort of drum to beat and the thing that I really focused on, we had definitely some old servers sitting around and SRE was originally run by people who thought of things a little bit more traditional IT-kind of roles and were doing sort of some of the things in the traditional way. So it was coming out with a different lens, kind of getting some fresh perspective that allowed us to kind of push that envelope and push things forward and say: I think we can actually do this, I think this is actually not as hard as we think it's going to be and prove that out and make that happen.

[0:11:30] Darren: So, moving files either to EFS or S3?

[0:11:34] Erik: Yeah exactly. We moved them to S3.

[0:11:36] Darren: Very cool. Yeah that's a such a great opportunity especially when EBS bills, whether it's volumes or snapshots, that adds up very, very quickly. I've had multiple clients with tens of thousands of dollars a month in either volumes that they didn't immediately remove for good business reasons because there might have been data recovery that was needed for elastic tiers, where if things don't downscale properly maybe that data would be needed - and it was used periodically. It was a pretty significant issue. And the snapshots just add up if you're dealing with snowflake systems that might have some important things on them - there may be some I. T. business operation type systems, those sorts of things. You always want to move off of that, but it can be tricky to move off of, again, some of these older systems. Anything new is really easy.

So when I start working with a new client, one of my favorite things to do is to get to the billing tab and get into Cost Explorer. For me it's like reading a novel. It has suspense, it has drama. I could trend it and I could I could see the history of what's been going on. Literally, you can look at their cloud maturity from just looking at how they've been evolving their bill in terms of the different services they've been adopting. And then of course to help identify all the, as you called it low hanging fruit, all the areas where there could be some quick wins. I just love it. They've done such a good job with Cost Explorer. You can dive so deep. There are so many different ways to filter things. I really get a lot out of that. So it's the absolute first place that I go when I look in an account.

[0:13:29] Erik: Do you go to Cost Explorer? Like mess around with all the cool pretty graphs or do you actually go into their line by line bill?

[0:13:35] Darren: I'll start with the bill; I'll start with the bill and then dive deeper. By the way, as part of preparation for this episode, I actually went into one of my clients who has a relatively small monthly bill with AWS, and I went through and anywhere there was actually a cost for that month, I just expanded the tabs - and this is a small client and I was at least at 500 lines. Obviously, there's a few key areas at any organization that's going to represent a significant amount of the cost. But nevertheless, it brings up the point of how tricky it can be to manage spend in the cloud. So, I start with the bill and then I dive into Cost Explorer pretty quickly and start looking at breaking out things based on services. And I'll go into usage types to really dive a little bit deeper to really figure out what's going on there

[0:14:27] Erik: Yeah, I think that's fair. I occasionally enjoy expanding our bill and running through it. It's amazing how many times I'm like, wait, what is that? Because we're mostly in us-east-1 for the most part, we're not really multi region at this point and we'll see something in a different region, like wait, what is going on here?

[0:14:42] Darren: CloudFront

[0:14:43] Erik: Yeah, CloudFront or we were we were backing up some files across S3 regions and so there's some stuff in there and I was like, do we still need to do this? Who set this up? Why are we doing this? So, exploring some of that kind of stuff is always a fun adventure. For me the thing that I always fail to grasp when I'm setting something up - I'll go look at pricing tab of EC2 instances or whatever service we're using - is it's really hard for me to really understand and predict or know, outside of a scientific wild-ass guess is the networking stuff. Seeing how much traffic and how much is actually going on and keeping an eye on some of that kind of stuff is very interesting. There's a big cost difference between where your network may or may not be coming from and so understanding how data flows through your system and the amount of data that flows to your system is very important.

[0:15:36] Darren: And then how implementing some of the networking resources, things like VPC Endpoints, might change that. Networking can often represent 10-20% of a monthly bill. That's pretty significant. And you're totally right on. It's really hard. You can track things down to a certain extent. But then what do you do about it? So when you look at your CloudFront bill and how you might be able to clean that up by modifying some of your caching settings. Well, that's not a trivial thing. There are some significant implications associated with that. So almost every time you get to the point where you're trying to reduce your networking bill, it's risky. It can be very, very risky. There are very few places where it's not, when you want to reduce, you can easily increase your bill in good ways. Think about things like NAT gateways and trying to create some resiliency there and it's going to increase your bill but it's nominal. It doesn't add up to be that much. But to reduce your networking component of the bill is very hard.

[0:16:48] Erik: Yeah. Well there's a developer who - she was moving S3 files between accounts or regions or was just streaming out and then reading some data off and then storing it DynamoDB. She was doing something like that. It was very important work, stuff we needed to migrate. Some things that we needed to do, there's no way around it, but they kind of thought it was just going to be like fire and forget, you know, it's moving storage from one to the other. Like it shouldn't be much of a cost increase. Then we pull up the cost explorer and see this giant spike: What is going on here? And so we reached out and talked to him: "We're doing this migration of files". Yeah, that has a network cost related to it - you just doubled this cost for the month.

[0:17:29] Darren: One of the nice things about those kinds of experiments or short-term projects is, you generally get feedback pretty quickly, which is nice, but I have gotten hit by that many times. One of my favorite OMG stories of: Wait a minute, what just showed up on on the bill? - It was a personal story at ReInvent when I did a Sagemaker workshop and they gave me a CloudFormation template to spin up and of course, you know how things never quite go as planned when you're spinning things up, and at the end of the workshop, you break it down, well I did break it down. But with Sagemaker, I can't remember the details. This is going back a couple of years, I don't remember if there was a notebook instance or something else that, even though the CloudFormation template broke down, it left an artifact. And I got a $300 bill. I did manage, through support, I explained what was going on, it was at ReInvent, and they waived it, but that was fun.

I was working on a project last year with a client where they had to do a bunch of spatial processing and I was trying to figure out how to move this out of the database and out of EC2 and to try to do some of the sort of decoupled microservice patterns to do this, because I knew it was a lot of processing, but it was very small chunks of files, basically using the Uber - Uber developed this H3 spatial library to process a lot of spatial things as hexagons and it's really efficient mathematically and all that. But I was dealing with hundreds of millions of little files and so I'm trying to think - I think the first time I did it, I'm like: I'm going to do this using DynamoDB and I'm just going to use Dynamo and I think I might have had a container, some containers that were basically just pulling things off - maybe there was an SQS queue involved - I don't remember all the details but all I remember is that I thought oh this is really lightweight, it's really super horizontally scalable, it's working, it's really fast. And then I got like a $2000 bill because DynamoDB - I can't remember which dimension - but it was basically the amount of read units or write units. I can't remember the details, again it was a while ago. And then I and I moved it over, I tried a different pattern using S3 because it's just really small files and I'll just put it on S3 and in S3 it wasn't as big a bill but it was like a $300-400 bill just around the puts and the gets because I was dealing in the hundreds of millions of files, order of magnitude.

[0:20:08] Erik: That's what's kind of fun about it. There are two or three facets to every project that you end up doing in AWS. There's what can I do to make it work, what can I do it to make it work efficiently and scalable and what can I do to make it cost efficient? It's really easy to spin up a lot of infrastructure very quickly that can have a ton of CPU and just process the crap out of something, but it may not be your best decision.

[0:20:33] Darren: But again, this is the great thing about working in cloud. This is why we think so many people love it. You get to practice and explore different patterns and get quick feedback. You're getting it within a month at most? If you're really thinking about it from the perspective of, we're developing a pattern that we're thinking about deploying long term, you can start looking at it daily? And you can implement cost tagging and you can run reports on it, and so you get that quick feedback on a daily basis and you can then forecast that out. That's the tradeoff. You might get hit with a bill, but you didn't just invest in a whole server farm and your committed for a year. This is a really simple, quick feedback loop, which is very nice.

[0:21:26] Erik: Yeah. Well what's interesting again, if you go and you break down the bill, you can actually kind of break it down by: what's my actual compute spend. So you could actually trend that then over time and say: our traffic is growing this much and our compute is growing at this rate. Yu can actually start to do some interesting things to break down and understand how efficient at a high level your code is running. If you have very efficient code that's pretty scalable, your compute should not necessarily scale linearly with your traffic, which is something that we saw happening at TrainingPeaks. And we're able say: that doesn't look right, and go in and optimize and change and tweak some things. We started to see like: cool, our CPU scaling is starting to flatten out and our traffic growth continues on. That's really what you want to see in software engineering. It's a really great way to visualize that, that I did not see as much when we were, not at TrainingPeaks, but companies when we were in traditional...

[0:22:21] Darren: That that's fantastic to be taking that kind of lens into the application performance. That's impressive.

[0:22:29] Erik: I recommend you all do that. And now that you can do like compute savings plans...

[0:22:35] Darren: So, do you deal much with the reservations or savings plans?

[0:22:39] Erik: We do a lot of reservations. We started doing some of the S3 storage savings plans. We haven't been able to do the compute storage plans because our organization is not just TrainingPeaks, but a few other sister companies that we have, and it has to be done at least previously and I don't know if they've updated this since, but we already did our reservations kind of for the year that you have to do an organization level. And so, we had a hard time committing to what these other companies were going to be doing as well as ourselves and to get them to commit to something.

[0:23:13] Darren: There's ways. Yes, do it at the org level - you actually have options when you're dealing with multi accounts. AWS is pushing the savings plans and there's EC2 and compute-based savings plans very, very hard. In fact, I predict They're going to end of life reservations for EC2 and compute for savings plans. It's going to be interesting because there's a lot of savings that you can be gleaned there. But with reservations, what's really nice about reservations is that you can look at your EC2 farm and you can fairly well predict, given that there's flexibility in the reservations in terms of what they do within an instance class, you can really say: I want to make sure that reservations for 60% of my baseline load, or 80% - whatever that number is for your organization. You did a thumbs up, which listeners can't see, but you're trying to keep that pretty high because you have, on EC2, you generally have things fairly well dialed in, you're a very mature organization. You guys spend a lot of time on this and so you know: we're not going to be going down from this. This is our sort of foundational or baseline for EC2. With savings plans, because savings plans include EC2, it includes Lambda, it includes FarGate and it's truly a black box algorithm where you're trusting AWS a whole lot more with what you're deciding to commit to. I mean it's a massively confusing algorithm and it really comes down to: Trust Us. You go and they make recommendations and then you decide what to do about that. There's nothing wrong with that.

I mean, I do trust AWS a fair amount. I make my living working with AWS. At the same time, it gets a little tricky when you're giving up that much control. You can't really see it in the same way, and so what you have to do, what I found to be an effective strategy of the savings plans is: They'll make recommendations. I will start working some numbers in terms of what I'm seeing, with Lambda, with FarGate, with EC2, and will commit usually a little bit less than what the recommendations are and then watch it over time. So instead of reservations where you sort of generally tend to make a fairly large reservation commitment at certain points in time, on an annual basis, with savings plans I'm finding that I'm stepping up the commitment as I go to make sure that we don't over commit because once things are ever committed, you kind of hosed so you need to kind of find that right balance, especially with compute that is much more variable because things can change with FarGate. You can turn a lot of dials and make a lot of changes are obviously with Lambda, which honestly with Lambda, who notices the Lambda part of the bill? I mean, that's one of the most beautiful things about working with Lambda is you did all these things with Lambda over the course of the month and then you look at the bill and you're like: it's a rounding error. It doesn't even make a dent.

[0:26:35] Erik: Yeah, we went from a farm of servers handling file processing to putting it in Lambda. It's like: Nothing. It's a tiny percent of what we're spending on EC2. It's so nice.

[0:26:50] Darren: It's one of my, one of my favorite things. What about the other side? What's the place where you just kind of looked down on the floor and you're like: Oh, I just dropped $5, I just dropped $10 or the things that you're just leaking out every month: Is it worth my time because it's so little? Have you run into any of those?

[0:27:11] Erik: I'd say our that's actually kind of a problem we have within our organization. You know, I don't think we're wasteful by any means. We are conscious and mindful of our bill, but we have no problems spending when there's spending to be done. We have no problem letting somebody go in and explore and play around or something.

[0:27:28] Darren: Yeah. But if you drop $5 on the ground, you'd pick it up, right?

[0:27:32] Erik: I would, yeah. So we did a bad job setting up some of our playground account where we let people go and build whatever they want. And they can do they have much more lenient permissions, they can go explore. It's isolated. We didn't really have good alerting around some of the stuff. We didn't really pay close attention to it because it was new. And we kind of just got focused on some other stuff and went in there one time and be like: wait what is all this...somebody has spun up a bunch of Sagemaker stuff, somebody spun up a bunch of other things and they just left them up and they were just sitting there. Okay, part of the agreement was that you guys were supposed to delete this stuff when you're done. Bad on us to not put those guidelines more hard coded in there or put the buoys in place. But yeah there's been some stuff like that. I certainly think it's a big one. Trying to think of other places where I know like we could save money there but it's hard to just...

[0:28:27] Darren: One of the things that always annoys me would be: Elastic IPs. That's something that I've gotten hit by a few times where for whatever reason, spun up some Elastic IPs and then ended up using them. You only pay for them if you don't use them because of the scarcity associated with IPV4. So, if you don't attach, an Elastic IP to something they charge. - it's like a prorated five bucks a month. So, if you have three of them, it's 15 bucks a month and you just have to go in and clean it up. It's no big deal. And again, it's usually with more mature accounts where that might have been hanging around for one reason or another. It wasn't maybe as code or whatever else. Little things like that always are annoying.

[0:29:09] Erik: We had some instances sitting in the stopped state for a long time, but we were never going to start them again. We were never going to use them. Just nobody was willing to be the person who killed this old thing until finally, I was like, I'll do it, happy to do that. The other weird one that we had; well this may not fall into that. This is more of like in spite of. I got annoyed and decided to spend more money instead of doing things more...well I decided to save $5 when it cost us $50 to save that $5. We had a AWS Elasticsearch cluster and we were using their managed service and ours got into this weird state where we basically over doubled the number of nodes that we had in there and I could not get it to go back down and I reached out to a couple people and didn't hear anything. And so we're basically double paying, you know, for the size of this ElasticSearch cluster that we needed and I couldn't get the controls to downscale it to where it needed to be. I didn't understand why I got stuck in the state. So, I just spun up an ElasticSearch cluster on EC2 and migrated all the data over there and deleted that thing I was like: OK, we're just going to run it ourselves on EC2. This is after I spun up all this thing myself, this brand-new infrastructure, new CloudFormation pipeline, blah blah blah. I was so mad. And that still runs in our environment currently. So I still have to manage that.

[0:30:32] Darren: You know, to your credit. You were solving a problem in a way that, again it goes back to the capability, it’s really easy to do these things if you need to.

[0:30:44] Erik: Yeah. And that was, that was something that is actually in the infancy of SRE and we're taking on the world. I'd say I'd be a little more pragmatic now and be a little more patient with the cluster, be willing to push AWS a little harder to get that figured out. I would like to go back to more of their managed service for some of the clusters and stuff that we're managing ourselves as we were full of piss and vinegar and thought we could do all these things and realized quickly that we've exerted ourselves too thin.

[0:31:11] Darren: We've talked about it before the tuning of any database, but certainly the distributed NOSQL ones, is work and a lot of management is needed over time.

[0:31:22] Erik: Yeah, and I've worked on them before with you at our previous company. I actually really like ElasticSearch. I think it's a really great tool. There's a lot of interesting things going on with that company and AWS at the moment, which would be a good podcast conversation.

[0:31:34] Darren: I think it would end in fisticuffs.

[0:31:38] Erik: I don't think so. I think we're probably on the same side.

[0:31:41] Darren: Not you and I.

[0:31:33] Erik: Yeah, we’d probably get a cease and desist letter from one of them.

[0:31:45] Darren: Let's not do that. The podcast is still young. We don't want to...

[0:31:49] Erik: Yeah...get it released first. Yeah, it's the technology that we like, I enjoy working on that, but I also realized that it's has its limitations. It can be difficult; it can be a bear.

[0:32:01] Darren: So, coming back to cost,

[0:32:43] Erik: Sorry.

[0:32:45] Darren: It's fine...good coffee talk. The whole idea with cost at AWS is you only pay for what you use, but of course you use everything and there's certain cost categories that really bug me. Either it costs too much or, really, should you be charging me for that? Shouldn’t that be something you just kind of give me? So, are there any there any categories that you can think of? Where it's too much for what you get?

[0:32:31] Erik: What do you mean by cost categories?

[0:32:33] Darren: Some of the things that bug me are around...CloudWatch alarms. I think CloudWatch alarms are just too expensive because I want to put...I don't want costs to be a factor when I'm trying to put more observability on my systems. I'm already paying for all the data flowing back and forth and per gigabyte. And I just think alarms should be a little bit less expensive and synthetics as well. I love the canaries except that they're just too darn expensive when you want to put canaries on http endpoints, which you're going to have a lot of http endpoints if you're doing anything in the cloud, you look at the bill on that, it's like, no, I'm just gonna build a scheduled Lambda function to do it for me.

[0:33:27] Erik: Yeah, I think that makes sense. Something like the CloudWatch alarms, like at least now we have a lot of incentive right? You already are going to an SNS topic, you're going to be paying through whatever it's sending through to our alerting system. We're already paying for it beyond that. So, I think that would make sense. Trying to think if there's ones that really bother me.

[0:33:46] Darren: Do you use the VPN client at all?

[0:33:49] Erik: We used it for a hot minute. We were looking to migrate off of an actual server VPN that we had and into a cloud one and we're using AWS' VPN client for a while and we basically just wired it up so it shut people off after a period of time, kicked them off after a number of hours.

[0:34:07] Darren: Because those client connection hours add up quickly and now in the era of covid when everyone's working remotely...One of my clients is shutting down an office and that we're going to save some money the site to site, but they're going to make up for it with all the client connection hours they're paying.

[0:34:28] Erik: We were trying to do something like that, and we really didn't want to manage a VPN. So, we're just trying to look for a managed solution that works pretty well. If AWS' client VPN worked on mobile devices, they would have had us, but unfortunately it was a deal breaker for us as we got into it.

[0:34:44] Darren: So, thinking about some more complex ways to save money, have you worked with Spot at all?

[0:34:54] Erik: Yes. We have at times played with some of the Spot instances. The biggest effort we made towards those was around our build servers. So we run team city...

[0:35:05] Darren: That's a nice application for it.

[0:35:07] Erik: Yeah, so it runs a bunch of agents and attempted to use spot instances and we just found it to be a little spotty. Developers got a little irritated about sometimes. Build servers would disappear, they just take a little while to spin up and find a reservation and we just didn't spend enough time, energy and effort really tuning our price points and making sure that we had the availability that we needed and that there were going to stick around and just kind of ran into some a few hiccups along the way. That kind of led us to abandon it for the time being, unfortunately.

[0:35:39] Darren: I do think it's bandied about as a cost saving solution a little too loosely, I think it's great when you find the right applications and you have those workloads that can handle interruptions, that are idempotent, so it could pick up where it left off and you're not going to lose anything. Batch processing is, I think, a really good one that works out really well. I've done a fair amount of that on ECS with spot, but really the workloads have to be carefully considered.

[0:36:10] Erik: Yeah, absolutely. Just some of our build process is - they are Windows machines and some of the spin up time just takes a little bit longer. Some of them are running automated test suites that are a little bit larger, so they just take a little bit longer. You just have to make sure that their availability is there through all of that. And it's something you have to be really careful with and understand the idempotency of the compute that you're using in that case. It needs to be able to run again and again and again and pick up where it left off. However, you want to manage that. But you need to be able to handle those scenarios and those case really well.

[0:36:46] Darren: And then you need to possibly have some additional tooling because I know for instance now you can choose to hibernate the spot instances instead of just...

[0:36:50] Erik: I saw that. Yeah.

[0:36:51] Darren: But then what do you do with that? You have to build some tooling to kind of address that if it's needed. I couldn't figure out a good use case for it for it in my world.

[0:37:10] Erik: So just spin up a normal EC2 and since it just has a cron job that runs on it and just...I like to think of really inefficient ways to do things in the cloud?

[0:37:25] Darren: Rube Goldberg would be really impressed with a lot of what we do in the cloud these days.

[0:37:28] Erik: I think that at its core that is what cloud computing is turning into, and I'm here for it. I'm all about it. Really efficient rube Goldberg machines, with lots of redundancy,

[0:37:42] Darren: Lots and lots. So, I'm trying to think if there are any other topics we can talk about now.

[0:37:48] Erik: The only thing that popped up in mind when you're talking about cost categories. I was thinking about the Cost Explorer and it kind of tied me into one of my biggest complaints of CloudWatch that has gotten better is your ability to break things out by tags, which they've gotten way better about in Amazon as well. So, like how do you break down what pieces cost what? If you're in an organization, you can take tax credits for R&D work. So do you tag things 'R&D', so billing down the road can see: was this an R&D cost, was this a production cost? Was this cost of goods sold? Was this whatever? We actually got to start doing some of that kind of stuff. We got to go break things apart and understand what our prod servers were running or what our app, versus our worker nodes, versus our web tier were costing. I wish it was baked in a little bit better in the same way that with some of that stuff was baked into CloudWatch a little bit better. Once you start to build that stuff out you can really get great breakdowns of where your spend is from.

[0:38:51] Darren: Yeah, I da lot of cost tagging by environment. Same idea. I think that's it's improved significantly in the last year or so. So, for example one of the challenges because you use a lot of FarGate and I believe it wasn't until recently where you can actually have the tags from the service propagate to the running tasks. So there was no way in the FarGate world to separate your workloads through cost tagging. Now you can, although you can't do it on existing services, which is very annoying. You actually have to have to re-spin the service, not just not just update the service, you have to actually re-spin the service entirely and/or have some sort Lambda function to occasionally check the service and the associated tasks and tag the running tasks with that. So it's a little bit =clunky. Hopefully they'll continue to move on that. But now at least you can, if you spin up new services you can propagate the tags down to the running tasks and that shows up on in your report.

[0:39:57] Erik: It's a great opportunity to do some chaos engineering and really test out your error budget.

[0:40:02] Darren: Yeah, I just ran into that recently actually. It was that exact scenario, with the existing services, I'm trying to think about the best way to solve that problem without potentially causing some interruptions to my clients.

[0:40:20] Erik: Yeah, that would be an interesting - I keep talking about our next interesting podcast ideas - talking about downtime, talking about error budgets, talking about availability, reliability, however you want to talk about those kind of components in the cloud because zero is not a goal. You know, you're going to have downtime, you're going to have to deal with some of these hiccups. How do you manage those effectively?

[0:40:41] Darren: Indeed, instead of spending that significant additional effort to get to zero, focus on meeting your service level objectives and take that time and expense and use it to release features, improve your overall experience, which is what most users of the software I want. Okay, I think we're wrapping up Erik, thanks for being part of another episode of Cloud Coffee talk.

[0:41:06] Erik: Thanks for having Darren. Yeah, the cost subject is a great topic for anybody who is interested in the cloud. I think for most organizations, their cloud spend is probably going to be their second highest after people and maybe health insurance. So, it's worth keeping an eye, it's worth paying attention to, and it's a really big indicator of how your organization is doing.

[0:41:24] Darren: Rally whatever workloads you're running the cloud, you're all paying for it. So, something we all have in common. So, thanks to all the listeners out there. You can find us on twitter at @cloudcoffeetalk. We welcome all your feedback, as well as suggestions for future episodes. Until next time, have fun in the cloud