Amos: Welcome to Elixir Outlaws, the hallway track of the Elixir community.
Chris: Oh man, Amos is here. (sound effect of a sad trombone)
Amos: (laughing) Oh no, you got a soundboard!
Amos: Perfect! This is going to be the worst episode ever.
Chris: The worst episode since the last episode. (both laughing)
Amos: Nice. So, do you only have sad trombone or-
Chris: Yeah, no, that’s it. Just the sad trombone.
Chris: Sorry, fam.
Amos: I thought maybe you had set up a whole soundboard.
Chris: Mm. No. Definitely not. Oh no- what’s that outside? (screaming sound effect)
Amos: Alright, last question on the soundboard, before we just have to go through the whole thing.
Chris: No, we’re not doing that.
Amos: Do you have any sound bites from NBA jam?
Chris: No, no it’s all stock. I’m running stock currently. Yeah, I’m gonna have to trick this bad boy out. Get this baby up to 88 miles an hour.
Amos: Is it from the same company that makes Audio Hijack?
Chris: Is is from the same company that makes Audio Hijack. And Loopback. All of this brought to you by the fabulous folks at Rogue Amoeba.
Amos: Rogue Amoeba. I buy all their things. I should just buy the package.
Chris: I, I spent some serious coin to pick up basically the entire package of stuff. And it’s all really good, it works really, really well. It is highly intuitive. Even an idiot like me can get it all to work. It’s very nice.
Amos: It’s only way that we can get live streaming Twitch from the Mac, pretty sure-
Chris: Yeah, OBS is still broken on the Mac, turns out.
Amos: Yup, yup. I used Airfoil. Airfoil by Rogue Amoeba
Chris: Ooh, nice. I use Loopbac to get my, to get my OBS to work. And now prepared to stream. I’m gonna be calling out code, I think. I’m gonna type some code in the whole time I’m just going to be like “Oh yeah, thanks for the 720, 360 no-scope blaze it. Thanks for the three month resub dude. Appreciate that!” (Amos laughing) “Oh yeah, thanks for the bits!”
Amos: I never thought that I would get to the day where people would sit at home and pay- we’re so lazy that we pay other people play the video games while we watch.
Chris: You pay people to play sports. I don’t know like, people-
Amos: I don’t watch a lot of sports either.
Chris: Well, I’m just sayin’. Listen. I don’t understand Twitch I’m too old to understand, to truly understand Twitch.
Amos: (laughing) Well then, I’m in trouble.
Chris: Yeah no, you’re ancient. You barely understand the Internet. I’m too old to really get Twitch, like at a fundamental level, but I do acknowledge that it’s basically no different than watching people play anything else. It’s like not really any different than watching people- it’s just entertainment right? It’s like, I dunno. People find entertainment in whatever.
Amos: Right, I mean- but, I used to go to my friends’ houses and watch them play video games and I enjoyed that.
Amos: But I did it for free.
Chris: You can still do it for free.
Chris: That’s still possible.
Amos: So, that’s why I don’t pay people on Twitch.
Chris: Oh! That’s why. That’s why you don’t pay people on Twitch.
Amos: Yeah, I’m like ,“Why would I pay you for what I can get for free?”
Chris: Why are you- (in a comical voice) “Why buy the cow when you can get the milk for free?”
Amos: That saying is used in way too many terrible places.
Chris: Well- yes. Absolutely.
Amos: Ooh (stretching). So, how’s your Elixir life going?
Chris: My Elixir life? Uh, t’s good. It’s good.
Amos: Good deal.
Chris: I’m doing- what am I doing?
Amos: Anything exciting?
Chris: No, not exciting necessarily. What am I doing? Mostly I’ve spent a lot of time working on resiliency. That’s, that’s really where, where all of my effort is going right now. And it’s going into a lot of things that I suspect the vast majority of Elixir users won’t need or use, but it’s very fun to see the outcomes of at work.
Amos: What area of resiliency are you looking for, like keeping a single system up or dealing with external systems going down?
Chris: Right. Yeah, so that’s well, all of it- yeah. So, when I talk about- right now we’re mostly focused on is a lot of plumbing. We’re just building plumbing that is smart plumbing. And all that plumbing is able to do the things that we actually kind of need to do for all the calls that we’re making to all these different services. So the thing is- Okay, so all services- how do we, how do we motivate this conversation? Service resiliency is about continuing to serve requests to meet demand even when everything else around you is on fire. And it’s important to note that the amount of requests you actually need a service or that you need to provide, that you need to satisfy, is based on some sort of actual target. Your target might be 10 requests a second or it might be a thousand requests a second or it might be whatever. But you have to pick a target. You have to say, “This is the amount of requests we can actually service, and this is what we’re going to say- is going to be what we’re going to count as success.” Most companies just never need to do that because they only, you know, most companies would be thrilled to death to have 100 requests a second.
Chris: And at that point you can more or less just do it. You know what I mean? Like it just doesn’t matter. Like, there’s, you know, you’re tuning database queries maybe or something that, but it’s just, like, the symbolism is simple enough that that is possible just to satisfy. And you can solve it with some pretty rudimentary? I would say- that sounds, that sounds, like dismissive. You can solve it with the classic things of, “Let’s just scale horizontally” or “Let’s just scale vertically. Let’s just give it more power.” And we could probably just do, to satisfy those 100 requests a second. And most companies would be, would be thrilled to death to have 100 requests a second, like steady 100 requests a second.
Chris: So once you get beyond that sort of threshold, though, I think once you enter into the next sort of order of magnitude, you experience a whole different set of problems. And that set of problems cannot be satisfied by just having more instances most of the time.
Chris: And you have to get more selective about how you-about what requests you’re going to satisfy. Which ones matter, which ones don’t, which ones can be discarded, which ones can get a degraded request. And you just have to do that, like you have to do that math. Because at the end of the day, that’s, that’s what will get, will allow you to hit your targets. Which means the targets have to be defined at all. Which means you have to define the targets in the first place. And that’s the part that I just don’t think most people either need to do or or do.
Amos: Is it kind of random targets?
Chris: Yeah ’cause it kind of only matters at a certain level.
Chris: But it’s something that we have to do because we’re at a point where there’s too many external things that any service depends on. That’s part of it. So, there’s too many things that you can’t control. And the overload scenarios, which is really- why don’t we talk about that? A lot of the failures that you’ll experience, a lot of what causes services to go down and for requests to fail is overload scenarios. An overload tends to happen because an external thing goes down, an external thing doesn’t have enough compute power to keep up with your requests counts, they have some sort of degraded experience, you’re running on EC2 instance with a thousand other containers and those containers are all stealing your CPU, you’ve got noisy neighbors, or whatever. You know. So many things can just go wrong and then you get overloaded. And as soon as you start to get overloaded everything goes to hell and that’s where, that’s where you will find your inability to meet the goals that you set for yourself.
Amos: Well and I, I see a lot of people with a, a quick solution is well, we’ll just turn on auto scaling.
Chris: Yes. Yes.
Amos: And that’s, in my experience, that is no good. It might be a band-aid for a little period of time. It can get really expensive too, especially if you get, like, start to become steadily needing, needing more. And then also you get downstream things of those services starting to be overloaded because now the one thing sending them requests you have two or ten, you know? So you’re exponentially increasing the problem.
Amos: Pushing it down further in the stack.
Chris: Right! And so, my opinion is that auto scaling is not really a part of your (sigh)- Auto scaling in the sense of, “I’ve got a minimum of five and a maximum of ten instances” or whatever it is. “I’ve got a minimum five and maximum of forty instances that I’ll deploy.” That’s not resiliency, and it’s not- because it doesn’t equate to resiliency. Because it doesn’t talk at all about how you’re going to handle failure. And it also- the idea that you can gain some amount of resiliency by auto scaling is predicated on the idea that the thing that your service is dependent on- be it a database, another service, whatever, -has an infinite well of compute to draw from.
Chris: If you’ve got -let’s just make it simple, right? If you’ve a app. You’ve got some sort of service and you run some number of instances of that thing between one and it. And it talks to a database. The only way that horizontally scaling works is if the database has more capacity for requests than your application servers do.
Chris: And, and that is true forever by the way. You know, I mean, that is, like, if you look at sort of a maximal, maximal view of the world. Auto scaling to infinity only helps if the compute power of the database is infinity. Which, of course, isn’t true. It just so happens to be that the database tends to be higher throughput, have more capacity than most of our application servers do.
Chris: But that is less true in certain run times. It is more true in run times that can do one thing at a time.
Chris: Because then you really do gain a lot more capacity just by having more app instances, so- But even in that scenario, right? Let’s say that you can, let’s say your database does become the problem. Well how does that manifest? Or you know you’ve got some, you’re running some really long query or locking tables. You’ve added some sort of inefficiency or maybe just some things misconfigured, right? Something bad happens to the database but you still need to service your hundred requests a second. Now the rules of that are based on your business, right? Maybe you can serve the greater request, maybe you can have stuff out of cache, maybe stuff can be stale, right? Whatever the case may be.
Chris: How do you continue to serve 100 requests a second though if the database has a problem? Because the thing is, if the database has a problem and what your auto scaling on is, let’s say, CPU measurements in the application server, well how does the database being slow manifest in your application? It manifests by having a queue of requests that they need to be satisfied.
Chris: And each of those requests start taking longer. Well, guess what happens then? All those processes that need to get out of your Beam start increasing the CPU utilization of the Beam. Your run queue starts getting backed up and you start trying to like, jump between all these different things, especially if you’ve given the Beam a very small number of cores and not a lot CPU overhead right. It now all of a sudden can’t do all the work it needs to do. So guess what happens? Your CPU auto scaler thing looks at that CPU signal and says “Oh we need more of those. It’s getting outta line. It’s rising. We need to increase.” So, you increase. Guess what you just did? You just made your problem so much worse because now you’re overloading the database even more.
Amos: Well, and and sometimes it’ll- it can appear to work too. Like, maybe you basically increase the size of the queue, right? Like one Beam can only have so many, so many requests in the queue before it says, “I can’t handle any more requests” and then, so you spin up another one, and all you’re doing is increasing your queue size.
Chris: Right, exactly.
Amos: And then we get Little’s Law.
Chris: And it’s just- all you’ve done- you have not solved the problem. And you’ve not hit your targets, either. That’s the other big important thing is you’ve not hit your target. And if you don’t hit your target you’ve failed. That’s, that’s the long and short of it. That’s an outage at that point.
Chris: So you have to be able to hit your targets and which, which, -again predicated on the idea that you have targets in the 1st place, which most people just don’t. It’s, “I want to handle whatever amount of traffic that I can receive.” And that works at low orders of magnitude.
Amos: And, yeah. And, and sometimes we have to deal with people who don’t really know what their target should be or are. And we don’t know.
Chris: Right! And it doesn’t feel good to say, “Actually after 100 requests we’re okay yeeting all the rest of them into the-into the ether.”
Amos: Yeah. (laughing)
Chris: No one wants to say that. Like that doesn’t feel good and your product manager is not cool with you saying you’re gonna yeet requests into the ether. That’s not- that’s not- that’s not rad. That’s not math. For them.
Amos: So, so then, what do you? What do you do to handle this? In my experience we get into the harvest and yield discussion.
Chris: Yeah, it’s- precisely. And then discussion I tend to have with people- and at this point I don’t really have to have the discussion, ‘cause I’ve just had it enough that people get it. I think most everybody gets it, although occasionally you gotta take them back to the wheel and continue to break them and be like, “You can’t do this”. But the choice is “Do you wanna- do you wanna throw away 10 requests you do you wanna throw all hundred away?”
Chris: Do you wanna throw away all of them or do you want to throw away ten? And that’s the choice you have to make. You can just throw, you can throw away all of them if you want to.
Amos: Or you can send it a few times- or degraded requests. Can I give them half the data?
Amos: Or some stale data out of the cache?
Chris: Right. And that’s the thing is, you have multiple layers, right? You have multiple layers of ways that you can fail. And so the first thing that we’ll start doing is we’ll start, we start limiting the amount of requests that any given service can make to any downstream thing. Those limits are, are discovered dynamically. We have a tool that I wrote called regulator that is essentially like a adaptive circuit breaker.
Chris: It just shuts off traffic or it only allows certain amounts of traffic through. But it probes the system and looks at latencies to determine that dynamically. And so if it determines -which is actually, by the way, just side note, mind-blowingly cool. The other day I watched it start dropping traffic and I looked at the- I was looking at like the highly down-sampled metrics and graphs that are in Datadog. And when I actually went and looked at like the P99’s, I realized that there was database latency in the downstream service like 2 hops away. And our front door API was able to figure that out and stop sending traffic to it ‘cause it was detecting these big changes in latencies and just immediately started backing off.
Amos: That’s pretty awesome.
Chris: It’s pure magic to watch that happen. And it happens like instantly. It happened within, you know, like 10 seconds.
Amos: So, is regulator internal?
Chris: No, no, its, that’s open source.
Amos: It is?
Chris: It’s on my desktop, yeah, it’s just nobody uses it because nobody understands why they need it. And honestly, most people probably don’t need it. And it honestly, it also makes like highly specific tradeoffs, for our purposes. Like it’s built to service the specific use case that we needed at work.
Chris: It’s all based on the idea of congestion- congestion control, like traffic control and TCP networks, like all that research is from, you know, the 80’s. And congestion avoidance and all that sort of stuff, so it’s all based on that same math. And there’s a bunch of other implementations of similar ideas that are out there in the Erlang world that I found. The most popular, and kind of the Swiss army knife of this stuff, is called Jobs, that’s a really good library and really, really robust, but you definitely need to, one, know what you’re trying to accomplish. You need know the math, you need to like understand what it is you’re even trying to do to understand Jobs. And also you really need to learn Jobs. Like it’s a complicated library because it, it’s a total Swiss Army knife. Its not even a Swiss Army knife, it’s like a Swiss Army knife that you need to build yourself, you know?
Chris: But it’s incredibly robust and really. really good like it’s a good library. So people should look at that. The other one that I know of that does similar things is called safety valve, which is a J-Lewis joint, and its, um, it’s also good. So those are good, worth looking at as well. Regulator is built for very minimal configuration and it’s built to service our very, you know it’s- it’s built to service- so I would say it is very much built to service RPCS and kind of services calling other services and stuff like that.
But in any case, so that’s, that’s the first step for us, is we immediately start backing off. We start dropping traffic and we just don’t allow you to call it ’cause it’s like, okay, we already have, you know, every- every service has a capacity, has some sort of limit and something that- we refer to that as the concurrency limit. And the concurrency limit might be “This thing can handle 10 requests at the same time” or “It can handle 100 requests at the same time.” Or “It can handle 10,000”, right? Some number. The thing is, is that number is never actually static. That number is dynamic based on system and the things that it’s depending on, and the things in the rest of the ecosystem, right. The world that you live in.
Chris: Your noisy neighbors and all that stuff. Which is why we attempt to discover that, that limit dynamically, and that’s what we refer to as the concurrency limit. And we do that both on the server side, so the server keeps track of its own latencies, and will start to drop load at the door. We just have a plug that immediately says “Nope, can’t handle traffic.” So, we’re just going to drop you. We’re not even gonna try and service this.
Amos: So what happens to people that are utilizing your service?
Chris: Uh, they just get like a like- what is the status code? – they get like a, “We don’t have capacity for this.” They just immediately get dropped, right? So they get, you just toss it back over the wire to them. We also listen for those errors and we listen on- for the latency change on the client side and so the client starts doing preemptive dropping as well, if it believes that this downstream service is also unhealthy. It just won’t bother calling it, because why go over the wire for a thing that’s not going to be satisfied?
Chris: So it stops sending traffic as well and then we make a choice based on use case. Is this something that we can then serve out of a cache? Is this something that we can just degrade? Can I return a partial result for this, and, or, do I have to fail the entire thing? ‘Cause one, one request to the front door might be 5 API calls to downstream services. So one of those fails is that like a deal breaker? Do we just have to fail the entire request? Or can we serve something degraded? Can it be stale? Can- whatever, you start making those determinations right? And that’s like your first layer, you start thinking through like how you’re going to start failing. And at some point, if that’s not good enough, cause the thing is degrading requests can still take time.
Chris: At some point that stops working and then you start figuring out like, “Can I serve this from a static- can I just serve this from like a static content page?” Can I just served anything, right? Can I get traffic off of this altogether? And you start falling through all these different bulkheads, right? And then eventually you’re like, “Oh, now I’m down.” Now things are just bad. Now I can’t recover at all and I just, like, shut off traffic for the front door. Or shut off some amount of traffic to the front door, right. Start just degrading, start dropping traffic before it even gets into the system. And there’s a threshold of that, right? I want to drop 5% of traffic, and then on 10, 20,30,40, 100, you know, you, like, start falling through all these gates. And thinking about failure in that way, thinking about, “How am I going to allow this to fail?” How am I going to build failure into the system at a fundamental level and then control that failure such that when stuff breaks it breaks under my auspices. It breaks under my conditions. I get to decide how this is going to fail now. That is, that’s, that’s actually what system design and resiliency is all about. And thinking about failure as a first-class thing. And so that’s, that’s, – that’s that’s we end up building. That’s what I’m building plumbing and tooling around right now. It’s like how do you do that?
Amos: And I think you can, you can sell the idea to use the time to figure this out by also, I mean what you said there, ”You get to decide how failure looks,” is also like, a good selling point to non-technical people too, like, “Hey, we get to decide what this looks like when things start to fall apart.”
Chris: Right. Cause they’re going to at some point.
Amos: Best laid plans, right?
Chris: Right, exactly, yeah. And you start getting creative. So one of the things we do is we have some of our requests are personalized, for instance. Like, we know that you are so user X and we want to show you stuff that’s relevant to user X, to you as a user. But it turns out that RPC or that query is really expensive ’cause that’s a big database caller, that needs to call some other data science thing or whatever, you know, which is written in Python and only do one thing at a time. It’s like those are really expensive calls to make. So what we do in our front door, for some of our operations, is we run a, run a GenServer that periodically refreshes a generic cache of certain content things, and stuffs it into a little s-cache library that I wrote called mint hat. And if we start degrading personalized requests we have a fall back to a generic thing. It’s like, “Okay well this is like the top stuff that most people see.” Or this is like, you know, this is something approximating what’s important but we can’t actually give you your content so we’re going to get you something that’s close enough. So at least we’re still serving something.
Chris: So we just run that, and every front door just has one of those processes that keeps like a cache warmed up, and, you know. And then if it’s there we serve that, and if not, we don’t. And if it’s not there, then that’s where we start falling through, like, these layers of failure, right?
Amos: So most of what we talked about on how to handle failure here is with serving data. So somebody requests data from us. How-what have you done whenever you want to handle incoming data? Like, people trying to send you data and you have to write it somewhere, and that starts having failure? ‘Cause that scenario to me is a lot different.
Chris: Right. So my opinion on that stuff is that all of the writes, all of the ingestion that you do, and actually also to some degree all the reads you do- although I think it’s less crucial for our specific purposes. And I think for a lot of the apps I’ve worked on, reads kind of don’t- reads can be stale. Because that’s just the nature of the internet. Like, data is stale the second you load the page.
Chris: And then you hit refresh. Or whatever. It’s like, yeah, I mean, like, everybody just knows that you’re getting a snapshot in time. So it’s kind of like, less important for reads, in my experience, although not universally. And not you know definitely not universally for all problems and there are specific problems that that timeliness really matters.
Amos: And another thing that I find that runs into it, it doesn’t seem to be, like, part of the- I guess it is part of the problem- but, is, actually, the age of your audience.
Amos: People who are younger, younger than me even, have a tendency to deal with latency and data like, “I wrote this but I can’t see it quite yet,” a whole lot better than, than an older generation. Like I feel like I’m right on the cusp of both of both at my age.
Chris: (in a comical old person voice) I went to reload, this page, but then the-the-data just is, it’s not there anymore. When I click the button it just not what I expected it to do. I went to check out my Medicare, and I couldn’t see what-
Amos: (laughing) I’m not trying to call anybody out, but I mean it’s just like uh-
Chris (still in old person voice) And with my rheumatism I’m not able to-
Amos: A natural thing, if you grew up with latency-
Chris: I cannot click this, this mouse.
Amos: Versus, when I write it on the piece of paper, it’s on the piece of paper.
Chris: (still in comical voice) Didn’t there use to be a very helpful paper clip on this application. I’d love, I would love- he was so useful for me.
Amos: (laughing) You’re fired.
Chris: But now my eyesight’s not what it used to be.
Amos: I think it’s the soundboard time (laughing). So…writing. How do we deal with writing?
Chris: Right, so yeah, okay, so going back to that, I think all of the different RPCs that you’re going to make- my experience that’s more important for writes, although, I’m going in trouble for saying that. People have comments about that. But, all of your RPCs really have a priority. And there really is a sliding scale of priority for every request that gets made. Some of them are more important. And that is the truth. We don’t, we don’t talk about that a lot mostly because it’s hard to assign priorities. In the same way that, you know, if you’re doing, if you’re in the class of Agile people who puts points on things, you know, (Amos laughs), that’s, like, that’s like pointing, right? Like, it’s like Whose Line? you know? Who knows?
Amos: You’re gonna get me all worked up you start talking about points. There’s only three correct points-
Chris: And also, let’s-
Amos: One. Too big. And no clue. Those are the only three possible.
Chris: And also, and also, points are based on complexity not time.
Amos: Correct. It’s hard for people to think in complexity, though.
Chris: Yeah. Well, in any case, it’s not days. Points aren’t days.
Amos: No. Or hours.
Chris: We’re gonna come back, let’s circle back to that.
Amos: Different episode. Different episode.
Chris: I was in a meeting where somebody was like, somebody legitimately asked, “How can we have 35 points in a sprint but there’s only 30 days?” And I about lost it. But anyway. (clears throat)
Chris: So every RPC you have has a priority, and you really need to sort of assign them. And the thing is, the other thing is that the caller is the one who gets to assign them. The server doesn’t actually get to assign priority. The caller assigns the priority. Because if you’ve built an API that is utilized by more than one caller, that necessarily means that you no longer understand all of the ways in which your API is being used. And it means that for certain callers that thing, that RPC succeeding, is a, is like make or break for that- for their ability to do work. Versus someone else who’s saying, “Actually it’s fine, it’s actually OK if this doesn’t pan out for this request today, right now.” So those RPC’s have a criticality to them. They have a priority to them. And like I said the caller gets to determine that. It, shockingly- or not actually shockingly, but the thing that I don’t think a lot of people think about a lot is that the caller gets to determine most of the semantics or, about a call, about any given call. It gets to determine what failure looks like, it gets to determine what criticality looks like, it gets to determine all things. So, it gets to determine the latency, you know, it- the caller matters a lot in these systems and that’s a thing that I think we discount.
But all that to say, you need to assign a priority and then your, your plumbing, really needs to get good at when its shedding load being smart enough to start shed- start shedding less- lower priority requests. And you just have to get comfortable saying “Yeah, like for us right now, this isn’t that, this,- for us at this call site this RPC isn’t that critical to our ability to do work.” We can do something else if that doesn’t pan out and we just allow it to be shed, right? And you just have to allow certain things to be, like critical, more critical than others, and you need to be, you know, you need to be conscientious of that.
Amos: What kind of things do you use for determining priority? Like do you have a prioritized these, these endpoints versus these endpoints are prioritized differently? Or this customer versus this customer? Or like combinations? Or are other things out there?
Chris: No, I mean, so I guess in a sense it’s, it’s endpoint by endpoint based. So when I say RPC I just mean like you’re making our request, right? You’re making some sort of, you’re making some sort of call to an external service and saying, “I want you to fulfill this.” So maybe that’s a certain call to a certain end point or whatever. But each call itself needs to carry along its priority. So if I’m making a request to the user’s endpoint that will have a criticality to it it will have, you know, this is either critical, or highly critical, or not critical, or highly not critical, right? Some sort of, some sort of scale.
Amos: In the request itself?
Chris: Yeah, or in the header. I mean, I mean, it just needs to be conveyed in the requests.
Amos: I mean, yeah, a header’s part of the requests as far as-
Chris: Yeah, it just needs to be, it doesn’t have to be part of the payload but it needs to be conveyed in some way to the end, to the server. And then the server can make a determination like, “Hey, I’m shedding load right now because things are going bad for me.” I don’t know, you know, does- it doesn’t matter why, things are not good. So anything that’s lower than- so if I have a queue of requests to satisfy I’m going to start dropping immediately all the ones that are low, low, low criticality.
Chris: So throw those away.
Amos: So you have to have some level of trust in the person. At that point, if they’re sending their level of critical, you have to have trust in them to not just say, “Well everything is critical and everything is high.”
Chris: Or you do the Google thing and you just default, you’re, you just -you just default the, the priority to critical, yeah. That’s exactly –
Amos: That’s not just Google.
Chris: That’s literally in the SRE book is, is they just default to critical. And then there’s one above that and then there’s two below that but everything starts in critical. And so at some level you have to just have an engineering organization that is aware of the fact that you’re gonna drop traffic at some point. Every service is going to drop traffic. Every service is going to throw away requests, and you have to be comfortable saying that and you have to get comfortable with the ramifications of that. Meaning that every service needs to be defensive. Every service has to take into account the fact that they could get overloaded, and if a downstream thing is- they have to be a good neighbor. That’s just like, has to be built into the system.
Chris: Which, you know, is possible to do. But again, it only really matters at, at the point where you no longer can hold all, the entire system in your head, you know. It only matters if you, if you’ve got, – I don’t want to say it only matters- it matters a lot more if you’re, if you’ve got all of these dependencies. If you have an app server talking to an database-
Chris: You can put fairly rudimentary safeguards around that. You know, you could put a regulator around all your database calls. You can put a circuit breaker around your database calls and if your database is down you don’t bother making the call but then you can start falling back through some sort of failure. And it’s a lot easier to manage. Like not everybody has to care about criticality at that point because you’re just, you know. Maybe you’re 1 app server does or whatever. And that’s, that’s a lot easier to manage obviously.
Amos: And when you start dealing with these kind of things, one thing that I wasn’t cognizant about when I started is, actually, that sometimes checking, like, “Should I be serving a request or not? Should I be serving out of the cache? That actually is going to add overhead to your request, right?”
Chris: Oh, yeah!
Amos: So, like if I have to say, “What is my database latency?” before I determine whether I’m going to give them a static version or not.
Amos: That is overhead on every single request where you’re still able to serve real time.
Amos: So you might be adding, you know, 20 milliseconds to every request in order to make sure that you can gracefully degrade and keep your service running instead of crashing.
Chris: Right, yeah, your capacity can most definitely be affected by these, by these sorts of tools. Regulator specifically, the way we do that is that- the way we try to, like, reduce the amount of overhead is that essentially everything is determined by the caller process. So we look at-we look at counters and ETS to determine if we’re above or below the concurrency limit or if we have available concurrency left. There’s still coordination that happens there, right, you can have locking contention on an ETS table so-
Chris: You still have a contention there. Which is, you know, a thing you try to reduce. And so you’re totally right, like you can have contention on all these sorts of checks. And sometimes, you know, the real thing is like it can be just as fast to serve -if all your requests are coming out of memory, like it can be negligible difference to serve all the server requests out of an ETS table as it is just to reject it.
Chris: So, you know, if all you’re doing is one more ETS look up and you don’t do any major serialization work on that- you just, like, grab it out of ETS and drop it and send it back over the wire- it can take, it’s like you’re maybe adding negligible latency-
Chris: As opposed to just rejecting the request wholesale. So you have to sort of reject it as early as possible on the server side, which is, which is why you have to reject it on the client, like you just stop calling the thing.
Amos: So you said something that I think might get hidden in there and that’s the -now I blanked out exactly what you said- but if you don’t have to munge the data after it pulls out of ETS it can be negligible, you didn’t use munge, whatever.
Chris: Yeah, yeah right.
Amos: So, so that’s, that’s one thing that I found really important is if you can cache a full response and it’s already ready to go out the door level-
Amos: That can save you a ton of time.
Amos: And actually allow you to become-get back to more real time serving quicker. Because you have thrown all that load away and you’ve gotten people off your back so you can continue to move forward.
Chris: Right, precisely. And yeah, if you’re serving stuff out of – if you’re, if you’re pulling the data out ETS and then converting it into JSON, boy is that a bad time.
Amos: Right. (laughing)
Chris: Like that’s a terrible time because it, it you’re, you’re wasting time in that serialization step. And JSON is slow as hell. I mean, JSON encoding in general is slow as hell right, not JSON the library but just converting to JSON is so slow.
Amos: So, so store the JSON in your cache.
Chris: Yeah, if you can, right? If you, if you can do that. And you might not be able to. That’s the thing is like, if it is a highly personalized result you may not be able to. If, if the liveliness of it is, is not possible to store it that way you can’t do it that way.
Amos: Right. Right, right.
Chris: So that’s, that’s where, that’s where those decision-making things have to start coming in. Or if you need to take that thing and then put it into another thing, you know what I mean? There’s all these other rules, there’s all these other caveats, but-
Amos: And sometimes think about how you can make that decision without necessarily having the data, and then you can store those decisions under different keys.
Amos: I’ve actually had decisions of “Just build the key,” and then go get the right data.
Chris: But I think, but, you know this all this kind of failure stuff- OH, the other big one’s deadlines and this is something we’re doing. We’ve started doing now as we’ve kind of aligned on a unified RPC framework for all of our internal calls.
Amos: Is deadline just “I have to respond to you in a certain amount of time”?
Chris: But we’ll just say you have two seconds. If the first call takes 1.7 seconds, and you need to make 2 calls, and they’re serial, and the next call- so the first call would have a deadline of 1.7 or well 2 seconds, or, whatever you just propagate that- the next call now has 300 milliseconds. That’s what you have left to satisfy. And you send that, you send that amount of time along with the request. And it just, it just propagates to the next thing. And so in the next service, the next service reads that value, in milliseconds, is how we send it, and then it turns it into an actual deadline that it could- there’s technical things that- this is also a library that I have on my GitHub called Deadline which people can go look at, it uses process dictionary so people have feelings about that- but what it actually is-
Amos: It’s fast!
Chris: What it actually does is it looks at system monotonic time and it adds the amount of, it adds that actual amount, like milliseconds, that you’ve specified. And so if you go over that it just starts, start saying, “Now you’ve exceeded the time” or whatever.
Amos: Do you use that only at a top service level? Or do you use it at, like, internally if you have a GenServer that you’re passing the message to, you send them lower, a shorter time out?
Chris: So that’s the thing, is, now you can start to use it for everything, or you can use it, well or you can use it for everything you don’t have to. You can use it for things that need to be timely. You can use it for things that have to happen in the critical path of that request. So for instance if your GenServer call has to happen in that, in the critical path of that request, then you could propagate, you could use that as the, as the deadline for your GenServer call. And that can be the default time, timeout now. So now you’re not- and the point of it is, now you’re not wasting compute resources to satisfy a request that the front door, three steps away, three network hops away, has already given up on.
Amos: Right, right.
Chris: So if I send that you’re wasting your time.
Amos: 50 milliseconds.
Chris: Yeah if I send a 50 millisecond deadline to a downstream service that, what that means is that my upstream service is going to wait 50 milliseconds. And then it might retry, right? It might, it might try again if it still has available time to do in. It might reissue those with a higher criticality at that time, or whatever. And it does that up to certain, you know, some amount of retries, to make all those sorts of, all those sorts of decisions. Anyway, so if I’m waiting 50 milliseconds and then issuing a retry, I don’t want to waste compute power in a downstream service or in, or in downstream service from the downstream service, on a request that, like, has no hope of being fulfilled. So instead, we rip that deadline out of a header, and we use that to either cancel request, stop making computations, stop doing computations, you know. Or you shove it in the Ecto timeout, right? You know, you shove it in your database timeout. You just say, “Hey if you can’t fulfill this in the next 5 milliseconds, I’m not doing it, it doesn’t matter.” And then you just raise. You’re like “Eh, get me outta here ’cause I can’t fill this anyway and the upstream things already given up so just throw an exception.” Log it, log it to the APM thing, we’ll look at it in a minute, and then you just, you just cancel everything. And that allows you to really utilize, to better utilize I would say, your compute power. Some GenServer calls, you know, you just expect that they might be long running. You’ve got some GenServe that sits there, and you don’t care. It happens outside the critical path for requests and you just say, “If that takes 5 seconds it takes 5 seconds I’m not even going to be around anymore.” At which point maybe you just cast to it.
Amos: Yeah. So, when you’re dealing with the cast, like asynchronous stuff, like the cast type thing, I have seen one time, only one time in the wild, I’ve heard people talk about it a few times. I think it would be incredibly complicated to implement-
Amos: But have you ever dealt with, actually, you said the word “cancel.” Have you ever dealt with being able to send, send a request and then later send a cancellation before the request is complete and have it halt the system?
Chris: Have I ever, like, successfully done that?
Amos: Yeah, have you ever had to do that?
Chris: I have experienced that in the past. There’s kind of, I mean, my feeling is like there’s kind of not a good-there’s kind of not good patterns for all that stuff. Or not, not parents, there are patterns for that that you can deploy that do that. Essentially what you’re recreating- there’s a transaction when you do that.
Chris: When you when you can say, like, “Cancel this.” It’s, you’re reinventing transactions at that point.
Amos: The only time I saw it in the wild there was, it was not writing any data.
Amos: It was, it was a report system that had these reports that would sometimes take hours.
Chris: Oh, so you could say can- just don’t do this anymore.
Amos: Yeah, “Stop!”” And it would. I know that the engineers had a really hard time because a lot of the report stuff, you know, we’re trying to do things in parallel, things trying to speed up these massive reports. I mean, these are reporting on millions and millions of data points and so they would have it, it spread out, but then they had to, like tell all these different job type things “Hey, never mind.”
Chris: Right, yeah.
Amos: And if you missed one, or you would get a cancel like when you’re 99% done-
Amos: And you’re like, “You know, those 12 reports behind you could have been done by now.”
Amos: You had the big report, you canceled it last second.
Chris: So then, that, yeah at that point, you’re, what you’re basically reinventing is, is like a saga, right?
Chris: If people are familiar with sagas.
Amos: Okay, so I’ve seen saga in two different places. The only ones I’m extremely familiar with is like you have a pipeline of work to do and you can inject in there in the pipeline say, “Don’t, so, stop this at some point.”
Chris: Yeah, so, uh, a saga is a way of doing distributed transactions. It’s actually, like fairly nuanced but the metaphor that I think I’ve heard a lot is, like, you’re going, you’re going to go on a trip, and in order to go on a trip you want to book a plane, you’re going to rent a car, and you’re going to get a hotel room. But those, all three, those three things all happen in different services. And they can all succeed or fail independently of each other, and you can’t really control that. Like, they’re just, they’re owned by different things you need to be able to account for that. And so what you can use is this idea of a saga. And the idea is that you would write to something durable and say, “I want this to happen. I want to, I want to go on a trip. And this, these are all the parameters of my trip. I’m going to book a ticket from here to here,” etc. And then what the system does is attempts to fulfill those things. So it attempts to get, you know, it goes and it books you a hotel room. And then it goes in and it books you a flight. Then it goes and books you a car. And if any of those things fail, it goes back to the beginning of the system and says, “Okay, this failed. What do I do?” Well, maybe you just keep retrying for some amount of retries, but, if that’s not possible what you’ll end up doing is you go back to the beginning and you say, “Okay, cancel this” And it actually issues highly specific “Cancel my hotel room.” Like I booked this hotel room, now cancel it. I booked my hotel now cancel it. I booked my car now cancel it, right?
Chris: And it actually knows how to issue each of those. The other important thing is that it has to keep a running log of all the things that it succeeded or failed at in order to know what to go back in and undo in the future. So that’s- when you have to cancel long running stuff like that, maybe it’s a febrile, like if it’s like a reporting thing you just like, kill the process. Or you look up the process in your cluster and you just kill it.
Chris: If it’s something like that where it could have side effects out in the world that, that matter, then you need to start working, worrying about how do you issue cancellations to all these other things? And then you need a way to talk about that concretely which, that’s where the saga plat stuff comes in. And like ordering and all these other things.
Amos: That sounds like a whole new conversation.
Chris: It’s yeah, it’s a big. That’s a big conversation. (laughing)
Amos: We should do that sometime.
Chris: Yeah, sagas are cool and are very useful in the sort of, the current landscape.
Amos: Of tech?
Chris: No, I just mean like, if you’re working in a big system that has a lot of disparate stuff with lots of services, like, sagas can be a really useful pattern. Transactions in a single database are also very useful it turns out.
Amos: Well, yeah. (laughing)
Chris: So, you know. They’re sort of a replacement for that.
Amos: In a distributed system?
Chris: Yes, yes. So, yeah, so but I think the, the interesting stuff really, I mean all the technical bits and all this plumbing are all in service of the, I mean, the important thing to remember is that all those things are in service of the idea that you’re going to sit and think about failure and allow for failure as a first-class thing. And that’s actually the much more interesting stuff because that, I’ve been thinking about that a lot. And how we allow our tools to account for failure in a graceful way. And how we allow our systems to account for that in a graceful way. Cause I think, I think this is where all of the academic-y kind of research-y stuff, I think this is where types, I think this is where, like, people who think they can prove things, I think this is where all that dies.
Amos: Right ’cause it’s harder, it’s hard to prove timing pulps and-
Chris: Well, yeah, I mean so much of- here’s the thing is, so much of what people want to prove right, it’s like, it’s a mindset thing. So much of what people want to prove in these things- when they talk about like formalisms about computer science, right, is, “I want to prove that there are no failure states. I want to prove that there are no error states.” Like you just, like, essentially remove those. And so the mentality is, like, “I can just think about the system and design it such that it is like a pristine perfect application.”
Amos: Right. Well, you have to, there’s a lot of assumptions that go into that. Like I know every single use case, right? And even using like TLA+-
Chris: Right, and anything I didn’t account for is wrong.
Amos: Right. (laughing)
Chris: Also. That’s another important thing. But I think even beyond that, like, if you look at like, railway-oriented programming, right? Nothing about railway-oriented programming makes sense when you look at it through the view of resiliency and reliability. It all falls apart when you look at it from, from that lens.
Amos: So you have to understand the failure?
Chris: Right. Well, and because the failures matter. In the railway-oriented programming thing, so much of it pushes failure way too far away from the call site. That’s like literally the point. You’re on the rail, you’re on the railroad and then failure’s like this other thing that can happen and you just, like remove that failure from, like, your, your equation right? Like you just, all you see is signal now, right? Except that’s not actually how it works when you look at it from the point of view of a resiliency.
Amos: Right. Cause the failure still gets to the end of the railway, and then if you are going to handle that failure you’re really far from where it possibly happened-
Amos: And so yeah, you can read the happy path of your code in railway-oriented programming really easily but it’s really hard to to look at and comprehend the error path.
Chris: And there is a way you can- you could use railway-oriented programming with these, like, branching error conditions, in theory right? Like, you could, you could still make all that work but it is not the mentality, right? You have to totally shift your mentality to look at it, to look at that differently and it just becomes really cumbersome. Like I, you know, I’ve attempted to do that multiple times and I find that it falls apart really rapidly. You just can’t fall through enough width clauses, like that just doesn’t, it doesn’t, it doesn’t work out in the end. Like there’s too many widths, there’s too many external things, there’s too many ways, that, you know, there’s too many fallback scenarios, like all that becomes really complicated.
Amos: When you’re, when you’re else in your width clause has like 5 different matches and it’s doing all kinds of things, it gets pretty nasty.
Chris: Well, and gets back to the design aspect of it, which is that you can’t put a cache- Okay, so let’s say I’ve got service A and I’m talking to service B. And service A defines a client for service B, right? It’s got all that, it’s got tuned stuff. It’s got all my HTTP calls and I have good functions to call that mask the HTTP, underlying HTTP calls or whatever it is, right, to service B. Here’s the thing, your fallbacks cannot be designed into service B. You can’t encapsulate your fall back into the service B client. And it gets back to the- because of what we talked about earlier. Because the call site determines whether or not, like what you do with that. And if you’re calling service B lot in service A, then each time you call it you need to make a determination about how critical that is, about how critical that call is, what you’re going to do if it fails. You can’t just, you can’t make a choice generically. Or, if you make a choice generically, it will be wrong for a certain class of things, and you’ll have to work around that. So the encapsulation becomes really hard. Like the encapsulation has to move. The layer has to – the encapsulation layer for what you’re going to do in fall back scenarios has to move a layer up above the HTTP client layer, ’cause you can’t add a cache there. You have to add a cache above it at the actual call site where it matters, where you’re handling that RPC. And that becomes really hard. That’s also part of why the railway-oriented programming thing starts to fall apart, because there isn’t a generic way to talk about all of the error conditions. Your call site has to care about all the error conditions, and using the tools that are at our disposal in Elixir, like widths and all that sort of stuff, to do the railway oriented thing you end up with way too many else clauses. ‘Cause then you’ve got else clauses that call into more widths. And then, you know, it’s like, if this fails in this specific case, so you just say, you know, it’s, it’s you end up with a proliferation of functions that need to, that need to be able to support all these different use cases.
Amos. Right. Or you shuf- or if it’s not right inside of your width and you’re shuffling it off somewhere else.
Chris: Right, yeah. And that becomes really, that becomes- and then you would have a one off functions that you use one time for the simple purpose of like I wanted to use width, and that only works up to a point. And so I think when you look at it from the lens of “I’m going to make this thing reliable, like really truly reliable, like it’s going to fall through, it’s going to fail in these ways, I’m going to allow it to fail in these ways, it’s going to fall through the system in this way.” It becomes tricky to do that in a, in a truly generic way and have maximal control over all that stuff. And so I think we don’t quite have, or at least I don’t have good patterns for that right now. It’s something I think about a lot, is, how to incorporate that sort of stuff in a way that is, well, I would define as simple, meaning you’ve got functions that do a lot of things that are deep that, like provide value, but it’s also still understandable and still has all this stuff built into it. Now which is just, I think that’s really hard. That’s a hard thing to, it’s a hard pattern that I haven’t figured out yet. Right now, my handlers for all my RPC calls it just, it just you know it just does a lot of stuff.
Amos: Giant, like one function? Like it might call out other stuff but it’s big itself?
Chris: Yeah it, yeah, I mean they’re just big because they need to you, know the errors matter.
Amos: I completely understand that.
Chris: Like what if I don’t like to follow that controller thing, right?
Chris: ‘Cause the errors matter. And the errors matter at the call site. They’re not generic, they’re never generic.
Chris: And the more you make them generic, the more you push them farther away, the more complicated your system gets and the less reliable it gets.
Amos: But if you look at just, just implementations of width, frequently people will tack up on some atom, and return and have a 2 pole at each line of the width, so that when there is a failure you know where it came from so how to handle it.
Amos: So now you have your error condition pushed off somewhere else with a special atom that is, may only pertinent in that one location, like you said, and so, like, you see people trying to build little things to be able to use things like width or railroad oriented programming but still capture that context. And to me that means that all of that context should be captured in one place instead of spread apart.
Chris: Right. And the way we do it right now, I don’t think it’s scale- ope, when I say we I mean literally me and other people at B/R- the way we write our RPC handlers now, ’cause we’re sort of moving, it’s not like widely talked about right now but we’re not really using Phoenix controllers and the rest anymore.
Chris: I mean obviously we still have all that stuff, and we still support it, like our legacy systems, but all of our new stuff is not that. We actually, the stuff we’re using is much simpler, it’s much, like, there’s a higher signal there, we don’t worry about the plumbing as we use an actual RPC framework now. It’s not GRPC I’ll just say that.
Amos: (laughing) It it internal?
Chris: Cause it’s, ’cause it’s, ’cause in GRPC the G stands for garbage. (laughing)
Amos: That’s been my experience, too.
Chris: No, it’s not, its not internal. Sorry about that, it’s not relevant right now. But like, what we end up with, or what I’ve been doing a lot of, is like your RPC handlers call, use width to call these different functions. And the functions do these sort of one off things, but as you add more handlers you need more one off functions to make your widths actually look like real width with railway origin stuff. And so it’s not scalable in a way that I’ve- from code standpoint yet. And contexts don’t save you contexts are right out. Contexts are not the right answer because they don’t provide enough encapsulation. Again ’cause you can’t encapsulate it that way. The call site has to encapsulate it. So yeah, so it’s really interesting right now, and the way I’m sort of moving to solve it is by making the handlers themselves independent modules and then that module can just do whatever it needs to do for that RPC. But that’s not really scalable either like it’s, it’s, it’s definitely, it decreases reuse so there’s- really there’s a lot of interesting discussion around this. And I think it’s, I think that’s a really interesting design discussion. Like how do you make that- how do you make failure a part of the system? How do you make the ability to create (failure) as part of the system and how do you make it scalable from a code standpoint? Like, that’s really hard.
Amos: Yeah, I think we’re going to have to have a whole discussion on that alone at some point.
Chris: Oh, yeah! Absolutely.
Amos: Yeah, we are already at an hour and I know that we both have things to do today. I could sit and talk about this all day. I got like a whole page of stuff that I wrote down. I do think that it would be good to leave some people with some resources to start thinking about the stuff, so I’m going to throw one out there, ’cause we mentioned it a few times, is harvest and yield. Harvest, Yield, and Scalable Tolerant Systems, it’s a white paper by Fox and Brewer. I think it’s a great place to start, to start thinking about, it’s just, it’s, it’s a simple way to handle failure. And so, I think that white paper’s a great place for people to start.
Chris: Yeah, I like that paper a lot. I’ll, we can add- I’ll add links to the other repose that I talked about so people can check those out, if they, if they’re interested in them. I’m trying to think if there’s other, I mean a lot of this stuff that we’re talking about is also in the SRE book, which is- The middle part’s really good, of the SRE book. You should, you should probably read the SRE, the middle part of the book.
Amos: The middle part?
Chris: Well, yeah, there’s a lot of chapters in the middle that are really worth reading and there’s a lot of stuff on the, on either end, that are like, either platitudes or just not relevant to most companies.
Amos: Cool. Well, yeah, I’ll have to take a look, I haven’t read that.
Chris: Yeah, it’s free on the internets.
Amos: Sweet! I like free!
Chris: (sighs) Alright.
Amos: Well, Keathley-
Chris: It’s been fun.
Amos: Yep. Have a great day. Thanks for the adventure today.
Amos: And the soundboard. (Bicycle bell rings) I’m loving it.
Chris: You’re welcome.
Amos: (laughing) Perfect. Alright, take it easy.