Twitter Logo

Amos King

Twitter Logo

 Chris Keathley

 

The Elixir Outlaws now have a Patreon. If you’re enjoying the show then please consider throwing a few bucks our way to help us pay for the costs for the show.

Episode Transcript

 

Amos: Welcome to Elixir Outlaws the hallway track of the Elixir community.

Chris: Why does this always make you laugh? It always-

Amos: I don't know. Sometimes I wish, sometimes I wish that we were a videocast. I think it would be hilarious.

Chris: I think what you mean is sometimes I wish we took this seriously.

Amos: No, no, I'm really glad we don't. I'm having too much fun to take it seriously.

Chris: I have to keep this super bright and tight.

Amos: Let's keep it bright and tight. All right.

Chris: Toi. Toi.

Amos: It- okay. So I have things that I could talk about, but if you have things that you want to talk about.

Chris: No, no, please it's, as they say, it's your show.

Amos: It's my show. I don’t think so.

Chris: So you take the wheel.

Amos: Jesus take the wheel. Oh, wait. Wrong song. Um, so I'm, I'm uh, currently wrapping a web service.

Chris: Okay.

Amos: Um, and I always struggle with this. Um, on where to handle errors. So I'm taking a little bit from Quinn, right. And I have a protocol that you, you pass different things into. And the protocol really, all it does is it uses Finch.

Chris: Right.

Amos: Thank you.

Chris: You're welcome.

Amos: Grabs grabs, grabs a response. And for everybody else that worked on Finch too, it's not just Chris. Uh, but he'll take the credit. Um,

Chris: No I really won't. I was just going to pass on that credit to, to our dear friend, Mitch, but yes, I understand.

Amos: Uh, so, um, I get the response back and, and that's actually what I pass back out of the, um, out of the service and into a GenServer and the GenServer, I have it parse, um, the body and deal with stuff. So I'm like trying to figure out where to, to deal with errors, to like, and I know that sometimes it's going to depend, but like my, I should, my, should my protocol implementation, do the parsing of the JSON and return that error?. Should I deal with the error in the GenServer? Like where do you like to deal with errors? Like at what level? And then I also have a web front end, that's talking to that, right? Like a Phoenix app that talks to that GenServer that then talks to the service because I'm trying to be responsible and make sure like, it returns in. I need it to like, kind of always work. So if the service is down, I need to return a string. I want to wrap deadline stuff around it. So on the front, the front end, like I always want a string.

Chris: Right. So like when a user, if anything, that you, anything that interacts with a user, you want that to return a real human error, not a stack trace.

Amos: Right? Right. Well, I don't know. I just want it to, I actually want it to, I either need it to the service to return me a name. Ultimately, I need a name out of it. Or if anything doesn't work. I just want it to return an empty string. I'm, I'm basically a using it as in like you type in one text field and auto fill the next one.

Chris: Okay.

Amos: Like if you type in a zip code, fill in a state.

Chris: Oh, sure, sure, sure. Okay. Okay. Okay. Okay. So you want to do, okay, But the point is you want to do something, you need to do something,.

Amos: Right.

Chris: Okay. So I'm trying to keep a picture in my head of exactly what you got going on over there. But.

Amos: It's not easy.

Chris: No, no, no, no. I think I understand, but, and then obviously correct me if I get any of this wrong, but it seems like to me that in general, in general, when we talk about errors, there are, we need to, we need to make it very clear, if we're talking about errors, in the sense of these are expected things that could go wrong. Or if we're talking about exceptional cases where we want to either allow something to crash or to like, like, like how many things do you want to actually handle, right? This is, this is really the crux of, of the, let it crash thing for me is what can be, what can be handled generically and what can be hand, what needs to be handled, What errors have to be accounted for? Uh, so if the service that you're talking to is down, what do you do in that scenario? And also, are you going to handle in all the different ways that some, that it could be down or are you going to like, have a blanket sort of, oh, well, I'll just let this, I'll let this fail this way, this one time, you know what I'm saying? Like, like, how do you isolate? How do you isolate errors from each other and what errors are important? And also are you going to handle every single possible bad thing that could happen?

Amos: You know, that there are, I think a so many ways that bad things can happen. And this is also where I get stuck is like planning. Like, I can take any little story. Like, this is not, not a big thing. Type in a zip code. It gives you a state or whatever. That's just my example. There are so many things that could go wrong that I could turn like a, something that I think would take a day to turn around and make it take weeks, you know, like exponential back off. Or if the service is down, maybe I returned from a cache and I wait a little bit and I let it come back. Uh, if, if I can't authorize with the service, maybe I say, hey, we, we just need to let it die. This needs to end because I'm not going to be able to auth in 10 seconds either more than likely like, like if, like, if there's a key change or something.

Chris: Right. Right, right right.

Amos: So I think every, every error is different, but I think that's where I get into, like, where do I handle it? Part of me just says, handle it in the GenServer, because that's the thing that decides whether it crashes or not, but maybe not all errors go there. And that's where I'm like, well, do I parse the JSON in the GenServer? Or do I leave that down at the level? And if they change their JSON, do I just say unparcable and try again? I guess that depends. Like, cause it's probably not going to work if I can't parse the JSON, but I may not know why until I physically go look at it, right.

Chris: Right. Yeah. I mean, I think there's two classes.

Amos: I just threw out a bunch of different error types, but,-

Chris: Well, so, so that's, uh, the, the, let's talk about the JSON one, for example, because I think that's an interesting, there's an interesting sort of discussion to have there. So there are two types of this JSON is not parceable. Um, I mean, I guess there's like three different ways that that can manifest, but the two that are interesting are you request an endpoint and you believe it to be returning JSON, like you've had a discussion with them or they're, they've, they've given you a contract of some sort, like you have some sort of relationship such that you expect there to be JSON coming back. And if you call them and let's say they like 404 or something, and as part of their 404 they return you HTML. Well, now you can't render, you can't parse JSON anymore. Cause it's not JSON it's HTML. Right. So that's the scenario-

Amos: Not JSON.

Chris: You have non JSON. You have, you have the fact that it's just not JSON at all. You have the secondary thing when, and that's more like a, the service is wrong, you know, or, or a lot of web servers will do that. You, you know, they'll default back to HTML cause I just don't know what to do in that scenario or whatever. They don't know how to render adjacent payload that says 404, you know, it's just like, that's a, that's a, that's a common problem. So that could be an issue. The other thing though, is that they could just stop returning JSON to. Now that's like very unlikely, but it's a possibility. Do you want to write code around that? I don't know, but it's possible that they just, all of a sudden decide message pack is better one day.

Amos: But the, the one that I've seen frequently in my past is, um, I'm going to use Chris Keathley's favorite word, somebody refactors and they change the public interface. Uh, and maybe it said state before and now their state name and state abbreviation and they dropped state entirely.

Chris: Right. And so that's, that's the, that's the third thing, right? That's the third one that we actually are really worried about. Um, and people break APIs is like that all the time. And then they claim that it's for the betterment of everybody and really what they've done is just, you know, democratize all of they've just arbitraged all your time. They've pushed all the costs back onto you, especially, it's really fun when you pay for that, for a service that does that too. It's really good. That's a real Google as a real Google approach to APIs. We know that you pay for this, but, uh, actually we believe that you should use the new undocumented API because we're really good engineers. Thanks, Google.

Amos: And we always return two hundreds.

Chris: Yeah. Yeah. So, but I think it is interesting that like, okay, so the first class of problems, the first class is just like, things are wrong. In that case, you probably want to retry it. Cause if you're getting a four, a 4 0 4 , or have some remediation, cause you, you know, that they're, you know, it's a 4 0 4, it's a 500 or something and they just didn't return you the right stuff. It's probably safe to retry in that case. And if you get the error-

Amos: What if the status is wrong?

Chris: Well, or let's say, if it's a 404, maybe you don't retry, but if it's a 500, you do . And you just got like the wrong payload back or something like that. Maybe it's safe to retry that. And the second class, when they just change stuff, hopefully your tests are catching that. Hopefully you have a test that, or some sort of contract that is in place that you can say like, Hey, this doesn't, this doesn't look the way I, I expect it to look,

Amos: So do you have automated tests for external services or do you just have a notification in your app?

Chris: Sometimes. Sometimes.

Amos: Okay.

Chris: But I also lean a lot more towards, these days, towards put that junk into production and have really good monitoring and observability, which is really the same thing as testing, if you can get to that point. Like testing and observability live on the same spectrum in a lot of ways, because of, you know, the more you lean, you start to, the more you start to realize that tests and large scale tests, especially in services become a really huge bottleneck on delivering anything. If you seek to retain your agility in like producing new code, what will inevitably happen is that you'll realize that the testing that you need to do needs to be much more sort of continuous and happening all the time, at which point, why aren't you just running it against your production instances and, and how do we learn that stuff's gone wrong? Well, we're going to build really good tools to monitor and observe the behavior. And then, yeah, you've just built, you know, you, you know you're testing in production. Because you just give up on the idea that you can test in a sanitized way. But back to your point, I think having contracts in place is really good for that kind of stuff. And often services that you know, that you pay money for one, they're, they're not likely unless they're Google and they're bad at engineering, they're not likely to, they're not likely to change their APIs, you know, like Stripe famously, it's like you can use any version of their API that's ever been back 10 years ago or something like that.

Amos: Yeah, it’s pretty wild.

Chris: It's, you know, they've just cared a lot about that. Um, which I think is what we ought to all be striving for in a lot of ways for services like that, like that are really crucial. So. But in any case they'll, but they'll often have like sandboxes and stuff like that. So you can hit a real request in your CI suite. You don't have to do it like locally, but in CI you hit a real API and you're like, hey, did you know occasionally, or you validate a contract in some way, even down to like pull their docs and compare, like, I've done that before too. Like, you know, you, you don't want to make that arduous, but uh, some degree too, you kind of have to trust them. Like, I don't think you can verify the world. And so you kind of just trust that, like you get close. So that's the second class of things. And then the third class things is, yes, you've got a really smart engineer who's like, this will be so much cleaner if I do it this way. And then they like change the keys inside the values and break everybody because they thought they liked it better that way. Or it's more optimized for them or whatever. Any host of any number of bad reasons that are all justifiable from like an engineering purity perspective, but are actually like really net bad. So yeah. And then they change the keys and then you break. Right? And so in that case, we can't do anything about that. So you just blow up, right? Like you can't solve that problem. And in fact, your user can't solve that problem either. They need to know that something went wrong, but they can't solve the issue on their own. They can't do anything about it. So they're going to have to figure it out too. So, uh, in that regard, the question becomes where do you, the developer, want to handle all those things? And along with that, do you want to handle them at all? Do you want to handle all of those use cases or is there a generic way that you can handle all of them? Ala let it crash.

Amos: Right. Yeah. I mean, I guess it, it depends on each one because I can see like services down. I might want to bubble that up to the user and say, hey, sorry, we couldn't auto complete. You can still type in, we have a back, like a backup plan, kind of thing. If I can't parse the JSON, do I, I may want to tell the user the same thing.

Chris: Does it matter to the user? The service is down right.

Amos: Not at that point.

Chris: So you maybe, maybe what you do is you isolate that entire call. Maybe not all of it, right? But just that call, you're making a call to some service somewhere and then parsing JSON, maybe you isolate all of that in a process somewhere. Be it a task, GenServer, whatever. But you isolate that. And then if that process, and then you just raise, then you write that code, that codes trivial, right? All of a sudden that code becomes five lines of code. That task has five lines of code inside of it because you just use bang functions everywhere and you raise and you don't care. Now you as an engineer and a developer need to care. Your monitoring and your observability need to care to some degree, but like you can, that's on you to figure that out. And then you raise the, you raise an exception, you have Sassel logs or something like that, that pick up errors where you're like this crashed, and you can do something with that. I don't know whatever you want to do about to, to monitor that process crashing a whole bunch of times, but then you just let the call crash. And then you either are trapping exits and your calling process or your, awaiting, uh, the like, or you're awaiting a response and you don't get it back or you, you know, whatever the case may be. And you can run that other, you can run your HTTP call and JSON stuff inside of a process somewhere, uh, underneath it, its own supervisor. You don't have to link it to your calling process necessarily, although it's, there are benefits to doing so, and then you can make your take whatever remediation you want or respond to the user and say, hey, it's dead. And you know, services, services wrong. I don't know what to do about this.

Amos: Why do you link the service to your calling process? Let's say that you're, you're a Phoenix app, right? And you have something. Why would you link that service call process to, to that Phoenix or an end instead of just leaving it as a separate process tree, maybe in your, your backend processes?

Chris: For a couple reasons, um, one is their life cycles are inextricably linked together. That's why it's called a link. You know, the, the, the life cycle of that web request depends upon the life cycle of the other thing. And so, and vice versa, the lifecycle of the web request is dependent upon the life cycle of the web request in, in general, in theory, right? Like maybe it's not, maybe you want to go full async mode, but if you're getting a web request and you're going to respond to the user in that web request, when you go to make an HTTP call, let's say, you start making that HTTP call, and now you're in async land and your web requests could be doing anything. What if your web request gets shut down? What if your web request process dies? And it dies because it was trying to do some other work and that crashed or some other process it's linked to died , right? Now, you don't want to be spending time doing a bunch of work to call a service that no one's going to ever see. So just crash the process and move on with you and clean it up and get it all out of there. Like that's the benefit of the links, right? Is that you can start to break all this. Like you start to take all these things down and it's cheap to do so that's the benefit or in theory, that's the benefit, right? You have to just kind of design your system to support this, but when two things are dependent on each other, you want to link them.

Amos: So you do, you spin up one of these, um, I'm going to call it a service process, the process that goes out and hits some service for you. Do you spin that up, uh, at the front end, like in, inside of handling the Phoenix request? Or do you have one sitting in the tree already?

Chris: It depends on how much concurrency you want. So if you're comfortable, well, and even then your concurrency is going to be somewhat limited, but, um, cause you've got a, uh, a HTTP connection pool in there somewhere if you're making HTTP calls. So if you want, I'm trying to think of an example that doesn't, that wouldn't already involve some other pool somewhere. The problem, the only downside to calling, well, not the only downside, one of the downsides to spawning a process that is like a task let's say inside of a web request that then makes an HTTP call is that you have your eight, your web requests, uh, are depending on your web server. And depending on how you've configured stuff is kind of like theoretically unbounded, right, there, it's an unbounded queue problem. So you've got a bunch of web requests. You could get a hundred, one hundred users right now. You could get a thousand users right now, or ten thousand, right? You don't really get to control that. Now there are, there are ways to control it at sort of like your load balancer level and that sort of stuff. But if you imagine just your web server servicing web requests, it just has to spawn as many processes as it can when it gets web traffic. So that's an unbounded queue. Now-

Amos: I guess, I guess if you do that too, you don't, you don't end up with a single process that's sitting there trying to handle all those in blocking,

Chris: Right. So you want a pool of those processes, which your HTTP client already gives you. But the downside to doing, just to spawning a task in this case, is just that you're going to spawn N number of tasks for N number of web requests. So any request that hits that end point is spawning a task, which may be, um, also, which is also unbounded, right? Like if we assume that your web requests are unbounded, which is like loosely a thing that we could, you know, that's a little bit of like draw the rest of the owl, uh, like we're just drawing circles and pretending that these are, you know, or it's like, it's like, uh, how fast is the horse moving? We'll assume the horse is a sphere and there's no wind resistance. That's a little bit of what we're doing right now. We're sort of like making some broad assumptions, but you're going to have an unbounded amount of tasks being created, um, which then are going to bottleneck on a pool because any HTTP client that you use is going to have a pooling solution behind it, just about, any, any, any realistic HTTP client that you're using is going to have a pool.

Amos: Right. Well, or your service is going to bound you too, probably.

Chris: Uh, yeah. So, so that's the trick though, right? Is that your, in this case, you're spawning a task to isolate errors cause you just don't want to handle them. Like you just don't want to deal with all the different errors that you're going to get. And why should you to do that, you know? That would that's, you know, there's a ton of complexity that goes into what kind of error was this, is this the kind of error that blows a circuit? Is this the kind of error that does this, that, or the other thing, like some of that you want to handle? Cause it's important. Some of this is like we got a 404, so don't retry, but this was a 500. So go ahead and retry. Some of that you want to build in some of that, you know, uh it's like if you know that you're getting rate limited, maybe that doesn't count towards, you know, maybe you take some sort of remediation about that. If you know, you're violating deadlines, maybe you take some sort of remediation about that. And maybe some of those errors matter, but like the JSON thing? Yeah, just blow up. Like at that point now you can safely blow up and that's really cool.

Amos: So do you, do you use tasks a lot then?

Chris: Oh yeah. I love tasks. Tasks are great.

Amos: I don't use them.

Chris: I don't know about enough, but like I use tasks all the time.

Amos: I barely use them.

Chris: I use tasks to isolate stuff all the time and that's really the benefit I think, of, of that approach is that you're isolating errors and that's the let it crash thing. When people talk about let it crash, it's not like you just let it all blow up. It's that you isolate errors such that it can blow up. Like the benefit is that it is that you get to let it blow up.

Amos: Okay. So where, where I start to worry about this, right, is, you know, that, that unbounded queue, if I spin up a task to go hit that service and somewhere, I feel like I need to keep track of like how many times it's failed in a certain amount of time, because I may just say we shouldn't spin up any more of these tasks. So where do you, where do you put that? Do you, do you store it in like global memory or how do you, how do you deal with that in the, in the task oriented I spin up a task inside of my, my Phoenix process?

Chris: The way-

Amos: In that world.

Chris: Yeah, for sure. I'm trying to think of, I don't know if I've done this in Phoenix specifically, but we did this. I, when I wrote our Kafka library, I, I, this is how I wrote it. And the way it worked is you would spawn a PR you would spawn a task or a process somewhere, typically a task, um, underneath the supervisor. Um, and then you don't link. You have a separate process that doesn't link to that task, but does monitor it.

Amos: Okay.

Chris: And you, and you monitor it. What you're monitoring it for is like, that's essentially your work coordinator. You have a work coordinator that monitors these tasks, but doesn't actually do the work. It disseminates work to either a pool of processes or other tasks somewhere. And the work coordinator is keeping track of things like how often has this one failed, uh, what PID is associated with this work ID, that sort of stuff. So you can, you know, every work process, every piece of work that comes in and you tag with like a ref ID, and then you disseminate it out to a task or some process. You monitor that process to see, does it crash or does it return to me and say, hey, I'm done. And if it crashes, then you can restart it either on a timer. So you send yourself, you send the work coordinator can send itself a message that is like, oh, hey, this crashed, so in, I'm going to back off by this amount, cause I know it's crashed this many times and then I'm going to like fire it off again. And in that way, you allow yourself to build, you know, a more robust set of workers because you don't worry about the work being done inside of them crashing. You allow more people to write- And this was the benefit of the thing I worked on with Kafka is like, I wanted other people to be able to write this stuff. And without worrying about crashing the, the Kafka consumer, cause you've crashed the Kafka consumer and then it takes a long time to come back up and you have problems. So I wanted to isolate all the various types of errors that could happen and allow other people to write workers without having to worry about all that kind of junk and worry about back off and worry about retries. So the word coordinator just does all that for you, right? So you can ask the work coordinator to run a task. Yeah. I mean, the API was written in such a way that like, you never knew that that was what was going on. You just said do stuff and it figured it out. Cause I don't believe in like exposing all that to the end user. Like you just want an API, that's like do the work or you want a behavior that you hook into that, you know, has rules about it. But that's, that's, that's how I would approach it. And if I really, really was worried about it, that's how I would approach it for web requests as well. But I also think that there are certain web stuff where, so, you know, there, there are certain types of HTTP calls that are going to crash and then you just kind of don't care. Like you'd rather just sort of crash as well. There's a class of things that you would just rather crash for.

Amos: Right. Yeah. If I can't, if I can't serve anything to the user.

Chris: Yeah.

Amos: I just let the whole stack go up.

Chris: Yeah. Just 500.

Amos: Yep.

Chris: So that's my general approach to trying to isolate errors. If I can isolate them such that I can remove a lot of overhead from the user so that they're not writing, uh, copious amounts of error logic, then that's good. That's net good.

Amos: Nice. That's all I had talked about today.

Chris: Cool. Well that's good. Cause I gotta run.

Amos: I know you said you had to keep it short and tight.

Chris: Yeah. So this is good. I hope that that's, that's a useful, uh, you know, those sorts of concepts, the idea of isolating areas, the idea of, of not handling the stuff that you don't have to handle that you just don't want to handle or that doesn't need to be handled. It's gonna, one, eliminate a ton of code. You don't just don't have to write any , that code anymore, which is great. And you free others up to do that as well. And if you can solve this holistically, then you, then you really get away with like building a system out of a lot less code, which is always a thing to strive for. I think. So. Yeah. That's my, that's my thoughts on it.

Amos: Awesome. Thanks. I want to point out to everybody that we talked about a lot of Elixir today.

Chris: Yeah. In a very short amount of time.

Amos: Yeah. Enjoy. Don't get used to it.

Chris: All right. I got to go.

Amos: Alright, have a good one Keathley.

Chris: Bye.