Episode 96: Stuck Between Two ETS Tables and a GenServer

Amos King

Chris Keathley

The Elixir Outlaws now have a Patreon. If you’re enjoying the show then please consider throwing a few bucks our way to help us pay for the costs for the show.

Support Elixir Outlaws

Episode Transcript

Amos: Welcome to Elixir Outlaws, the hallway track of the Elixir community.

Amos: How's it going?

Chris: Oh, good. I'm good. I'm good.

Amos: Ever open-ended question. That actually means nothing other than I actually think you're a fun person.

Chris: I have work on my mind because I was in the middle of, just, I was in the middle of at least three GenServers and an ETS table. And normally they call that Friday night, but right now it's work. So it takes some of the fun out of it. But, I uh,

Amos: Three GenServers. How many Gens did you order? It takes three, three servers to bring you your gen?

Chris: And an ETS table. Three GenServers and an ETS table.

Amos: How many gins can you fit in an ETS table? Anyway, that's terrible. So that's, like, you’re just saying, having programming on your mind, like that is the bane of my existence. I feel for every, every spouse, partner, significant other of any software developer, because I don't know very many of us that can just shut it off and put it away.

Chris: I know I have a hard time with it. I'm having a hard time with it literally right now, um, where I'm just thinking about what it was, what it is that I needed to have been fixing and working on. But it's all good. Getting there.

Amos: That's all right. We just need you to keep talking about ridiculous stuff and, and you'll move on. Maybe.

Chris: Yes.

Amos: Or you'll solve it and be like, we gotta cut this recording.

Chris: I dunno, sometimes I go in waves, right? Sometimes it's like, ah, I just can't the problem takes hold and you can't let it go. And you're, that's all you're thinking about. That's, that's currently where I'm at.

Amos: I seriously, I think the best thing that ever happened to probably, probably my career and my marriage, it was also really hard, was driving 100 miles each way to work because I was able to just have that thinking time, like I could leave a problem at work. And by the time I got home, I was, I had already thought through it so much that I didn't have to, I didn't have to like wind down to get away from it. I already had, I don't know. I solved a lot of problems on the road.

Chris: I have to go take a walk a lot, like after work, you know?

Amos: Oh, yes.

Chris: I'll look at Andrea and I'm like, my, I am, “My brain is like that of a newborn baby, currently.” Like, I'm squishy. Like, I need to go. I need to go wander around a little bit and just decompress from that. Come back to earth.

Amos: I think that's a fantastic description. Like, I know when my wife tries to talk to me after I've had like one of those sessions like that, I'll talk to her for 20 minutes and barely know anything that I've been saying. And everyone like sometimes she'll notice and sh- she goes,"You're programming, aren't you?" Yes, yes, yes, I am. I'm really sorry. Uh, this is why I think that, uh, hourly wages for developers is ridiculous. Like, telling your developers, “You have to be in the seat 40 hours a week,” because I promise you they're, they're coding when they're not in their seat.

Chris: Yeah. I, yes. I tend to agree. Tend to agree.

Amos: Alright, so GenServers, and ETS table.

Chris: Yes.

Amos: We we've, we've talked about this over and over. Well, you've brought it up over and over. I don't know that we've actually had much of a discussion on it,

Chris: GenServers and ETS tables? Have we gotten onto almost a hundred, a hundred episodes, and we haven't talked about a fundamental key part of Elixir? It's probably because we spend too much time talking about what our kids ate for breakfast.

Amos: You mean, you mean GenServers? Yes. And ETS tables.

Amos: Yeah. And ETS tables. Alright. So, so, so fill me in, fill me in on the magic. Uh, I've heard you multiple times bring up how ETS tables is something that is, is key and that we should be using them more, uh, and thinking about using them more. So, so fill me in on the magic of ETS and why, where do you use it? Why do you use it?

Chris: Okay. Yeah, sure. So an ETS tables feel like, uh, I don't know. It's, it's so interesting. My read on the situation is ETS tables are like, I dunno, I feel a little bit like maybe I reach for them too much, but a lot of the work I end up doing, uh, relies on ETS tables, uh, and the guarantees that they provide. And I feel like they are an under, I won't say they're underused. Like we should be using them more. I think they are under like, like they're not understood well by the greater community and by people who are just sort of arriving to Elixir. I feel like people don't realize what ETS tables do for you and what they don't and how to use them and when to use them. Um, and yeah, so, so we can certainly talk about it. So ETS tables are a, uh, we'll say a primitive provided by OTP, uh, really by the BEAM, which I guess are somewhat the same thing. I dunno what we're calling it all these days, all the different terms that were, that were breaking down, uh, let's just call it Erlang at this point.

Amos: Well, I think BEAM is the virtual machine. OTP is a set of, uh, abstractions and things on top of some of the primitive functions like receive and send, right?

Chris: No, I mean, I get the difference, but like, they're provided with Erlang. That's the takeaway, right? They're provided with Erlang.

Amos: Fair, fair.

Chris: As much as, at this point, GenServers are basically provided with Erlang, like, I mean, that's not act like this, not technically accurate. Uh, but it's basically accurate.

Amos: They're the standard library. They're part of the standard library.

Chris: Uh, any, in any case. So one of the things that we love about Erlang and about the, the runtime that we work in about the actor model and how it's implemented with this immutable data and these actors is that it eliminates an entire class of race conditions. Now, of course, you can have concurrency bugs in an Erlang system. You can have race conditions. You can deadlock in an Erlang system. That is all possible. You can make it happen, but there is an entire class of bugs that is just eliminated by the fact that you cannot share data. In order to share data, you copy it and you send it to someone else. There is no mutable data,. Except that there are these escape hatches because Erlang is a deeply pragmatic language, a deeply, deeply pragmatic language. It has escape hatches everywhere because at the end of the day, you need to do real work. You gotta do real work.

Amos: If you can't munge data, you can't actually do anything.

Chris: Right. I wasn't around when ETS tables were added to the language, but for me, the ways that I use them, they are an amazing escape hatch for when you need out of the idea that you can't share anything, uh, in the sense that.

Amos: Are you talking about sharing between multiple processes?

Chris: Yeah, because if you're going across a process boundary, the only way to get data from inside of a process is to send a message to it. That's the only way you can get access to data. You have to send a message, that message has to go into the mailbox. It has to be processed. It has to be returned. Uh, and, and all of that, that data now needs to be copied across the process boundary into your process, into the caller because everything's processes, right? Like all like all the way down. If you're in an IEX shell, you're in a process somewhere. You know, if you're in a web request, you're in a process somewhere. And if you want to get access to some sort of like shared data, for instance, some stuff that ought to be shared across multiple, uh, processes, you gotta, you, you have only a couple options. And one of them is you call a GenServer and you, or, you know, you call an agent or you call into a process of some sort. And that process, that that message is going in a mailbox. It's going to get processed. It's going to come back to you eventually. And all that data's going to get copied and copying is expensive. And, uh, and that GenServer becomes a bottleneck. And if you're only running one of them, for instance, you know, this like single global process thing, which then ends up on a cluster somewhere, uh, and then you're really, really doing it poorly. So, and then you got problems.

Chris: So. There's often times where you have, let's say a shared set of data that is, uh, maybe cached like cached values, right. Or they are, uh, largely immutable in the sense that they do not change over the course of time, right? You write them into there once and you don't have multiple writers and you do have multiple readers. Like everybody needs access to it. So for instance, uh, one, a good example of this would be if, uh, when I was at Bleacher Report, we needed to, uh, turn off traffic to downstream systems occasionally, like, cause they were unhealthy or they couldn't keep up and they were getting overloaded and we would shut them off. And we would need to return cached values that we had been caching the whole time, that all, when we were making good requests, we were putting in a cache somewhere. And now we want to return those cache values because we want the user to see something. If we had to call into a GenServer to do that, that GenServer becomes a bottleneck, especially if it is, if everyone needs to call the same GenServer, right. If everybody needs to get to the same process, it becomes a bottleneck.

Chris: So how do you solve this problem? Well, ETS tables are a great way to solve that problem. And the way an ETS table works is it is an Erlang term storage table and it creates an in-memory table. Uh, they have different types that you can use. Um, by default, it is a set, which basically just means that you have keys and values and you can put keys and values into this table. And then you can read from it. And the really cool thing is that all of the reads are ha can happen concurrently. They do not bottleneck on a single process. There is no single process for an ETS table when you're trying to read data out of it. Now you still do pay the cost of copying. You still have to copy the data out of the table, into your process when you go and get it. And that can still be expensive, but you no longer have this central effectively mutex around who can get access to the data. And so if you have stuff that needs to be widely read, for instance, you can go and, and everybody can read from that, ETS table all at once, and it will handle that concurrency for you. And now you no longer have that bottleneck on it. And it massively speeds up, uh, your independent look-ups. So now no, none of your processes are all like queuing up. They're all able to run. They're all able to be isolated from each other. And that becomes a really, really useful pattern for when you need to, when you're willing to go outside of the rules of the game, so to speak. It's a really useful when you want to kind of cheat or get out of the, get out of the business of serializing all of your calls through a single process somewhere it's really, really useful for that.

Amos: So can, can ETS, you know, I know that you said you can concurrently read and, you know, people like Amazon want you to think that you can have an infinite, infinite amount of scalability too. So what, what are the limits of ETS? Like, like, is there a point where it becomes a bottleneck?

Chris: Um, I mean, ETS, ETS, uh, for certain operations, isn't the fastest thing out there. ETS can do a lot, that let's be clear. ETS, like I said, it has multiple types of tables that you can have. Um, so you can have bags which have like, which allow you to have multiple items inside of a single key and that sort of thing. And you can do ordering and, um, there's different performance characteristics for all these things. You can also, uh, do what's called, um, read optimizing or write optimizing your ETS tables. So you can actually mark an ETS table that says, “Hey, I'm willing to sacrifice read performance because writing is way more important and I'm willi” or inverse of that, “I'm willing to sacrifice write performance because reads are much more important.” And so you can, there's a bunch of knobs you can tune, you can also do things like atomic counters. So you can say, “I want you to atomically increment a counter and return me the value to it”, right? Um, and that way you can have multiple writers, all updating incrementing counters on, an ETS table. And, uh, that is like ordered correctly. So.

Amos: Right, it- performance wise, too. I was reading, I remember reading an article a year, year and a half where, um, somebody tested performance of pulling things out of a map versus pulling out of ETS and ETS was faster than the map at, at a certain point.

Chris: Yeah. Yeah. For sure. For sure. Well, given a certain amount of keys, right. Um, with enough keys ETS is, ETS is definitely gonna be faster and it's way faster than trying to share that map, right? Everywhere.

Amos: Yeah.

Chris: Yeah. Or going, you know, through a process somewhere to go get access to that.

Amos: Well, and you don't have to copy that map if you're sending stuff around again, right? Like they can pull out what keys they need without you passing the whole map or pulling out the keys for them and passing them that portion.

Chris: And you can do some other fancy stuff with ETS tables, where you have, there's a whole query language. And it is a little, I will say it's a, the query language you need, you use for ETS is the match spec stuff is a little bit opaque. You have to learn how it works for sure. But overall it's not too bad. And it is just working with Erlang data. That feels really cool because you're just using Erlang data to go get access to the stuff that you want. Um, you're using your normal Elixir bits to go and get things and filter and select the different keys and values you want out of, out of ETS.

Amos: So, do you use the match specs directly, or do you use like the EX to match?

Chris: I, I I'm real bare bones with all that stuff. I just use, I just either write the match specs directly, or I figure out what I actually need and then just write them into the function. Like I write the query into the function that I'm where I'm using it.

Amos: Yeah. I find that I find the match specs, they were, they were semi cryptic at first, when I was first looking at 'em, I'm like, what the heck is that? But they're not, they're not really that hard to read, uh, after you, after you understand what's going on. But I, I have found with working with other people, like I have, I haven't used a lot of ETS. I'm sure that you've, you've done a lot more with ETS than I have, but I did it for like storing sensor data and stuff like that, that I needed kind of readily available, but didn't need it to go to a database. And, um, I needed it across a lot of places, but I found that whenever people wanted to search the EX to match spec stuff was a lot simpler to, to get my team on board with. So it's like a nice halfway point, I think.

Chris: Yeah. Yeah, for sure. And I mean, all of that stuff takes education, right. And all of the, and the ways in which you can use ETS takes education and it takes like a little bit of learning, you know, there's, there's definitely some, uh, you got to kind of get used to it. Definitely. And it's one of those things too. Once you discover it, people then overuse it in the same way that they've sort of overused GenServers with like, by putting like their domain model or whatever they call it, their domain objects, quote, unquote, into like GenServers. It's like, well, you didn't want to do that either. For sure. So people can overuse ETS tables, uh, and there's definitely costs associated with, let's say spinning up an ETS table. And you know, that's not necessarily a fast operation, uh, relatively speaking. Right. Um, there's memory involved in that there used to be limits on how many tables you could have, but that's been that's, that's not really the case anymore. Uh, and, and, you know, you also have to be careful with like, how many things do you allow to go in the ETS table because the ETS table is not bounded. Like you have to do that, right. Like you have to build, it's not, you know, they don't know how you're going to use these tables. So you do the bounding operations,

Amos: So, you can blow your memory out.

Chris: Yeah, exactly. Yeah. And you can, and there are other, there are other options, like you can put it in there compressed, um, and stuff like that. Uh, so you, you, you do get those benefits. Um, but yeah, I mean, it's a super, super useful tool. Um, it was invaluable at Bleacher Report for a whole just, uh, a very wide range of problems, like a very, very wide range of problems. It was, it was invaluable. Um, a lot of it was caching. Um, we relied really heavily on, you know, we relied really heavily on, uh, caching, good requests that were high cardinality. So we could return those, like in case, like one of the services started to go down, uh, and we use it's, it's the thing that is behind. I mean, it's, it's behind all of the caching libraries that are out there it's, it's. And that includes Mentat, uh, which is my caching library, which is the one that we used it at Bleacher Report. Um, because it was just dead simple. And like, didn't have as many knobs to turn as a lot of the others cause we didn't need any of those features. And so it was just a, it was a lot more efficient for us just to use Mentat at Bleacher Report

Amos: At Bleacher Report, you were a, I mean, just from knowing what Bleacher Report does, it seems like you were probably serving a lot more data than you were writing. So caching becomes probably really, really important.

Chris: Yeah. And fail over is important. Cause we got to a point where we were shedding labor, shedding traffic constantly because that's the model like, so, so we ended up moving towards regulator. I like I built regulator and regulator is a way to do adaptive concurrency. And one of the tricks to adaptive concurrency is you need to basically be living on the razor's edge all the time. Like you need to be on the edge of acceptable all the time. And thus you are constantly finding the actual concurrency limit all the time and by, and when you go over it, when regulator allows you to exceed that concurrency limit, you're going to start shedding traffic, which just means you're going to drop it instantly. You're not going to make the downstream call or you're not going to do whatever. And when that happens, you have choices to make, you can choose to return errors. Uh, you can choose to return and unhydrated payload, or you could choose to return some stale data. And so, for our use case and our user experience, returning stale data was totally acceptable and so-

Amos: -Better than nothing.

Chris: Exactly. And so we were always in this position of like, just about dropping traffic all the time, that's kind of where we wanted to live towards the end, towards my, the end of my tenure. I mean, obviously stuff may have changed, but that's kinda what I was tuning it towards, uh, at the front door , was just constantly like trying to get closer and closer to like almost always dropping traffic just a little bit and then serving and finding ways to serve cached values and finding ways to keep those caches more alive and keep those caches more hydrated and, and being better about that. And a lot we used, um, regulator for that, which is, er, and Mentat, which is basically just ETS. Um, in some cases we just, I wrote ETS tables powered, you know, that got hydrated by GenServers and did the whole normal thing that you do, right. You know, sometimes you just reach for and grab a Gen- or a, an ETS table directly. You don't need a caching library for that.

Amos: So when, when you were at Bleacher Report, I'm curious, did you, um, just spend like most of your time just working on getting a little more performance out constantly or, and rarely new features or?

Chris: Towards the end of my time, uh, that is predominantly what I was doing, what I, what I was personally working on, partially because that was the job that I wanted and carved out for myself while I was there. Partially because, uh, I had, uh, uh, some, uh, some amount of experience doing, having done that and knew, had done a bunch of research into it and sort of knew patterns that I wanted to use. So, yeah. And also partially because I was like just sort of bullish enough to kind of push on it constantly. So, um, that's, that's yeah. So that's mostly what I was doing towards the end was like, not even performance, I would say, but just reliability resiliency, like making sure that stuff stayed alive a lot longer. And so at the end we had a whole, a whole suite of tools. Um, most of which are open-sourced on my GitHub, that we ended up relying on. So we used a lot of, uh, we used regulators basically everywhere. We had, uh, we used Mentat in a lot of places, um, which, I mean, all of these things started as like internal ideas. And then either they got open-sourced or I open source some ahead of time. And then we brought, we brought them in to, to the organization, uh, trying to think what others, oh, we used, um, for all of our, our calls, all of our downstream calls, uh, to other services, we always included deadlines. So at the front door, when, uh, when a web request came in, we assigned it a deadline. So how long it ought to take to fulfill it. And then we propagated that through all the other calls, and we had automatic tooling to be able to do that and propagated across like process boundaries and all that sort of stuff.

Amos: So do you give it like a, like how, how are you doing that? Cause I know you're going across a lot of processes and you know, sometimes I don't know how much you cared. Did, were you relying on a system clock and just saying, “Hey, this is the dead time. Whenever it hits this time period, we throw it out”, or were you like, counting?

Chris: So the way it worked internally is as soon as it came in, uh, you assigned a deadline to it based on the RPC you were trying to make, or the web request, you were, you know, whatever requests you were trying to fulfill. And, uh, the way it works is , what's cool- So, the way deadline propagation works is you get, let's say you get some call and you're like, this should take no more than one second. Uh, in order to fulfill this call at the front door, I need to call the content API. I need to call their, you know, the social API and get back like comments and, and the number of likes this thing has, I need to call the user API to get usernames. I need to call this other thing to see if like any of these comments have been muted or like, you know, uh, like I dunno, filtered in some way, right? I have to do all of these things and some of those can happen concurrently and some of them can happen serially. And some of them will trigger other downstream calls. Uh, not often at Bleacher Report, cause we tried to like keep that we try to keep the hierarchy really flat in terms of our service calls, um, just via design, but like, sometimes you did, sometimes it ended up being multiple other calls. So you assign that deadline of one second. And then we had, I, I say, I mean, I did a lot of it. Other people also worked on it obviously, but we had a bunch of different ways to support that. One was all of our HTTP client tooling, supported retries and deadlines. And, and so what it would do is it sends it- internally what it's doing is it's tracking the timestamp, the system dot monotonic time that you get when the process comes in, when you assign the deadline, you say, okay, well, this is, this is going to take one second. So it takes, it goes, “Okay, well that is this a monotonic time right now, plus 1000 milliseconds or whatever.” And then like holds onto that. And it stores it in process dictionary. And then we had our own wrappers around like tasks and GenServer calls that automatically propagated that deadline into the next call. So tasks are really easy. You just like, wrap the task module and then like shove that new deadline into the, into the, um, into the task. Whenever you, whenever you start the task. And then what that would do is once it gets to the HTP client, it took the deadline and took the Delta, like how much time was left in, in finite fixed time and put it in a header. And then it propagated like how much time they had left via the Delta to the next service. And so-

Amos: -Did any of the services make decisions on that time to like, say, “Hey, just render out of the cache?”

Chris: Uh, yeah. So for instance, they could totally do that. The other big thing that deadline would do is you can say, “Hey, um, terminate my, terminate me if I exceed my deadline.” And so all of the Phoenix processes, like we had a plug that the, one of the first things it did is it said, “Hey, uh, here's the deadline that I got from the header, terminate me if I exceed that.” The idea of being, if I'm at the front door and I have a deadline of one second and the downstream call takes two seconds, there is no reason for the downstream service to keep working if the deadline has been exceeded.

Amos: Right.

Chris: If the upstream thing has given up, for instance, the upstream thing, let's say you have a call that you're making to a downstream service and you set the timeout in Hackney to 50 milliseconds. Well, what we would do is we detect that and we take the min of either your actual deadline, how much time you have left and how much time you've specified, right? So if you've got 30 milliseconds left, based on the deadline that you set, which you specified 50, we would send 30.

Amos: Okay. Just like, like a, like a manager, right?

Chris: Yeah, exactly.

Amos: You, you have six months to do that, Hey, you guys have three months do that, hey, hey, developer, I need this next week.

Chris: Exactly. And so we would send, we would propagate the deadline, the, either the real one that they had set or the actual minimum that was left. And we would say, “Hey, you have 20, 20 milliseconds to finish this 30 milliseconds to finish this, finish this, 50 milliseconds to finish it.” And if at any point it started to exceed that because like Redis was taking too long or Ecto took too long or just got bottlenecked on something or whatever, we just killed it. Because if Hackney is going to give up, up here at the, at, at, at your front door, why bother? Why should the downstream service bother working, wasting CPU, trying to fulfill a request that it literally can't, it can't fulfill anyway.

Amos: So did you plug, just create a task and kick off a task with a timeout or, like how did it-

Chris: No, it, um, the way it worked is we had a pool of GenServers and it picked one at random and then said, monitor me. And then that thing set timers, and then would, would, would like kill processes for you.

Amos: But just off the process ID. Yeah. Yeah.

Chris: It would just say monitor it and then do it.

Amos: Well, that's, that's smart. That's way better than what I was thinking.

Chris: So we had that set, set up. I mean, we just had a bunch of tooling like that, right? We had a bunch of stuff that helped keep all the services kind of alive. And, um, it was, it was, it was really useful and it was all tied into our, like, it was all tied into our monitoring tooling, which we had built out. So you just automatically got trace propagations and you could see what the deadlines were when you called like that thing and why it got ended. And we had specific status codes that we used for when a deadline was exceeded and that sort of stuff. Like we, we just spent a lot of time building all that out.

Amos: It's nice though. Like, really.

Chris: Yeah, once you have it all, it's really nice. Like, it's just like, oh, this is, this is great. You know, like I have all this stuff now. And I think a really important observation was we spent some time to do it, but it wasn't like, it wasn't a huge, massive time expenditure. We just took the time to do it, cause it was important to us. And then, because it was because we had not relegated it to some sort of ops Yammel configuration, we actually were able to build much more useful, like, solutions. Like there's a lot of people who really are really into like the Istio thing, right. And one of my main problems with like Istio is not Istio necessarily it's that- And not the fact that you're going through yet another fake network and not the fact that it violates the end to end principle and not the fact, you know, like not the 20 extra milliseconds of time that you'll spend waiting on your Istio to do work.

Amos: And 20 milliseconds is an eternity when you're trying to serve a request lately.

Chris: That's what I'm seeing currently. I don't know if that's universal, but that's what I'm seeing at the moment. Uh, but anyway, it's like, you could delegate Istio to be your circuit breaker, which I mean, yeah, good luck with that. Istio circuit breakers are terrible. But like you could, you could do that, right. But you're going to be building like by, by removing- it's the end to end principle thing. By removing your, by moving that into middleware and to something in the network, right, and not the end, you're providing an like markedly worse experience for your user. And if you take control of this stuff, guess what? You can now build a much, much, much better experience for your end user because you control it all. And you can be the arbiter of like what you're going to return, how you're going to cache values, how you're going to like service stuff, how you're going to propagate changes across the cluster, how you're going to fail over to certain, certain things like you, the programmer, get to be in charge of that. And you are much closer to the end user, then your Istio network layer that is in between all your services and like some ops person who's trying to generically modify all this stuff. It's like, that's not, that's not tenable. If what you want to do is deliver the best user experience that you can.

Amos: Yeah. Yeah. Um, I wanted to piggyback on that because I can't agree with you more, but I don't, I don't think there's anything to add to it. Like just like bill building the software that your end users need takes a lot more control. This is, this is a part of the reason that I think like all the serverless stuff is just like BS, but maybe I don't know enough about it. Or maybe I'm, I'm exactly on board. But a lot of that stuff feels like exactly like Istio, where you're, you're trying to give all that control off to somebody else. But if you want to build a real user experience, you can't.

Chris: Well, and Istio is what you do when you have literally unbounded complexity. You know what I mean? Istio is the thing that you have to start to reach for when you have so many services calling each other with so many junior people that you can't manage them all, and they can't be bothered to build all this stuff, to support all the things that you need. And like that is, that's what you end up with, right? Istio is like the last bastion. It's the, like the last thing that you, that you should ever reach for, if what you want to do is, is build like a system that, I mean, I don't know, it's, it's, you're solving such, you're solving a symptom of a much larger problem at that point. And I'm not saying it's not useful for somebody, but it's useful for somebody in the situation where there is no other alternative. Like it's, it's not where it's not tenable to build a bunch of services that are actually reliable and actually resilient. Like, because it's just too much effort and not worth it, or it's this, or you're in a situation where no one has bothered to do that. And now has built such a mess, that the only way out of it is just to like, get the network involved.

Amos: No, I, I, I, well, and you get that time, talent, people problem, right? So, um, that's where I think I, I see a lot of developers who would love to solve those really hard problems, but the business maybe doesn't have the time for the developers to sit back and figure it out. So I can see that it gets, it's easy to get into a place like that. Although, you know, now, today you don't, you don't have to do any performance, you just install the TP 24 and the Git, and you're fine, right? I mean, I ha I have,

Chris: If you can manage to get it installed. I haven't managed to get it installed yet.

Amos: I haven't managed to get it installed yet, but I have, I have a quote of the day. You ready for this?

Chris: Yes.

Amos: Friend of the show, BR engineer, Jason Stewart.

Chris: Okay.

Amos: Jason, we love you, but I'm going to quote you, even though you told me not to, uh, Git and Elixir is where he was talking about. And he said "Faster than C." Just saying. That's what he said.

Chris: Ok, well there you go.

Amos: So he wrote a ray tracer in Elixir. It's pretty awesome. You should check it out if you haven't seen it. Um, he, he, uh, also wrote a ray tracer in C. And originally the ray tracer is one of the tests that he had in, um, in Elixir took- Here. I'm gonna- I'm I don't want to lie about it. So I'm going to do this cool radio thing where I go dig while we're on air and look for this. So he took an eight core machine in OTP 23, it took 212 seconds to render in OTP 24 it took 144 seconds to render. And so it's eight cores. And the Cthat he had is only going to, it only uses a single core, but he, and he could go write it threaded, but that's a lot of work in C, so it's running faster than the C.

Amos: Cool. So, so "Now faster than C" is the quote.

Chris: There you go. That's the takeaway. That's the takeaway.

Amos: That's the takeaway (laughing). Yeah. I have not been able to get it compile or

Chris: I haven't been able to get it working yet, I haven't really tried that hard, but I haven't gotten that working yet.

Amos: I tried to compile two different times and then went online and started looking and found a giant thing about compiling it on OSX.

Chris: It's all the cryptos.

Amos: Yeah.

Chris: It’s all the cryptos. The crypto stuff. It's different now. And, uh, yeah, I just haven't. I tried it. I I ran ASDF install Erlang twenty four dot zero, and that didn't work. And then I walked away and I'm like, “Well, I’ll look at this later.”

Amos: I blew away all my open SSL and re-installed all of it. And then tried.

Chris: And it didn't work?

Amos: No. So-

Chris: Aw, whatever. So I’ll just do it later. In the meantime-

Amos: -Thought I'd try.

Chris: The takeaway here, the real takeaway here is ETS really cool. And you can do a lot of stuff with it. Caching is a really obvious one. If you want to see an example of like, kind of how it's used, you can go look at the Mentat, which is that caching library that I worked on, but I would also say, like, we do the same thing. Regulator makes very heavy use of ETS tables, much more heavy use of ETS tables, actually. Like there are many, many ETS tables inside of every regulator. Every regulator I think you get, um, I mean, well, I, there, the number is indeterminant, but you're going to get at least as many schedulers as you have plus one, number of ETS tables per regulator.

Amos: So do you, do you think that regulators, uh, a good project to go look at, if you want to see how ETS tables are being used?

Chris: Sure. Yeah. You can totally go look at that. And actually there's some optimizations that can be made in there with Atomics these days. I didn't add it at first because we were running, um, an old, an old enough version of our line on some services that they didn't have Atomics, but, uh, yeah, adding Atomics in, in it would also be good. But yes, it's, I would say there's, uh, it's, um, it's pretty useful, if you want to see like a very different way to use, uh, ETS tables in this case, we actually use them in a right optimized way. And we, we, the reason you get so many ETS tables is because the goal of regulator is to add as little overhead to the calling process as possible. And so what we ended up doing is when each process keeps track of its own latency, like how long it takes to, to issue its calls and stuff like that. And then it writes those values into an ETS table. And when it writes the value into the ETS table, it actually looks up what scheduler it's on and then picks the ETS table that corresponds to that scheduler based on a name and writes it into a specific ETS table. And doing it that way means that there is even less write contention on any individual table and over in aggregate. So there's only there should, in theory, this doesn't work out in practice, but in theory, you should own, like, there should only be a one person writing to an ETS table at any given time. It should really be one process writing to an ETS table at any given time, which, which massively speeds up writes, over- And then the way we gather up the data is we just do MapReduce.

Amos: Oh, nice!

Chris: We take all the tables and we MapReduce over them. So that's another use case of ETS tables where like you're taking a ton of, uh, you're doing all this complex stuff behind the scenes, all in the service of like, I need to send, I need to collect, uh, I have 10,000 requests a second, and I need to collect the they're like the durations of every single one of those requests. I need them all to write all of their values somewhere so that I can process them and determine what the average is. Like, that's the total of what you're trying to do for, in regulator, basically. And, uh, yeah, like that would not be possible to do calling a single GenServer.

Amos: Or even a single ETS table, right? You would add-

Chris: Which is so much overhead to it, right. And if you're going to go, if you're going to go to the trouble of trying to make this really efficient, you might as well make it really efficient.

Amos: It is writing to an ETS table going through a GenServer?

Chris: No.

Amos: Is it straight C or-

Chris: -It’s, yeah. I mean, all those are built in functions, right? And so they're, they're going, uh, it's doing its own locking and stuff though, to try to ensure that, uh, rights to the ETS table, uh, are ordered based on some guarantees.

Amos: Okay.

Chris: And so it has, it does have to do some amount of locking to do that. It's just that it's way faster than, you know, you're and especially too, if you're reading from it at the same time, right. It's, it's trying to give you a, uh, ordered view of the world. Like, it's trying to provide those sorts of guarantees, um, at least within some, some loose-

Amos: -Some bounds. Yeah.

Chris: Yeah. And so it's trying to do that. And so if you can remove contention from the tables, then that is very, very useful for speeding up for, for optimizing all of your other processes.

Amos: Very cool. So I want to get this picture right. So you have, you have one table that keeps track of which scheduler something is on, and then that tells it what other table to write to-

Chris: -Uh, no, you, we have a table that stores the actual concurrency limits.

Amos: Okay.

Chris: And, and then we have, uh, so, and that's read optimized.

Amos: Okay.

Chris: Um, and then we have a bunch of what we call the, a buffer, which is all of the statistics that it's collecting. And that is, uh, it's, it's N number of ETS tables. Where N is the number of schedulers.

Amos: Okay.

Chris: In the, in, in Erlang. And so we write all of the statistics into those and we just named them, like, you know, uh, the Adam is like your regulator name dash scheduler ID.

Amos: Okay.

Chris: And then, so what happens is, it's just a name, it's just a, it's just a unique name that we give to the regulator.

Amos: Right.

Chris: And then, uh, when you want to write into it, you say, you know, whatever the call is, it's like Erlang dot system info, scheduler ID or something like that. Like, I don't remember the, the name of the function, but you get the scheduler ID. And then that becomes the table, that part of the table name that you are going to then write into.

Amos: So when you, when you read from those though you're reading across, all of them because like something else that comes along and tries to read might be on a different scheduler at that point. So they can't just read from the scheduler ID.

Chris: Right. Right. Exactly.

Amos: So they read across all of them. And that's where you're map reads- okay.

Chris: Yeah, it reads across all of them all at once. Uh, well, it can do it, but it can do it in a MapReduce fashion, so it can read them all in parallel and then gather them all up to do the reduced step.

Amos: Okay. Yeah, cool.

Chris: So, yeah, so that's how it ends up working out and, uh, and it's pretty speedy. It's pretty nice. So it, it has worked fairly well, but like I say, there's so many different use cases for ETS. Um, I think it really is kind of a superpower once you figure it out and, and learn kind of what it can do and use it appropriately, right? Like you can definitely overuse it. You know, like we've talked about this before. Data is key to your application, you know, don't overuse it. And, and also like, don't use it to the degree that you're going to, re-introduce all the gross, like mutability, like, you know, mutable, state bugs that you global like mutable global state bugs that you try, like, we walked away from by doing Elixir in the first place.

Amos: Right.

Chris: Like, don't go back to that life. Most of the time you can probably get away with using protected tables, which are like, you have a GenServer somewhere that starts to table that GenServer can write into the table, but then anybody can read from the table. And that's like, that's like a really good stock pattern and you should do that. So like, you don't call the GenServer to read, just read directly from the table and the calling process.

Amos: But if that GenServer that started, that table dies, that table is gone. And that's important.

Chris: And then, and then you have to understand that if that GenServer dies, that table's gone unless, you do the whole song and dance of like inheriting the table, like saying marking a different process as the actual owner. And then there's ways to keep those tables alive. For sure.

Amos: I've seen a lot of places where they'll put the table on the supervisor itself.

Chris: That's generally what I end up doing. Um, that's not the, really the best way to do it if we're being honest, but that is often what I do. That's what I do probably like 90 something percent of the time is if multiple people need to be able to write to it and read from it. And I am worried about, uh, if I want that data persisted after a crash of a GenServer somewhere, I don't want that that data needs to live longer than the GenServer needs to, that could possibly live. If the life cycles of those are different than I will almost always just put it on the scheduler -or not on a scheduler- on the supervisor, and mark the table as public, and then write to it from the GenServer and read from it from everywhere else and just not worry about it. That's typically what I will do. Um, because the, the, the reason being when you start an ETS table, whoever starts, whatever process starts, it becomes it's quote unquote owner. And if that owner dies, the ETS table is garbage collected,.

Amos: Right.

Chris: So if you, if you start an ETS table in a GenServer and the GenServer crashes, for whatever reason, all the data in the ETS table is now gone. Sometimes that's on purpose. Sometimes you want that because you're like, I, you know, you're, you're storing stuff that is attached to that GenServer's lifecycle, but sometimes you want it to persist beyond crashes. And so you can either do, uh, ETS tables, have a notion of being able to have a successor, right? You can mark a different process as, okay, well, ought to be the owner now. And if I die, it goes back to this one and you can kind of, you can fiddle with that. And that's really the best way to do it if you want, if you want to maximize, uh, both the safety of the table and also meaning like who can write to it and who can read from it, if you want that, that allows you to still use protected tables.

Amos: So then it, it swaps owners auto magically. If you set that up.

Chris: Yeah. Yeah. Basically, if the, if the table dies, it jumps back over to this one, and then you've got to bring up the new one. It's got to ask for the table, it's got to get the table again and blah, blah, blah. It's got to do a whole thing.

Amos: Right.

Amos: Um, it is just a lot, a lot easier to start the tables inside the supervisor. And obviously at that point, if the supervisor dies, then obviously then.

Amos: Yeah, you're in the same place.

Chris: Most of the time, if it's, if the supervisor dies, if you construct your, if you construct your supervision tree well, if the supervisor, if the supervisor dies, I mean, you're like, that probably indicates that you're, you're cool with resetting all that data.

Amos: Right. Cause it could be that data that's causing you your issue.

Chris: Right.

Amos: Maybe it, maybe you have changed something in it in a bad way. So that's cool. I've got to get out of here and get back to some other stuff.

Chris: Yeah, I do too.

Amos: And I know that you were deep into programming. I hope I got you, uh, onto something else for a little bit, allowed your brain to do some, uh-

Chris: I'm just thinking about ETS tables now.

Amos: -Diffuse mode thinking. Yeah. There you go. All right, sir. Well-

Chris: Cool.

Amos: Maybe enjoy some lunch, go and walk, and uh, I'll see you next time.

Chris: Later.

Episode 96: Stuck Between Two ETS Tables and a GenServer

Amos King

Chris Keathley

Recent Posts

Quick Links

Find Us

Subscribe