Scaling with LND
Speakers: Alex Bosworth
Date: November 5, 2021
Transcript By: sahil-tgs via review.btctranscripts.com
Category: Conference
Media: https://www.youtube.com/watch?v=W-Ev_MZAdgA
Intro
I wanted to talk about scaling lightning using LND. I started working at Lightning Labs three years ago and come really a long way. I started working with LND before that when I worked at BitGo. That was the barest, almost didn’t work, it was crashing everyday type of scenario. Now, three years later, we’re trying to scale this to millions of people. We’re trying to scale this to millions of payments. I wanted to cover how do we do that? How do we do that with LND software? A lot of people throw out scaling as how do we scale LND? A lot of the answer to that depends on how do you even define what scaling is? I want to scale maybe the amount that I put on my node. Sometimes people come up to ask me, I have my Raspberry Pi and I put on $1,000 on there, but now I’m comfortable with it. I want to go up to $10,000 or $100,000, but how do I scale that volume? I’m afraid something’s going to happen to that Raspberry Pi. Then there’s availability. The same thing with the Raspberry Pi nodes. Maybe my internet cuts out, I have a storm or something. How do I scale to make sure that I have super high uptime? Then there’s latency. How quickly can I make a payment across the network? Then there’s decentralization. There’re all sorts of different aspects to scale is what I’m saying.
Routing Nodes
My philosophy on how to scale Lightning might not be the same as everybody else’s, but the perspective that LND has and that I think is a good perspective is that you don’t necessarily need to be a routing node yourself. That’s always been our design idea, is that a routing node is a specialist operation that we help people at all levels partake in, but it requires somebody who actually wants to do something like that. That means that it wouldn’t be the mobile wallet user, it wouldn’t be the merchant, and it wouldn’t be the exchange. It would be somebody, although it could be all those people, we don’t want to make it so that that’s a necessity, that’s a requirement for them. We want to have specialization. We want to say that a routing node is somebody who has one job. Their job is kind of like the ISP. Their job is to connect you with other people, and if they don’t do that, they’re not going to get paid. But despite this, a lot of people who are running on the Lightning network and scaling up, they’re basically trying to run routing nodes and not doing a good job of it. That applies also even if you’re not a major player in the Lightning network and you want to figure out who to connect to, it’s not necessarily a great idea to connect to somebody who is a well-known merchant or a well-known exchange, because in all likelihood, they’re spending their time managing their exchange or managing their store. They’re not spending their time managing their node specifically. So, there’s a huge variance and there’s plenty of big nodes that do a terrible job of being a routing node. And the interesting thing about routing nodes is that they need other routing nodes. So, what we want to have is we want to have like this big, connected network where I can connect to a routing node, and that’s going to multiply the utility of my funds. Because if I just had my funds in a peer-to-peer channel with the person that I’m paying, the utility of those funds is just very limited. It’s like a gift card at one shop that you can only pay for at that shop. What we want is for the funds that you have on Lightning to be as good as cash. You can use them in any store. And for that to happen, you need to connect to somebody who’s connected to other people who are connected to other people in all one giant mass of people who are all connected to each other. You need a routing node in order to connect your payments to other people, and routing nodes themselves need connectivity outside of that. So that’s kind of like a big concept of what a routing node is. Okay, like I said, I’m telling people don’t do routing nodes.
How to run a routing node
You’re a merchant. Find routing nodes that can be useful to you. Or you’re an exchange. Find routing nodes. Or you’re just buying stuff. Find other people to do routing nodes. People don’t like to listen to that. They like to run their own routing nodes. So, I think that’s cool, if you want to try it. So, I think the way to think about how to run the routing node is kind of specific to itself. It’s a whole game. It’s a whole strategy system into itself. One thing that I think a lot of people make as a mistake is that they create the routing node, and they look at other nodes that have a million, billion connections, and they’re like those are the best nodes. Or they think about their own node, and they’re like, okay, I have one Bitcoin, and the minimum channel size is 20,000 Satoshis, so I’ll just make as many channels as I possibly can to everybody, and that way I’ll have the best chance of connecting other people together. And then the graph algorithm will think that I’m amazing because I’m connecting a bajillion people. And I think that’s a bad strategy, because the main limitation in Lightning is capital. You can’t move capital through the channels unless you’ve committed it. So, these 20,000 Satoshi channels are kind of worthless. I kind of think of a good number as, there’s this theory of the Dunbar number, of how many people can you even keep in your head of who you even know, who you’re interacting with. And I think that also applies to the six degrees of separation concept of Lightning. So Lightning, the idea of it is, if you just have the people that you know, and you want to connect with some random person anywhere on the earth, you can ask one of the people that you know, do they know a person that they know that knows a person that they know? And because we have exponential growth, you’ll be able to connect with them even if people just have normal numbers of people that they know. So, I think the same thing applies to Lightning. No matter how big the network grows, if you just have a normal set of people that you’re kind of connected to, you’ll be able to pay to anybody in the world. The other thing I’d say about the routing node of how to run it is that you need to be continuously improving. I think, I kind of think about it like you’re an animal. An animal, if you just dropped it in the forest or something, it’s going to have reaction to stimuli. That’s how you know it’s alive. And sometimes people like put up a node and then no matter what happens, it doesn’t matter. They’ve forgotten about managing it. They’re not looking at it. And what happens is it basically just dies. It’s not earning any routing fees because it’s not reacting to its environment. But what you should be doing is you should be saying money’s flowing over here, so I’m going to move capital over here. You should be saying nothing is happening over here. I’m going to remove capital from here. If you’re doing rebalancing, you should be saying, okay, rebalancing needs to happen over here because when I do it, something good happens. Or if you’re doing fee changes, you could be saying like if I increase the fee, then my traffic goes away and that’s bad because I want traffic to happen. So that’s kind of like the job in a nutshell of what running, managing the routing node means. The other thing about running a routing node is that there’s a lot of different access like patterns of how traffic is flowing. And it’s important to know that. And that’s also why I get back to the Dunbar number. You can only really keep a certain number, especially if you’re doing like manual management of your routing node. You’re only going to be able to keep a certain number of nodes like identities in your mind at one time. So, one thing I do is I do like categorization of my nodes. I’m like this node is this type of node and this type of node is another type of node. And then I can kind of know what type of actions to deal with them. Another thing that I notice is that even people who are experts on routing and are telling people like here’s how routing should work or here’s some algorithm, they’re not paying attention to one of the most basic ideas of routing, which is that in order to forward a payment, you need to have local balance on your channel. And if you don’t, what will happen is the person will go to try to pay. And they won’t know that you don’t have local balance. And you will return to them temporary channel failure. And you’ll say, I don’t have balance right now. So, if you tried to go and make payments arbitrarily in the network right now, you would go to all sorts of different nodes. And the nodes wouldn’t have local balance on their channels. And they would all start to return errors that say temporary channel failure. And that will be very exacerbated if you try to send a big payment because people aren’t keeping big balances on their channels in corresponding to where the money wants to go. So that’s part of the job of running the routing node. The other thing I think people don’t really recognize is that running a routing node actually consumes disk space. What I notice is that people who run it, they just ignore that they have this disk that’s filling up. They ignore that this database is getting bigger and bigger and bigger. And that’s something that you also have as a job, especially just in the present day of what the tooling we have right now. And especially if you install automation. Every time you actually make a payment, even if the payment fails, you are accumulating state in your database. In L&D, there’s a lot we can do to even reduce that state. But at the core, every time you make a new channel state with your peer, you’re adding more data that you have to keep around. So, if you automate something, you might be tempted to say, I’ll just make a script. And every 30 seconds, I’m going to try to make a payment. But actually, that does have a cost to you. And you might not even realize it, but you’ll wake up one day and you’ll say, well, my database is full because I tried this every 30-second thing, and it wasn’t doing anything. I just filled up my whole channel state with garbage. I think what we have today, if we’re here three or four years in, I think three or four years in the future, it will be totally different. And one thing I think it will be more like is there will be more virtualization of what a node is. Because a node, it’s kind of two things. It’s kind of like a database. And also, like a set of keys. And you could run that on many different servers. They could all kind of work together with each other. But currently, LND isn’t set up for that. But it’s something that we can expand on in the future. And you could say, actually, my node is lots of different servers. It’s not a one-to-one mapping.
Scaling with invoices
So, routing node has had a lot of info on that subject. Because I like to run routing nodes. I think it’s interesting. But there’s also people who come to me and they’re like, we want to receive tons and tons of payments on Lightning. And we’ve done tests where we go through the whole flow of creating invoices. And an invoice is actually kind of like creating an invoice is pegging the database. Because every invoice is creating this new record in the database. It has to lock it. When we had a flood of customers who come in and they want to create a million invoices, then we have a queue, we have a wait time, we have errors. And this is especially apparent when your routing node is the same as your invoicing node. So, my basic recommendation for people who are trying to scale the easy way is horizontally scale with multiple private nodes. You don’t really have to implement anything super special. You’re running the same thing that everybody else is running. And if you need to make more invoices, just make more little private nodes that have connections to routing nodes. And it’s very simple. And you can cycle them out. Because invoices aren’t long lived. So, you can say, okay, I’m gonna make invoices on this node. And there’s a bunch of new customers. They’ve come in and they’re flooding me with new invoice requests. So, I’ll just spin up a bunch of my private nodes and have them all take jobs off of that invoice queue. And then if the invoice burst goes away, I can spin down those nodes. That’s something you can’t do if you run the routing node. If you run a routing node, you can’t just say, oh, I’m gonna create a new routing node out of thin air. So, it’s just easier and more self-contained if you have these private nodes. The other thing I think, as far as value scaling, if you’re a merchant and you’re generating tons and tons of inbound revenue, you’re creating a problem for yourself that you might not even realize. Which is when you receive a payment, the payment payer is using up the easiest, cheapest liquidity that they can get to you. And so, if you have payer one, they’re gonna use the easiest, cheapest path to get you to you. And then payer two, they maybe don’t have access to that path. So, they’ll try the first path, and they’ll get the temporary channel failure maybe. And then they’ll have to go to path two. And they’ll say, okay, I found it after path one. Didn’t have to wait too long. Path two. But if you have millions of customers and they’re all coming in, payer one million is now having to go to path one, two, three, four, all through a million to figure out how to get to you. And it’s becoming progressively harder to pay you. Progressively more expensive to pay you. And you might not even realize it because you don’t see the payer having these struggles. All you see is the invoice is not getting paid. Why is it not getting paid? And a lot of times people are scaling and they’re hitting this problem and they’re not even seeing it. So, the solution is that if you are very heavily inbound, you need to be pushing the funds out back to the network. So, number one way is pay your suppliers with Lightning. So, pay your employees with Lightning. Pay anybody you can with Lightning. Because every time, if you’re receiving a lot, every time you make a payment, you’re actually making room for a future payment back to you. But we also, at Lightning Labs, we have a service called Lightning Loop where you can pay to us off the chain and then we will pay you back on chain in a noncustodial swap. So that’s a pretty popular service. And it also has the advantage that if you’re a merchant and you’re receiving millions of dollars and you aren’t making payments, you don’t need to keep those on your hot wallet. You can remove them and put them in a you can loop them out, push them off to the cold storage and then you don’t have to worry about, like, okay, I’m securing too much money or what if there’s something that hacked. You can kind of keep your security under control by maybe every day or every week pushing out the funds that you’ve received back to the chain, back to an exchange, that kind of thing. So, if I were to get more complicated about how to scale LND invoicing, I would actually go back to what an invoice even means. And what an invoice is, is two things. One is it’s a secret. It’s a pre-image of a Lightning payment. And the other thing is it’s a database metadata about what the payment is for, like how long until it expires, or what is it associated with. So usually, the scaling problem when it comes to creating a lot of invoices is the metadata actually because you have to be inserting all these rows. The secret generation part is very simple because all you’re doing is just creating a cryptographic random 32-byte number. So, one way that we could actually if you wanted to kind of like end game scale it would be that you would separate these two processes out and actually you wouldn’t even use LND to create your invoice. Instead, what you’d do is you’d use LND’s API called the HTLC interceptor which allows you to interactively interact with the forwards that your node is getting. And so, what you would do is you would say maybe I have a bunch of virtual channels that I just made-up channel IDs for and I pretend that they’re private channels that actually exist, but they don’t exist. And instead, what they are, they’re just representing like a shared secret that I have between all my different nodes. And whenever any forward that I see comes in and it matches a hash, a pre-image that I’ve already created, and I created bajillions of them in memory somewhere then I can fast settle it. I can pretend like the end node received it but even though it didn’t. And because I’ve created like these virtual channels, I didn’t even need to make channels and I didn’t even need to receive it on the node that it came in. I can say like I have an array of five different nodes, whichever node that the payer pays into, I have the pre-image and I can take it right there. And the database, you can use any type of database you want because all you’re doing is kind of checking the metadata like is the expiry up, that kind of thing. So, with that kind of architecture, you’d be able to accept payments at the rate which would be limited by the forwarding. So as fast as you could forward you would be able to receive payments. And that would be the natural limit also even without your control because other people are having to forward to you so they would also be the bottleneck on you, like the greater network would be.
Scaling with payments
Okay, receiving payments is actually a lot easier to scale than making payments. So sometimes we have customers who come to us and they’re like, I want to get ready to be able to pay in super rapid succession, a thousand invoices or maybe I have a big website and all the users are going to be wanting to stream payments and listen to podcasts and stuff, all at the same time. And we’ve done some calculations on the current payment speed of LND or we’re doing some tests and we’re like, there’s a limit of how much that one node is going to be able to support in terms of payments. So, like LND currently, just to calculate one route on my node, it takes probably one or two seconds. So that means if I had a bunch of users, I wouldn’t be able to make very many payments at all. And that’s just pathfinding. That’s not actually even making the payment. It can take minutes to make a payment. And you are maybe using some liquidity while you make a payment, while you search for it, you’re locking some capital along the routes. So, it can be tricky to scale up payments, but I would recommend even using the same concept that I recommended before, where you would make many different little private nodes and that you would connect them independently. And then as you needed to make more payments, you would add more private nodes. And then in LND 0.14, like our next release, we’re going to also be moving the payment graph into memory, which will have a big speed up in terms of the pathfinding time. So, I would say there’d be at least a 10x speed up in terms of one node can handle 10 times as many payments just in terms of calculations. And then if you needed to scale, it’d be still pretty simple. You would be spinning up nodes and each one of those nodes would be kind of independently paying and you could scale it up or down as you wanted to. And you wouldn’t want to do that with the routing node because you would run into that limitation. The payments also have the database problem, but way worse than invoices in LND. Because in LND currently, when you make a payment, every single attempt that you make, even if it fails, is stored in the database. So, the database kind of has this log of all the different information that you’ve learned when you made the payment. And it’s a major cause of the nodes to slow down. So if you’re running a lot of payments on your routing node and you notice that the performance is slipping, probably what’s happening is your database is filling up with all this payment data. That’s also something that we’re working on in LND 0.14. We’re going to allow you to remove the payment data that failed, that wasn’t interesting to you. And currently, the way that I scale now without that feature is I actually delete all my payments every week or so from my node. That’s kind of a common scaling practice, is that people are going to their node and removing every single payment record because that’s the API that’s currently available to LND as far as how you can manipulate your payment history data. And then like I said, every time you’re updating the state of a channel, every time you’re doing a forward, you’re creating disk records. So that’s another way to reduce the size of your database in relation to payments, is that you can take a look at the number of updates on a channel, the number of past states that existed, and if it gets too high and the mempool is pretty low, you can close the channel and then reopen the channel and you can kind of reclaim some of the disk space because it doesn’t need to keep around old states that don’t exist anymore. The thing that people say might not work or are worried about working is that they think that like the pathfinding won’t succeed. But actually, I think that that’s usually related to people’s routing settings. And that’s something that we also have been improving in LND so far, but it’s something that people don’t maybe know about as how to manipulate their nodes. And in LND what it has is it has this router RPC settings in the LND.conf. And pathfinding is a tradeoff between different variables. It’s not like pathfinding can just improve by itself. So, like the most naive pathfinding algorithm would be I’ll calculate every single possible route that I can make it through the network, and then I’ll sort all the routes by which one is the cheapest route. And then I’ll try cheap one, cheap two, cheap three, cheap four, and eventually I’ll arrive at the cheapest route, and I’ll be happy that my payment didn’t take too much money. But if you actually run that algorithm, and if other people are running the algorithm, you’re going to be exhausting all the cheapest liquidity and you’re probably going to spend a lot of time before your payment gets through. So, the adjustment to that algorithm is to say, well, I’m going to still start with the cheapest, but then I’m going to start to take the knowledge that I’m getting as I’m making the payments. This node isn’t working well for me. This channel isn’t working well for me. And I’m going to start to ignore possible pathways because I think it’s just not really worth it to me to try it. This node already returned to me 30 failures. Am I going to give them the 31st chance? Maybe not. And that’s something that you can configure on your own node based on your own time preference. So, if you have a high time preference and you want payments to go through quickly, then what you want to do is you want to change the number of times that you’re going to trust a node to not fail, to be lower. You want to say, if this node fails five times, I’m going to ignore it for a day. It’s just not working. But of course, it might have been that the 31st time did work. So that’s not something that you can just say, okay, I’m always going to do it, that there’s a perfect setting. It’s relative to you on how you want to make payments. And also, this is a two-sided system. So, if other people are paying more for reliability, that’s actually like incentivizing the routing nodes to do a better job. And so, it will become a better choice to pay more because you’ll be using the nodes that are responding to that economic incentive.
Scaling with Lightning
A lot of times people are even questioning, like, Lightning itself, is it going to scale? Or what are the problems of scaling in Lightning by itself? Did we solve scaling? Once we released Lightning, scaling over, we won? Is there more work to do? I’m of the opinion that there’s still a lot more work to do. And what we achieved with Lightning is we’ve created a scalable design for one part of the system, which is that we created the HTLC locking mechanism that by itself is not using broadcast. It’s not telling everybody about everything and forcing everybody to become in sync. So, the larger that it grows, the behavior still remains the same. I’m still forwarding through six hops, whether the network is 1,000 people or 10,000 people or a million people, a billion people. That part is very scalable and doesn’t really need so much work. The part that does need work is the graph resources. So, part of how Lightning works is how do I know how to figure out which hops to use when I’m making the payment? And the way that that currently works is we have this thing called the channel graph. And every node who wants to be a routing node, or maybe they don’t want to be, but they just left the default settings, they publish information about their channels to everybody else. And actually, that’s using a gossip system, just like the way that the blockchain works. That system will not scale the same way the blockchain doesn’t scale. It doesn’t make sense that everybody can tell everybody about everything all the time. So I’m going to update my fees every minute, and you’re going to update your fees every minute, and all the millions of people are going to update their fees every minute. We’re going to hit a bottleneck there. And then we also need to minimize the bytes that we use on the blockchain. If everybody were making huge channels, or huge block-sized channels, where they weren’t batching, they were using fancy scripts to make the channels, we could make far fewer channels than if we batch channel opens together. And if we do things like swaps to collapse lots of different liquidity changes together into small amounts of vbytes. And so that’s something that we’re working on, especially with respect to minimizing the blockchain vbytes. That’s something that I work on with Lightning Loop, and we have a lot of possibilities with the taproot activation to go to huge extremes on how much we can collapse, because Schnorr signature collapse is unlimited. Even if you had a thousand different people with a thousand different signatures and public keys, you can collapse all those people into one public key and one signature. So that’s something we really want to leverage going forward into next year.
Questions
Alex Bosworth: That’s my high-level coverage of scaling LND and Lightning. I think we have a couple minutes for questions.
Audience: So, for the routing nodes, what are the real challenges that is preventing running multiple nodes under the single routing node identity.
Alex Bosworth: Running multiple nodes under the same identity poses no technical issues. It depends on your interpretation. Does it refer to the node as the database? And can I run the database on multiple disks or on multiple machines? And that’s something that we will have improvement for in LND 0.14 when we add Postgres support, because we can kind of give that problem to Postgres, which has already solved this “how do I run the database on multiple machines?” problem. But it’s more about building the infrastructure around kind of like, even if I run it on multiple machines, does that mean it’s faster? In the first implementation, it won’t be faster. What it will mean is it’s more reliable. One of them goes down. Postgres will automatically recover. And we don’t even have to do anything on the Lightning side for that. We just get that for free, because database people already figured that out.
Alex Bosworth: Last question.
Audience: You mentioned in an earlier slide not to leave your local balance empty. Having tried running Lightning Node myself, sometimes I end up in a situation where it’s going to cost me enough Satoshis to rebalance that. And based on my experience of what I can charge for routing, it doesn’t feel like it’s sensible for me to rebalance a channel. So, what are my options? I can leave it empty. I can close the channel. I can pay through the nodes to rebalance it. Any thoughts on how a pleb should approach this?
Alex Bosworth: That’s a good question. So, if you run out of balance, one thing I would say is you paid for the information that it’s going to run out of balance. Because if you just opened up a bunch of channels with a bunch of different nodes, not all of them would run out. So now you actually didn’t lose money. You spent it to get information about money is moving in this direction. Instead of rebalancing, add a new channel. You can say, the channel size that I chose in the beginning wasn’t big enough, I need a bigger one. And then I would use the information to say the fee was probably incorrect. I was pricing this too cheaply. And so somebody else, you know, I was a low bid for that route. Somebody else used me. I need to open a bigger channel and charge more. But it’s also possible that people could be competing below cost. That’s something that I faced a lot, especially as an early routing node. People who are approaching this from an altruistic perspective are actually distorting the market. They’re saying, okay, I’ll open up a bunch of channels and I’ll pay a bunch of fees. And I know of big nodes who have spent multiple, you know, more than a few bitcoins just on fees and they never made any money back. And so, if you’re just trying to be an economic actor competing with them, you have to find places where they’re not spending that money. So sometimes you do have to walk away.
Alex Bosworth: Okay, thank you.