At merkle, we consume hundreds of millions of RPC requests every month, and we need to make sure that our infrastructure is capable of handling the load. We use a custom load balancer that we built in-house to distribute the load across our RPC nodes and achieve high availability while keeping costs low.
- RPC services
- Building our own RPC nodes
- Picking a cloud
- Picking a load balancer
- Building a load balancer for rpc nodes
- The architecture
- Supporting eth_subscribe
- Keep track of the head
- Quality of life improvements
When merkle started, we used Alchemy. However, we quickly realized this wouldn't scale, after receiving a $1,600 bill for just 2 days of usage:
Extrapolated, this would have cost us $24,000 per month, which is way too much for a startup. We decided to build our own RPC nodes, and we've been using them ever since.
Building our own RPC nodes
Our goals were simple:
- High availability: we need to be able to handle billions of requests per month, and we can't afford to have downtime.
- Low cost: we're a startup, and we need to keep our costs low, less than $1,500 per month.
- Low latency: we need to be able to handle requests in a timely manner.
- Easy to scale: we need to be able to scale up and down easily.
- Low maintenance: we don't want to spend a lot of time maintaining our infrastructure.
- Multi-chain: we need to be able to support multiple chains / add new chains quickly with zero downtime.
Picking a cloud
We decided to use OVH, a French cloud provider, because they offer a lot of flexibility and low prices for beefy machines. We also use AWS for some services, but we prefer OVH for our RPC nodes.
Specifically, we use ADVANCE-2 servers, which have 16 cores, 32GB of RAM, and 2x 1.92TB NVMe SSDs. They cost $200 (less with commitments) per month, which is a great deal.
For Polygon, and BSC, we use the same server, with a higher disk capacity (2x 3.84TB NVMe SSDs) for $250 per month.
But the real value in OVH servers is the unlimited outgoing/incoming bandwidth.
We spawned minimum 2 nodes per chain, adding up to 6 nodes in total, and a monthly cost of ~$1,000 (due to some long term commitments and discounts from OVH).
Picking a load balancer
Nginx is a great choice for a load balancer, but RPC nodes are different, that's why we decided to build our own load balancer using Go. merkle products are mostly rust, but we use Go for some high traffic services, and it's a great fit for this use case.
Building a load balancer for rpc nodes
We needed a high throughput, low latency load balancer, that can handle hundreds of millions of requests per month, and we needed to build it quickly.
In order to keep track of all upstream nodes, the load balancer connects to them over multiple websockets (we don't use http).
The load balancer, at all times, has between 5 and 10 websocket connections to every upstream servers.
Keep track of the head
State consistency is an issue with normal web services, but when it comes to rpc nodes, it's a totally different problem. We want to make sure we never route requests to a server that is lagging behind the head of the network.
For example, suppose that we have server
A hears of a new block, it'll quickly process it and update its state, but
B might not have receive the new block yet. And then you have two nodes with different state.
In order to solve this problem, we keep track of the head of the network of each node, and we only route requests to nodes that are at the latest head. However, we need to wait until the majority of nodes have been synced to the new head become advertising it to clients, otherwise all requests would be routed to one server for a short period, which would lead to a lot of load on one server.
eth_subscribe is the fastest way to get notified of new blocks and new pending transaction, but we don't want to just proxy the request and attach a stream to a node, because we want to make sure that we don't miss any events. And in case the nodes goes down, we want to make sure that the client never notices, and keeps receiving new blocks.
Thankfully, we already track every new block event to route requests. Therefore, a
eth_subscribe never actually needs to be forwarded to a node, we can just keep track of the subscription on the load balancer, and forward the events to the client.
New pending transactions:
Under the hood, the load balancer connects to our Transaction stream to seamlessly advertise pending transactions as fast as possible.
We know from experience that as soon as a new block is advertised, the load balancer gets flooded with
eth_getBlockByHash. That's why we cache all the responses before advertising a new block. Leading to 40-80% cache hits.
Quality of life improvements
Our engineers used to always ask
What is the RPC url for <x> chain. So we put our load balancer behind
rpc.merkle.net (our internal network). Now, they can just use
https://rpc.merkle.net/<chain> for any chain that we support.
We've been using this load balancer for over 3 months now. It's been working great, has scaled very well, and we were to save over $250,000 in the process.