- Published on
We saved $250,000 by running our own RPC nodes
- Authors
- Name
- merkle
- @merkle_mev
At merkle, we consume hundreds of millions of RPC requests every month, and we need to make sure that our infrastructure is capable of handling the load. We use a custom load balancer that we built in-house to distribute the load across our RPC nodes and achieve high availability while keeping costs low.
- RPC services
- Building our own RPC nodes
- Picking a cloud
- Picking a load balancer
- Building a load balancer for rpc nodes
- The architecture
- Supporting eth_subscribe
- Keep track of the head
- eth_subscribe
- Caching
- Quality of life improvements
- Conclusion
RPC services
When merkle started, we used Alchemy. However, we quickly realized this wouldn't scale, after receiving a $1,600 bill for just 2 days of usage:
Extrapolated, this would have cost us $24,000 per month, which is way too much for a startup. We decided to build our own RPC nodes, and we've been using them ever since.
Building our own RPC nodes
Our goals were simple:
- High availability: we need to be able to handle billions of requests per month, and we can't afford to have downtime.
- Low cost: we're a startup, and we need to keep our costs low, less than $1,500 per month.
- Low latency: we need to be able to handle requests in a timely manner.
- Easy to scale: we need to be able to scale up and down easily.
- Low maintenance: we don't want to spend a lot of time maintaining our infrastructure.
- Multi-chain: we need to be able to support multiple chains / add new chains quickly with zero downtime.
Picking a cloud
We decided to use OVH, a French cloud provider, because they offer a lot of flexibility and low prices for beefy machines. We also use AWS for some services, but we prefer OVH for our RPC nodes.
Specifically, we use ADVANCE-2 servers, which have 16 cores, 32GB of RAM, and 2x 1.92TB NVMe SSDs. They cost $200 (less with commitments) per month, which is a great deal.
For Polygon, and BSC, we use the same server, with a higher disk capacity (2x 3.84TB NVMe SSDs) for $250 per month.
But the real value in OVH servers is the unlimited outgoing/incoming bandwidth.
We spawned minimum 2 nodes per chain, adding up to 6 nodes in total, and a monthly cost of ~$1,000 (due to some long term commitments and discounts from OVH).
Picking a load balancer
Nginx is a great choice for a load balancer, but RPC nodes are different, that's why we decided to build our own load balancer using Go. merkle products are mostly rust, but we use Go for some high traffic services, and it's a great fit for this use case.
Building a load balancer for rpc nodes
We needed a high throughput, low latency load balancer, that can handle hundreds of millions of requests per month, and we needed to build it quickly.
The architecture
In order to keep track of all upstream nodes, the load balancer connects to them over multiple websockets (we don't use http).
The load balancer, at all times, has between 5 and 10 websocket connections to every upstream servers.
Supporting eth_subscribe
Keep track of the head
State consistency is an issue with normal web services, but when it comes to rpc nodes, it's a totally different problem. We want to make sure we never route requests to a server that is lagging behind the head of the network.
For example, suppose that we have server A
and B
. When A
hears of a new block, it'll quickly process it and update its state, but B
might not have receive the new block yet. And then you have two nodes with different state.
In order to solve this problem, we keep track of the head of the network of each node, and we only route requests to nodes that are at the latest head. However, we need to wait until the majority of nodes have been synced to the new head become advertising it to clients, otherwise all requests would be routed to one server for a short period, which would lead to a lot of load on one server.
eth_subscribe
New blocks:
eth_subscribe
is the fastest way to get notified of new blocks and new pending transaction, but we don't want to just proxy the request and attach a stream to a node, because we want to make sure that we don't miss any events. And in case the nodes goes down, we want to make sure that the client never notices, and keeps receiving new blocks.
Thankfully, we already track every new block event to route requests. Therefore, a eth_subscribe
never actually needs to be forwarded to a node, we can just keep track of the subscription on the load balancer, and forward the events to the client.
New pending transactions:
Under the hood, the load balancer connects to our Transaction stream to seamlessly advertise pending transactions as fast as possible.
Caching
We know from experience that as soon as a new block is advertised, the load balancer gets flooded with eth_getTransactionReceipt
, eth_getTransactionByHash
, eth_getBlockByNumber
and eth_getBlockByHash
. That's why we cache all the responses before advertising a new block. Leading to 40-80% cache hits.
Quality of life improvements
Our engineers used to always ask What is the RPC url for <x> chain
. So we put our load balancer behind rpc.merkle.net
(our internal network). Now, they can just use https://rpc.merkle.net/<chain>
for any chain that we support.
Conclusion
We've been using this load balancer for over 3 months now. It's been working great, has scaled very well, and we were to save over $250,000 in the process.