- Published on
Scaling the merkle Private Mempool to 25M tx/day
- Authors
- Name
- merkle
- @merkle_mev
It's been a busy year for us at merkle. We've been working on a lot of things, but one the most interesting one is our private mempool and how we managed to scale it to billions of requests per month with "random" traffic spikes, especially over the last 3 months. The mempool is a critical part of infrastructure, where even a p99 > 1s is highly noticeable, hurts users trust and can lead to lost revenue in the orders of millions.
We are hiring for multiple positions for our engineering teams, come join us!
merkle is now powering major wallets, RPC providers and even Trump's World Libery Finance frontend through our public ethereum RPC service! We have cumulatively paid out millions of dollars in revenue to our customers in 2024 and serve almost 250M requests per day.
Transaction received per day over the last 3 months.
Challenges
Unpredictable traffic
The main challenge with scaling anything related to trading or markets is the unpredictability of the traffic. You cannot predict when Ethereum is going to drop 20% in one day, or when a DeFi protocol is going to get hacked. However, there are some predictable events, for example when Trump won the election.
Trump won the election, and our traffic spiked 4x in a few minutes.
Data production
The second challenge is the amount of data we produce. We have to store all the transactions that are sent to our servers, as well as the state of the mempool at any given time. This means that we have to deal with a lot of data, and we have to be able to scale our storage to handle it. Thankfully, we do not need to keep a lot of historical data in our production database. As of today, we produce terabytes of data per month.
Database scaling
Our main datastore if PostgreSQL. We use it to store the state of the mempool and the transactions that are sent to our servers. However, scaling off-the-shelf PostgreSQL can be a brutal task.
Tackling challenges one by one
Unpredictable traffic
In order to absord traffic spikes, we had to optimize database queries, hot paths and be queue-driven.
Optimizing database queries
We noticed that our database queries would exponentially grow with traffic. This was due to our queues receiving data in the following format, let's take our broadcast queue for example:
{
// uuid of the transaction
"transactionId": "0001-2345-4953-1029"
}
This meant that before broadcasting the transaction, the queue processor had to do a lookup in the database to get the transaction. We refactored the queue processors with a more efficient data structure and one rule: "all the data we need needed to process the queue should be in the message".
{
"transaction": "0x02....",
"blockNumber": 17000000,
"options": {
// broadcast options
"maxTimestamp": 17000000,
// etc ...
}
}
This allowed the queue to do its work even if the database was down. Additionally, it reduced the amount of queries many folds.
We applied this principle over all our of queues (20+) and it allowed us to reduce database load by about 70%.
Hot paths
We optimized our hot paths to be queue-drive instead of database driven. This allows us to continue partial operations (reciving and broadcasting transactions) even if the database is down or at max capacity. We achieved this by passing around all required data in the queue message and having one queue to write to the database.
Data production
We are fortunate that our data is temporary, once a transaction is mined or expired, it's not longer needed. This means that we can use a lot of tricks to optimize our database. We automatically ETL our production data into Snowflake and delete transactions older than 7 days from our production database. This helps keeping indexes light and inserts fast.
Thanks to Snowflake, we can easily scale our data storage and compute to handle the data production and queries for our reports & dashboards.
Database scaling
PostgreSQL is a great database, but it's not the best at scaling, at least not out of the box. While PostgreSQL might be faster at smaller scale, you really want consistancy over speed for a database that is used by thousands of users.
Fortunately, distributed SQL was invented 10 years ago and is starting to be mature enough to run in production. We are using CockroachDB to scale our database and it's been a game changer.