Elasticsearch Cluster Sizing and Performance Tuning
How many nodes should my cluster have? How many replicas should I create? What is the optimal average shard size for the best searching performance? Blah blah… All these questions have only one answer: ‘Nobody knows but you can!’ No one could know your data, query structure, the hardware you work with, and the throughput you have. Unfortunately, there is no mathematical formula or theoretical calculation method to provide a definitive answer. I’m sorry if this disappoints you, especially if that was your expectation. But don’t worry, I will help guide you in finding a way
1) Size Your Data
Before sizing the data, it’s important to be familiar with two terms: ‘shards’ and ‘replicas’. In distributed systems, data is divided into smaller pieces called shards, which are distributed across different nodes to account for hardware constraints and potential failures. A replica is an exact duplicate of a shard, kept in reserve to replace the original shard in the event of a node failure.
What is the optimum shard number?
Each request is processed in a single thread per shard. This means that if you have multiple shards, the queries are executed in parallel, leading to a decrease in execution time compared to scenarios with fewer shards — which sounds good. However, increasing the number of shards can positively impact performance up to a certain point, but it may lead to performance issues beyond that. This is because a…