in elasticsearch cluster ~ read.

Building an ElasticSearch cluster

Today I want to share my recent experience of setting up a few terabytes ElasticSearch cluster. Strictly speaking, we had to build a 0,5TB cluster. That is not a big amount of data (nowadays you can easily find a 3TB SSD-drive available for sale). But when you start thinking about capacity for the future data, fault tolerance through clustering/replication and all that stuff - your 0,5TB becomes 5TB, which is a bit more interesting task :)


Analysis

The main questions you must answer on before making any decisions are:

  • what kind of data are you going to store?
  • how will you use it?

Answers on these two questions determine which technology/solution you have to use and how you have to configure it. Our case was about full-text search. Taking into account that we already familiar with ElasticSearch and have an experience with it, the choice of technology was obvious.

Partitioning

As for data, we had to deal with a list of user operations. A big list of user operations, actually :) Hopefully, each operation has its own date and time, so that our data can be easily partitioned. To be honest, once your data is bigger than X (depends on cpu/mem), you have to split it into partitions (by date, user id, etc), which can be calculated on the client side. The smaller your index is, the faster search against it is. Don't confuse partitions with ES shards - last ones have a fixed amount and can't be addressed directly.

Moreover, our use-cases assume that:

  • a user must specify a date range to retrieve the data;
  • in 99% cases users require data for the last week (in 95% - for the last day);

Given that, we can say that normally we have to deal with last 7 indices (each of them weight ~300mb). That is how requirements analysis can narrow a 0,5TB issue to a 2,1GB task :)

ElasticSearch assumes that all the partitioning logic must be implemented on the client side, so there is nothing more to add here. Just don't forget to set up “ignoreUnavailable” options on a client to avoid problems with nonexistent partitions:

searchRequestBuilder.setIndicesOptions(IndicesOptions.fromOptions(true, true, false, false));  

Indices

By default ES indexes all the fields of your objects (1, 2). That's a good "out of box" solution when dealing with a small amount of data. But since we play in a serious game with big numbers we have to think about each byte.

Let's say that your value-object already contains only necessary fields. Now we have to do the following:

  • enable _source field, since we are going to store an original document with all its fields;
  • disable _all field to avoid indices we don't need;
  • use dynamic_templates to disable indexing and storing for all the fields (we don't need to store data into the fields since we have an original object in _source field, indexing will be configured explicitly below);
  • disable norms, if you don't need them;

Now it's time to enable indexing only for those fields which are used in search filters/queries. Plus, you should store/index values for fields which are used in aggregations (if you have ones, of course).

Keeping in mind all the things that we find out about partitioning, here is an example of our index template:

{
    "entities": {

        "template": "entities*",
        "settings": {
            "index.number_of_shards": 3,
            "index.number_of_replicas": 0,
        },
        "mappings": {
            "entity": {

                "_source": {
                    "enabled": true
                },
                "_all": {
                    "enabled": false
                },

                "dynamic_templates": [{
                    "common": {
                        "match": "*",
                        "mapping": {
                            "type": "{dynamic_type}",
                            "store": false,
                            "index": "no"
                        }
                    }
                }],
                "properties": {

                    "timestamp": {
                        "type": "date",
                        "store": true,
                        "index": "not_analyzed",
                        "format": "yyyy-MM-dd HH:mm:ss"
                    },

                    "amount": {
                        "type": "long",
                        "store": true,
                        "index": "not_analyzed"
                    },


                    "comment": {
                        "type": "string",
                        "store": false,
                        "index": "analyzed",
                        "analyzer": "russian"

                    }

                }
            }
        }
    }
}

Notice, that amount type here is long, not double. Storing money in currency's minor units prevents you from a lot of troubles.


Cluster configuration

There is an official guide that you have to read if you want to set up your cluster properly. In short:

  • set -Xmx and -Xms to the same values (not more than 30Gb);
  • turn of swap (sysctl -w vm.swappiness=1);
  • increase amount of open files (sysctl -w vm.max_map_count=262144);
  • enable mlockall (man mlockall);

Amount of nodes

That was the simplest solution - our blades have 2TB SSD drives. Hence, to achieve 5TB capacity we need 3 nodes.

Also our experiments and load testing showed us that we have to extract a separate master mode. Without it cluster might be unavailable when nodes experiences high load. Plus, that is what ES guys recommend.

Shards and Replicas

The next stop is the amount of shards and replicas that you need. I would recommend you to start from reading this article.

ES is not a golden source of data in our architecture. And since we have date-based partitioning we always can refill missed or corrupted data. So, we decided to avoid using replicas at all.

As for the amount of shards, we set it equal to amount of data nodes. In this case we'll be able to run our search queries with a higher level of parallelism.


Initial data uploading

And again, before saying anything about data uploading I would recommend you to read the official documentation :)

Disabling indexing

By default ES indexes data every second. It means that documents become available to search right after you put them into an index. The point is you pay for this by performance.

If you want to decrease upload time you have to disable indexing (for all of your indices) by setting index.refresh_interval to -1. After the initial uploading is complete, run indexing manually through the REST API /index-2015*/_refresh or update index settings.

Bulk operations

According to our experience, that is the most valuable step. Using bulk operations on the client side allowed us to increase the initial upload speed by 10 times. After a set of experiments we found that optimal bulk size for our case is 15mb.

As for amount of clients, after 2 threads we stuck with network performance (100m/bit). Technically, the amount of bulk operations threads equals to amount of available processors. So, you can try to play with the amount clients, but don't forget to monitor the bulk queue size (thread.bulk.queue_size) on your cluster. By default, it's unbounded which may cause a memory problems in case if the threads won't be able to process data in time.

Use a node client

Using a node client is more preferable way. Just because your client becomes a part of a cluster, it knows where to route your requests (which nodes it should be executed on). That saves some time, so it's highly recommended to use it at least for initial upload.

But being a part of a cluster means that host and network interface of your client must be accessible for all the other nodes. If for some reason that isn't so, you can use a transport client. It takes one more "hop" per operation, but solves the problem above.

Throttling

Finally, the official documentation recommends us to disable index merge throttling entirely (during the time of the initial uploading). As far as I understand, it doesn't affect on your speed if refresh_interval is turned off. But it definitely will, when you turn it back for initial indexing.


Monitoring

Well, that is the key to success :) All your experiments with settings, all your test/production activity must be observed and analyzed. It's the only way to find a problem and fix it.

I would recommend you to start from reading about Cluster API. It's very useful to understand which kind of information ES can share with you.

The next step is installing a special plugin, which can gather all the info from the API in one place with a pretty interface. We started with Marvel, but... well, it's far from being perfect from both UI and functional points of view. At the moment we are using ElasticHQ and BigDesk.

Finally, when you want to understand what is really going on on the OS level, you have nothing but htop, iftop and iotop. And yes - ES can tell you about its threads, but network and disk subsystems can be properly analyzed only by *top utils.


Interesting facts

Before we finish I want to describe two interesting moments that occurred with us during our journey.

The UnderDesk cluster

Procurement is a big pain in every company. Waiting for hardware that we need for this project might take a lot of time. Hopefully, a lot of people in our company has Linux and Docker installed on their computers (quite powerful computers)! Few Ansible lines, one Docker container... and 1,5TB cluster for development and early time testing is ready. In 2 hours. Effective, isn't it?

OOM

In the very beginning we wasn't able to pass the load testing. Our cluster just collapsed after 30 minutes of uploading process. Logs has a lot of OutOfMemoryError: Java heap space errors and variety of their consequences. The reason was very simple - our test client was writing to indexes randomly. So that every new record required loading a new index from the disk. Seems that ES wasn't able to unload data in time and was failing. That is an invalid case for our system, since we import the data incrementally every day and no updates or deletes supposed. Problem was solved by fixing the test client.


Conclusion

You might think - omg, too many settings, why!? Well, first of all that is ok :) ES provides quite a good out-of-box configuration which allows to solve a lot of common problems without editing its config file. Here we spoke about a bit non standard problem which requires a bit more complicated solution. And ES provides us all the necessary settings to implement it. There is no "Silver Bullets", everything comes with price.

comments powered by Disqus