in monitoring ~ read.

ELK as a software monitoring tool

Development is only the first step of an application lifecircle. Once it is in production (or even in testing) you have to monitor it: watch if errors occur, memory leaks and how fast its components work. Hopefully, there are a lot of solutions which allow you to do it. Today, I want briefly describe my experience with one of them - ELK.

ELK is a set of 3 products created by Elasticsearch company:

  • Elasticsearch - Lucene based distributed search engine;
  • Logstash - utility for data (logs) extraction and parsing;
  • Kibana - visualization tool;

All these components are well described on the related pages, so I don't see any reason to repeat anything here. Instead, let's consider how this technology can be used for your application.


Architecture

Imagine that we have a distributed application running on N hosts. Each host contains a few components of the application which write logs on the local file system (gathering logs on NFS-like system is not a good idea, yeah?).

How can we organize the monitoring process in this case? Well... it's very simple! First of all we need to find a separate host and install Elasticsearch there. Yes, we know that it can be used in the distributed mode... but for the first time let's stop on the simplest solution. On the same host we can install Kibana, since we know that it's a quite lightweight html5 application almost running on the client side.

The final step is installing Logstash on all N host and configuring it to parse certain logs and send them into the Elasticsearch instance (through its REST api).

Data for monitoring

The first question we should answer is what kind of data/metrics we want to monitor? I think we can distinguish at least three areas here:

  • application errors;
  • application metrics: amount of processed tasks, time that spent for major operations, etc;
  • system metrics: cpu usage, heap size, gc pauses (through MBeans and Sigar / a simple external bash script);

The first option seems to be obvious - if your application says it has an exception/error, you have to fix it. We can easily grep such kind of information from the logs using a simple pattern "[ERROR]" (or something like that). But strictly speaking, we also need the date when the error occurred... and it's a good idea to extract the error's description from the message. Hopefully, Logstash knows how to deal with Log4j logs. But what if we want to deal with more complicated, structured data as our metrics are?

For this case there is an extremely old but very effective idea - "if you want your logs to be used not only by humans, write them in some parsable format: CSV, XML or JSON". Sounds great, doesn't it? :)
Of course, Logstash can deal with all these formats. So, all we need is a separate appender for your metrics class with a JSON layout (see below).

After that you'll be able to have all your logs in JSON format! But from my point of view, it better to proceed storing "classic" logs (including errors) in the non-formatted way (which is more acceptable for humans) and use JSON only for metrics (which are mostly supposed for machine analysis).

{
  "@version":1,
  "@timestamp":"2014-01-27T19:52:35.334Z",
  "host":"app1",
  "component":"WEB",
  "type": "SYS",
  "metric": "CPU_USAGE",
  "value": 0.85
}
{
  "@version":1,
  "@timestamp":"2014-01-27T19:52:35.738Z",
  "host":"app1",
  "component":"BACKEND",
  "thread_name":"processor-14",
  "type": "APP",
  "metric": "BACK-SCENARIO1",
  "value": 13212
}

Hint: in a real log each json object must be written in one line.

Taking into account that we have Logstash installed on each node and each component has its own log file, some fields in example above are exceed and can be moved into the Logstash configuration (e.g. "host"). Also keep in mind, that ElasticSearch builds indexes for all the fields you have... so, if you have problems with free space, do not add the data you aren't going to use (or try to amend ElasticSearch behavior).


Installation

You can download tarballs of all the components by the links I put above, but I would recommend you to use the package based way. In this case you wont have to create all these start up scripts, service wrappers and other stuff.


Configuration

Since we aren't going to build Elasticseach cluster, its configuration will be quite simple:

  • update /etc/elasticseacrh/elasticsearch.conf to use actual IP address;
  • set up indices templates (optional);
  • configure indexes clean up by adding Curator into CronTab;

Initial Kibana configuration isn't required at all (taking into account the we installed it on the same host as the Elasticsearch server). But if you want to get some profit from using ELK you have to spent a few hours on configuring histograms for your metrics (see "Results" section).

So, the most interesting part is Logstash. It has a very large range of abilities and settings. Let's consider the configuration (/etc/logstash/conf.d/01-your-app.conf), which works with both "classic" and "formatted" logs:

input {  
        file {
                start_position => beginning
                path => [ "/var/log/your_app/*.log"]
                sincedb_path => "/var/data/logstash/sincedb"
                type => "file_log4j"
        }
        file {
                start_position => beginning
                path => [ "/var/log/your_app/*.log.json"]
                sincedb_path => "/var/data/logstash/sincedb"
                codec => json
                type => "file_json"
        }
}

filter {  
        if [type] == "file_log4j" {

                # Parse Log4j entries
                grok {
                        match => ["message", "%{TIMESTAMP_ISO8601:timestamp}%{SPACE}\[%{DATA:thread}\]%{SPACE}%{LOGLEVEL:level}%{SPACE}%{DATA:classname}%{SPACE}%{GREEDYDATA:message}"]
                        overwrite => [ "message" ]
                }

                # Filter out non-ERRORs
                if [level] != "ERROR" {
                        drop {}
                }

                # Update entry's date
                date {
                        match => [ "timestamp", "yyyy-MM-dd HH:mm:ss,SSS" ]
                        remove_field => [ "timestamp" ]
                }
        }
}
output {  
        elasticsearch {
                host => elasticsearch.host
        }
}

Now, when our system is configured, it's time to take a look at the results!


Results

Here is an example of a real monitoring system based on ELK. It's too big to paste its whole screenshot into this article, so I'll consider it by parts. Also I slightly GIMPed the photos... just... in case :)

First of all there is a set of filters which are used in diagrams below. Under ordinal circumstances they are folded. Also you always can specify some extra filtering criteria (e.g. time interval).

Below you can find a table with a list of the latest errors. Keep in mind that the data is updated in real time (according to predefined time interval - 5s).

The last stop is a set of histograms which show us different metrics for different components (how many tasks in the queue, how long they were stored in the database, amount of user requests in the web interface, etc).

Going further

As you can see from the text above, storing metrics onto the file system is an exceed step - we don't need them there. Instead we can directly send them into Logstash. This fact leads us to the new architecture:

Here, we can have only one Logstash instance (on the same host as ElasticSearch) listening to UDP socket. In this case we avoid writing data on the local FS and configure our Log4j appender to write directly into Logstash. This approach slightly reminds how SysLog works.

Log4J config:

# Root logger option
log4j.rootLogger=INFO, logstash

# Direct log messages to Logstash
log4j.appender.logstash=org.apache.log4j.receivers.net.UDPAppender  
log4j.appender.logstash.remoteHost=logstash.host  
log4j.appender.logstash.port=6666  
log4j.appender.logstash.application=YourProject  
log4j.appender.logstash.encoding=UTF-8  
log4j.appender.logstash.layout=name.krestjaninoff.log.json.LogstashJsonLayout  



Logstash config:

input {  
        udp {
                port => 6666
                codec => json
                type => "udp_json"
        }
}

...

output {  
        elasticsearch {
                host => localhost
        }
}



LogstashJsonLayout code (just a proof of concept):

/**
 * JSON layout for Logstash
 */
public class LogstashJsonLayout extends Layout {

    private final Gson gson = new GsonBuilder().create();
    private final String hostname = getHostname().toLowerCase();
    private final String username = System.getProperty("user.name").toLowerCase();

    @Override
    public String format(LoggingEvent le) {
        Map<String, Object> message = new LinkedHashMap<>();

        message.put("@timestamp", le.timeStamp);
        message.put("hostname", hostname);
        message.put("username", username);
        message.put("level", le.getLevel().toString());
        message.put("thread", le.getThreadName());

        if (le.getMessage() instanceof Map) {
            message.putAll(le.getMessage());

        } else {
            message.put("message", null != le.getMessage() ? le.getMessage().toString() : null);
        }

        return gson.toJson(message) + "\n";
    }

    private static String getHostname() {
        String hostname;
        try {
            hostname = java.net.InetAddress.getLocalHost().getHostName();
        } catch (Exception e) {
            hostname = "Unknown, " + e.getMessage();
        }
        return hostname;
    }

    @Override
    public boolean ignoresThrowable() {
        return false;
    }

    @Override
    public void activateOptions() {
    }
}
comments powered by Disqus