The key problem with all the monitoring systems which I've ever seen is a fact that you need a person who monitors them. But what if you don't have one or you don't want him to spent all his time staring into the logs?
Let's say you've just set up ELK-based monitoring and expect that all of your problems have gone. Unfortunately, the truth is - no, they haven't :( In the real world nobody from your team wants to spent all his time analyzing Kibana output. And even if you have a special support fellow as a part of your team, spending all his time for the only activity looks like a resource wasting.
So, we need an effective notification mechanism, which can attract attention of those who interested if an error occurred:
- emails or chats may be ignored (e.g. I read them only trice a day due to a personal time management policy);
- sms is a bit more difficult to implement, and... much more annoying :)
The solution which I want to speak about here is visualization. And by saying "visualization" I mean a simple monitor, which is connected to the most useless computer in your office. Once you install it, you can display Kibana console with error messages from your application. But let's be honest - these messages will be too small and wouldn't encourage people to start fixing a problem (not sure, if a new error can be noted at all).
In other words, we need a more effective way to highlight app's problems. That's where Dashing comes into a play. Just click here and switch your browser into full-screen mode (F11 for Chrome). Awesome, isn't it? Now, imagine that every part of your application (i.e. a service / docker container) can be represented as a separate widget which is green (if it's OK), yellow/red (if a warning/error occurred) or black (if no logs are available).
Something like that:
This is a custom Dashing application with a set of special widgets. Each of them displays information about a specific service. Information is retrieved through ElasticSearch REST API. Nice, isn't it?
The original idea was to use Docker API to retrieve all the necessary metrics. From one point of view that is more a convenient way because you can easily understand if your application is running or not. But that is the only advantage :) Parsing raw logs and gathering info from different hosts make this approach quite painful. So, after a week of approbation we moved to ES API. That requires your services to write "I'm alive" messages periodically, but you get all the logs in one place (with indices and awesome API for retrieving).
The next problem was about colors and reaction of people around. I spent a lot of time explaining non-developers from nearby teams that RED color doesn't mean that a service is down and we all are going to die - it just says us that an error occurred. But people are stubborn :)
That's why you have to be really careful about what you consider as an error. Remember, an input form constraint violation by an user isn't an ERROR, it's not even a WARNing! So, clean up your code from false-positive error messages before turning such monitoring on :)
Also keep in mind that "Error occurred" isn't the best message to put into your logs )
Plus, you have to carefully manage all the errors that you find (log them into jira before marking as knonw, track it, etc) - otherwise every one get used to the RED screen... and no one will care.
And the main part - your team must be interested in solving the problems, otherwise all this is just a wasting of time. You can't solve all the problems alone (especially, in the beginning).
As a conclusion, remember - that isn't a standalone application. Its main and the only mission is to shout about a problem. For further analysis you have to use an another instrument - Kibana.
For those who interested here is the link to GitHub project. My Ruby is far from perfect, but that is not the point of this article :)