Skyline does a lot of work and because of this you can experience issues.
Generally the key indicators with Skyline are CPU usage,
CPU iowait, CPU steal, disk I/O (specifically
skyline.<host>.analyzer.runtime and some key Redis metrics.
Redis performance is one of the best indicators as it is the cornerstone of speed in the Skyline pipeline. Redis cannot be slow. If Redis is running slow there is a problem.
If you are running a moderately loaded Skyline instance/s in a cloud environment over time you probably will experience some performance degradation. These incidents are generally (and in almost all cases related) to:
A noisy neighbour on the host machine on which the Skyline VM instances runs, which introduces iowait and resource limiting on host/resources which is divided and distributed between all the users of those resources.
An actual issue on the host, disk I/O, memory allocation failure, swap, etc, etc
Or you are just trying to do too much on the host on which you are running Skyline on.
Hopefully you are monitoring the machine/VM on which you run Skyline with Skyline, if not install telegraf and start monitoring it!
Key Skyline performance metrics and things to check if you are experiencing issues are:
CPU user, CPU iowait, CPU steal
Using swap (although you swap should always be disabled)
The roomba runtime in
Check that Transparent huge pages is disabled and has not been accidentally re-enabled on you kernel. Check with
cat /sys/kernel/mm/transparent_hugepage/enabledand it should be set to
always madvise [never]. If it is not, use
echo never > /sys/kernel/mm/transparent_hugepage/enabledto disable and restart Redis.
usec_per_callstats in the case of Skyline the following samples are normal times for an instances handling between 1500 to 3000 metrics (with everything running on the Skyline VM, Redis, MariaDB, Graphite, etc)
If you are experiencing very high
usec_per_call times on the above metrics
(much higher than the examples above) you probably have an IO problem
The good news is that Skyline is very robust and runs in a degraded state OK, analyzer may take 3 minutes to get through a single analysis run, but it still works fine, it may failover into waterfall alerting and albeit it will probably be much more chatty about its own metrics, but it will carry on running. Don’t panic and just methodically try and determine where and what the bad actor is.
Often just opening an issue with your service provider will get your issue resolved, unless your are just doing too much on the machine in which case just scale Skyline.