Trouble shooting

Skyline does a lot of work and because of this you can experience issues.

Generally the key indicators with Skyline are CPU usage, CPU iowait, CPU steal, disk I/O (specifically write_time), skyline.<host>.analyzer.runtime and some key Redis metrics.

Redis performance is one of the best indicators as it is the cornerstone of speed in the Skyline pipeline. Redis cannot be slow. If Redis is running slow there is a problem.

If you are running a moderately loaded Skyline instance/s in a cloud environment over time you probably will experience some performance degradation. These incidents are generally (and in almost all cases related) to:

  • A noisy neighbour on the host machine on which the Skyline VM instances runs, which introduces iowait and resource limiting on host/resources which is divided and distributed between all the users of those resources.
  • An actual issue on the host, disk I/O, memory allocation failure, swap, etc, etc
  • Or you are just trying to do too much on the host on which you are running Skyline on.

Hopefully you are monitoring the machine/VM on which you run Skyline with Skyline, if not install telegraf and start monitoring it!

Key Skyline performance metrics and things to check if you are experiencing issues are:

  • skyline.<host>.analyzer.runtime
  • CPU user, CPU iowait, CPU steal
  • disk io write_time
  • Using swap (although you swap should always be disabled)
  • The roomba runtime in /var/log/skyline/horizon.log
  • Check that Transparent huge pages is disabled and has not been accidentally re-enabled on you kernel. Check with cat /sys/kernel/mm/transparent_hugepage/enabled and it should be set to always madvise [never]. If it is not, use echo never > /sys/kernel/mm/transparent_hugepage/enabled to disable and restart Redis.
  • redis INFO specifically rdb_last_bgsave_time_sec
  • redis INFO commandstats specifically usec_per_call stats in the case of Skyline the following samples are normal times for an instances handling between 1500 to 3000 metrics (with everything running on the Skyline VM, Redis, MariaDB, Graphite, etc)
# Commandstats
cmdstat_mget:calls=3879,usec=66793268,usec_per_call=17219.20
cmdstat_smembers:calls=747026,usec=175273317,usec_per_call=234.63
cmdstat_sunionstore:calls=687,usec=855735,usec_per_call=1245.61
cmdstat_hgetall:calls=9518,usec=5108903,usec_per_call=536.76
# Commandstats
cmdstat_mget:calls=238831,usec=16884528653,usec_per_call=70696.55
cmdstat_smembers:calls=75653989,usec=24095897878,usec_per_call=318.50
cmdstat_sunionstore:calls=118861,usec=242901100,usec_per_call=2043.57
cmdstat_hgetall:calls=1324879,usec=2020767640,usec_per_call=1525.25

If you are experiencing very high usec_per_call times on the above metrics (much higher than the examples above) you probably have an IO problem

The good news is that Skyline is very robust and runs in a degraded state OK, analyzer may take 3 minutes to get through a single analysis run, but it still works fine, it may failover into waterfall alerting and albeit it will probably be much more chatty about its own metrics, but it will carry on running. Don’t panic and just methodically try and determine where and what the bad actor is.

Often just opening an issue with your service provider will get your issue resolved, unless your are just doing too much on the machine in which case just scale Skyline.