Tuning tips
Okay, so you’ve got everything all set up, data is flowing through, and…what? You can’t consume everything on time? Allow me to help:
1. Try increasing settings.CHUNK_SIZE
- this increases the size of a chunk of
metrics that gets added onto the queue. Bigger chunks == smaller network traffic.
2. Try increasing settings.WORKER_PROCESSES
- this will add more workers to
consume metrics off the queue and insert them into Redis.
3. Try decreasing settings.ANALYZER_PROCESSES
- this all runs on one box (for
now), so share the resources!
4. Still can’t fix the performance? Try reducing your settings.FULL_DURATION. If this is set to be too long, Redis will buckle under the pressure.
5. Is your analyzer taking too long? Maybe you need to make your algorithms faster, or use fewer algorithms in your ensemble.
6. Reduce your metrics! If you’re using StatsD, it will spit out lots of variations for each metric (sum, median, lower, upper, etc). These are largely identical, so it might be worth it to put them in settings.SKIP_LIST.
7. Disable Oculus - if you set settings.OCULUS_HOST
to ‘’, Skyline will not
write metrics into the mini. namespace - this should result in dramatic speed
improvements. At Etsy, they had a flow of about 5k metrics coming in every
second on average (with 250k distinct metrics). They used a 32 core Sandy Bridge
box, with 64 gb of memory. We experience bursts of up to 70k TPS on Redis. Here
are their relevant settings:
CHUNK_SIZE: 7000 WORKER_PROCESSES: 2 ANALYZER_PROCESSES: 25 FULL_DURATION: 86400
Smaller deployments
Skyline runs OK on much less. It can handle ~45000 metrics per minute on a 4 vCore, 4GB RAM cloud SSD server where the metric resolution is 1 datapoint per 60 seconds, it will run loaded, but OK.
Do take note of the notes in settings.py related to the settings.ANALYZER_PROCESSES
and settings.ANALYZER_OPTIMUM_RUN_DURATION
if you are only processing a few
1000 metrics with a datapoint every minute then the optimum settings will most
likely be something similar to:
ANALYZER_PROCESSES = 1
ANALYZER_OPTIMUM_RUN_DURATION = 60
Reliability
Skyline has been verified running in production against 6.5 million request per minute peak internet advertising infrastructure, since 20131016.
However it should be noted that something should monitor the Skyline processes. Over a long enough timeline Python or Redis or some I/O issue is going to lock up and the Python process is just going to hang. This means that it is not really sufficient to just monitor the process with something like monit.
Skyline has made some progress in monitoring its own process threads and keeping itself running sanely, however not every possible issue can be mitigated against, therefore some another external to Python monitoring can help.