Horizon¶
The Horizon service is responsible for collecting, cleaning, and formatting incoming data. It consists of Listeners, Workers, and Roombas. Horizon can be started, stopped, and restarted with the bin/horizon.d script.
Listeners¶
Listeners are responsible for listening to incoming data. There are currently two types: a TCP-pickle listener on port 2024, and a UDP-messagepack listener on port 2025. The Listeners are easily extendible. Once they read a metric from their respective sockets, they put it on a shared queue that the Workers read from. For more on the Listeners, see [Getting Data Into Skyline](getting-data-into-skyline.html).
Workers¶
The workers are responsible for processing metrics off the queue and inserting them into Redis. They work by popping metrics off of the queue, encoding them into Messagepack, and appending them onto the respective Redis key of the metric.
Roombas¶
The Roombas are responsible for trimming and cleaning the data in Redis. You will only have a finite amount of memory on your server, and so if you just let the Workers append all day long, you would run out of memory. The Roomba cycles through each metric in Redis and cuts it down so it is as long as settings.FULL_DURATION. It also dedupes and purges old metrics.
HORIZON_SHARDS¶
This is an advanced feature related to running multiple, distributed Skyline servers (and Graphite instances) in a replicated fashion. This allows for all the Skyline servers to receiver all metrics but only analyze those metrics that are assigned to the specific server (shard). This enables all Skyline servers that are running Horizon to receive the entire metric population stream from multiple Graphite carbon-relays and have Horizon drop (not submit to their Redis instance) any metrics that do not belong to their shard.
In order to achieve this replicated configuration, Graphite also needs to be configured and run appropriately to allow the metrics to be forwarded to the remote Graphite servers. To achieve this an additional carbon-relay-b instance needs to be run on each Graphite server. This additional carbon-relay-b receives metrics from all the other Graphite servers and only relays them to the Horizon shard listen process (default port 2026) and the local Graphite carbon-cache. The primary Graphite carbon-relay process on each Graphite server relays metrics to the normal horizon listen process (default port 2024), to the local carbon-cache and to each remote carbon-relay-b instance.
The carbon-relay-b only relays metrics locally.
This Graphite/Horizon mesh configuration achieves:
- all metrics on all Graphite servers
- each Skyline server receiving all metrics
- each Skyline server Horizon only submitting the metrics assigned to its shard for analysis
This allows the cluster to be reconfigured to should any server or service fail.
If a Skyline server/shard is lost/fails, it can be removed from the
settings.HORIZON_SHARDS
and the metric population will be resharded to
only the remaining online shards. Although the Skyline shard server will not
have the metrics from the other shard in Redis, with the
settings.MIRAGE_AUTOFILL_TOOSHORT
feature enabled, the Skyline Analyzer
process will automatically initiate population of the FULL_DURATION data into
its Redis datastore by adding the newly assigned metrics to mirage.populate_redis
Redis set (queue). Mirage will then populate Redis with the FULL_DURATION data
from Graphite.
When the shards reassignment is in process, the Skyline cluster should be
considered to be running in a degraded state until the failed shard is brought
back online and all shard reassignment is complete. For this configuration to
work in times of failure each Skyline server in the cluster must be able to
handle (number_of_metrics_per_shard + (100/number_of_horizon_shards)) to handle
a failure and shard reassignment. Using other appropriate load management
Skyline configuration options like settings.ANALYZER_DYNAMICALLY_ANALYZE_LOW_PRIORITY_METRICS
and possibly adjusting the namespaces in settings.SKYLINE_FEEDBACK_NAMESPACES
,
etc allows Skyline to manage the load in a best effort manage until the cluster
has been restored to normal configuration.
There are many other different configuration strategies that can be employed run achieve HA and clustering with Skyline and Graphite which can augment the management of load and failover such as haproxy, MariaDB Galera clustering, etc. However, the Skyline/Graphite clustering design outlined here is aimed at ensuring that each Skyline/Graphite server has all the data and can handle any shard assignment.
HORIZON_SHARDS related settings.
HORIZON_SHARDS = {
'skyline-server-1': 0,
'skyline-server-2': 1,
'skyline-server-3': 2,
}
HORIZON_SHARD_PICKLE_PORT = 2026
Shards are 0 indexed.
If enabled the metric population is divided between the number_of_horizon_shards
declared in settings.HORIZON_SHARDS
by calculating the zlib.alder32
hash value (an integer) and applying a modulo to it to determine if the metric
belongs to the Horizon shard number assigned to that server.
Although zlib.alder32
is not a cryptographic hash it is fast and
computes the same value across all Python version and platforms so it does the
job. Further the distribution of metrics is not an exact equal division of the
metrics, but it is good enough.