skyline.analyzer package

Submodules

skyline.analyzer.agent module

skyline.analyzer.agent_batch module

class AnalyzerBatchAgent[source]

Bases: object

The AnalyzerBatchAgent class does the follow:

ensures that the required OS resources as defined by the various settings are available for the app.

run()[source]

run()[source]

Check that all the ALGORITHMS can be run.

Start the AnalyzerBatchAgent.

Start the logger.

skyline.analyzer.alerters module

LOCAL_DEBUG = False

Create any alerter you want here. The function will be invoked from trigger_alert.

Three arguments will be passed, two of them tuples: alert and metric, and context

alert: the tuple specified in your settings:

alert[0]: The matched substring of the anomalous metric

alert[1]: the name of the strategy being used to alert

alert[2]: The timeout of the alert that was triggered

alert[3]: The type [optional for http_alerter only] (dict)

The snab_details [optional for SNAB and slack only] (list)

Alert testing (dict) e.g. {‘test alert’: True}

alert[4]: The snab_details [optional for SNAB and slack only] (list)

The anomaly_id [optional for http_alerter only] (list)

metric: information about the anomaly itself

metric[0]: the anomalous value

metric[1]: The full name of the anomalous metric

metric[2]: anomaly timestamp

context: app name

alert_smtp(alert, metric, context)[source]: Called by trigger_alert() and sends an alert via smtp to the recipients that are configured for the metric.

alert_pagerduty(alert, metric, context)[source]: Called by trigger_alert() and sends an alert via PagerDuty

alert_hipchat(alert, metric, context)[source]: Called by trigger_alert() and sends an alert the hipchat room that is configured in settings.py.

alert_syslog(alert, metric, context)[source]: Called by trigger_alert() and logs anomalies to syslog.

alert_stale_digest(alert, metric, context)[source]: Called by trigger_alert() and sends a digest alert via smtp of the stale metrics to the default recipient

alert_slack(alert, metric, context)[source]

alert_http(alert, metric, context)[source]: Called by trigger_alert() and sends and resend anomalies to a http endpoint.

alert_sms(alert, metric, context)[source]: Called by trigger_alert() and sends anomalies to a SMS endpoint.

trigger_alert(alert, metric, context)[source]

Called by skyline.analyzer.Analyzer.spawn_alerter_process to trigger an alert.

Analyzer passes three arguments, two of them tuples. The alerting strategy is determined and the approriate alert def is then called and passed the tuples.

Parameters:

alert –
The alert tuple specified in settings.py e.g. (‘stats.*’, ‘smtp’, 3600, 168)

alert[0]: The matched substring of the anomalous metric (str)

alert[1]: the name of the strategy being used to alert (str)

alert[2]: The timeout of the alert that was triggered (int)

alert[3]: The type [optional for http_alerter only] (dict)

The snab_details [optional for SNAB and slack only] (list)

Alert testing (dict) e.g. {‘test alert’: True}

alert[4]: The snab_details [optional for SNAB and slack only] (list)

The anomaly_id [optional for http_alerter only] (list)
metric –
The metric tuple e.g. (2.345, ‘server-1.cpu.user’, 1462172400)

metric[0]: the anomalous value (float)

metric[1]: The base_name of the anomalous metric (str)

metric[2]: anomaly timestamp (float or int)
context (str) – app name

skyline.analyzer.algorithms module

USE_NUMBA = True

This is no man’s land. Do anything you want in here, as long as you return a boolean that determines whether the input timeseries is anomalous or not.

The key here is to return a True or False boolean.

You should use the pythonic except mechanism to ensure any excpetions do not cause things to halt and the record_algorithm_error utility can be used to sample any algorithm errors to log.

To add an algorithm, define it here, and add its name to settings.ALGORITHMS.

tail_avg(timeseries, series)[source]: This is a utility function used to calculate the average of the last three datapoints in the series as a measure, instead of just the last datapoint. It reduces noise, but it also reduces sensitivity and increases the delay to detection.

numba_median_absolute_deviation(y_np_array)[source]: This is a numba implementation of median_absolute_deviation, it speeds up the computation on a 1000 timeseries from 1.593684 seconds to 0.029051 seconds.

median_absolute_deviation(timeseries, series)[source]: A timeseries is anomalous if the deviation of its latest datapoint with respect to the median is X times larger than the median of deviations.

grubbs(timeseries, series)[source]: A timeseries is anomalous if the Z score is greater than the Grubb’s score.

first_hour_average(timeseries, series)[source]: Calcuate the simple average over one hour, FULL_DURATION seconds ago. A timeseries is anomalous if the average of the last three datapoints are outside of three standard deviations of this value.

numba_stddev_from_average(series, t)[source]

stddev_from_average(timeseries, series)[source]: A timeseries is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the average. This does not exponentially weight the MA and so is better for detecting anomalies with respect to the entire series.

stddev_from_moving_average(timeseries, series)[source]: A timeseries is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the moving average. This is better for finding anomalies with respect to the short term trends.

numba_mean_subtraction_cumulation(y_np_array)[source]: This is a numba implementation of mean_subtraction_cumulation, it speeds up the computation on a 1000 timeseries from 7.042794 seconds to 0.041275 seconds.

mean_subtraction_cumulation(timeseries, series)[source]: A timeseries is anomalous if the value of the next datapoint in the series is farther than three standard deviations out in cumulative terms after subtracting the mean from each data point.

numba_projected_errors(x, y, m, c)[source]: This is a numba implementation of the original calculation of the errors loop. It speeds up the loop on a 1000 timeseries from 8.727679 seconds to 2.785311 seconds.

least_squares(timeseries, series)[source]: A timeseries is anomalous if the average of the last three datapoints on a projected least squares model is greater than three sigma.

get_bin_edges(a, bins)[source]: https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html

compute_bin(x, bin_edges)[source]: https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html

numba_histogram(a, bins)[source]: https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html

numba_histogram_bins(t, y_np_array)[source]: Pass the tail average and the y np array

histogram_bins(timeseries, series)[source]

A timeseries is anomalous if the average of the last three datapoints falls into a histogram bin with less than 20 other datapoints (you’ll need to tweak that number depending on your data)

Returns: the size of the bin which contains the tail_avg. Smaller bin size means more anomalous.

ks_test(timeseries, series)[source]: A timeseries is anomalous if 2 sample Kolmogorov-Smirnov test indicates that data distribution for last 10 minutes is different from last hour. It produces false positives on non-stationary series so Augmented Dickey-Fuller test applied to check for stationarity.

get_function_name()[source]: This is a utility function is used to determine what algorithm is reporting an algorithm error when the record_algorithm_error is used.

record_algorithm_error(algorithm_name, traceback_format_exc_string)[source]

This utility function is used to facilitate the traceback from any algorithm errors. The algorithm functions themselves we want to run super fast and without fail in terms of stopping the function returning and not reporting anything to the log, so the pythonic except is used to “sample” any algorithm errors to a tmp file and report once per run rather than spewing tons of errors into the log.

Note

algorithm errors tmp file clean up: the algorithm error tmp files are handled and cleaned up in Analyzer after all the spawned processes are completed.

Parameters:

algorithm_name (str) – the algoritm function name
traceback_format_exc_string (str) – the traceback_format_exc string

Returns:

True the error string was written to the algorithm_error_file
False the error string was not written to the algorithm_error_file

Return type:

boolean

determine_median(timeseries)[source]: Determine the median of the values in the timeseries

determine_array_median(array)[source]: Determine the median of the values in an array

identify_airgaps(metric_name, timeseries, airgapped_metrics, airgapped_metrics_filled)[source]

Identify air gaps in metrics to populate the analyzer.airgapped_metrics Redis set with the air gaps if the specific air gap it is not present in the set. If there is a start_airgap timestamp and no end_airgap is set then the metric will be in a current air gap state and/or it will become stale. If the netric starts sending data again, it will have the end_airgap set and be added to the analyzer.airgapped_metrics Redis set. Also Identify if a time series is unordered.

Parameters:

metric_name (str) – the FULL_NAMESPACE metric name
timeseries (list) – the metric time series
airgapped_metrics (list) – the air gapped metrics list generated from the analyzer.airgapped_metrics Redis set

Returns:

list of air gapped metrics and a boolean as to whether the time series is unordered

Return type:

list, boolean

negatives_present(timeseries, series)[source]: Determine if there are negative number present in a time series

is_anomalously_anomalous(metric_name, ensemble, datapoint)[source]: This method runs a meta-analysis on the metric to determine whether the metric has a past history of triggering. TODO: weight intervals based on datapoint

run_selected_algorithm(timeseries, metric_name, airgapped_metrics, airgapped_metrics_filled, run_negatives_present, check_for_airgaps_only, custom_stale_metrics_dict)[source]

Run selected algorithm if not Stale, Boring or TooShort

Parameters:

timeseries (list) – the time series data
metric_name (str) – the full Redis metric name
airgapped_metrics (list) – a list of airgapped metrics
airgapped_metrics_filled (list) – a list of filled airgapped metrics
run_negatives_present (boolean) – whether to determine if there are negative values in the time series
check_for_airgaps_only – whether to only check for airgaps in the time series and NOT do analysis
custom_stale_metrics_dict (dict) – the dictionary containing the CUSTOM_STALE_PERIOD to metrics with a custom stale period defined

Returns:

anomalous, ensemble, datapoint, negatives_found, algorithms_run

Return type:

(boolean, list, float, boolean, list)

skyline.analyzer.algorithms_batch module

LOCAL_DEBUG = False

This is no man’s land. Do anything you want in here, as long as you return a boolean that determines whether the input timeseries is anomalous or not.

The key here is to return a True or False boolean.

You should use the pythonic except mechanism to ensure any excpetions do not cause things to halt and the record_algorithm_error utility can be used to sample any algorithm errors to log.

To add an algorithm, define it here, and add its name to settings.ALGORITHMS.

tail_avg(timeseries, use_full_duration)[source]: This is a utility function used to calculate the average of the last three datapoints in the series as a measure, instead of just the last datapoint. It reduces noise, but it also reduces sensitivity and increases the delay to detection.

median_absolute_deviation(timeseries, use_full_duration)[source]: A timeseries is anomalous if the deviation of its latest datapoint with respect to the median is X times larger than the median of deviations.

grubbs(timeseries, use_full_duration)[source]: A timeseries is anomalous if the Z score is greater than the Grubb’s score.

first_hour_average(timeseries, use_full_duration)[source]: Calcuate the simple average over one hour, use_full_duration seconds ago. A timeseries is anomalous if the average of the last three datapoints are outside of three standard deviations of this value.

stddev_from_average(timeseries, use_full_duration)[source]: A timeseries is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the average. This does not exponentially weight the MA and so is better for detecting anomalies with respect to the entire series.

stddev_from_moving_average(timeseries, use_full_duration)[source]: A timeseries is anomalous if the absolute value of the average of the latest three datapoint minus the moving average is greater than three standard deviations of the moving average. This is better for finding anomalies with respect to the short term trends.

mean_subtraction_cumulation(timeseries, use_full_duration)[source]: A timeseries is anomalous if the value of the next datapoint in the series is farther than three standard deviations out in cumulative terms after subtracting the mean from each data point.

least_squares(timeseries, use_full_duration)[source]: A timeseries is anomalous if the average of the last three datapoints on a projected least squares model is greater than three sigma.

histogram_bins(timeseries, use_full_duration)[source]

A timeseries is anomalous if the average of the last three datapoints falls into a histogram bin with less than 20 other datapoints (you’ll need to tweak that number depending on your data)

Returns: the size of the bin which contains the tail_avg. Smaller bin size means more anomalous.

ks_test(timeseries, use_full_duration)[source]: A timeseries is anomalous if 2 sample Kolmogorov-Smirnov test indicates that data distribution for last 10 minutes is different from last hour. It produces false positives on non-stationary series so Augmented Dickey-Fuller test applied to check for stationarity.

get_function_name()[source]: This is a utility function is used to determine what algorithm is reporting an algorithm error when the record_algorithm_error is used.

record_algorithm_error(algorithm_name, traceback_format_exc_string)[source]

This utility function is used to facilitate the traceback from any algorithm errors. The algorithm functions themselves we want to run super fast and without fail in terms of stopping the function returning and not reporting anything to the log, so the pythonic except is used to “sample” any algorithm errors to a tmp file and report once per run rather than spewing tons of errors into the log.

Note

algorithm errors tmp file clean up: the algorithm error tmp files are handled and cleaned up in Analyzer after all the spawned processes are completed.

Parameters:

algorithm_name (str) – the algoritm function name
traceback_format_exc_string (str) – the traceback_format_exc string

Returns:

True the error string was written to the algorithm_error_file
False the error string was not written to the algorithm_error_file

Return type:

boolean

determine_median(timeseries, use_full_duration)[source]: Determine the median of the values in the timeseries

determine_array_median(array)[source]: Determine the median of the values in an array

negatives_present(timeseries, use_full_duration)[source]: Determine if there are negative number present in a time series

run_selected_batch_algorithm(timeseries, metric_name, run_negatives_present)[source]: Filter timeseries and run selected algorithm.

skyline.analyzer.analyzer module

analyzer.py

class Analyzer(parent_pid)[source]

Bases: Thread

The Analyzer class which controls the analyzer thread and spawned processes.

check_if_parent_is_alive()[source]: Self explanatory

uniq_datapoints(timeseries)[source]

spawn_alerter_process(alert, metric, context)[source]

Spawn a process to trigger an alert.

This is used by smtp alerters so that matplotlib objects are cleared down and the alerter cannot create a memory leak in this manner and plt.savefig keeps the object in memory until the process terminates. Seeing as data is being surfaced and processed in the alert_smtp context, multiprocessing the alert creation and handling prevents any memory leaks in the parent.

Added 20160814 relating to:

Bug #1558: Memory leak in Analyzer
Issue #21 Memory leak in Analyzer see https://github.com/earthgecko/skyline/issues/21

Parameters as per skyline.analyzer.alerters.trigger_alert

spin_process(i_process, unique_metrics)[source]

Assign a bunch of metrics for a process to analyze.

Multiple get the assigned_metrics to the process from Redis.

For each metric:

unpack the raw_timeseries for the metric.
Analyse each timeseries against ALGORITHMS to determine if it is anomalous.
If anomalous add it to the Redis set analyzer.anomalous_metrics
Add what algorithms triggered to the self.anomaly_breakdown_q queue
If settings.ENABLE_CRUCIBLE is True:
- Add a crucible data file with the details about the timeseries and anomaly.
- Write the timeseries to a json file for crucible.

Add keys and values to the queue so the parent process can collate for:

self.anomaly_breakdown_q
self.exceptions_q

run()[source]

Called when the process intializes.
Determine if Redis is up and discover the number of unique metrics.
Divide the unique_metrics between the number of ANALYZER_PROCESSES and assign each process a set of metrics to analyse for anomalies.
Wait for the processes to finish.
Determine whether if any anomalous metrics require:
- Alerting on (and set EXPIRATION_TIME key in Redis for alert).
- Feed to another module e.g. mirage.
- Alert to syslog.
Populate the webapp json with the anomalous_metrics details.
Log the details about the run to the skyline analyzer log.
Send skyline.analyzer metrics to GRAPHITE_HOST

skyline.analyzer.analyzer_batch module

analyzer_batch.py

class AnalyzerBatch(parent_pid)[source]

Bases: Thread

The AnalyzerBatch class which controls the analyzer.batch thread and spawned processes.

check_if_parent_is_alive()[source]: Self explanatory

spin_batch_process(i, run_timestamp, metric_name, last_analyzed_timestamp, batch=[])[source]

Assign a metric and last_analyzed_timestamp for a process to analyze.

Parameters:

i – python process id
run_timestamp – the epoch timestamp at which this process was called
metric_name – the FULL_NAMESPACE metric name as keyed in Redis
last_analyzed_timestamp – the last analysed timestamp as recorded in the Redis key last_timestamp.basename key.

Returns:

returns True

run()[source]

Called when the process intializes.
Determine if Redis is up and discover the number of unique metrics.
Divide the unique_metrics between the number of ANALYZER_PROCESSES and assign each process a set of metrics to analyse for anomalies.
Wait for the processes to finish.
Determine whether if any anomalous metrics require:
- Alerting on (and set EXPIRATION_TIME key in Redis for alert).
- Feed to another module e.g. mirage.
- Alert to syslog.
Populate the webapp json with the anomalous_metrics details.
Log the details about the run to the skyline analyzer log.
Send skyline.analyzer metrics to GRAPHITE_HOST

skyline.analyzer.analyzer_labelled_metrics module

analyzer_labelled_metrics.py

class AnalyzerLabelledMetrics(parent_pid)[source]

Bases: Thread

The AnalyzerLabelledMetrics class which controls the analyzer_labelled_metrics thread and spawned processes.

check_if_parent_is_alive()[source]: Self explanatory

spawn_alerter_process(alert, metric, context)[source]

Spawn a process to trigger an alert.

This is used by smtp alerters so that matplotlib objects are cleared down and the alerter cannot create a memory leak in this manner and plt.savefig keeps the object in memory until the process terminates. Seeing as data is being surfaced and processed in the alert_smtp context, multiprocessing the alert creation and handling prevents any memory leaks in the parent.

Added 20160814 relating to:

Bug #1558: Memory leak in Analyzer
Issue #21 Memory leak in Analyzer see https://github.com/earthgecko/skyline/issues/21

Parameters as per skyline.analyzer.alerters.trigger_alert

metric_name_labels_parser(metric)[source]: Given a Prometheus metric string return a dict of the metric name and labels. :param metric: the prometheus metric :type metric: str :return: metric_dict :rtype: dict

labelled_metrics_spin_process(i_process, assigned_metrics_dict, filters=None)[source]

Assign a bunch of metrics for a process to analyze.

Multiple get the assigned_metrics to the process from Redis.

For each metric:

unpack the raw_timeseries for the metric.
Analyse each timeseries against ALGORITHMS to determine if it is anomalous.
If anomalous add it to the Redis set analyzer.anomalous_metrics
Add what algorithms triggered to the self.anomaly_breakdown_q queue
If settings.ENABLE_CRUCIBLE is True:
- Add a crucible data file with the details about the timeseries and anomaly.
- Write the timeseries to a json file for crucible.

Add keys and values to the queue so the parent process can collate for:

self.anomaly_breakdown_q
self.exceptions_q

run()[source]

Called when the process intializes.
Determine if Redis is up and discover the number of unique metrics.
Divide the unique_labelled_metrics between the number of ANALYZER_LABELLED_METRICS_PROCESSES and assign each process a set of metrics to analyse for anomalies.
Wait for the processes to finish.
Determine whether if any anomalous metrics require:
- Alerting on (and set EXPIRATION_TIME key in Redis for alert).
- Feed to another module e.g. mirage.
- Alert to syslog.
Populate the webapp json with the anomalous_metrics details.
Log the details about the run to the skyline analyzer log.
Send skyline.analyzer metrics to GRAPHITE_HOST

skyline.analyzer.metrics_manager module

metrics_manager.py

class Metrics_Manager(parent_pid)[source]

Bases: Thread

The Metrics_Manager class which controls the metrics_manager thread and spawned processes.

All of this functionality was previously done in the Analyzer thread itself however with 10s of 1000s of metrics, this process can take longer than a minute to achieve, which would make Analyzer lag. All the original commits and references from the Analyzer code have been maintained here, although the logical order has been changed and the blocks ordered in a different, but more appropriate and efficient manner than they were laid out in Analyzer. Further some blocks from Analyzer were removed with the new consolidated methods using sets, they were no longer required.

check_if_parent_is_alive()[source]: Self explanatory

assigned_to_shard(metric_name)[source]: Determine which shard a metric is assigned to.

get_remote_data(remote_skyline_instance, data_required, endpoint, save_file=False)[source]

sync_cluster_files(i)[source]: Fetch any missing training_data and features_profiles directories and files from REMOTE_SKYLINE_INSTANCES

metric_management_process(i)[source]: Create and manage the required lists and Redis sets

run()[source]

Called when the process intializes.
Determine if Redis is up
Spawn a process to manage metrics lists and Redis sets
Wait for the process to finish.
Log the details about the run to the skyline analyzer log.
Send skyline.analyzer.metrics_manager metrics to GRAPHITE_HOST

skyline.analyzer package

Submodules

skyline.analyzer.agent module

skyline.analyzer.agent_batch module

skyline.analyzer.alerters module

skyline.analyzer.algorithms module

skyline.analyzer.algorithms_batch module

skyline.analyzer.analyzer module

skyline.analyzer.analyzer_batch module

skyline.analyzer.analyzer_labelled_metrics module

skyline.analyzer.metrics_manager module

Module contents