Welcome to the StackLight Collector plugin for Fuel documentation!¶
Overview¶
Introduction¶
The StackLight Collector Plugin for Fuel is used to install and configure several software components that are used to collect and process all the data that is relevant to provide deep operational insights about your OpenStack environment. These finely integrated components are collectively referred to as the StackLight Collector, or just the Collector.
Note
The Collector has evolved over time, so the term collector is a little bit of a misnomer since it is more of a smart monitoring agent than a mere data collector.
The Collector is the key component of the so-called Logging, Monitoring, and Alerting toolchain of Mirantis OpenStack, also known as StackLight.
The Collector is installed on every node of your OpenStack environment. Each Collector is individually responsible for supporting all the monitoring functions of your OpenStack environment for both the operating system and the services running on the node. The Collector running on the primary controller (the controller which owns the management VIP) is called the Aggregator since it performs additional aggregation and correlation functions. The Aggregator is the central point of convergence for all the faults and anomalies detected at the node level. The fundamental role of the Aggregator is to issue an opinion about the health status of your OpenStack environment at the cluster level. As such, the Collector may be viewed as a monitoring agent for cloud infrastructure clusters.
The main building blocks of the Collector are as follows:
- The collectd daemon, which comes bundled with a collection of monitoring plugins. Some of them are standard collectd plugins while others are purpose-built plugins written in Python to perform various OpenStack services checks.
- Heka, a Golang data-processing multifunctional tool by Mozilla. Heka supports a number of standard input and output plugins that allows to ingest data from a variety of sources including collectd, log files, and RabbitMQ, as well as to persist the operational data to external back-end servers like Elasticsearch, InfluxDB, and Nagios for search and further processing.
- A collection of Heka plugins written in Lua, which perform the actual data processing such as running metrics transformations, running alarms, and logs parsing.
Note
An important function of the Collector is to normalize the operational data into an internal Heka message structure representation that can be ingested into the Heka’s stream-processing pipeline. The stream-processing pipeline uses matching policies to route the Heka messages to the Lua plugins that perform the actual data-computation functions.
The following Lua plugins were developed for the Collector:
- decoder plugins sanitize and normalize the ingested data.
- filter plugins process the data.
- encoder plugins serialize the data that is sent to the back-end servers.
The following are the types of data sent by the Collector (and the Aggregator) to the back-end servers:
- The logs and the notifications, which are referred to as events sent to Elasticsearch for indexing.
- The metric’s time-series sent to InfluxDB.
- The annotations sent to InfluxDB.
- The OpenStack environment clusters health status sent as passive checks to Nagios.
Note
The annotations are like notification messages that are exposed in Grafana. They contain information about the anomalies and faults that have been detected by the Collector. Annotations basically contain the same information as the passive checks sent to Nagios. In addition, they may contain hints on what can be the root cause of a problem.
Requirements¶
The StackLight Collector plugin 1.0.0 has the following requirements:
Requirement | Version/Comment |
---|---|
Mirantis OpenStack | 8.0, 9.x |
A running Elasticsearch server (for log analytics) | 1.7.4 or higher, the RESTful API must be enabled over port 9200 |
A running InfluxDB server (for metric analytics) | 0.10.0 or higher, the RESTful API must be enabled over port 8086 |
A running Nagios server (for infrastructure alerting) | 3.5 or higher, the command CGI must be enabled |
Prerequisites¶
Prior to installing the StackLight Collector plugin for Fuel, you may want to install the back-end services the collector uses to store the data. These back-end services include the following:
- Elasticsearch
- InfluxDB
- Nagios
There are two installation options:
- Install the back-end services automatically within a Fuel environment using the following Fuel plugins:
- Install the back-end services manually outside of a Fuel environment. In this case, the installation must comply with the requirements of the StackLight Collector plugin.
Limitations¶
The StackLight Collector plugin 1.0.0 has the following limitations:
- The plugin is not compatible with an OpenStack environment deployed with nova-network.
- When you re-execute tasks on deployed nodes using the Fuel CLI, the collectd processes will be restarted on these nodes during the post-deployment phase. See bug #1570850.
Release notes¶
Version 1.0.0¶
The StackLight Collector plugin 1.0.0 for Fuel contains the following updates:
New alarms:
- Monitor RabbitMQ based on Pacemaker point-of-view
- Monitor all partitions and OSD disk(s)
- Horizon HTTP 5xx errors
- Keystone slow response times
- HDD errors
- SWAP percent usage
- Network packet drops
- Local OpenStack API checks
- Local checks for services: Apache, Memcached, MySQL, RabbitMQ, Pacemaker
- Monitor Nova resource utilization per aggregate (virtual CPUs, memory and disk)
Alarm enhancements:
- Added the
group by
attribute support for alarm rules- Added support for
pattern matching
to filter metric dimensions
Bug fixes:
- Fixed the concurrent execution of logrotate. See #1455104.
- Implemented the capability for the Elasticsearch bulk size to increase when required. See #1617211.
- Implemented the capability to use RabbitMQ management API in place of the rabbitmqctl command.
- Enforce timezone setting in log processing. See #1633074.
- Improve the resilience of the log_collector. See #1643280.
- Support Oslo messaging v2 notifications See #1648479.
Version 0.10.0¶
Additionally to the bug fixes, the StackLight Collector plugin 0.10.0 for Fuel contains the following updates:
Separated the processing pipeline for logs and metrics.
Prior to StackLight version 0.10.0, there was one instance of the hekad process running to process both the logs and the metrics. Starting with StackLight version 0.10.0, the processing of the logs and notifications is separated from the processing of the metrics in two different hekad instances. This allows for better performance and control of the flow when the maximum buffer size on disk has reached a limit. With the hekad instance processing the metrics, the buffering policy mandates to drop the metrics when the maximum buffer size is reached. With the hekad instance processing the logs, the buffering policy mandates to block the entire processing pipeline. This helps to avoid losing logs (and notifications) when the Elasticsearch server is inaccessible for a long period of time. As a result, the StackLight collector has now two processes running on the node:
- One for the log_collector service
- One for the metric_collector service
The metrics derived from logs are now aggregated by the log_collector service.
To avoid flooding the metric_collector with bursts of metrics derived from logs, the log_collector service sends metrics by bulk to the metric_collector service. An example of aggregated metric derived from logs is the openstack_<service>_http_response_time_stats.
Added a diagnostic tool.
A diagnostic tool is now available to help diagnose issues. The diagnostic tool checks that the toolchain is properly installed and configured across the entire LMA toolchain. For more information, see Diagnostic tool.
Version 0.9.0¶
The StackLight Collector plugin 0.9.0 for Fuel contains the following updates:
- Upgraded to Heka 0.10.0.
- Added the capability to collect libvirt metrics on compute nodes.
- Added the capability to detect spikes of errors in the OpenStack services logs.
- Added the capability to report OpenStack workers status per node.
- Added support for multi-environment deployments.
- Added support for Sahara logs and notifications.
- Bug fixes:
- Added the capability to reconnect to the local RabbitMQ instance if the connection has been lost. See #1503251.
- Enabled buffering for Elasticsearch, InfluxDB, Nagios and TCP outputs to reduce congestion in the Heka pipeline. See #1488717, #1557388.
- Fixed the status for Nova when Midonet is used. See #1531541.
- Fixed the status for Neutron when Contrail is used. See #1546017.
- Increased the maximum number of file descriptors. See #1543289.
- The spawning of several hekad processes is now avoided. See #1561109.
- Removed the monitoring of individual queues of RabbitMQ. See #1549721.
- Added the capability to rotate hekad logs every 30 minutes if necessary. See #1561603.
Version 0.8.0¶
The StackLight Collector plugin 0.8.0 for Fuel contains the following updates:
- Added support for alerting in two different modes:
- Email notifications
- Integration with Nagios
- Upgraded to InfluxDB 0.9.5.
- Upgraded to Grafana 2.5.
- Management of the LMA collector service by Pacemaker on the controller nodes for improved reliability.
- Monitoring of the LMA toolchain components (self-monitoring).
- Added support for configurable alarm rules in the Collector.
Version 0.7.0¶
The initial release of the StackLight Collector plugin. This is a beta version.
Licenses¶
Third-party components¶
Name | Project website | License |
---|---|---|
Heka | https://github.com/mozilla-services/heka | Mozilla Public License |
collectd | http://collectd.org/ | GPLv2 |
Collectd::CPU | http://collectd.org/ | GPLv2 |
Collectd::Disk | http://collectd.org/ | GPLv2 |
Collectd::Df | http://collectd.org/ | GPLv2 |
Collectd::Interface | http://collectd.org/ | GPLv2 |
Collectd::Load | http://collectd.org/ | GPLv2 |
Collectd::Memory | http://collectd.org/ | GPLv2 |
Collectd::Processes | http://collectd.org/ | GPLv2 or later |
Collectd::Swap | http://collectd.org/ | GPLv2 |
Collectd::User | http://collectd.org/ | GPLv2 |
Collectd::LogFile | http://collectd.org/ | GPLv2 |
Collectd::User | http://collectd.org/ | GPLv2 |
Collectd::WriteHttp | http://collectd.org/ | GPLv2 |
Collectd::MySQL | http://collectd.org/ | GPLv2 |
Collectd::DBI | http://collectd.org/ | GPLv2 |
Collectd::Apache | http://collectd.org/ | GPLv2 |
Collectd::Python | http://collectd.org/ | MIT |
Collectd::Python::RabbitMQ | http://collectd.org/ | Apache v2 |
Collectd::Python::HAProxy | http://collectd.org/ | Permissive |
Puppet modules¶
Name | Project website | License |
---|---|---|
puppet-collectd | https://github.com/puppet-community/puppet-collectd | Apache v2 |
puppetlabs-apache | https://github.com/puppetlabs/puppetlabs-apache | Apache v2 |
puppetlabs-stdlib | https://github.com/puppetlabs/puppetlabs-stdlib | Apache v2 |
puppetlabs-inifile | https://github.com/puppetlabs/puppetlabs-inifile | Apache v2 |
puppetlabs-concat | https://github.com/puppetlabs/puppetlabs-concat | Apache v2 |
puppetlabs-firewall | https://github.com/puppetlabs/puppetlabs-firewall | Apache v2 |
openstack-cinder | https://github.com/openstack/puppet-cinder | Apache v2 |
openstack-glance | https://github.com/openstack/puppet-glance | Apache v2 |
openstack-heat | https://github.com/openstack/puppet-heat | Apache v2 |
openstack-keystone | https://github.com/openstack/puppet-keystone | Apache v2 |
openstack-neutron | https://github.com/openstack/puppet-neutron | Apache v2 |
openstack-nova | https://github.com/openstack/puppet-nova | Apache v2 |
openstack-openstacklib | https://github.com/openstack/puppet-openstacklib | Apache v2 |
References¶
- The StackLight Collector plugin project at GitHub
- The StackLight Elasticsearch-Kibana plugin project at GitHub
- The StackLight InfluxDB-Grafana plugin project at GitHub
- The StackLight Infrastructure Alerting plugin project at GitHub
- The official Kibana documentation
- The official Elasticsearch documentation
- The official InfluxDB documentation
- The official Grafana documentation
- The official Nagios documentation
Installing StackLight Collector plugin for Fuel¶
Install using the RPM file of the Fuel plugins catalog¶
To install the StackLight Collector Fuel plugin using the RPM file of the Fuel plugins catalog:
Go to the Fuel plugins catalog.
From the Filter drop-down menu, select the Mirantis OpenStack version you are using and the Monitoring category.
Download the RPM file.
Copy the RPM file to the Fuel Master node:
[root@home ~]# scp lma_collector-1.0.0-1.0.0-1.noarch.rpm \ root@<Fuel Master node IP address>:
Install the plugin using the Fuel Plugins CLI:
[root@fuel ~]# fuel plugins --install lma_collector-1.0.0-1.0.0-1.noarch.rpm
Verify that the plugin is installed correctly:
[root@fuel ~]# fuel plugins --list id | name | version | package_version ---|----------------------|---------|---------------- 1 | lma_collector | 1.0.0 | 4.0.0
Install from source¶
Alternatively, you may want to build the plugin RPM file from source if, for example, you want to test the latest features of the master branch or customize the plugin.
Note
Running a Fuel plugin that you built yourself is at your own risk and will not be supported.
To install the StackLight Collector Plugin from source, first prepare an environment to build the RPM file. The recommended approach is to build the RPM file directly onto the Fuel Master node so that you will not have to copy that file later on.
To prepare an environment and build the plugin:
Install the standard Linux development tools:
[root@home ~] yum install createrepo rpm rpm-build dpkg-devel
Install the Fuel Plugin Builder. To do that, you should first get pip:
[root@home ~] easy_install pip
Then install the Fuel Plugin Builder (the fpb command line) with pip:
[root@home ~] pip install fuel-plugin-builder
Note
You may also need to build the Fuel Plugin Builder if the package version of the plugin is higher than the package version supported by the Fuel Plugin Builder you get from
pypi
. For instructions on how to build the Fuel Plugin Builder, see the Install Fuel Plugin Builder section of the Fuel Plugin SDK Guide.Clone the plugin repository:
[root@home ~] git clone https://github.com/openstack/fuel-plugin-lma-collector.git
Verify that the plugin is valid:
[root@home ~] fpb --check ./fuel-plugin-lma-collector
Build the plugin:
[root@home ~] fpb --build ./fuel-plugin-lma-collector
To install the plugin:
Once you have created the RPM file, install the plugin:
[root@fuel ~] fuel plugins --install ./fuel-plugin-lma-collector/*.noarch.rpm
Verify that the plugin is installed correctly:
[root@fuel ~]# fuel plugins --list id | name | version | package_version ---|----------------------|---------|---------------- 1 | lma_collector | 1.0.0 | 4.0.0
Configuring StackLight Collector plugin for Fuel¶
Plugin configuration¶
To configure the StackLight Collector plugin:
Create a new environment as described in Create a new OpenStack environment.
In the Fuel web UI, click the Settings tab and select the Other category.
Scroll down through the settings until you find the StackLight Collector Plugin section. You should see a page like this:
Select The Logging, Monitoring and Alerting (LMA) Collector Plugin and fill in the required fields as indicated below.
- Optional. Provide an Environment Label of your choice to tag your data.
- In the Events Analytics section, select Local node if you plan to use the Elasticsearch-Kibana Plugin in the environment. Otherwise, select Remote server and specify the fully qualified name or the IP address of an external Elasticsearch server.
- In the Metrics Analytics section, select Local node if you plan to use the InfluxDB-Grafana Plugin in the environment. Otherwise, select Remote server and specify the fully qualified name or the IP address of an external InfluxDB server. Then, specify the InfluxDB database name you want to use, a username and password that have read and write access permissions.
- In the Alerting section, select Alerts sent by email if you want to receive alerts sent by email from the Collector. Otherwise, select Alerts sent to a local cluster if you plan to use the Infrastructure Alerting Plugin (Nagios) in the environment. Alternatively, select Alerts sent to a remote Nagios server.
- For Alerts sent by email, specify the SMTP authentication method you want to use. Then, specify the SMTP server fully qualified name or IP address, the SMTP username and password to have the permissions to send emails.
Configure your environment as described in Configure your environment.
Note
By default, StackLight is configured to use the management network of the so-called default node network group created by Fuel. While this default setup may be appropriate for small deployments or evaluation purposes, it is recommended that you not use the default management network for StackLight. Instead, create a dedicated network when configuring your environment. This will improve the overall performance of both OpenStack and StackLight and facilitate the access to the Kibana and Grafana analytics.
Deploy your environment as described in Deploy an OpenStack environment.
Note
The StackLight Collector Plugin is a hot-pluggable plugin. Therefore, it is possible to install and deploy the collector in an environment that is already deployed. After the installation of the StackLight Collector Plugin, define the settings of the plugin and run the commands shown below from the Fuel master node for every node of your deployment starting with the controller node(s):
[root@nailgun ~]# fuel nodes --env <env_id> --node <node_id> --tasks hiera install-ocf-script
Once the task has completed for the node, run the following command:
[root@nailgun ~]# fuel nodes --env <env_id> --node <node_id> --start post_deployment_start
Plugin verification¶
Once the OpenStack environment is ready, verify that both the collectd and hekad processes are running on the OpenStack nodes:
[root@node-1 ~]# pidof hekad
5568
5569
[root@node-1 ~]# pidof collectd
5684
Note
Starting with StackLight version 0.10, there are two hekad processes running instead of one. One is used to collect and process the logs and the notifications, the other one is used to process the metrics.
Troubleshooting¶
If you see no data in the Kibana and/or Grafana dashboards, follow the instructions below to troubleshoot the issue:
Verify that the collector services are up and running:
On the controller nodes:
[root@node-1 ~]# crm resource status metric_collector [root@node-1 ~]# crm resource status log_collector
On non-controller nodes:
[root@node-2 ~]# status log_collector [root@node-2 ~]# status metric_collector
If a collector is down, restart it:
On the controller nodes:
[root@node-1 ~]# crm resource start log_collector [root@node-1 ~]# crm resource start metric_collector
On non-controller nodes:
[root@node-2 ~]# start log_collector [root@node-2 ~]# start metric_collector
Look for errors in the log file of the collectors located at
/var/log/log_collector.log
and/var/log/metric_collector.log
.Look for errors in the log file of collectd located at
/var/log/collectd.log
.Verify that the nodes are able to connect to the Elasticsearch server on port 9200.
Verify that the nodes are able to connect to the InfluxDB server on port 8086.
Diagnostic tool¶
The StackLight Collector Plugin installs a global diagnostic tool on the Fuel Master node. The global diagnostic tool checks that StackLight is configured and running properly across the entire LMA toolchain for all the nodes that are ready in your OpenStack environment:
[root@nailgun ~]# /var/www/nailgun/plugins/lma_collector-<version>/contrib/tools/diagnostic.sh
Running lma_diagnostic tool on all available nodes (this can take several minutes)
The diagnostic archive is here: /var/lma_diagnostics.2016-06-10_11-23-1465557820.tgz
Note
A global diagnostic can take several minutes.
All the results are consolidated in the
/var/lma_diagnostics.[date +%Y-%m-%d_%H-%M-%s].tgz
archive.
Instead of running a global diagnostic, you may want to run the diagnostic on individual nodes. Based on the role of the node, the tool determines what checks should be executed. For example:
root@node-3:~# hiera roles
["controller"]
root@node-3:~# lma_diagnostics
2016-06-10-11-08-04 INFO node-3.test.domain.local role ["controller"]
2016-06-10-11-08-04 INFO ** LMA Collector
2016-06-10-11-08-04 INFO 2 process(es) 'hekad -config' found
2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 4352
2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 8325
2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 5567
2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 4353
[...]
In the example above, the diagnostic tool reports that two hekad processes are running on node-3, which is the expected outcome. In the case when one hekad process is not running, the diagnostic tool reports an error. For example:
root@node-3:~# lma_diagnostics
2016-06-10-11-11-48 INFO node-3.test.domain.local role ["controller"]
2016-06-10-11-11-48 INFO ** LMA Collector
2016-06-10-11-11-48 ERROR 1 'hekad -config' processes found, 2 expected!
2016-06-10-11-11-48 ERROR 'hekad' process does not LISTEN on port: 4352
[...]
In the example above, the diagnostic tool reported two errors:
- There is only one hekad process running instead of two.
- No hekad process is listening on port 4352.
These examples describe only one type of checks performed by the diagnostic tool, but there are many others.
On the OpenStack nodes, the diagnostic results are stored in /var/lma_diagnostics/diagnostics.log
.
Note
A successful LMA toolchain diagnostic should be free of errors.
Advanced configuration¶
Due to a current limitation in Fuel, when a node is removed from an OpenStack environment through the Fuel web UI or CLI, the services that were running on that node are not automatically removed from the database. Therefore, StackLight reports these services as failed. To resolve this issue, remove these services manually.
To reconfigure the StackLight Collector after removing a node:
From a controller node, list the services that are reported failed. In the example below, it is
node-7
.root@node-6:~# source ./openrc root@node-6:~# neutron agent-list +--------------+-------------------+-------------------+-------------------+-------+ | id | agent_type | host | availability_zone | alive | +--------------+-------------------+-------------------+-------------------+-------+ | 08a69bad-... | Metadata agent | node-8.domain.tld | | :-) | | 11b6dca6-... | Metadata agent | node-7.domain.tld | | xxx | | 22ea82e3-... | DHCP agent | node-6.domain.tld | nova | :-) | | 2d82849e-... | L3 agent | node-6.domain.tld | nova | :-) | | 3221ec18-... | Open vSwitch agent| node-6.domain.tld | | :-) | | 84bfd240-... | Open vSwitch agent| node-7.domain.tld | | xxx | | 9452e8f0-... | Open vSwitch agent| node-9.domain.tld | | :-) | | 97136b09-... | Open vSwitch agent| node-8.domain.tld | | :-) | | c198bc94-... | DHCP agent | node-7.domain.tld | nova | xxx | | c76c4ed4-... | L3 agent | node-7.domain.tld | nova | xxx | | d0fd8bb5-... | L3 agent | node-8.domain.tld | nova | :-) | | d21f9cea-... | DHCP agent | node-8.domain.tld | nova | :-) | | f6f871b7-... | Metadata agent | node-6.domain.tld | | :-) | +--------------+-------------------+-------------------+-------------------+-------+ root@node-6:~# nova service-list +--+----------------+-----------------+---------+--------+-------+-----------------+ |Id|Binary |Host | Zone | Status | State | Updated_at | +--+----------------+-----------------+---------+--------+-------+-----------------+ |1 |nova-consoleauth|node-6.domain.tld| internal| enabled| up | 2016-07-19T11:43| |4 |nova-scheduler |node-6.domain.tld| internal| enabled| up | 2016-07-19T11:43| |7 |nova-cert |node-6.domain.tld| internal| enabled| up | 2016-07-19T11:43| |10|nova-conductor |node-6.domain.tld| internal| enabled| up | 2016-07-19T11:42| |22|nova-cert |node-7.domain.tld| internal| enabled| down | 2016-07-19T11:43| |25|nova-consoleauth|node-7.domain.tld| internal| enabled| down | 2016-07-19T11:43| |28|nova-scheduler |node-7.domain.tld| internal| enabled| down | 2016-07-19T11:43| |31|nova-cert |node-8.domain.tld| internal| enabled| up | 2016-07-19T11:43| |34|nova-consoleauth|node-8.domain.tld| internal| enabled| up | 2016-07-19T11:43| |37|nova-conductor |node-7.domain.tld| internal| enabled| down | 2016-07-19T11:42| |43|nova-scheduler |node-8.domain.tld| internal| enabled| up | 2016-07-19T11:43| |49|nova-conductor |node-8.domain.tld| internal| enabled| up | 2016-07-19T11:42| |64|nova-compute |node-9.domain.tld| nova | enabled| up | 2016-07-19T11:42| +--+----------------+-----------------+---------+--------+-------+-----------------+ root@node-6:~# cinder service-list +----------------+-----------------------------+----+-------+-----+----------------+ | Binary | Host |Zone| Status|State| Updated_at | +----------------+-----------------------------+----+-------+-----+----------------+ |cinder-backup | node-9.domain.tld |nova|enabled|up |2016-07-19T11:44| |cinder-scheduler| node-6.domain.tld |nova|enabled|up |2016-07-19T11:43| |cinder-scheduler| node-7.domain.tld |nova|enabled|down |2016-07-19T11:43| |cinder-scheduler| node-8.domain.tld |nova|enabled|up |2016-07-19T11:44| |cinder-volume |node-9.domain.tld@LVM-backend|nova|enabled|up |2016-07-19T11:44| +----------------+-----------------------------+----+-------+-----+----------------+
Remove the services and/or agents that are reported failed on that node:
root@node-6:~# nova service-delete <id of service to delete> root@node-6:~# cinder service-disable <hostname> <binary> root@node-6:~# neutron agent-delete <id of agent to delete>
Restart the Collector on all the controller nodes:
[root@node-1 ~]# crm resource restart log_collector [root@node-1 ~]# crm resource restart metric_collector
Configuring alarms¶
Overview¶
The process of running alarms in StackLight is not centralized, as it is often the case in more conventional monitoring systems, but distributed across all the StackLight Collectors.
Each Collector is individually responsible for monitoring the resources and services that are deployed on the node and for reporting any anomaly or fault it has detected to the Aggregator.
The anomaly and fault detection logic in StackLight is designed more like an expert system in that the Collector and the Aggregator use artifacts we can refer to as facts and rules.
The facts are the operational data ingested in the StackLight’s stream-processing pipeline. The rules are either alarm rules or aggregation rules. They are declaratively defined in YAML files that can be modified. Those rules are turned into a collection of Lua plugins that are executed by the Collector and the Aggregator. They are generated dynamically using the Puppet modules of the StackLight Collector Plugin.
The following are the two types of Lua plugins related to the processing of alarms:
- The AFD plugin – Anomaly and Fault Detection plugin
- The GSE plugin – Global Status Evaluation plugin
These plugins create special types of metrics, as follows:
- The AFD metric, which contains information about the health status of a node or service in the OpenStack environment. The AFD metrics are sent on a regular basis to the Aggregator where they are further processed by the GSE plugins.
- The GSE metric, which contains information about the health status of a cluster in the OpenStack environment. A cluster is a logical grouping of nodes or services. We call them node clusters and service clusters hereafter. A service cluster can be anything like a cluster of API endpoints or a cluster of workers. A cluster of nodes is a grouping of nodes that have the same role. For example, compute or storage.
Note
The AFD and GSE metrics are new types of metrics introduced in StackLight version 0.8. They contain detailed information about the fault and anomalies detected by StackLight. For more information about the message structure of these metrics, refer to Metrics section of the Developer Guide.
The following figure shows the StackLight stream-processing pipeline workflow:
The AFD and GSE plugins¶
The current version of StackLight contains the following three types of GSE plugins:
- The Service Cluster GSE Plugin, which receives AFD metrics for services from the AFD plugins.
- The Node Cluster GSE Plugin, which receives AFD metrics for nodes from the AFD plugins.
- The Global Cluster GSE Plugin, which receives GSE metrics from the GSE plugins above. It aggregates and correlates the GSE metrics to issue a global health status for the top-level clusters like Nova, MySQL, and others.
The health status exposed in the GSE metrics is as follows:
Down
: One or several primary functions of a cluster has failed or is failing. For example, the API service for Nova or Cinder is not accessible.Critical
: One or several primary functions of a cluster are severely degraded. The quality of service delivered to the end user is severely impacted.Warning
: One or several primary functions of the cluster are slightly degraded. The quality of service delivered to the end user is slightly impacted.Unknown
: There is not enough data to infer the actual health status of the cluster.Okay
: None of the above was found to be true.
The AFD and GSE persisters¶
The AFD and GSE metrics are also consumed by other types of Lua plugins called persisters:
- The InfluxDB persister transforms the GSE metrics into InfluxDB data points and Grafana annotations. They are used in Grafana to graph the health status of the OpenStack clusters.
- The Elasticsearch persister transforms the AFD metrics into events that are indexed in Elasticsearch. Using Kibana, these events can be searched to display a fault or an anomaly that occurred in the environment (not yet implemented).
- The Nagios persister transforms the GSE and AFD metrics into passive checks that are sent to Nagios for alerting and escalation.
New persisters can be easily created to feed other systems with the operational insight contained in the AFD and GSE metrics.
Alarms configuration¶
StackLight comes with a predefined set of alarm rules. We have tried to make
these rules as comprehensive and relevant as possible, but your mileage may
vary depending on the specifics of your OpenStack environment and monitoring
requirements. Therefore, it is possible to modify those predefined rules and
create new ones. To do so, modify the /etc/hiera/override/alarming.yaml
file and apply the Puppet manifest that will dynamically
generate Lua plugins, known as the AFD Plugins, which are the actuators of the
alarm rules. But before you proceed, verify that understand the structure of
that file.
Alarm structure¶
An alarm rule is defined declaratively using the YAML syntax. For example:
name: 'fs-warning'
description: 'Filesystem free space is low'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
group_by: ['fs']
relational_operator: '<'
fields:
fs: "=~ ceph/%d+$"
threshold: 5
window: 60
periods: 0
function: min
Where
fs_space_percent_free
metric.true
if the difference
exceeds the standard deviation multiplied by the ‘threshold’ value.
This function uses the rate of change algorithm already available in the
anomaly detection module of Heka. It can only be applied to normal
distributions. With an alarm rule using the ‘roc’ function, the
‘window’ parameter specifies the duration in seconds of the current
window, and the ‘periods’ parameter specifies the number of windows
used for the historical data. You need at least one period and the
‘periods’ parameter must not be zero. If you choose a period of ‘p’,
the function will compute the rate of change using a historical data
window of (‘p’ * window) seconds. For example, if you specify the
following in the alarm rule:Dimension pattern matching¶
The alarming framework allows specifying alarms against metrics with a filtering mechanism called dimension pattern matching. For details, see the specification.
The pattern matching is evaluated against the field/dimension specified by the alarm rule:
rules:
- metric: foo_metric
fields:
my_dimension: <PATTERN MATCHING EXPRESSION>
where the pattern matching expression has the following format:
EXP ::= “<relation operator> string”
Expressions can be combined with logical operator(s):
EXP (<logical_operator> EXP, ..)
Where:
- Logical operators:
- OR: ||
- AND: &&
- Relational operators:
- Strings and numbers:
- Equality: ==
- Not equal: !=
- Strings only (for syntax, see Lua pattern matching):
- Match: =~
- Negated match: !~
- Numbers only:
- Greater than: >
- Greater than equals: >=
- Less than: <
- Less than equals: <=
- Strings and numbers:
Example:
Value | Pattern | Matching |
---|---|---|
10 | 10 | True |
10 | == 10 | True |
10 | != 42 | True |
10 | > 42 | False |
10 | >= 10 | True |
foo | == foo | True |
foo | == bar | False |
/var/log | =~ /log$ | True |
/data | !~ ^/data$ | False |
/var/log/data | !~ /data$ | False |
/var/log/data | !~ ^/data$ | True |
Modify or create an alarm¶
To modify or create an alarm, edit the /etc/hiera/override/alarming.yaml
file. This file has the following sections:
The
alarms
section contains a global list of alarms that are executed by the Collectors. These alarms are global to the LMA toolchain and should be kept identical on all nodes of the OpenStack environment. The following is another example of the definition of an alarm:alarms: - name: 'cpu-critical-controller' description: 'CPU critical on controller' severity: 'critical' enabled: 'true' trigger: logical_operator: 'or' rules: - metric: cpu_idle relational_operator: '<=' threshold: 5 window: 120 periods: 0 function: avg - metric: cpu_wait relational_operator: '>=' threshold: 35 window: 120 periods: 0 function: avg
This alarm is called ‘cpu-critical-controller’. It says that CPU activity is critical (severity: ‘critical’) if any of the rules in the alarm definition evaluate to true.
The rule says that the alarm will evaluate to ‘true’ if the value of the metric
cpu_idle
has been in average (function: avg), below or equal (relational_operator: <=) to 5 for the last 2 minutes (window: 120).OR (logical_operator: ‘or’)
If the value of the metric cpu_wait has been in average (function: avg), superior or equal (relational_operator: >=) to 35 for the last 2 minutes (window: 120)
Note that these metrics are expressed in percentage.
What alarms are executed on which node depends on the mapping between the alarm definition and the definition of a cluster as described in the following sections.
The
node_cluster_roles
section defines the mapping between the internal definition of a cluster of nodes and one or several Fuel roles. For example:node_cluster_roles: controller: ['primary-controller', 'controller'] compute: ['compute'] storage: ['cinder', 'ceph-osd'] [ ... ]
Creates a mapping between the ‘primary-controller’ and ‘controller’ Fuel roles, and the internal definition of a cluster of nodes called ‘controller’. Likewise, the internal definition of a cluster of nodes called ‘storage’ is mapped to the ‘cinder’ and ‘ceph-osd’ Fuel roles. The internal definition of a cluster of nodes is used to assign the alarms to the relevant category of nodes. This mapping is also used to configure the passive checks in Nagios. Therefore, it is critically important to keep exactly the same copy of
/etc/hiera/override/alarming.yaml
across all nodes of the OpenStack environment including the node(s) where Nagios is installed.The
service_cluster_roles
section defines the mapping between the internal definition of a cluster of services and one or several Fuel roles. For example:service_cluster_roles: rabbitmq: ['primary-controller', 'controller'] nova-api: ['primary-controller', 'controller'] elasticsearch: ['primary-elasticsearch_kibana', 'elasticsearch_kibana'] [ ... ]
Creates a mapping between the ‘primary-controller’ and ‘controller’ Fuel roles, and the internal definition of a cluster of services called ‘rabbitmq’. Likewise, the internal definition of a cluster of services called ‘elasticsearch’ is mapped to the ‘primary-elasticsearch_kibana’ and ‘elasticsearch_kibana’ Fuel roles. As for the clusters of nodes, the internal definition of a cluster of services is used to assign the alarms to the relevant category of services.
The
node_cluster_alarms
section defines the mapping between the internal definition of a cluster of nodes and the alarms that are assigned to that category of nodes. For example:node_cluster_alarms: controller-nodes: apply_to_node: controller alerting: enabled members: cpu: alarms: ['cpu-critical-controller', 'cpu-warning-controller'] root-fs: alarms: ['root-fs-critical', 'root-fs-warning'] log-fs: alarms: ['log-fs-critical', 'log-fs-warning'] hdd-errors: alerting: enabled_with_notification alarms: ['hdd-errors-critical']
Creates four alarm groups for the cluster of controller nodes:
- The cpu alarm group is mapped to two alarms defined in the
alarms
section known as the ‘cpu-critical-controller’ and ‘cpu-warning-controller’ alarms. These alarms monitor the CPU on the controller nodes. The order matters here since the first alarm that evaluates to ‘true’ stops the evaluation. Therefore, it is important to start the list with the most critical alarms. - The root-fs alarm group is mapped to two alarms defined in the
alarms
section known as the ‘root-fs-critical’ and ‘root-fs-warning’ alarms. These alarms monitor the root file system on the controller nodes. - The log-fs alarm group is mapped to two alarms defined in the
alarms
section known as the ‘log-fs-critical’ and ‘log-fs-warning’ alarms. These alarms monitor the file system where the logs are created on the controller nodes. - The hdd-errors alarm group is mapped to the ‘hdd-errors-critical’ alarm
defined in the
alarms
section. This alarm monitors thekern.log
log entries containing critical IO errors detected by the kernel. The hdd-error alarm obtains the enabled_with_notification alerting attribute, meaning that the operator will be notified if any of the controller nodes encounters a disk failure. Other alarms do not trigger notification per node but at an aggregated cluster level.
Note
An alarm group is a mere implementation artifact (although it has functional value) that is primarily used to distribute the alarms evaluation workload across several Lua plugins. Since the Lua plugins runtime is sandboxed within Heka, it is preferable to run smaller sets of alarms in different plugins rather than a large set of alarms in a single plugin. This is to avoid having alarms evaluation plugins shut down by Heka. Furthermore, the alarm groups are used to identify what is called a source. A source is a tuple in which we associate a cluster with an alarm group. For example, the tuple [‘controller’, ‘cpu’] is a source. It associates a ‘controller’ cluster with the ‘cpu’ alarm group. The tuple [‘controller’, ‘root-fs’] is another source example. The source is used by the GSE Plugins to remember the AFD metrics it has received. If a GSE Plugin stops receiving AFD metrics it used to get, then the GSE Plugin infers that the health status of the cluster associated with the source is Unknown.
This is evaluated every ticker-interval. By default, the ticker interval for the GSE Plugins is set to 10 seconds.
- The cpu alarm group is mapped to two alarms defined in the
Aggregation and correlation configuration¶
StackLight comes with a predefined set of aggregation rules and correlation
policies. However, you can create new aggregation rules and correlation
policies or modify the existing ones. To do so, modify the /etc/hiera/override/gse_filters.yaml
file and apply the
Puppet manifest that will generate Lua plugins known as
the GSE Plugins, which are the actuators of these aggregation rules and
correlation policies. But before you proceed, verify that you understand the
structure of that file.
Note
As for /etc/hiera/override/alarming.yaml
, it is critically
important to keep exactly the same copy of
/etc/hiera/override/gse_filters.yaml
across all the nodes of the
OpenStack environment including the node(s) where Nagios is installed.
The aggregation rules and correlation policies are defined in the /etc/hiera/override/gse_filters.yaml
configuration file.
This file has the following sections:
- The
gse_policies
section contains the health status correlation policies that apply to the node clusters and service clusters. - The
gse_cluster_service
section contains the aggregation rules for the service clusters. These aggregation rules are actuated by the Service Cluster GSE Plugin that runs on the Aggregator. - The
gse_cluster_node
section contains the aggregation rules for the node clusters. These aggregation rules are actuated by the Node Cluster GSE Plugin that runs on the Aggregator. - The
gse_cluster_global
section contains the aggregation rules for the so-called top-level clusters. A global cluster is a kind of logical construct of node clusters and service clusters. These aggregation rules are actuated by the Global Cluster GSE Plugin that runs on the Aggregator.
Health status policies¶
The correlation logic implemented by the GSE plugins is policy-based. The policies define how the GSE plugins infer the health status of a cluster.
By default, there are two policies:
- The highest_severity policy defines that the cluster’s status depends on the member with the highest severity, typically used for a cluster of services.
- The majority_of_members policy defines that the cluster is healthy as long as (N+1)/2 members of the cluster are healthy. This is typically used for clusters managed by Pacemaker.
A policy consists of a list of rules that are evaluated against the current status of the cluster’s members. When one of the rules matches, the cluster’s status gets the value associated with the rule and the evaluation stops. The last rule of the list is usually a catch-all rule that defines the default status if none of the previous rules matches.
The following example shows the policy rule definition:
# The following rule definition reads as: "the cluster's status is critical
# if more than 50% of its members are either down or critical"
- status: critical
trigger:
logical_operator: or
rules:
- function: percent
arguments: [ down, critical ]
relational_operator: '>'
threshold: 50
Where
Consider the policy called highest_severity:
gse_policies:
highest_severity:
- status: down
trigger:
logical_operator: or
rules:
- function: count
arguments: [ down ]
relational_operator: '>'
threshold: 0
- status: critical
trigger:
logical_operator: or
rules:
- function: count
arguments: [ critical ]
relational_operator: '>'
threshold: 0
- status: warning
trigger:
logical_operator: or
rules:
- function: count
arguments: [ warning ]
relational_operator: '>'
threshold: 0
- status: okay
trigger:
logical_operator: or
rules:
- function: count
arguments: [ okay ]
relational_operator: '>'
threshold: 0
- status: unknown
The policy definition reads as follows:
- The status of the cluster is
Down
if the status of at least one cluster’s member isDown
. - Otherwise, the status of the cluster is
Critical
if the status of at least one cluster’s member isCritical
. - Otherwise, the status of the cluster is
Warning
if the status of at least one cluster’s member isWarning
. - Otherwise, the status of the cluster is
Okay
if the status of at least one cluster’s entity is Okay. - Otherwise, the status of the cluster is
Unknown
.
Service cluster aggregation rules¶
The service cluster aggregation rules are used to designate the members of a service cluster along with the AFD metrics that must be taken into account to derive a health status for the service cluster. The following is an example of the service cluster aggregation rules:
gse_cluster_service:
input_message_types:
- afd_service_metric
aggregator_flag: true
cluster_field: service
member_field: source
output_message_type: gse_service_cluster_metric
output_metric_name: cluster_service_status
interval: 10
warm_up_period: 20
alerting: enabled_with_notification
clusters:
nova-api:
policy: highest_severity
group_by: member
members:
- backends
- endpoint
- http_errors
Where
Service cluster definition¶
The following example shows the service clusters definition:
gse_cluster_service:
[...]
clusters:
nova-api:
members:
- backends
- endpoint
- http_errors
group_by: member
policy: highest_severity
Where
cluster_field
value is equal to the cluster name and the
member_field
value is in this list.A closer look into the example above defines that the Service Cluster GSE plugin resulting from those rules will emit a gse_service_cluster_metric message every 10 seconds to report the current status of the nova-api cluster. This status is computed using the afd_service_metric metric for which Fields[service] is ‘nova-api’ and Fields[source] is one of ‘backends’, ‘endpoint’, or ‘http_errors’. The ‘nova-api’ cluster’s status is computed using the ‘highest_severity’ policy, which means that it will be equal to the ‘worst’ status across all members.
Node cluster aggregation rules¶
The node cluster aggregation rules are used to designate the members of a node cluster along with the AFD metrics that must be taken into account to derive a health status for the node cluster. The following is an example of the node cluster aggregation rules:
gse_cluster_node:
input_message_types:
- afd_node_metric
aggregator_flag: true
# the field in the input messages to identify the cluster
cluster_field: node_role
# the field in the input messages to identify the cluster's member
member_field: source
output_message_type: gse_node_cluster_metric
output_metric_name: cluster_node_status
interval: 10
warm_up_period: 80
alerting: enabled_with_notification
clusters:
controller:
policy: majority_of_members
group_by: hostname
members:
- cpu
- root-fs
- log-fs
Where
Node cluster definition¶
The following example shows the node clusters definition:
gse_cluster_node:
[...]
clusters:
controller:
policy: majority_of_members
group_by: hostname
members:
- cpu
- root-fs
- log-fs
Where
cluster_field
value is equal to the cluster name and the member_field
value is in
this list.A closer look into the example above defines that the Node Cluster GSE plugin resulting from those rules will emit a gse_node_cluster_metric message every 10 seconds to report the current status of the controller cluster. This status is computed using the afd_node_metric metric for which Fields[node_role] is ‘controller’ and Fields[source] is one of ‘cpu’, ‘root-fs’ or ‘log-fs’. The ‘controller’ cluster’s status is computed using the ‘majority_of_members’ policy which means that it will be equal to the ‘majority’ status across all members.
Top-level cluster aggregation rules¶
The top-level aggregation rules aggregate GSE metrics from the Service Cluster GSE Plugin and the Node Cluster GSE Plugin. This is the last aggregation stage that issues health status for the top-level clusters. A top-level cluster is a logical construct of service and node clustering. By default, we define that the health status of Nova, as a top-level cluster, depends on the health status of several service clusters related to Nova and the health status of the ‘controller’ and ‘compute’ node clusters. But it can be anything. For example, you can define a ‘control-plane’ top-level cluster that would exclude the health status of the ‘compute’ node cluster if required. The top-level cluster aggregation rules are used to designate the node clusters and service clusters members of a top-level cluster along with the GSE metrics that must be taken into account to derive a health status for the top-level cluster. The following is an example of a top-level cluster aggregation rules:
gse_cluster_global:
input_message_types:
- gse_service_cluster_metric
- gse_node_cluster_metric
aggregator_flag: false
# the field in the input messages to identify the cluster's member
member_field: cluster_name
output_message_type: gse_cluster_metric
output_metric_name: cluster_status
interval: 10
warm_up_period: 30
clusters:
nova:
policy: highest_severity
group_by: member
members:
- nova-logs
- nova-api
- nova-metadata-api
- nova-scheduler
- nova-compute
- nova-conductor
- nova-cert
- nova-consoleauth
- nova-novncproxy-websocket
- controller
- compute
hints:
- cinder
- glance
- keystone
- neutron
- mysql
- rabbitmq
Where
Top-level cluster definition¶
The following example shows the top-level clusters definition:
gse_cluster_global:
[...]
clusters:
nova:
policy: highest_severity
group_by: member
members:
- nova-logs
- nova-api
- nova-metadata-api
- nova-scheduler
- nova-compute
- nova-conductor
- nova-cert
- nova-consoleauth
- nova-novncproxy-websocket
- controller
- compute
hints:
- cinder
- glance
- keystone
- neutron
- mysql
- rabbitmq
Where
member_field
cluster_name
), is on this list.member_field
value (cluster_name
) is on this list. This meansApply your configuration changes¶
Once you have edited and saved your changes in
/etc/hiera/override/alarming.yaml
and / or
/etc/hiera/override/gse_filters.yaml
,
apply the following Puppet manifest on all the nodes of your OpenStack
environment including the node(s) where Nagios is installed
for the changes to take effect:
# puppet apply --modulepath=/etc/fuel/plugins/lma_collector-<version>/puppet/modules:\
/etc/puppet/modules \
/etc/fuel/plugins/lma_collector-<version>/puppet/manifests/configure_afd_filters.pp
If you have also deployed lma_infrastructure_alerting” plugin, Nagios must be reconfigured as well by running the following commands on all the nodes with the *lma_infrastructure_alerting role:
# rm -f /etc/nagios3/conf.d/lma_* && puppet apply \
--modulepath=/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/modules:\
/etc/puppet/modules \
/etc/fuel/plugins/lma_infrastructure_alerting-<version>/puppet/manifests/nagios.pp
Appendix¶
List of metrics¶
The following is a list of metrics that are emitted by the StackLight Collector. The metrics are listed by category, then by metric name.
System¶
CPU¶
Metrics have a cpu_number
field that contains the CPU number to which the
metric applies.
cpu_idle
, the percentage of CPU time spent in the idle task.cpu_interrupt
, the percentage of CPU time spent servicing interrupts.cpu_nice
, the percentage of CPU time spent in user mode with low priority (nice).cpu_softirq
, the percentage of CPU time spent servicing soft interrupts.cpu_steal
, the percentage of CPU time spent in other operating systems.cpu_system
, the percentage of CPU time spent in system mode.cpu_user
, the percentage of CPU time spent in user mode.cpu_wait
, the percentage of CPU time spent waiting for I/O operations to complete.
Disk¶
Metrics have a device
field that contains the disk device number the metric
applies to. For example, ‘sda’, ‘sdb’, and others.
disk_merged_read
, the number of read operations per second that could be merged with already queued operations.disk_merged_write
, the number of write operations per second that could be merged with already queued operations.disk_octets_read
, the number of octets (bytes) read per second.disk_octets_write
, the number of octets (bytes) written per second.disk_ops_read
, the number of read operations per second.disk_ops_write
, the number of write operations per second.disk_time_read
, the average time for a read operation to complete in the last interval.disk_time_write
, the average time for a write operation to complete in the last interval.
File system¶
Metrics have a fs
field that contains the partition’s mount point to which
the metric applies. For example, ‘/’, ‘/var/lib’, and others.
fs_inodes_free
, the number of free inodes on the file system.fs_inodes_percent_free
, the percentage of free inodes on the file system.fs_inodes_percent_reserved
, the percentage of reserved inodes.fs_inodes_percent_used
, the percentage of used inodes.fs_inodes_reserved
, the number of reserved inodes.fs_inodes_used
, the number of used inodes.fs_space_free
, the number of free bytes.fs_space_percent_free
, the percentage of free bytes.fs_space_percent_reserved
, the percentage of reserved bytes.fs_space_percent_used
, the percentage of used bytes.fs_space_reserved
, the number of reserved bytes.fs_space_used
, the number of used bytes.
System load¶
load_longterm
, the system load average over the last 15 minutes.load_midterm
, the system load average over the last 5 minutes.load_shortterm
, the system load average over the last minute.
Memory¶
memory_buffered
, the amount of buffered memory in bytes.memory_cached
, the amount of cached memory in bytes.memory_free
, the amount of free memory in bytes.memory_used
, the amount of used memory in bytes.
Network¶
Metrics have an interface
field that contains the interface name the
metric applies to. For example, ‘eth0’, ‘eth1’, and others.
if_collisions
, the number of collisions per second per interface.if_dropped_rx
, the number of dropped packets per second when receiving from the interface.if_dropped_tx
, the number of dropped packets per second when transmitting from the interface.if_errors_rx
, the number of errors per second detected when receiving from the interface.if_errors_rx_crc
, the number of received frames with wrong CRC (cyclic redundancy check) per second.if_errors_rx_fifo
, the number of received frames dropped per second due to FIFO buffer overflows.if_errors_rx_frame
, the number of received frames with invalid frame checksum (FCS).if_errors_rx_length
, the number of received frames with a length that doesn’t comply with the Ethernet specification.if_errors_rx_missed
, the number of missed packets when receiving from the interface.if_errors_rx_over
, the number of received frames per second that were dropped due to an hardware port receive buffer overflow.if_errors_tx
, the number of errors per second detected when transmitting from the interface.if_errors_tx_aborted
, the number of aborted frames per second when transmitting from the interfaceif_errors_tx_carrier
, the number of times per second the interface has lost its link connection to the switch.if_errors_tx_fifo
, the number of transmitted frames per second dropped due to FIFO buffer overflows.if_errors_tx_heartbeat
, the number of heartbeat errors per second.if_errors_tx_window
, the number of late collisions per second when transmitting from the interface.if_multicast
, the number of multicast packets per second per interface.if_octets_rx
, the number of octets (bytes) received per second by the interface.if_octets_tx
, the number of octets (bytes) transmitted per second by the interface.if_packets_rx
, the number of packets received per second by the interface.if_packets_tx
, the number of packets transmitted per second by the interface.
Processes¶
processes_count
, the number of processes in a given state. The metric has astate
field (one of ‘blocked’, ‘paging’, ‘running’, ‘sleeping’, ‘stopped’ or ‘zombies’).processes_fork_rate
, the number of processes forked per second.
Swap¶
swap_cached
, the amount of cached memory (in bytes) that is in the swap.swap_free
, the amount of free memory (in bytes) that is in the swap.swap_io_in
, the number of swap bytes written per second.swap_io_out
, the number of swap bytes read per second.swap_used
, the amount of used memory (in bytes) that is in the swap.swap_percent_used
, the amount of used memory (in percentages) that is in the swap.
Users¶
logged_users
, the number of users currently logged in.
Apache¶
apache_bytes
, the number of bytes per second transmitted by the server.apache_connections
, the current number of active connections.apache_idle_workers
, the current number of idle workers.apache_requests
, the number of requests processed per second.apache_workers_closing
, the number of workers in closing state.apache_workers_dnslookup
, the number of workers in DNS lookup state.apache_workers_finishing
, the number of workers in finishing state.apache_workers_idle_cleanup
, the number of workers in idle cleanup state.apache_workers_keepalive
, the number of workers in keepalive state.apache_workers_logging
, the number of workers in logging state.apache_workers_open
, the number of workers in open state.apache_workers_reading
, the number of workers in reading state.apache_workers_sending
, the number of workers in sending state.apache_workers_starting
, the number of workers in starting state.apache_workers_waiting
, the number of workers in waiting state.
MySQL¶
Commands¶
mysql_commands
, the number of times per second a given statement has been
executed. The metric has a statement
field that contains the statement to
which it applies. The values can be as follows:
change_db
for the USE statement.commit
for the COMMIT statement.flush
for the FLUSH statement.insert
for the INSERT statement.rollback
for the ROLLBACK statement.select
for the SELECT statement.set_option
for the SET statement.show_collations
for the SHOW COLLATION statement.show_databases
for the SHOW DATABASES statement.show_fields
for the SHOW FIELDS statement.show_master_status
for the SHOW MASTER STATUS statement.show_status
for the SHOW STATUS statement.show_tables
for the SHOW TABLES statement.show_variables
for the SHOW VARIABLES statement.show_warnings
for the SHOW WARNINGS statement.update
for the UPDATE statement.
Handlers¶
mysql_handler
, the number of times per second a given handler has been
executed. The metric has a handler
field that contains the handler
it applies to. The values can be as follows:
commit
for the internal COMMIT statements.delete
for the internal DELETE statements.external_lock
for the external locks.read_first
for the requests that read the first entry in an index.read_key
for the requests that read a row based on a key.read_next
for the requests that read the next row in key order.read_prev
for the requests that read the previous row in key order.read_rnd
for the requests that read a row based on a fixed position.read_rnd_next
for the requests that read the next row in the data file.rollback
the requests that perform the rollback operation.update
the requests that update a row in a table.write
the requests that insert a row in a table.
Locks¶
mysql_locks_immediate
, the number of times per second the requests for table locks could be granted immediately.mysql_locks_waited
, the number of times per second the requests for table locks had to wait.
Network¶
mysql_octets_rx
, the number of bytes per second received by the server.mysql_octets_tx
, the number of bytes per second sent by the server.
Threads¶
mysql_threads_cached
, the number of threads in the thread cache.mysql_threads_connected
, the number of currently open connections.mysql_threads_created
, the number of threads created per second to handle connections.mysql_threads_running
, the number of threads that are not sleeping.
Cluster¶
The following metrics are collected with statement ‘SHOW STATUS’. For details, see Percona documentation.
mysql_cluster_connected
,1
when the node is connected to the cluster, if not, then0
.mysql_cluster_local_cert_failures
, the number of write sets that failed the certification test.mysql_cluster_local_commits
, the number of write sets committed on the node.mysql_cluster_local_recv_queue
, the number of write sets waiting to be applied.mysql_cluster_local_send_queue
, the number of write sets waiting to be sent.mysql_cluster_ready
,1
when the node is ready to accept queries, if not, then0
.mysql_cluster_received
, the total number of write sets received from other nodes.mysql_cluster_received_bytes
, the total size in bytes of write sets received from other nodes.mysql_cluster_replicated
, the total number of write sets sent to other nodes.mysql_cluster_replicated_bytes
the total size in bytes of write sets sent to other nodes.mysql_cluster_size
, the current number of nodes in the cluster.mysql_cluster_status
,1
when the node is ‘Primary’,2
if ‘Non-Primary’, and3
if ‘Disconnected’.
Slow queries¶
The following metric is collected with statement ‘SHOW STATUS where Variable_name = ‘Slow_queries’.
mysql_slow_queries
, the number of queries that have taken more than X seconds, depending on the MySQL configuration parameter ‘long_query_time’ (10s per default).
RabbitMQ¶
Cluster¶
rabbitmq_connections
, the total number of connections.rabbitmq_consumers
, the total number of consumers.rabbitmq_channels
, the total number of channels.rabbitmq_exchanges
, the total number of exchanges.rabbitmq_messages
, the total number of messages which are ready to be consumed or not yet acknowledged.rabbitmq_queues
, the total number of queues.rabbitmq_running_nodes
, the total number of running nodes in the cluster.rabbitmq_disk_free
, the free disk space.rabbitmq_disk_free_limit
, the minimum amount of free disk space for RabbitMQ. Whenrabbitmq_disk_free
drops below this value, all producers are blocked.rabbitmq_remaining_disk
, the difference betweenrabbitmq_disk_free
andrabbitmq_disk_free_limit
.rabbitmq_used_memory
, bytes of memory used by the whole RabbitMQ process.rabbitmq_vm_memory_limit
, the maximum amount of memory allocated for RabbitMQ. Whenrabbitmq_used_memory
uses more than this value, all producers are blocked.rabbitmq_remaining_memory
, the difference betweenrabbitmq_vm_memory_limit
andrabbitmq_used_memory
.
HAProxy¶
The frontend
and backend
field values can be as follows:
- cinder-api
- glance-api
- glance-registry-api
- heat-api
- heat-cfn-api
- heat-cloudwatch-api
- horizon-web (when Horizon is deployed without TLS)
- horizon-https (when Horizon is deployed with TLS)
- keystone-public-api
- keystone-admin-api
- mysqld-tcp
- murano-api
- neutron-api
- nova-api
- nova-metadata-api
- nova-novncproxy-websocket
- sahara-api
- swift-api
Server¶
haproxy_connections
, the number of current connections.haproxy_pipes_free
, the number of free pipes.haproxy_pipes_used
, the number of used pipes.haproxy_run_queue
, the number of connections waiting in the queue.haproxy_ssl_connections
, the number of current SSL connections.haproxy_tasks
, the number of tasks.haproxy_uptime
, the HAProxy server uptime in seconds.
Frontends¶
The following metrics have a frontend
field that contains the name of the
front-end server:
haproxy_frontend_bytes_in
, the number of bytes received by the frontend.haproxy_frontend_bytes_out
, the number of bytes transmitted by the frontend.haproxy_frontend_denied_requests
, the number of denied requests.haproxy_frontend_denied_responses
, the number of denied responses.haproxy_frontend_error_requests
, the number of error requests.haproxy_frontend_response_1xx
, the number of HTTP responses with 1xx code.haproxy_frontend_response_2xx
, the number of HTTP responses with 2xx code.haproxy_frontend_response_3xx
, the number of HTTP responses with 3xx code.haproxy_frontend_response_4xx
, the number of HTTP responses with 4xx code.haproxy_frontend_response_5xx
, the number of HTTP responses with 5xx code.haproxy_frontend_response_other
, the number of HTTP responses with other code.haproxy_frontend_session_current
, the number of current sessions.haproxy_frontend_session_total
, the cumulative number of sessions.
Backends¶
The following metrics have a backend
field that contains the name of the
back-end server:
haproxy_backend_bytes_in
, the number of bytes received by the back end.haproxy_backend_bytes_out
, the number of bytes transmitted by the back end.haproxy_backend_denied_requests
, the number of denied requests.haproxy_backend_denied_responses
, the number of denied responses.haproxy_backend_downtime
, the total downtime in seconds.haproxy_backend_error_connection
, the number of error connections.haproxy_backend_error_responses
, the number of error responses.haproxy_backend_queue_current
, the number of requests in queue.haproxy_backend_redistributed
, the number of times a request was redispatched to another server.haproxy_backend_response_1xx
, the number of HTTP responses with 1xx code.haproxy_backend_response_2xx
, the number of HTTP responses with 2xx code.haproxy_backend_response_3xx
, the number of HTTP responses with 3xx code.haproxy_backend_response_4xx
, the number of HTTP responses with 4xx code.haproxy_backend_response_5xx
, the number of HTTP responses with 5xx code.haproxy_backend_response_other
, the number of HTTP responses with other code.haproxy_backend_retries
, the number of times a connection to a server was retried.haproxy_backend_server
, the status of the backend server where values0
and1
represent, respectively,DOWN
andUP
. This metric has two additional fields: astate
field that contains the state of the backend (either ‘down’ or ‘up’) and aserver
field that contains the hostname of the backend server.haproxy_backend_servers
, the count of servers grouped by state. This metric has an additionalstate
field that contains the state of the back ends (either ‘down’ or ‘up’).haproxy_backend_servers_percent
, the percentage of servers by state. This metric has an additionalstate
field that contains the state of the back ends (either ‘down’ or ‘up’).haproxy_backend_session_current
, the number of current sessions.haproxy_backend_session_total
, the cumulative number of sessions.haproxy_backend_status
, the global back-end status where values0
and1
represent, respectively,DOWN
(all back ends are down) andUP
(at least one back end is up).
Memcached¶
memcached_command_flush
, the cumulative number of flush reqs.memcached_command_get
, the cumulative number of retrieval reqs.memcached_command_set
, the cumulative number of storage reqs.memcached_command_touch
, the cumulative number of touch reqs.memcached_connections_current
, the number of open connections.memcached_df_cache_free
, the current number of free bytes to store items.memcached_df_cache_used
, the current number of bytes used to store items.memcached_items_current
, the current number of items stored.memcached_octets_rx
, the total number of bytes read by this server from the network.memcached_octets_tx
, the total number of bytes sent by this server to the network.memcached_ops_decr_hits
, the number of successful decr reqs.memcached_ops_decr_misses
, the number of decr reqs against missing keys.memcached_ops_evictions
, the number of valid items removed from cache to free memory for new items.memcached_ops_hits
, the number of keys that have been requested.memcached_ops_incr_hits
, the number of successful incr reqs.memcached_ops_incr_misses
, the number of successful incr reqs.memcached_ops_misses
, the number of items that have been requested and not found.memcached_percent_hitratio
, the percentage of get command hits (in cache).memcached_ps_cputime_syst
, the percentage of CPU time spent in system mode by memcached. It can be greater than 100% when the node has more than one CPU.memcached_ps_cputime_user
, the percentage of CPU time spent in user mode by memcached. It can be greater than 100% when the node has more than one CPU.
For details, see the Memcached documentation.
Libvirt¶
Every metric contains an instance_id
field, which is the UUID of the
instance for the Nova service.
CPU¶
virt_cpu_time
, the average amount of CPU time (in nanoseconds) allocated to the virtual instance in a second.virt_vcpu_time
, the average amount of CPU time (in nanoseconds) allocated to the virtual CPU in a second. The metric contains avcpu_number
field which is the virtual CPU number.
Disk¶
Metrics have a device
field that contains the virtual disk device to which
the metric applies. For example, ‘vda’, ‘vdb’, and others.
virt_disk_octets_read
, the number of octets (bytes) read per second.virt_disk_octets_write
, the number of octets (bytes) written per second.virt_disk_ops_read
, the number of read operations per second.virt_disk_ops_write
, the number of write operations per second.
Memory¶
virt_memory_total
, the total amount of memory (in bytes) allocated to the virtual instance.
Network¶
Metrics have an interface
field that contains the interface name to which
the metric applies. For example, ‘tap0dc043a6-dd’, ‘tap769b123a-2e’, and others.
virt_if_dropped_rx
, the number of dropped packets per second when receiving from the interface.virt_if_dropped_tx
, the number of dropped packets per second when transmitting from the interface.virt_if_errors_rx
, the number of errors per second detected when receiving from the interface.virt_if_errors_tx
, the number of errors per second detected when transmitting from the interface.virt_if_octets_rx
, the number of octets (bytes) received per second by the interface.virt_if_octets_tx
, the number of octets (bytes) transmitted per second by the interface.virt_if_packets_rx
, the number of packets received per second by the interface.virt_if_packets_tx
, the number of packets transmitted per second by the interface.
OpenStack¶
Service API checks¶
openstack_check_api
, the service’s API status through the load balancer- VIP,
1
if it is responsive, if not, then0
. The metric contains aservice
field that identifies the OpenStack service being checked.
openstack_check_local_api
, the service’s API status checked locally.1
if it is responsive, if not, then0
. The metric contains aservice
field that identifies<service>
identifies the OpenStack service being checked.
<service>
is one of the following values with their respective resource
checks:
- ‘ceilometer-api’: ‘/v2/capabilities’
- ‘cinder-api’: ‘/’
- ‘cinder-v2-api’: ‘/’
- ‘glance-api’: ‘/’
- ‘heat-api’: ‘/’
- ‘heat-cfn-api’: ‘/’
- ‘keystone-public-api’: ‘/’
- ‘neutron-api’: ‘/’
- ‘nova-api’: ‘/’
- ‘swift-api’: ‘/healthcheck’
- ‘swift-s3-api’: ‘/healthcheck’
Note
All checks except for Ceilometer are performed without authentication.
Compute¶
The following metrics are emitted per compute node:
openstack_nova_free_disk
, the disk space in GB available for new instances.openstack_nova_free_ram
, the memory in MB available for new instances.openstack_nova_free_vcpus
, the number of virtual CPU available for new instances.openstack_nova_instance_creation_time
, the time in seconds it took to launch a new instance.openstack_nova_instance_state
, the number of instances which entered a given state (the value is always1
). The metric contains astate
field.openstack_nova_running_instances
, the number of running instances.openstack_nova_running_tasks
, the number of tasks currently executed.openstack_nova_used_disk
, the disk space in GB used by the instances.openstack_nova_used_ram
, the memory in MB used by the instances.openstack_nova_used_vcpus
, the number of virtual CPU used by the instances.
If Nova aggregates are defined then the following metrics are emitted per
aggregate. These metrics contain a aggregate
field containing the aggregate name and a aggregate_id
field containing the
ID (integer) of the aggregate.
openstack_nova_aggregate_free_disk
, the total amount of disk space in GB available in given aggregate for new instances.openstack_nova_aggregate_free_ram
, the total amount of memory in MB available in given aggregate for new instances.openstack_nova_aggregate_free_vcpus
, the total number of virtual CPU available in given aggregate for new instances.openstack_nova_aggregate_running_instances
, the total number of running instances in given aggregate.openstack_nova_aggregate_running_tasks
, the total number of tasks currently executed in given aggregate.openstack_nova_aggregate_used_disk
, the total amount of disk space in GB used by the instances in given aggregate.openstack_nova_aggregate_used_ram
, the total amount of memory in MB used by the instances in given aggregate.openstack_nova_aggregate_used_vcpus
, the total number of virtual CPU used by the instances in given aggregate.
The following metrics are retrieved from the Nova API and represent the aggregated values across all compute nodes.
openstack_nova_total_free_disk
, the total amount of disk space in GB available for new instances.openstack_nova_total_free_ram
, the total amount of memory in MB available for new instances.openstack_nova_total_free_vcpus
, the total number of virtual CPU available for new instances.openstack_nova_total_running_instances
, the total number of running instances.openstack_nova_total_running_tasks
, the total number of tasks currently executed.openstack_nova_total_used_disk
, the total amount of disk space in GB used by the instances.openstack_nova_total_used_ram
, the total amount of memory in MB used by the instances.openstack_nova_total_used_vcpus
, the total number of virtual CPU used by the instances.
The following metrics are retrieved from the Nova API:
openstack_nova_instances
, the total count of instances in a given state. The metric contains astate
field which is one of ‘active’, ‘deleted’, ‘error’, ‘paused’, ‘resumed’, ‘rescued’, ‘resized’, ‘shelved_offloaded’ or ‘suspended’.
openstack_nova_service
, the Nova service state (either0
for ‘up’,1
for ‘down’ or2
for ‘disabled’). The metric contains aservice
field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and astate
field (one of ‘up’, ‘down’ or ‘disabled’).openstack_nova_services
, the total count of Nova services by state. The metric contains aservice
field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and astate
field (one of ‘up’, ‘down’, or ‘disabled’).openstack_nova_services_percent
, the percentage of Nova services by state. The metric contains aservice
field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and astate
field (one of ‘up’, ‘down’, or ‘disabled’).
Identity¶
The following metrics are retrieved from the Keystone API:
openstack_keystone_roles
, the total number of roles.openstack_keystone_tenants
, the number of tenants by state. The metric contains astate
field (either ‘enabled’ or ‘disabled’).openstack_keystone_users
, the number of users by state. The metric contains astate
field (either ‘enabled’ or ‘disabled’).
Volume¶
The following metrics are emitted per volume node:
openstack_cinder_volume_attachement_time
, the time in seconds it took to attach a volume to an instance.openstack_cinder_volume_creation_time
, the time in seconds it took to create a new volume.
Note
When using Ceph as the back end storage for volumes, the hostname
value is always set to rbd
.
The following metrics are retrieved from the Cinder API:
openstack_cinder_snapshots
, the number of snapshots by state. The metric contains astate
field.openstack_cinder_snapshots_size
, the total size (in bytes) of snapshots by state. The metric contains astate
field.openstack_cinder_volumes
, the number of volumes by state. The metric contains astate
field.openstack_cinder_volumes_size
, the total size (in bytes) of volumes by state. The metric contains astate
field.
state
is one of ‘available’, ‘creating’, ‘attaching’, ‘in-use’, ‘deleting’,
‘backing-up’, ‘restoring-backup’, ‘error’, ‘error_deleting’, ‘error_restoring’,
‘error_extending’.
openstack_cinder_service
, the Cinder service state (either0
for ‘up’,1
for ‘down’, or2
for ‘disabled’). The metric contains aservice
field (one of ‘volume’, ‘backup’, ‘scheduler’) and astate
field (one of ‘up’, ‘down’ or ‘disabled’).openstack_cinder_services
, the total count of Cinder services by state. The metric contains aservice
field (one of ‘volume’, ‘backup’, ‘scheduler’) and astate
field (one of ‘up’, ‘down’ or ‘disabled’).openstack_cinder_services_percent
, the percentage of Cinder services by state. The metric contains aservice
field (one of ‘volume’, ‘backup’, ‘scheduler’) and astate
field (one of ‘up’, ‘down’, or ‘disabled’).
Image¶
The following metrics are retrieved from the Glance API:
openstack_glance_images
, the number of images by state and visibility. The metric containsstate
andvisibility
fields.openstack_glance_images_size
, the total size (in bytes) of images by state and visibility. The metric containsstate
andvisibility
fields.openstack_glance_snapshots
, the number of snapshot images by state and visibility. The metric containsstate
andvisibility
fields.openstack_glance_snapshots_size
, the total size (in bytes) of snapshots by state and visibility. The metric containsstate
andvisibility
fields.
state
is one of ‘queued’, ‘saving’, ‘active’, ‘killed’, ‘deleted’,
‘pending_delete’. visibility
is either ‘public’ or ‘private’.
Network¶
The following metrics are retrieved from the Neutron API:
openstack_neutron_floatingips
, the total number of floating IP addresses.openstack_neutron_networks
, the number of virtual networks by state. The metric contains astate
field.openstack_neutron_ports
, the number of virtual ports by owner and state. The metric containsowner
andstate
fields.openstack_neutron_routers
, the number of virtual routers by state. The metric contains astate
field.openstack_neutron_subnets
, the number of virtual subnets.
<state>
is one of ‘active’, ‘build’, ‘down’ or ‘error’.
<owner>
is one of ‘compute’, ‘dhcp’, ‘floatingip’, ‘floatingip_agent_gateway’, ‘router_interface’, ‘router_gateway’, ‘router_ha_interface’,
‘router_interface_distributed’, or ‘router_centralized_snat’.
Note
These metrics are not collected when the Contrail plugin is deployed.
openstack_neutron_agent
, the Neutron agent state (either0
for ‘up’,1
for ‘down’, or2
for ‘disabled’). The metric contains aservice
field (one of ‘dhcp’, ‘l3’, ‘metadata’, or ‘openvswitch’), and astate
field (one of ‘up’, ‘down’ or ‘disabled’).openstack_neutron_agents
, the total number of Neutron agents by service and state. The metric containsservice
(one of ‘dhcp’, ‘l3’, ‘metadata’ or ‘openvswitch’) andstate
(one of ‘up’, ‘down’ or ‘disabled’) fields.openstack_neutron_agents_percent
, the percentage of Neutron agents by state. The metric contains aservice
field (one of ‘dhcp’, ‘l3’, ‘metadata’ or ‘openvswitch’) and astate
field (one of ‘up’, ‘down’, or ‘disabled’).
API response times¶
openstack_<service>_http_response_times
, HTTP response time statistics. The statistics aremin
,max
,sum
,count
,upper_90
(90 percentile) over 10 seconds. The metric contains anhttp_method
field, for example, ‘GET’, ‘POST’, and others, and anhttp_status
field, for example, ‘2xx’, ‘4xx’, and others.
<service>
is one of ‘cinder’, ‘glance’, ‘heat’ ‘keystone’, ‘neutron’ or
‘nova’.
Logs¶
log_messages
, the number of log messages per second for the given service and severity level. The metric containsservice
andlevel
(one of ‘debug’, ‘info’, and others) fields.
Ceph¶
All Ceph metrics have a cluster
field containing the name of the Ceph
cluster (ceph by default).
For details, see Cluster monitoring and RADOS monitoring.
Cluster¶
ceph_health
, the health status of the entire cluster where values1
,2
,3
representOK
,WARNING
andERROR
, respectively.ceph_monitor_count
, the number of ceph-mon processes.ceph_quorum_count
, the number of ceph-mon processes participating in the quorum.
Pools¶
ceph_pool_total_avail_bytes
, the total available size in bytes for all pools.ceph_pool_total_bytes
, the total number of bytes for all pools.ceph_pool_total_number
, the total number of pools.ceph_pool_total_used_bytes
, the total used size in bytes by all pools.
The following metrics have a pool
field that contains the name of the
Ceph pool.
ceph_pool_bytes_used
, the amount of data in bytes used by the pool.ceph_pool_max_avail
, the available size in bytes for the pool.ceph_pool_objects
, the number of objects in the pool.ceph_pool_op_per_sec
, the number of operations per second for the pool.ceph_pool_pg_num
, the number of placement groups for the pool.ceph_pool_read_bytes_sec
, the number of bytes read by second for the pool.ceph_pool_size
, the number of data replications for the pool.ceph_pool_write_bytes_sec
, the number of bytes written by second for the pool.
Placement Groups¶
ceph_pg_bytes_avail
, the available size in bytes.ceph_pg_bytes_total
, the cluster total size in bytes.ceph_pg_bytes_used
, the data stored size in bytes.ceph_pg_data_bytes
, the stored data size in bytes before it is replicated, cloned or snapshotted.ceph_pg_state
, the number of placement groups in a given state. The metric contains astate
field whose<state>
value is a combination separated by+
of 2 or more states of this list:creating
,active
,clean
,down
,replay
,splitting
,scrubbing
,degraded
,inconsistent
,peering
,repair
,recovering
,recovery_wait
,backfill
,backfill-wait
,backfill_toofull
,incomplete
,stale
,remapped
.ceph_pg_total
, the total number of placement groups.
OSD Daemons¶
ceph_osd_down
, the number of OSD daemons DOWN.ceph_osd_in
, the number of OSD daemons IN.ceph_osd_out
, the number of OSD daemons OUT.ceph_osd_up
, the number of OSD daemons UP.
The following metrics have an osd
field that contains the OSD identifier:
ceph_osd_apply_latency
, apply latency in ms for the given OSD.ceph_osd_commit_latency
, commit latency in ms for the given OSD.ceph_osd_total
, the total size in bytes for the given OSD.ceph_osd_used
, the data stored size in bytes for the given OSD.
OSD Performance¶
All the following metrics are retrieved per OSD daemon from the corresponding
/var/run/ceph/ceph-osd.<ID>.asok
socket by issuing the perf dump
command.
All metrics have an osd
field that contains the OSD identifier.
Note
These metrics are not collected when a node has both the ceph-osd and controller roles.
For details, see OSD performance counters.
ceph_perf_osd_op
, the number of client operations.ceph_perf_osd_op_in_bytes
, the number of bytes received from clients for write operations.ceph_perf_osd_op_latency
, the average latency in ms for client operations (including queue time).ceph_perf_osd_op_out_bytes
, the number of bytes sent to clients for read operations.ceph_perf_osd_op_process_latency
, the average latency in ms for client operations (excluding queue time).ceph_perf_osd_op_r
, the number of client read operations.ceph_perf_osd_op_r_latency
, the average latency in ms for read operation (including queue time).ceph_perf_osd_op_r_out_bytes
, the number of bytes sent to clients for read operations.ceph_perf_osd_op_r_process_latency
, the average latency in ms for read operation (excluding queue time).ceph_perf_osd_op_rw
, the number of client read-modify-write operations.ceph_perf_osd_op_rw_in_bytes
, the number of bytes per second received from clients for read-modify-write operations.ceph_perf_osd_op_rw_latency
, the average latency in ms for read-modify-write operations (including queue time).ceph_perf_osd_op_rw_out_bytes
, the number of bytes per second sent to clients for read-modify-write operations.ceph_perf_osd_op_rw_process_latency
, the average latency in ms for read-modify-write operations (excluding queue time).ceph_perf_osd_op_rw_rlat
, the average latency in ms for read-modify-write operations with readable/applied.ceph_perf_osd_op_w
, the number of client write operations.ceph_perf_osd_op_wip
, the number of replication operations currently being processed (primary).ceph_perf_osd_op_w_in_bytes
, the number of bytes received from clients for write operations.ceph_perf_osd_op_w_latency
, the average latency in ms for write operations (including queue time).ceph_perf_osd_op_w_process_latency
, the average latency in ms for write operation (excluding queue time).ceph_perf_osd_op_w_rlat
, the average latency in ms for write operations with readable/applied.ceph_perf_osd_recovery_ops
, the number of recovery operations in progress.
Pacemaker¶
Cluster¶
pacemaker_local_dc_active
,1
when the Designated Controller (DC) is the local host, if not, then0
.pacemaker_dc
[1],1
when the Designated Controller (DC) is present, if not, then0
.pacemaker_quorum_status
[1],1
when the cluster’s quorum is reached, if not, then0
.pacemaker_configured_nodes
[1], the number of configured nodes in the cluster.pacemaker_configured_resources
[1], the number of configured nodes in the cluster.
[1] | (1, 2, 3, 4) this metric is only emitted from the node that is the Designated Controller (DC) of the Pacemaker cluster. |
Node¶
The following metrics are only emitted from the node that is the Designated
Controller (DC) of the Pacemaker cluster. They have a status
field which is
one of ‘offline’, ‘maintenance’, or ‘online’:
pacemaker_node_status
, the status of the node,0
when offline,1
when in maintenance or2
when online.pacemaker_node_count
, the total number of nodes with the givenstatus
.pacemaker_node_percent
, the percentage of nodes with the givenstatus
.
Resource¶
pacemaker_local_resource_active
,1
when the resource is located on the host reporting the metric, if not, then0
. The metric contains aresource
field which is one of ‘vip__public’, ‘vip__management’, ‘vip__vrouter_pub’, or ‘vip__vrouter’.pacemaker_resource_failures
[2], the total number of failures that Pacemaker detected for theresource
. The counter is reset every time the collector restarts. The metric contains aresource
field which one of ‘vip__management’, ‘vip__public’, ‘vip__vrouter_pub’, ‘vip__vrouter’, ‘rabbitmq’, ‘mysqld’ or ‘haproxy’.pacemaker_resource_operations
[2], the total number of operations that Pacemaker applied to theresource
. The counter is reset every time the collector restarts. The metric contains aresource
field which one of ‘vip__management’, ‘vip__public’, ‘vip__vrouter_pub’, ‘vip__vrouter’, ‘rabbitmq’, ‘mysqld’ or ‘haproxy’.
The following metrics have resource
and status
fields.
status
is one of ‘offline’, ‘maintenance’, or ‘online’.
resource
is one of ‘vip__management’, ‘vip__public’, ‘vip__vrouter_pub’,
‘vip__vrouter’, ‘rabbitmq’, ‘mysqld’ or ‘haproxy’.
pacemaker_resource_count
[2], the total number of instances for the givenstatus
andresource
.pacemaker_resource_percent
[2], the percentage of instances for the givenstatus
andresource
.
[2] | (1, 2, 3, 4) this metric is only emitted from the node that is the Designated Controller (DC) of the Pacemaker cluster. |
Clusters¶
The cluster metrics are emitted by the GSE plugins. For details, see Configuring alarms.
cluster_node_status
, the status of the node cluster. The metric contains acluster_name
field that identifies the node cluster.cluster_service_status
, the status of the service cluster. The metric contains acluster_name
field that identifies the service cluster.cluster_status
, the status of the global cluster. The metric contains acluster_name
field that identifies the global cluster.
The supported values for these metrics are:
0
for the Okay status.1
for the Warning status.2
for the Unknown status.3
for the Critical status.4
for the Down status.
Self-monitoring¶
System¶
The metrics have a service
field with the name of the service it applies
to. The values can be: hekad
, collectd
, influxd
, grafana-server
or elasticsearch
.
lma_components_count_processes
, the number of processes currently running.lma_components_count_threads
, the number of threads currently running.lma_components_cputime_syst
, the percentage of CPU time spent in system mode by the service. It can be greater than 100% when the node has more than one CPU.lma_components_cputime_user
, the percentage of CPU time spent in user mode by the service. It can be greater than 100% when the node has more than one CPU.lma_components_disk_bytes_read
, the number of bytes read from disk(s) per second.lma_components_disk_bytes_write
, the number of bytes written to disk(s) per second.lma_components_disk_ops_read
, the number of read operations from disk(s) per second.lma_components_disk_ops_write
, the number of write operations to disk(s) per second.lma_components_memory_code
, the physical memory devoted to executable code in bytes.lma_components_memory_data
, the physical memory devoted to other than executable code in bytes.lma_components_memory_rss
, the non-swapped physical memory used in bytes.lma_components_memory_vm
, the virtual memory size in bytes.lma_components_pagefaults_majflt
, major page faults per second.lma_components_pagefaults_minflt
, minor page faults per second.lma_components_stacksize
, the absolute value of the start address (the bottom) of the stack minus the address of the current stack pointer.
Heka pipeline¶
The metrics have two fields: name
that contains the name of the decoder
or filter as defined by Heka and type
that is either decoder or
filter.
The metrics for both types are as follows:
hekad_memory
, the total memory in bytes used by the Sandbox.hekad_msg_avg_duration
, the average time in nanoseconds for processing the message.hekad_msg_count
, the total number of messages processed by the decoder. This resets to0
when the process is restarted.
Additional metrics for filter type:
heakd_timer_event_avg_duration
, the average time in nanoseconds for executing the timer_event function.hekad_timer_event_count
, the total number of executions of the timer_event function. This resets to0
when the process is restarted.
Back-end checks¶
http_check
, the API status of the back end,1
if it is responsive, if not, then0
. The metric contains aservice
field that identifies the LMA back-end service being checked.
<service>
is one of the following values, depending on which Fuel plugins
are deployed in the environment:
- ‘influxdb’
Elasticsearch¶
The following metrics represent the simple status on the health of the cluster. For details, see Cluster health.
elasticsearch_cluster_active_primary_shards
, the number of active primary shards.elasticsearch_cluster_active_shards
, the number of active shards.elasticsearch_cluster_health
, the health status of the entire cluster where values1
,2
,3
representgreen
,yellow
andred
, respectively. Thered
status may also be reported when the Elasticsearch API returns an unexpected result, for example, a network failure.elasticsearch_cluster_initializing_shards
, the number of initializing shards.elasticsearch_cluster_number_of_nodes
, the number of nodes in the cluster.elasticsearch_cluster_number_of_pending_tasks
, the number of pending tasks.elasticsearch_cluster_relocating_shards
, the number of relocating shards.elasticsearch_cluster_unassigned_shards
, the number of unassigned shards.
InfluxDB¶
The following metrics are extracted from the output of the show stats command. The values are reset to zero when InfluxDB is restarted.
cluster¶
The following metrics are only available if there is more than one node in the cluster:
influxdb_cluster_write_shard_points_requests
, the number of requests for writing a time series points to a shard.influxdb_cluster_write_shard_requests
, the number of requests for writing to a shard.
httpd¶
influxdb_httpd_failed_auths
, the number of failed authentications.influxdb_httpd_ping_requests
, the number of ping requests.influxdb_httpd_query_requests
, the number of query requests received.influxdb_httpd_query_response_bytes
, the number of bytes returned to the client.influxdb_httpd_requests
, the number of requests received.influxdb_httpd_write_points_ok
, the number of points successfully written.influxdb_httpd_write_request_bytes
, the number of bytes received for write requests.influxdb_httpd_write_requests
, the number of write requests received.
write¶
influxdb_write_local_point_requests
, the number of write points requests from the local data node.influxdb_write_ok
, the number of successful writes of consistency level.influxdb_write_point_requests
, the number of write points requests across all data nodes.influxdb_write_remote_point_requests
, the number of write points requests to remote data nodes.influxdb_write_requests
, the number of write requests across all data nodes.influxdb_write_sub_ok
, the number of successful points sent to subscriptions.
runtime¶
influxdb_garbage_collections
, the number of garbage collections.influxdb_go_routines
, the number of Golang routines.influxdb_heap_idle
, the number of bytes in idle spans.influxdb_heap_in_use
, the number of bytes in non-idle spans.influxdb_heap_objects
, the total number of allocated objects.influxdb_heap_released
, the number of bytes released to the operating system.influxdb_heap_system
, the number of bytes obtained from the system.influxdb_memory_alloc
, the number of bytes allocated and not yet freed.influxdb_memory_frees
, the number of free operations.influxdb_memory_lookups
, the number of pointer lookups.influxdb_memory_mallocs
, the number of malloc operations.influxdb_memory_system
, the number of bytes obtained from the system.influxdb_memory_total_alloc
, the number of bytes allocated (even if freed).
Checks¶
The check metrics are emitted to express the success or the failure of the
metric collections for the local services.
The value is 1
when successful and 0
if it fails.
apache_check
, for Apache.ceph_mon_check
, for Ceph monitor.ceph_osd_check
, for Ceph OSD.elasticsearch_check
, for Elasticsearch.haproxy_check
, for HAProxy.influxdb_check
, for InfluxDB.libvirt_check
, for Libvirt.mysql_check
, for MySQL.pacemaker_check
, for Pacemaker.rabbitmq_check
, for RabbitMQ.
List of built-in alarms¶
The following is a list of StackLight built-in alarms:
alarms:
- name: 'cpu-critical-controller'
description: 'The CPU usage is too high (controller node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- metric: cpu_wait
relational_operator: '>='
threshold: 35
window: 120
periods: 0
function: avg
- name: 'cpu-warning-controller'
description: 'The CPU usage is high (controller node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- metric: cpu_wait
relational_operator: '>='
threshold: 25
window: 120
periods: 0
function: avg
- name: 'swap-usage-critical'
description: 'There is no more swap free space'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: swap_free
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: max
- name: 'swap-activity-warning'
description: 'The swap activity is high'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: swap_io_in
relational_operator: '>='
threshold: 1048576 # 1 Mb/s
window: 120
periods: 0
function: avg
- metric: swap_io_out
relational_operator: '>='
threshold: 1048576 # 1 Mb/s
window: 120
periods: 0
function: avg
- name: 'swap-usage-warning'
description: 'The swap free space is low'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: swap_percent_used
relational_operator: '>='
threshold: 0.8
window: 60
periods: 0
function: avg
- name: 'cpu-critical-compute'
description: 'The CPU usage is too high (compute node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 30
window: 120
periods: 0
function: avg
- name: 'cpu-warning-compute'
description: 'The CPU usage is high (compute node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 20
window: 120
periods: 0
function: avg
- name: 'cpu-critical-rabbitmq'
description: 'The CPU usage is too high (RabbitMQ node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'cpu-warning-rabbitmq'
description: 'The CPU usage is high (RabbitMQ node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- name: 'cpu-critical-mysql'
description: 'The CPU usage is too high (MySQL node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'cpu-warning-mysql'
description: 'The CPU usage is high (MySQL node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- name: 'cpu-critical-storage'
description: 'The CPU usage is too high (storage node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 40
window: 120
periods: 0
function: avg
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'cpu-warning-storage'
description: 'The CPU usage is high (storage node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 30
window: 120
periods: 0
function: avg
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- name: 'cpu-critical-default'
description: 'The CPU usage is too high'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 35
window: 120
periods: 0
function: avg
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'rabbitmq-disk-limit-critical'
description: 'RabbitMQ has reached the free disk threshold. All producers are blocked'
severity: 'critical'
# If the local RabbitMQ instance is down, it will be caught by the
# rabbitmq-check alarm
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_disk
relational_operator: '<='
threshold: 0
window: 20
periods: 0
function: min
- name: 'rabbitmq-disk-limit-warning'
description: 'RabbitMQ is getting close to the free disk threshold'
severity: 'warning'
# If the local RabbitMQ instance is down, it will be caught by the
# rabbitmq-check alarm
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_disk
relational_operator: '<='
threshold: 104857600 # 100MB
window: 20
periods: 0
function: min
- name: 'rabbitmq-memory-limit-critical'
description: 'RabbitMQ has reached the memory threshold. All producers are blocked'
severity: 'critical'
# If the local RabbitMQ instance is down, it will be caught by the
# rabbitmq-check alarm
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_memory
relational_operator: '<='
threshold: 0
window: 20
periods: 0
function: min
- name: 'rabbitmq-memory-limit-warning'
description: 'RabbitMQ is getting close to the memory threshold'
severity: 'warning'
# If the local RabbitMQ instance is down, it will be caught by the
# rabbitmq-check alarm
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_memory
relational_operator: '<='
threshold: 104857600 # 100MB
window: 20
periods: 0
function: min
- name: 'rabbitmq-queue-warning'
description: 'The number of outstanding messages is too high'
severity: 'warning'
# If the local RabbitMQ instance is down, it will be caught by the
# rabbitmq-check alarm
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_messages
relational_operator: '>='
threshold: 200
window: 120
periods: 0
function: avg
- name: 'rabbitmq-pacemaker-down'
description: 'The RabbitMQ cluster is down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
logical_operator: 'and'
rules:
- metric: pacemaker_resource_percent
fields:
resource: rabbitmq
status: up
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'rabbitmq-pacemaker-critical'
description: 'The RabbitMQ cluster is critical because less than half of the nodes are up'
severity: 'critical'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
logical_operator: 'and'
rules:
- metric: pacemaker_resource_percent
fields:
resource: rabbitmq
status: up
relational_operator: '<'
threshold: 50
window: 60
periods: 0
function: last
- name: 'rabbitmq-pacemaker-warning'
description: 'The RabbitMQ cluster is degraded because some RabbitMQ nodes are missing'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
logical_operator: 'and'
rules:
- metric: pacemaker_resource_percent
fields:
resource: rabbitmq
status: up
relational_operator: '<'
threshold: 100
window: 60
periods: 0
function: last
- name: 'apache-warning'
description: 'There is no Apache idle workers available'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: apache_idle_workers
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: min
- name: 'apache-check'
description: 'Apache cannot be checked'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: apache_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'log-fs-warning'
description: "The log filesystem's free space is low"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/log'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'log-fs-critical'
description: "The log filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/log'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'root-fs-warning'
description: "The root filesystem's free space is low"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'root-fs-critical'
description: "The root filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'mysql-fs-warning'
description: "The MySQL filesystem's free space is low"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/mysql'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'mysql-fs-critical'
description: "The MySQL filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/mysql'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'nova-fs-warning'
description: "The filesystem's free space is low (compute node)"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/nova'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'nova-fs-critical'
description: "The filesystem's free space is too low (compute node)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/nova'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'other-fs-warning'
description: "The filesystem's free space is low"
severity: 'warning'
enabled: 'true'
no_data_policy: 'okay'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '!= /var/lib/nova && != /var/log && != /var/lib/mysql && != / && !~ ceph%-%d+$'
group_by: [fs]
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'other-fs-critical'
description: "The filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
no_data_policy: 'okay'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '!= /var/lib/nova && != /var/log && != /var/lib/mysql && != / && !~ ceph%-%d+$'
group_by: [fs]
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'osd-disk-critical'
description: "The filesystem's free space is too low (OSD disk)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
# Real FS is /var/lib/ceph/osd/ceph-0 but Collectd substituted '/' by '-'
fs: '=~ ceph/%d+$'
group_by: [fs]
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'nova-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on nova-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'nova-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'nova-logs-error'
description: 'Too many errors have been detected in Nova logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'nova'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'heat-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on heat-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'heat-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'heat-logs-error'
description: 'Too many errors have been detected in Heat logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'heat'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'swift-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on swift-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'swift-api || object-storage'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'swift-logs-error'
description: 'Too many errors have been detected in Swift logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'swift'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'cinder-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on cinder-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'cinder-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'cinder-logs-error'
description: 'Too many errors have been detected in Cinder logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'cinder'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'glance-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on glance-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'glance-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'glance-logs-error'
description: 'Too many errors have been detected in Glance logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'glance'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'neutron-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on neutron-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'neutron-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'neutron-logs-error'
description: 'Too many errors have been detected in Neutron logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'neutron'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'keystone-response-time-duration'
description: 'Keystone API is too slow'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: openstack_keystone_http_response_times
fields:
http_method: '== GET || == POST'
http_status: '!= 5xx'
relational_operator: '>'
threshold: 0.3
window: 60
periods: 0
value: upper_90
function: max
- name: 'keystone-public-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on keystone-public-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'keystone-public-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'keystone-admin-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on keystone-admin-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'keystone-admin-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'horizon-web-http-errors'
description: 'Too many 5xx HTTP errors have been detected on horizon'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'horizon-web || horizon-https'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'keystone-logs-error'
description: 'Too many errors have been detected in Keystone logs'
severity: 'warning'
no_data_policy: 'okay'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'keystone'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'mysql-node-connected'
description: 'The MySQL service has lost connectivity with the other nodes'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: mysql_cluster_connected
relational_operator: '=='
threshold: 0
window: 30
periods: 1
function: min
- name: 'mysql-node-ready'
description: "The MySQL service isn't ready to serve queries"
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: mysql_cluster_ready
relational_operator: '=='
threshold: 0
window: 30
periods: 1
function: min
- name: 'ceph-health-critical'
description: 'Ceph health is critical'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: ceph_health
relational_operator: '=='
threshold: 3 # HEALTH_ERR
window: 60
function: max
- name: 'ceph-health-warning'
description: 'Ceph health is warning'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: ceph_health
relational_operator: '=='
threshold: 2 # HEALTH_WARN
window: 60
function: max
- name: 'ceph-capacity-critical'
description: 'Ceph free capacity is too low'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: ceph_pool_total_percent_free
relational_operator: '<'
threshold: 2
window: 60
function: max
- name: 'ceph-capacity-warning'
description: 'Ceph free capacity is low'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: ceph_pool_total_percent_free
relational_operator: '<'
threshold: 5
window: 60
function: max
- name: 'elasticsearch-health-critical'
description: 'Elasticsearch cluster health is critical'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: elasticsearch_cluster_health
relational_operator: '=='
threshold: 3 # red
window: 60
function: min
- name: 'elasticsearch-health-warning'
description: 'Elasticsearch health is warning'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: elasticsearch_cluster_health
relational_operator: '=='
threshold: 2 # yellow
window: 60
function: min
- name: 'elasticsearch-fs-warning'
description: "The filesystem's free space is low (Elasticsearch node)"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
relational_operator: '<'
threshold: 20 # The low watermark for disk usage is 85% by default
window: 60
periods: 0
function: min
- name: 'elasticsearch-fs-critical'
description: "The filesystem's free space is too low (Elasticsearch node)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
relational_operator: '<'
threshold: 15 # The high watermark for disk usage is 90% by default
window: 60
periods: 0
function: min
- name: 'influxdb-fs-warning'
description: "The filesystem's free space is low (InfluxDB node)"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/influxdb'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'influxdb-fs-critical'
description: "The filesystem's free space is too low (InfluxDB node)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/influxdb'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'haproxy-check'
description: "HAProxy cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'rabbitmq-check'
description: "RabbitMQ cannot be checked"
# This alarm's severity is warning because the effective status of the
# RabbitMQ cluster is computed by rabbitmq-pacemaker-* alarms.
# This alarm is still useful because it will report the node(s) on which
# RabbitMQ isn't running.
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: rabbitmq_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'ceph-mon-check'
description: "Ceph monitor cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: ceph_mon_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'ceph-osd-check'
description: "Ceph OSD cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: ceph_osd_check
relational_operator: '=='
threshold: 0
window: 80 # The metric interval collection is 60s
periods: 0
function: last
- name: 'pacemaker-check'
description: "Pacemaker cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: pacemaker_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'elasticsearch-check'
description: "Elasticsearch cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: elasticsearch_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'influxdb-check'
description: "InfluxDB cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: influxdb_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'libvirt-check'
description: "Libvirt cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: libvirt_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'memcached-check'
description: "memcached cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: memcached_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'mysql-check'
description: "MySQL cannot be checked"
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: mysql_check
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'network-warning-dropped-rx'
description: "Some received packets have been dropped"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: if_dropped_rx
relational_operator: '>'
threshold: 100
window: 60
periods: 0
function: avg
- name: 'network-critical-dropped-rx'
description: "Too many received packets have been dropped"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: if_dropped_rx
relational_operator: '>'
threshold: 1000
window: 60
periods: 0
function: avg
- name: 'network-warning-dropped-tx'
description: "Some transmitted packets have been dropped"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: if_dropped_tx
relational_operator: '>'
threshold: 100
window: 60
periods: 0
function: avg
- name: 'network-critical-dropped-tx'
description: "Too many transmitted packets have been dropped"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: if_dropped_tx
relational_operator: '>'
threshold: 1000
function: avg
window: 60
- name: 'instance-creation-time-warning'
description: "Instance creation takes too much time"
severity: 'warning'
no_data_policy: 'okay' # This is a sporadic metric
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_instance_creation_time
relational_operator: '>'
threshold: 20
window: 600
periods: 0
function: avg
- name: 'hdd-errors-critical'
description: 'Errors on hard drive(s) have been detected'
severity: 'critical'
enabled: 'true'
no_data_policy: okay
trigger:
rules:
- metric: hdd_errors_rate
group_by: ['device']
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: max
- name: 'total-nova-free-vcpu-warning'
description: 'There is none VCPU available for new instances'
severity: 'warning'
enabled: 'true'
no_data_policy: skip # the metric is only collected from the aggregator node
trigger:
rules:
- metric: openstack_nova_total_free_vcpus
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: max
- name: 'total-nova-free-memory-warning'
description: 'There is none memory available for new instances'
severity: 'warning'
enabled: 'true'
no_data_policy: skip # the metric is only collected from the aggregator node
trigger:
rules:
- metric: openstack_nova_total_free_ram
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: max
- name: 'nova-aggregates-free-memory-warning'
description: "The nova aggregates free memory percent is low"
severity: 'warning'
enabled: 'true'
no_data_policy: skip # the metric is only collected from the aggregator node
trigger:
rules:
- metric: openstack_nova_aggregate_free_ram_percent
group_by: [aggregate]
relational_operator: '<'
threshold: 10.0
window: 60
periods: 0
function: min
- name: 'nova-aggregates-free-memory-critical'
description: "The nova aggregates free memory percent is too low"
severity: 'critical'
enabled: 'true'
no_data_policy: skip # the metric is only collected from the aggregator node
trigger:
rules:
- metric: openstack_nova_aggregate_free_ram_percent
group_by: [aggregate]
relational_operator: '<'
threshold: 1.0
window: 60
periods: 0
function: min
# Adds alarm on local check for OpenStack services endpoint
- name: 'cinder-api-local-endpoint'
description: 'Cinder API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'cinder-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'glance-api-local-endpoint'
description: 'Glance API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'glance-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-api-local-endpoint'
description: 'Heat API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'heat-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-cfn-api-local-endpoint'
description: 'Heat CFN API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'heat-cfn-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'keystone-public-api-local-endpoint'
description: 'Keystone public API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'keystone-public-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-api-local-endpoint'
description: 'Neutron API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'neutron-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-api-local-endpoint'
description: 'Nova API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'nova-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'swift-api-local-endpoint'
description: 'Swift API is locally down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: openstack_check_local_api
fields:
service: 'swift-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
# Following are the OpenStack service check API definitions and
# also InfluxDB API
- name: 'influxdb-api-check-failed'
description: 'Endpoint check for InfluxDB is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: http_check
fields:
service: 'influxdb-cluster'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-api-check-failed'
description: 'Endpoint check for nova-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'nova-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-api-check-failed'
description: 'Endpoint check for neutron-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'neutron-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-api-check-failed'
description: 'Endpoint check for cinder-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'cinder-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-v2-api-check-failed'
description: 'Endpoint check for cinder-v2-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'cinder-v2-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'glance-api-check-failed'
description: 'Endpoint check for glance-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'glance-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-api-check-failed'
description: 'Endpoint check for heat-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'heat-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-cfn-api-check-failed'
description: 'Endpoint check for heat-cfn-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'heat-cfn-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'swift-api-check-failed'
description: 'Endpoint check for swift-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'swift-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'swift-s3-api-check-failed'
description: 'Endpoint check for swift-s3-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'swift-s3-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'keystone-public-api-check-failed'
description: 'Endpoint check for keystone-public-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'keystone-public-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'ceilometer-api-check-failed'
description: 'Endpoint check for ceilometer-api is failed'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the controller running the management VIP
enabled: 'true'
trigger:
rules:
- metric: openstack_check_api
fields:
service: 'ceilometer-api'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
# Following are the AFD generated to check API backends
# All backends are down
- name: 'elasticsearch-api-backends-all-down'
description: 'All Elasticsearch backends are down'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'elasticsearch-rest'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'kibana-api-backends-all-down'
description: 'All API backends are down for Kibana'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'kibana'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'influxdb-api-backends-all-down'
description: 'All API backends are down for InfluxDB'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'influxdb'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'grafana-api-backends-all-down'
description: 'All API backends are down for Grafana'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'grafana'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'glance-registry-api-backends-all-down'
description: 'All API backends are down for glance-registry-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'glance-registry-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-api-backends-all-down'
description: 'All API backends are down for nova-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'nova-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-api-backends-all-down'
description: 'All API backends are down for cinder-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'cinder-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'object-storage-api-backends-all-down'
description: 'All API backends are down for object-storage'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'object-storage'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-cfn-api-backends-all-down'
description: 'All API backends are down for heat-cfn-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'heat-cfn-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'horizon-web-api-backends-all-down'
description: 'All API backends are down for horizon-web'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'horizon-web || horizon-https'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-novncproxy-websocket-api-backends-all-down'
description: 'All API backends are down for nova-novncproxy-websocket'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'nova-novncproxy-websocket'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-api-backends-all-down'
description: 'All API backends are down for heat-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'heat-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'keystone-public-api-backends-all-down'
description: 'All API backends are down for keystone-public-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'keystone-public-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-cloudwatch-api-backends-all-down'
description: 'All API backends are down for heat-cloudwatch-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'heat-cloudwatch-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-metadata-api-backends-all-down'
description: 'All API backends are down for nova-metadata-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'nova-metadata-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'mysqld-tcp-api-backends-all-down'
description: 'All API backends are down for mysqld-tcp'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'mysqld-tcp'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'keystone-admin-api-backends-all-down'
description: 'All API backends are down for keystone-admin-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'keystone-admin-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'glance-api-backends-all-down'
description: 'All API backends are down for glance-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'glance-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-api-backends-all-down'
description: 'All API backends are down for neutron-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'neutron-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'swift-api-backends-all-down'
description: 'All API backends are down for swift-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'swift-api || object-storage'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'ceilometer-api-backends-all-down'
description: 'All API backends are down for ceilometer-api'
severity: 'down'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'ceilometer-api'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
# At least one backend is down
- name: 'elasticsearch-api-backends-one-down'
description: 'At least one API backend is down for elasticsearch'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'elasticsearch-rest'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'kibana-api-backends-one-down'
description: 'At least one API backend is down for kibana'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'kibana'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'influxdb-api-backends-one-down'
description: 'At least one API backend is down for influxdb'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'influxdb'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'grafana-api-backends-one-down'
description: 'At least one API backend is down for grafana'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'grafana'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'glance-registry-api-backends-one-down'
description: 'At least one API backend is down for glance-registry-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'glance-registry-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-api-backends-one-down'
description: 'At least one API backend is down for nova-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'nova-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-api-backends-one-down'
description: 'At least one API backend is down for cinder-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'cinder-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'object-storage-api-backends-one-down'
description: 'At least one API backend is down for object-storage'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'object-storage'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-cfn-api-backends-one-down'
description: 'At least one API backend is down for heat-cfn-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'heat-cfn-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'horizon-web-api-backends-one-down'
description: 'At least one API backend is down for horizon-web'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'horizon-web || horizon-https'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-novncproxy-websocket-api-backends-one-down'
description: 'At least one API backend is down for nova-novncproxy-websocket'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'nova-novncproxy-websocket'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-api-backends-one-down'
description: 'At least one API backend is down for heat-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'heat-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'keystone-public-api-backends-one-down'
description: 'At least one API backend is down for keystone-public-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'keystone-public-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'heat-cloudwatch-api-backends-one-down'
description: 'At least one API backend is down for heat-cloudwatch-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'heat-cloudwatch-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-metadata-api-backends-one-down'
description: 'At least one API backend is down for nova-metadata-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'nova-metadata-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'mysqld-tcp-api-backends-one-down'
description: 'At least one API backend is down for mysqld-tcp'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'mysqld-tcp'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'keystone-admin-api-backends-one-down'
description: 'At least one API backend is down for keystone-admin-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'keystone-admin-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'glance-api-backends-one-down'
description: 'At least one API backend is down for glance-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'glance-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-api-backends-one-down'
description: 'At least one API backend is down for neutron-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'neutron-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'swift-api-backends-one-down'
description: 'At least one API backend is down for swift-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'swift-api || object-storage'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'ceilometer-api-backends-one-down'
description: 'At least one API backend is down for ceilometer-api'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers
fields:
backend: 'ceilometer-api'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
# Less than 50% of backends are up
- name: 'elasticsearch-api-backends-majority-down'
description: 'Less than 50% of backends are up for elasticsearch'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'elasticsearch-rest'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'kibana-api-backends-majority-down'
description: 'Less than 50% of backends are up for kibana'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'kibana'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'influxdb-api-backends-majority-down'
description: 'Less than 50% of backends are up for influxdb'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'influxdb'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'grafana-api-backends-majority-down'
description: 'Less than 50% of backends are up for grafana'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'grafana'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'glance-registry-api-backends-majority-down'
description: 'Less than 50% of backends are up for glance-registry-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'glance-registry-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-api-backends-majority-down'
description: 'Less than 50% of backends are up for nova-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'nova-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'cinder-api-backends-majority-down'
description: 'Less than 50% of backends are up for cinder-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'cinder-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'object-storage-api-backends-majority-down'
description: 'Less than 50% of backends are up for object-storage'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'object-storage'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'heat-cfn-api-backends-majority-down'
description: 'Less than 50% of backends are up for heat-cfn-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'heat-cfn-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'horizon-web-api-backends-majority-down'
description: 'Less than 50% of backends are up for horizon-web'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'horizon-web || horizon-https'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-novncproxy-websocket-api-backends-majority-down'
description: 'Less than 50% of backends are up for nova-novncproxy-websocket'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'nova-novncproxy-websocket'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'heat-api-backends-majority-down'
description: 'Less than 50% of backends are up for heat-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'heat-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'keystone-public-api-backends-majority-down'
description: 'Less than 50% of backends are up for keystone-public-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'keystone-public-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'heat-cloudwatch-api-backends-majority-down'
description: 'Less than 50% of backends are up for heat-cloudwatch-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'heat-cloudwatch-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-metadata-api-backends-majority-down'
description: 'Less than 50% of backends are up for nova-metadata-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'nova-metadata-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'mysqld-tcp-api-backends-majority-down'
description: 'Less than 50% of backends are up for mysqld-tcp'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'mysqld-tcp'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'keystone-admin-api-backends-majority-down'
description: 'Less than 50% of backends are up for keystone-admin-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'keystone-admin-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'glance-api-backends-majority-down'
description: 'Less than 50% of backends are up for glance-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'glance-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'neutron-api-backends-majority-down'
description: 'Less than 50% of backends are up for neutron-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'neutron-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'swift-api-backends-majority-down'
description: 'Less than 50% of backends are up for swift-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'swift-api || object-storage'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'ceilometer-api-backends-majority-down'
description: 'Less than 50% of backends are up for ceilometer-api'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: haproxy_backend_servers_percent
fields:
backend: 'ceilometer-api'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
# Following are the AFD generated to check workers
# All workers are down
- name: 'nova-scheduler-all-down'
description: 'All Nova schedulers are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'scheduler'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-cert-all-down'
description: 'All Nova certs are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'cert'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-consoleauth-all-down'
description: 'All Nova consoleauths are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'consoleauth'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-compute-all-down'
description: 'All Nova computes are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'compute'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-conductor-all-down'
description: 'All Nova conductors are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'conductor'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-scheduler-all-down'
description: 'All Cinder schedulers are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_cinder_services
fields:
service: 'scheduler'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-volume-all-down'
description: 'All Cinder volumes are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_cinder_services
fields:
service: 'volume'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-l3-all-down'
description: 'All Neutron L3 agents are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'l3'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-dhcp-all-down'
description: 'All Neutron DHCP agents are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'dhcp'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-metadata-all-down'
description: 'All Neutron metadata agents are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'metadata'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-openvswitch-all-down'
description: 'All Neutron openvswitch agents are down'
severity: 'down'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'openvswitch'
state: 'up'
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: last
# At least one backend is down
- name: 'nova-scheduler-one-down'
description: 'At least one Nova scheduler is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'scheduler'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-cert-one-down'
description: 'At least one Nova cert is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'cert'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-consoleauth-one-down'
description: 'At least one Nova consoleauth is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'consoleauth'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-compute-one-down'
description: 'At least one Nova compute is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'compute'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'nova-conductor-one-down'
description: 'At least one Nova conductor is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services
fields:
service: 'conductor'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-scheduler-one-down'
description: 'At least one Cinder scheduler is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_cinder_services
fields:
service: 'scheduler'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'cinder-volume-one-down'
description: 'At least one Cinder volume is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_cinder_services
fields:
service: 'volume'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-l3-one-down'
description: 'At least one L3 agent is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'l3'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-dhcp-one-down'
description: 'At least one DHCP agent is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'dhcp'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-metadata-one-down'
description: 'At least one metadata agents is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'metadata'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
- name: 'neutron-openvswitch-one-down'
description: 'At least one openvswitch agents is down'
severity: 'warning'
no_data_policy: 'skip' # the metric is only collected from the DC node
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents
fields:
service: 'openvswitch'
state: 'down'
relational_operator: '>'
threshold: 0
window: 60
periods: 0
function: last
# Less than 50% of service are up (compared to up and down).
- name: 'nova-scheduler-majority-down'
description: 'Less than 50% of Nova schedulers are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services_percent
fields:
service: 'scheduler'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-cert-majority-down'
description: 'Less than 50% of Nova certs are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services_percent
fields:
service: 'cert'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-consoleauth-majority-down'
description: 'Less than 50% of Nova consoleauths are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services_percent
fields:
service: 'consoleauth'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-compute-majority-down'
description: 'Less than 50% of Nova computes are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services_percent
fields:
service: 'compute'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'nova-conductor-majority-down'
description: 'Less than 50% of Nova conductors are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_nova_services_percent
fields:
service: 'conductor'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'cinder-scheduler-majority-down'
description: 'Less than 50% of Cinder schedulers are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_cinder_services_percent
fields:
service: 'scheduler'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'cinder-volume-majority-down'
description: 'Less than 50% of Cinder volumes are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_cinder_services_percent
fields:
service: 'volume'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'neutron-l3-majority-down'
description: 'Less than 50% of Neutron L3 agents are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents_percent
fields:
service: 'l3'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'neutron-dhcp-majority-down'
description: 'Less than 50% of Neutron DHCP agents are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents_percent
fields:
service: 'dhcp'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'neutron-metadata-majority-down'
description: 'Less than 50% of Neutron metadata agents are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents_percent
fields:
service: 'metadata'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last
- name: 'neutron-openvswitch-majority-down'
description: 'Less than 50% of Neutron openvswitch agents are up'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: openstack_neutron_agents_percent
fields:
service: 'openvswitch'
state: 'up'
relational_operator: '<='
threshold: 50
window: 60
periods: 0
function: last