ZMON Docs¶
Introduction¶
ZMON is a flexible and extensible open-source platform monitoring tool developed at Zalando and is in production use since early 2014. It offers proven scaling with its distributed nature and fast storage with KairosDB on top of Cassandra. ZMON splits checking(data acquisition) from the alerting responsibilities and uses abstract entities to describe what’s being monitored. Its checks and alerts rely on Python expressions, giving the user a lot of power and connectivity. Besides the UI it provides RESTful APIs to manage and configure most properties automatically.
Anyone can use ZMON, but offers particular advantages for technical organizations with many autonomous teams. Its front end (see Demo / Bootstrap / Kubernetes/ Vagrant) comes with Grafana3 “built-in,” enabling teams to create and manage their own data-driven dashboards along side ZMON’s own team/personal dashboards for alerts and custom widgets. Being able to inherit and clone alerts makes it easier for teams to reuse and share code. Alerts can trigger HipChat, Slack, and E-Mail notifications. iOS and Android clients are works in progress, but push notifications are already implemented.
ZMON also enables painless integration with CMDBs and deployment tools. It also supports service discovery via custom adapters or its built-in entity service’s REST API. For an example, see zmon-aws-agent to learn how we connect AWS service discovery with our monitoring in the cloud.
Feel free to contact us via slack.zmon.io.
ZMON Components¶
A minimum ZMON setup requires these four components:
- zmon-controller: UI/Grafana/Oauth2 Login/Github Login
- zmon-scheduler: Scheduling check/alert evaluation
- zmon-worker: Doing the heavy lifting
- zmon-eventlog-service: History for state changes and modifications
Plus the storage covered in the Requirements section.
The following components are optional:
- zmon-cli: A command line client for managing entities/checks/alerts if needed
- zmon-aws-agent: Works with the AWS API to retrieve “known” applications
- zmon-data-service: API for multi DC federation: receiver for remote workers primarily
- zmon-metric-cache: Small scale special purpose metric store for API metrics in ZMON’s cloud UI
- zmon-notification-service: Provides mobile API and push notification support for GCM to Android/iOS app
- zmon-android: An Android client for ZMON monitoring
- zmon-ios: An iOS client for ZMON monitoring
ZMON Origins¶
ZMON was born in late 2013 during Zalando’s annual Hack Week, when a group of Zalando engineers aimed to develop a replacement for ICINGA. Scalability, manageability and flexibility were all critical, as Zalando’s small teams needed to be able to monitor their services independent of each other. In early 2014, Zalando teams began migrating all checks to ZMON, which continues to serve Zalando Tech.
Entities¶
ZMON uses entities to describe your infrastructure or platform, and to bind check variables to fixed values.
{
"type":"host",
"id":"cassandra01",
"host":"cassandra01",
"role":"cassandra-host",
"ip":"192.168.1.17",
"dc":"data-center-1"
}
Or more abstract objects:
{
"type":"postgresql-cluster",
"id":"article-cluster",
"name":"article-cluster",
"shards": {
"shard1":"articledb01:5432/shard1",
"shard2":"articledb02:5432/shard2"
}
}
Entity properties are not defined in any schema, so you can add properties as you see fit. This enables finer-grained filtering or selection of entities later on. As an example, host entities can include a physical model to later select the proper hardware checks.
Below you see an exmple of the entity view with alerts per entity.

Checks¶
A check describes how data is acquired. Its key properties are: a command to execute and an entity filter. The filter selects a subset of entities by requiring an overlap on specified properties. An example:
{
"type":"postgresql-cluster", "name":"article-cluster"
}
The check command itself is an executable Python expression. ZMON provides many custom wrappers that bind to the selected entity. The following example uses a PostgreSQL wrapper to execute a query on every shard defined above:
# sql() in this context is aware of the "shards" property
sql().execute('SELECT count(1) FROM articles "total"').result()
A check command always returns a value to the alert. This can be of any Python type.
Not familiar with Python’s functional expressions? No worries: ZMON allows you to define a top-level function and define your command in an easier, less functional way:
def check():
# sql() binds to the entity used and thus knows the connection URLs
return sql().execute('SELECT count(1) FROM articles "total"').result()
Alerts¶
A basic alert consists of an alert condition, an entity filter, and a team. An alert has only two states: up or down. An alert is up if it yields anything but False; this also includes exceptions thrown during evaluation of the check or alert, e.g. in the event of connection problems. ZMON does not support levels of criticality, or something like “unknown”, but you have a color option to customize sort and style on your dashboard (red, orange, yellow).
Let’s revisit the above PostgreSQL check again. The alert below would either popup if there are no articles found or if we get an exception connecting to the PostgreSQL database.
team: database
entities:
- type: postgresql-cluster
alert_condition: |
value <= 0
Alerts raised by exceptions are marked in the dashboard with a “!”.
Via ZMON’s UI, alerts support parameters to the alert condition. This makes it easy for teams/users to implement different thresholds, and — with the priority field defining the dashboard color — render their dashboards to reflect their priorities.
Dashboards¶
Dashboards include a widget area where you can render important data with charts, gauges, or plain text. Another section features rendering of all active alerts for the team filter, defined at the dashboard level. Using the team filter, select the alerts you want your dashboard to include. Specify multiple teams, if necessary. TAGs are supported to subselect topics.

REST API and CLI¶
To make your life easier, ZMON’s REST API manages all the essential moving parts to support your daily work — creating and updating entities to allow for sync-up with your existing infrastructure. When you create and modify checks and alerts, the scheduler will quickly pick up these changes so you won’t have to restart or deploy anything.
And ZMON’s command line client - a slim wrapper around the REST API - also adds usability by making it simpler to work with YAML files or push collections of entities.
Development Status¶
The team behind ZMON continues to improve performance and functionality. Please let us know via GitHub’s issues tracker if you find any bugs or issues.
Getting Started¶
To quickly get started with ZMON, use the preconfigured Vagrant box featured on the main ZMON repository. Make sure you’ve installed Vagrant (at least 1.7.4) and a Vagrant provider like VirtualBox on your machine. Clone the repository with Git:
$ git clone https://github.com/zalando/zmon.git
$ cd zmon/
From within the cloned repository, run:
$ vagrant up
Bootstrapping the image for the first time will take a bit of time. You might want to grab some coffee while you wait. :)
When it’s finally up, Vagrant will report on how to reach the ZMON web interface:
==> default: ZMON installation is done!
==> default: Goto: https://localhost:8443
==> default: Login with your GitHub credentials
Creating Your First Alert¶
Log In¶
Open your web browser and navigate to the URL reported by Vagrant: e.g. https://localhost:8443/. Click on Sign In. This will redirect you to Github where you sign in and authorize the ZMON app. Then it takes you back and you are logged in.
Note
For your own deployment create your own app in Github with your redirect URL. In ZMON you can then limit users allowed access to your Github organization.
Checks and Alerts¶
An alert shown on ZMON’s dashboard typically consists of two parts: the check-definition, which is responsible for fetching the underlying data; and the alert-definition, which defines the condition under which the alert will trigger. Multiple alerts with different alert conditions can operate on the same check, fetching data only once.
Let’s explore this concept now by creating a simple check and defining some alerts on it.
Create a new Check¶
One way to create a new check from scratch is via the Using the CLI. A more convenient way, however, is to use the “Trial Run” feature. It enables you to develop checks and alerts, execute them immediately, and inspect the result. Once you are happy with your check command and filter, you can save it from the Trial Run directly. Some users prefer to download the YAML definition from there to store and maintain it in Git.
Create an Alert¶
In the top navigation of ZMON’s web interface, select Check defs from the list and click on Website HTTP status. Then click “Add New Alert Definition” to create a new alert for this particular check. Fill out the form (see example values below), and hit “Save”:
Name | Oops … website is gone! |
Description | Website was not reachable. |
Priority | Priority 1 (red) |
Alert Condition | value != 200 |
Team | Team 1 |
Responsible Team | Team 1 |
Status | ACTIVE |
After you hit save, it will take a few seconds until it is picked up and executed.
View Dashboard¶
If the alerts condition evaluates to anything but False
the alert will appear on the dashboard.
This means not only for True
, but also e.g. in case of exceptions triggered, e.g. due to timeouts or failure to connect.
Currently there’s only one dashboard, and it is configured to show all present alerts.
To view the dashboard, select Dashboards from the main menu and click on Example Dashboard.
To see the alert, you must simulate the error condition; try modifying its condition or the check-definition to return an error code). You do this, set the URL in the check command to http://httpstat.us/500. (The number in the URL represents the HTTP error code you will get.)
To see the actual error code in the alert, you might want to create/modify it like this:
Name | Website gone with status {code} |
Description | Website was not reachable. |
Priority | Priority 1 (red) |
Alert Condition | capture(code=value)!=200 |
Team | Team 1 |
Responsible Team | Team 1 |
Status | ACTIVE |
Using the CLI¶
The ZMON Vagrant box comes preinstalled with zmon-cli. To use the CLI, log in to the running Vagrant box with:
$ vagrant ssh
The Vagrant box also contains some sample yaml files for creating entities, checks and alerts. You can find these in /vagrant/examples.
As an example of using ZMON’s CLI, let’s create a check to verify that google.com is reachable. cd to /vagrant/examples/check-definitions and, using zmon-cli, create a new check-definition:
$ cd /vagrant/examples/check-definitions
$ zmon check-definitions init website-availability.yaml
$ vim website-availability.yaml
Edit the newly created website-availability.yaml to contain the following code. (type i for insert-mode)
name: "Website HTTP status"
owning_team: "Team 1"
command: http("http://httpstat.us/200", timeout=5).code()
description: "Returns current http status code for Website"
interval: 60
entities:
- type: GLOBAL
status: ACTIVE
Type ESC :wq RETURN to save the file.
To push the updated check definition to ZMON, run:
$ zmon check-definitions update website-availability.yaml
Updating check definition... http://localhost:8080/#/check-definitions/view/2
Find more detailed information here: Command Line Client.
Entities¶
Entities describe what you want to monitor in your infrastructure. This can be as basic as a host, with its attributes hostname and IP; or something more complex, like a PostgreSQL sharded cluster with its identifier and set of connection strings.
ZMON gives you two options for automation in/integration with your platform: storing entities via zmon-controller’s entity service, or discovering them via the adapters in zmon-scheduler. At Zalando we use both, connecting ZMON to tools like our CMDB but also pushing entities via REST API.
ZMON’s entity service describes entities with a single JSON document.
- Any entity must contain an ID that is unique within your ZMON deployment. We often use a pattern like
<hostname>(:<port>)
to create uniqueness at the host and application levels, but this is up to you. - Any entity must contain a type which describes the kind of entity, like an object class.
At the check execution we bind entity properties as default values to the functions executed, e.g. the IP gets used for relative http()
requests.
Format¶
Generally, ZMON entity is a set of properties that can be represented as a multi-level dictionary. For example:
{
"id":"arbitrary_entity_id",
"type":"some_type",
"oneMoreProperty":"foo",
"nestedProperty": {
"subProperty1": "foo",
"subProperty2": "bar",
}
}
2 notes here to keep in mind:
id
andtype
properties are mandatory.- ZMON filtering (e.g. in ZMON UI) does not support nested properties.
Examples¶
In working with the Vagrant Box, you can use the scheduler instance entity like this:
{
"id":"localhost:3421",
"type":"instance",
"host":"localhost",
"project":"zmon-scheduler-ng",
"ports": {"3421":3421}
}
Here, you can use the “ports” dictionary to also describe additional open ports. As with Spring Boot, a second port is usually added, exposing management features.
Now let’s look at an example of the PostgreSQL instance:
{
"id":"localhost:5432",
"type":"database",
"name":"zmon-cluster",
"shards": {"zmon":"localhost:5432/local_zmon_db"}
}
Usage of the property “shards” is given by how ZMON’s worker exposes PostgreSQL clusters to the sql() function.
View more examples here.
If you’d like to create an entity by yourself, check ZMON CLI tool
Check Definitions¶
Checks are ZMON’s way of gathering data from arbitrary entities, e.g. databases, micro services, hosts and more. Create them as describe below using either the UI or the CLI.
Key properties¶
Command¶
The command is being executed by the worker and is considered the data gathering part.
It is executed once per selected entity and its result made available to all attached alerts.
You have different wrappers at hand and the entity
variable is also available for access.
Entity Filter¶
Select the entities you want the check to execute against in general, often only a type filter is applied, sometimes more specific. The alert allows you to do more fine grained filtering. This proves useful to allow checks to be easily reused.
Interval¶
Specify the interval in seconds at which you want the check to be executed.
Owning team¶
This is the team originally creating the check, right now this has little effect.
Alert Definitions¶
Alert definitions specify when (condition, time period) and who (team) to notify for a desired monitoring event. Alert definitions can be defined in the ZMON web frontend and via the ZMON CLI.
The following fields exist for alert definitions:
- name
- The alert’s display name on the dashboard. This field can contain curly-brace variables like
{mycapture}
that are replaced by capture’s value when the alert is triggered. It’s also possible to format decimal precision (e.g. “My alert{mycapture:.2f}
” would show as “My alert123.45
” if mycapture is123.456789
). To include a comma separated list of entities as part of the alert’s name, just use the special placeholder{entities}
.- description
- Meaningful text for people trying to handle the alert, e.g. incident support.
- priority
- The alert’s dashboard priority. This defines color and sort order on the dashboard.
- condition
- Valid Python expression to return true when alert should be triggered.
- parameters
- You may apply parameters your alert condition using variables. More details here
- entities filter
- Additional filter to apply the alert definition only to a subset of entities.
- notifications
- List of notification commands, e.g. to send out emails.
- time_period
- Notification time period.
- team
- Team dashboard to show alert on.
- responsible_team
- Additional team field to allow delegating alert monitoring to other teams. The responsible team’s name will be shown on the dashboard.
- status
- Alerts will only be triggered if status is “ACTIVE”.
- template
- A template is an alert definition that is not evaluated and can only be used for extension. More details here
Condition¶
Simple expressions can start directly with an operator. To trigger an alert if the check result value is larger than zero:
> 0
You can use the value
variable to create more complex conditions:
value >= 10 and value <= 100
Some more examples of valid conditions:
== 'OK'
!= False
value in ('banana', 'apple')
If the value already is a dictionary (hash map), we can apply all the Python magic to it:
['mykey'] > 100 # check a specific dict value
'error-message' in value # trigger alert if key is present
not empty([ k for k, v in value.items() if v > 100 ]) # trigger alert if some dict value is > 100
Captures¶
You can capture intermediate results in alert conditions by using the
capture
function. This allows easier debugging of complex alert
conditions.
capture(value["a"]/value["b"]) > 0
capture(myval=value["a"]/value["b"]) > 0
any([capture(foo=FOO) > 10, capture(bar=BAR) > 10])
Please refer to Recipes section in Python Tutorial for some Python tricks you may use.
Named captures can be used to customize the alert display on the dashboard by using template substitution in the alert name.
If you call your capture dashboard, it will be used on dashboard next to entity name instead of entity value. For example, if you have a host-based alert that fails on z-host1 and z-host2, you would normally see something like that
ALERT TITLE (N) z-host1 (value1), z-host2 (value2)
Once you introduce capture called dashboard, you will get something like
ALERT TITLE (N) z-host1 (capturevalue1), z-host2 (capturevalue2)
where capturevalue1 is value of “dashboard” capture evaluated against z-host1.
Example alert condition (based on PF/System check for diskspace)
"ERROR" not in value
and
capture(dashboard=(lambda d: '{}:{}'.format(d.keys()[0], d[d.keys()[0]]['percentage_space_used']) if d else d)(dict((k, v) for k,v in value.iteritems() if v.get('percentage_space_used', 0) >= 90))))
Entity (Exclude) Filter¶
The check definition already defines on what entities the checks should run.
Usually the check definition’s entities
are broader than you want.
A diskspace check might be defined for all hosts, but you want to trigger alerts only for hosts you are interested in.
The alert definition’s entities
field allows to filter entities by their attributes.
See Entities for details on supported entities and their attributes.
Note: The entity name can be included in the alert message by using a special placeholder {entities}` on the alert name.
Notifications¶
ZMON notifications lets you know when you have a new alert without check the web UI. This section will explain how to use the different options available to notify about changes in alert states. We support E-Mail, HipChat, Slack and one SMS provider that we have been using.
The notifications field is a list of function calls (see below for examples), calling one of the following methods of notification:
-
send_email
(email*[, subject, message, repeat])¶
-
send_sms
(number*[, message, repeat])¶
-
send_push
([message, repeat, url, key])¶
-
send_slack
([channel, message, repeat, token])¶
-
send_hipchat
([room, message, color='red', repeat, token, message_format='html', notify=False])¶
If the alert has the top priority and should be handled immediately, you can specify the repeat interval for each notification. In this case, you will be notified periodically, according to the specified interval, while the alert persists. The interval is specified in seconds.
To receive push notifications you need one of the ZMON mobile apps (configured for your deployment) and subscribe to alert ids, before you can receive notifications.
In addition, you may use notification-groups to configure groups of people with associated emails and/or phone numbers and use these groups in notifications like this:
Example JSON email and SMS configuration using groups:
[
"send_sms('active:2nd-database')",
"send_email('group:2nd-database')"
]
In the above example you send SMS to active member of 2nd-database group and send email to all members of the group.
Example JSON email configuration:
[
"send_mail('a@example.org', 'b@example.org')",
"send_mail('a@example.com', 'b@example.com', subject='Critical Alert please do something!')",
"send_mail('c@example.com', repeat=60)"
]
Example JSON Slack configuration:
[
"send_slack()",
"send_slack(channel='#incidents')",
"send_slack(channel='#incidents', token='your-token')"
]
Example JSON HipChat configuration:
[
"send_hipchat()",
"send_hipchat(room='#incidents', color='red')",
"send_hipchat(room='#incidents', token='your-token')",
"send_hipchat(room='#incidents', token='your-token', notify=True)",
"send_hipchat(room='#incidents', token='your-token', notify=True, message='@here Plz check it', message_format='text')"
]
Example JSON Push configuration:
[ "send_push()" ]
Example JSON SMS configuration:
[
"send_sms('0049123555555', '0123111111')",
"send_sms('0049123555555', '0123111111', message='Critical Alert please do something!')",
"send_sms('0029123555556', repeat=300)"
]
Example email:
From: ZMON <zmon@example.com>
Date: 2014-05-28 18:37 GMT+01:00
Subject: NEW ALERT: Low Orders/m: 84.9% of last weeks on GLOBAL
To: Undisclosed Recipients <zmon@example.com>
New alert on GLOBAL: Low Orders/m: {percentage_wow:.1f}% of last weeks
Current value: {'2w_ago': 188.8, 'now': 180.8, '1w_ago': 186.6, '3w_ago': 196.4, '4w_ago': 208.8}
Captures:
percentage_wow: 184.9185496584
last_weeks_avg: 195.15
Alert Definition
Name (ID): Low Orders/m: {percentage_wow:.1f}% of last weeks (ID: 190)
Priority: 1
Check ID: 203
Condition capture(percentage_wow=100. * value['now']/capture(last_weeks_avg=(value['1w_ago'] + value['2w_ago'] + value['3w_ago'] + value['4w_ago'])/4. )) < 85
Team: Platform/Software
Resp. Team: Platform/Software
Notifications: [u"send_mail('example@example.com')"]
Entity
id: GLOBAL
type: GLOBAL
percentage_wow: 184.9185496584
last_weeks_avg: 195.15
Example SMS:
Message details:
Type: Text Message
From: zmon2
Message text:
NEW ALERT: DB instances test alert on all shards on customer-integration-master
Time periods¶
ZMON 2.0 allows specifying time periods (in UTC) in alert definitions. When specified, user will be notified about the alert only when it occurs during given period. Examples below cover most common use cases of time periods’ definitions.
To specify a time period from Monday through Friday, 9:00 to 17:00, use a period such as
wd {Mon-Fri} hr {9-16}
When specifying a range by using -, it is best to think of - as meaning through. It is 9:00 through 16:00, which is just before 17:00 (16:59:59).
To specify a time period from Monday through Friday, 9:00 to 17:00 on Monday, Wednesday, and Friday, and 9:00 to 15:00 on Tuesday and Thursday, use a period such as
wd {Mon Wed Fri} hr {9-16}, wd{Tue Thu} hr {9-14}
To specify a time period that extends Mon-Fri 9-16, but alternates weeks in a month, use a period such as
wk {1 3 5} wd {Mon Wed Fri} hr {9-16}
A period that specifies winter in the northern hemisphere:
mo {Nov-Feb}
This is equivalent to the previous example:
mo {Jan-Feb Nov-Dec}
As is
mo {jan feb nov dec}
And this is too:
mo {Jan Feb}, mo {Nov Dec}
To specify a period that describes every other half-hour, use something like:
minute { 0-29 }
To specify the morning, use
hour { 0-11 }
Remember, 11 is not 11:00:00, but rather 11:00:00 - 11:59:59.
5 second blocks:
sec {0-4 10-14 20-24 30-34 40-44 50-54}
To specify every first half-hour on alternating week days, and the second half-hour the rest of the week, use the period
wd {1 3 5 7} min {0-29}, wd {2 4 6} min {30-59}
For more examples and syntax reference, please refer to this documentation, note that suffixes like am or pm for hours are not supported, only integers between 0 and 23. In doubt, try calling with python with your period definition like
from timeperiod import in_period
in_period('hr { 0 - 23 }')
This should not throw an exception. The timeperiod module in use is timeperiod2. The in_period function accepts a second parameter which is a datetime like
from datetime import datetime
from timeperiod import in_period
in_period('hr { 7 - 23 }', datetime(2018, 1, 8, 2, 15)) # check 2018-01-08 02:15:00
Alert Definition Inheritance¶
Alert definition inheritance allows one to create an alert definition based on another alert whereby a child reuses attributes from the parent.
Each alert definition can only inherit from a single alert definition (single inheritance
).
Template¶
A Template is basically an alert definition with a subset of attributes that is not evaluated and can only be used for extension.
To create a template:
- Select the check definition
- click Add New Alert Definition
- Set attributes to reuse and activate checkbox
template
Extending¶
In general one can inherit from any alert definition/template. One should open the alert definition details and click inherit
on the top right corner.
To override a field, just type in a new value. An icon should appear on the left side, meaning that the field will be overridden.
To rollback the change and keep the value defined on the parent, one should click in override
icon.
Overriding¶
- By default the child alert retains all attributes of the parent alert with the exception of the following mandatory attributes:
- team
- responsible team
- status
These attributes are used for authorization
(see permissions for details) therefore, they cannot be reused. If one changes these attributes on the parent alert definition, child alerts are not affected and you don’t loose access rights.
All the remaining attributes can be overridden, replacing the parent alert definition with its own values.
Alert Definition Parameters¶
Alert definition parameters allows one to decouple alert condition from constants that are used inside it.
Use Case: Technical alert condition¶
If your alert condition is highly technical with a lot of Python code in it, it is often makes sense to split actual calculation from threshold values and move such constant values into parameters.
The same may apply in certain cases to alert definitions created by technical staff, which later need to be adjusted by non-technical people - if you split calculation from variable definition, you may let non-technical people just change values without touching calculation logic.
Use Case: Same alert, different priorities¶
Another use case where we recommend to use parameters is when you need to have the same alert come up with a different priority depending on threshold values.
In such case, refer to alert inheritance for configuring inherited alerts.
Proposed structure would look like:
- Base alert “A” with alert condition and parameters, check template box
- Alert “B1” inherits from “A” specifying priority RED and associated parameter values
- Alert “B2” inherits from “A” specifying priority YELLOW and associated parameter values
An example: Setting a simple parameter in trial run¶
In the zmon2 web interface click on the trial run button.
In the Check Command text box enter:
normalvariate(50, 20)
This is a simple normal probability function that produce a float number 50% of the time over 50.0, so it’s good to test things.
In the Alert Condition enter:
value>capture(threshold=threshold) + len(capture(params=params))
In the Parameters selector enter two values (by clicking the plus sign):
Name Value Type threshold 50.0 Float anything Kartoffel String In the Entity Filter text box enter:
[ { "type": "GLOBAL" } ]
In the Interval enter: 10
If you run this Trial you can get an Alert or an ‘OK’, but the interesting thing will be in the Captures column. See how the parameters that you entered are evaluated in the alert condition with the value that you provided. Notice also that there is a special parameter called params that holds a dict with all the parameters that you entered, this is done so the user can iterate over all the parameters and take conditional decisions, providing a kind of introspection capability, but this is only for advanced users.
Last but not least: Most of the time you don’t need to capture the parameter values, we did it like this so you can visually see that the parameters are evaluated, this means that you can run exactly the same check with this Alert Condition:
value>threshold + len(params)
Downtimes¶
This functionality allows the user to acknowledge an existing alert or create a downtime schedule for an anticipated service interruption. When acknowleding an existing alert, the user has to provide the predicted duration, and when creating a scheduled downtime - start and end date. If the downtime is currently active, meaning an alert occured within the downtime period, the alert notification won’t be shown in the dashboard and it’ll be greyed out in alert details page. Please note that the downtime will not be evaluated immediately after creation, meaning that the alert might appear as active until it’s evaluated again by the worker. E.g. if the user defined a downtime for an alert which is evaluated every minute and the last evaluation was 5 seconds ago, it would take approximately one more minute for the alert to appear in “downtime state”.
To acknowledge an alert or to schedule a new downtime, the user has to go to the specific alert details page and click on a downtime button next to the desired alert.
Alert Comments¶
Comments are useful in providing additional information to other members of your team (or other teams) about your alerts. Those with ADMIN and USER roles can add comments to an alert, but VIEWERS can not. ADMINs can delete either their own or other people’s comments. USERs can delete only their own comments.
Adding Comments¶
Follow these steps:
- Open the alert definition where you want to add your comment.
- Either click on the top-right link Comments to add a general comment (for all entities), or click on the balloon on the left side of the entity name to add a comment on a specific entity.
- In the comments window, type your comment. Use as many lines as you need.
- Click the Post comment button and save your comment. Done!
Seeing Existing Comments¶
It’s easy: Just open the alert definition, then click on Comments (top-right link).
Deleting Comments¶
Deleting is also easy: Open the alert definition, click on the top right-link Comments, click on the cross above the comment, and delete.
Dashboards¶

ZMON’s customizable dashboards enable you to configure widgets and choose which alerts to show. Dashboards have the following fields:
- name
- The dashboard’s name. This is mainly used to identify the dashboard.
- default view
The dashboard default view. Here you can specify the default rendering behavior when you open the dashboard. There are two options available:
- Full: Provides detailed information about the alert. Useful when using big screens.
- Compact: Only displays the alert message. Useful for small screens.
Note: You can toggle the view in the dashboard by clicking on the top right button of the alert container.
- edit mode
Here you can specify who can modify your dashboard. There are three options available:
- Private: Only you (and the admin) can edit the dashboard
- Team: All members of your team(s) can edit the dashboard
- Public: Everyone can edit the dashboard
- widget configuration
The widget configuration defines the different widgets that the current dashboard has. An example of a valid widget configuration is the following:
[ { "checkDefinitionId": 1, "entityId": "GLOBAL", "type": "gauge", "title": "Order Failure %", "options": { "max": 35 } }, { "checkDefinitionId": 4, "entityId": "GLOBAL", "type": "gauge", "title": "Random", "options": { "max": 100 } }, { "checkDefinitionId": 5, "entityId": "my_db_name-live", "type": "value", "title": "My database value" } ]Supported widget types are:
- gauge
- chart
- value
- networkmap
- iframe
In order to edit a specific dashboard, go to the dashboard tab, and click the edit button. To set it as active, just click on its name.
In order to be able to create or edit a new dashboard, user should be logged in. Unless you have the admin role, you will only be able to edit the dashboards you created.
Widgets will automatically spread out across the whole width, i.e. if you define two widgets both will take about 50% of screen width.
- alert teams
Here you can specify a list of patterns to filter alerts by team or responsible team you want to display (wildcards using * are allowed)
Example: All incident alerts (including sub-teams)
[ "Incident*" ]
value, gauge, chart, trend¶
The value widget will show the check value with a big font. The gauge will show a gauge from “min” to “max”. The chart will show the history of check values. The trend will show a trend arrow (going up or down).
These widgets expect a “checkDefinitionId”, “entityId” and “title” properties:
- “checkDefinitionId” - self-explanatory. Data in widget is based on check results
- “entityId” - if your check is based on GLOBAL, leave “GLOBAL”, otherwise specify name of entity (as it appears in alert details) that you will use to get the data from (as check returns one result for each entity).
- “title” - text displayed in the top part of the widget.
For chart widgets, instead of using “checkDefinitionId” + “entityId”, you can also define the data to be shown using a KairosDB query.
They’ll share the full screen width unless you set the “width” property, ranging from 12 (full width, calculated in “columns”, see Bootstrap) to 2 (smallest meaningful) or even 1.
Configuration options can be defined inside an “options” property. Each widget accepts a different set of options.
Value widgets accept “fontSize”, “color” and “format” properties. Additionally you can set a specific JSON value of the check result to be displayed by using the “jsonPath” property, in case the result is a JSON object instead of a string / number.
A font size can be specified with the “fontSize” property, with numbers (in pixels) for the desired size.
A color for the font can be specified with the “color” property.
A formatting string can be also specified to make python-like string interpolation and floating point precision rounding, by defining a “format” property in the options object. Syntax of the format string is mostly same as in python.
Options example for all widgets to specify which value from the check result to be displayed using “jsonPath”:
"options": {
"fontSize": 120, # set font size to 120px,
"color": "red", # set color to red (also accepts #FF0000).
"format": ".3f" # show value with 3 places of floating point precision
},
"jsonPath": ".cpu.load1"
Check the documentation of JSONPath for more info on how to use the jsonPath property. Please note that you don’t need to use the $ symbol, as it’s prepended automatically on parsing.
Charts can be configured by defining an “options” property. All options available to Flot charts can be overridden here, plus some extra options like stacked mode. The following shows an example of a stacked area chart with customized colors.
Series of data can be filtered, so that Charts show only the customized data you want to see. To specify which data series you want visible, define the ‘series’ property as an array of names of the data series as showed below.
{
"type": "chart",
"title": "Orders+Failures/m",
"checkDefinitionId": 131,
"entityId": "GLOBAL",
"options": {
"series": {
"stack": true
},
"colors": [
"#ff3333",
"#33ff33"
]
"series": [ "Mean", "Peak" ]
}
See the Flot documentation for more details.
Data from KairosDB-queries¶
As detailed in the Grafana3 and KairosDB section, all ZMON check data is saved into KairosDB, and
can be queried from there. For chart widgets, you can directly use
a KairosDB query in the options
section of a widget to specify the data series to be used.
The query consists of the key metrics
(which indicates the data series to use)
and a time specifier, for our purposes usually start_relative
. In addition you can use
cache_time
(in seconds) to indicate that a previous result can be reused.
Here is an example which shows the values of check 1 for just three of its entities.
{
"options": {
"lines": {},
"legend": {
"backgroundOpacity": 0.1,
"show": true,
"position": "ne"
},
"series": {
"stack": false
},
"start_relative": {
"unit": "minutes",
"value": "30"
},
"metrics": [
{
"tags": {
"entity": [
"website-zalando.de",
"website-zalando.ch",
"website-zalando.at"
],
"key": []
},
"name": "zmon.check.1",
"group_by": [
{
"name": "tag",
"tags": [
"entity",
"key"
]
}
]
}
],
"cache_time": 0,
"colors": [
"#F00",
"#0F0",
"#00F"
]
},
"type": "chart",
"title": "Response time (just de/at/ch)"
}
An easy way to compose the KairosDB queries (specially the value for metrics
) is to
create a new Grafana Dashboard in the built-in Grafana and then copy the query from the
requests sent by the browser (Developer Tools → Network in Chromium).
IFRAME¶
The Iframe widget is a simple widget that allows you to embed a thrid party page in a widget container.
For browser security reasons, only same-domain source urls can be used.
Style property is used to set scale and size of iframe inside the widget container. Normally widths and heights bigger than 100% will be used, and scales around 0.5 are also common.
Reload after a given amount of miliseconds can be done by setting the ‘refresh’ property.
Sample iframe widget:
{
"type": "iframe",
"src": "http://example.com",
"style": {
"width": "180%", // Width to be occupied by iframe (px or %).
"height": "180%", // Height to be occupied by iframe (px or %).
"scale: 0.54 // Scaling ratio
},
"refresh": 60000 // time in miliseconds after which the iframe content will be reloaded.
}
Alert Age¶
In the rightmost column of each alert block on the dashboard, the age of that alert is shown. An entry of “28m”, for example, indicates that the alert is 28 minutes old.
If an alert is raised for multiple entities, the alert age is based on the entity for which the alert has been raised first. Entities in downtime are ignored for determining alert age, but when an entity leaves downtime, the length of time it spent in downtime is taken into account.
An example:
time event entity A entity B alert age 00:00 alert is raised for entity A raised for 0h not raised 0h 01:00 alert is raised for entity B raised for 1h raised for 0h 1h (from entity A) 02:00 alert enters downtime for entity A raised for 2h, in downtime raised for 1h 1h (from entity B) 03:00 alert leaves downtime for entity A raised for 3h raised for 2h 3h (from entity A) 04:00 alert is cleared for entity A not raised raised for 3h 3h (from entity B) 05:00 alert enters downtime for entity A not raised, in downtime raised for 4h 4h (from entity B) 06:00 alert is raised for entity A raised for 0h, in downtime raised for 5h 5h (from entity B) 07:00 alert leaves downtime for entity A raised for 1h raised for 6h 6h (from entity B) 08:00 alert is cleared for entity B raised for 2h not raised 2h (from entity A)
Widgets styling and effects based on active alerts¶
You can change the styling or add a blinking effect to widgets in case one or more alerts are active at the moment. This is done by using the “alertStyles” option, like the sample below:
{
"type": "gauge",
// Some widget configuration here...
"alertStyles": {
"blink": [1, 4, 20],
"red": [9]
}
}
On the sample below the gauge widget will blink if alert 1, 4 or 20 is active, and make the background red if alert 9 is active. At the moment the following effects are defined:
- blink: will blink the whole widget (opacity 0 to 100%, 1 second interval)
- shake: will start shaking the widget
- red: set the background to red
- orange: set the background to orange
- yellow: set the background to yellow
- green: set the background to green
- blue: set the background to blue
Please note that you can mix different styles and alerts, as shown on the previous sample. If alerts 1 and 9 are active, it will blink with a red background. If you define different styles with the same alert ID it will always give priority to the last one.
Grafana3 and KairosDB¶
Grafana is a powerful open-source tool for creating dashboards to visualize metric data. ZMON deploys Grafana 3.x along with the new KairosDB plugin to read metric data from KairosDB. Grafana is served directly from the ZMON controller. Read requests are proxied through the controller so as not to expose the write/delete API from KairosDB. Dashboards are also saved via the controller, so there’s no need for any additional data store.
Check data¶
Workers will send all their data to KairosDB. Depending on the KairosDB setting, data is stored forever or you may set a TTL in KairosDB. ZMON will not clean up or roll up any data.
Serialization¶
In the simplest case you would have a check producing a single numeric value. In Zalando’s experience this is very rare.
Zmon also supports arbitrarily nested dictionaries of numeric values. Anything that is not a dictionary or a number will be silently dropped. The value is flattened into a single-level dictionary such that the elements can be stored in KairosDB (key-value storage).
{
"load": {"1min":1,"5min":3,"15min":2},
"memory_free": 16000
}
Will be flattened to an equivalent of
{
"load.1min": 1,
"load.5min": 3,
"load.15min": 2,
"memory_free": 16000
}
You might also want to output a list. The simple workaround is to generate a dictionary whose keys are some identifier extracted from the elements.
e.g. transform this list:
- {
- “partitions”: [
- {
- “count”: 2254839, “partition”: “0”, “stream_id”: “55491eb8-3ccc-40c5-b7c6-69bf38df3e16”
}, {
“count”: 2029956, “partition”: “1”, “stream_id”: “aa938451-d115-4e90-a5da-1ac4b435a4e9”},
into the following dictionary:
- {
- “partitions”: {
- “0”: {
- “count”: 2254839, “partition”: “0”, “stream_id”: “55491eb8-3ccc-40c5-b7c6-69bf38df3e16”
}, “1”: {
“count”: 2029956, “partition”: “1”, “stream_id”: “aa938451-d115-4e90-a5da-1ac4b435a4e9”},
this will be stored the same way as the value (remember that strings are dropped):
{
"partitions.0.count": 2254839,
"partitions.1.count": 2029956
}
Tagging¶
KairosDB creates timer series with a name and allows us to tag data points with additional (tagname, tagvalue) pairs.
ZMON stores all data to a single check in a time series named: “zmon.check.<checkid>”.
Single data points are then tagged as follows to describe their contents:
- entity: containing the entity id (some character replace rules are applied)
- key: containing the dict key after serialization of check value (see above)
- metric: contains the last segment of “key” split by “.” (making selection easier in tooling)
- hg: host group(hg) will contain a substring of the entity id, to try to group e.g. cassandra01 and cassandra02 into hg=cassandra
For a certrain set of metrics additional tags may be deployed(REST metrics/actuator)
- sc: HTTP status code
- sg: first digit of HTTP status code
Some of the tagging may seem strange, but as KairosDB does not allow real operations on tags they are basically precreated to allow easier filtering in the tools/charts. This is also fine from a storage/performance point of view during writes, as KairosDB’s Cassandra implementation creates a new row for each unique tuple (time series name, set of tags) thus this is only stored once.
“Read Only” Display Login¶
The ZMON front end requires users to login. However a very common way of deploying dashboards is on TV screens running across office spaces to e.g. render Grafana or ZMON dashboards. For this ZMON provides you with a way to login a read only authenticated user via one-time tokens.
Those tokens can be created by any real user by login in first and switching to TV mode or via the ZMON CLI.
How does it work¶
First time a valid one time token is used to login we associate a random UUID with it and the device IP. Both are registered within ZMON to create a persisted session, thus this will continue to work after the frontend gets deployed.
Tokens can’t be reused. Once used, it can no longer be used and you need to create a new one. You’ll need a different token per additional device or location. One time token sessions will last up to 365 days.
Using the ZMON CLI¶
You can also generate one time tokens using the command line tool. The tool also allows you to list which tokens you already generated.
Getting a token¶
zmon onetime-token get
Retrieving new one-time token ...
https://zmon.example.org/tv/AocciOWf/
OK
Login with token¶
Use the URL in the target browser to login directly. This will create a read-only session.
https://<your zmon url>/tv/<your token>
Note
Please make sure you access the generated URL in order to login. Appending the <token> to any other ZMON device or location won’t work.
Listing existing tokens¶
zmon onetime-token list
- bound_at: 2008-05-08 12:16:21.696000
bound_expires: 1234567800000
bound_ip: ''
created: 2008-05-08 12:16:20.533000
token: 1234abCD
Check Command Reference¶
To give an overview of available commands, we divided them into several categories.
AppDynamics¶
Enable AppDynamics Healthrule violations check and optionally query underlying Elasticsearch cluster raw logs.
-
appdynamics
(url=None, username=None, password=None, es_url=None, index_prefix='') Initialize AppDynamics wrapper.
Parameters:
Note
If username
and password
are not supplied, then OAUTH2 will be used.
If appdynamics()
is initialized with no args, then plugin configuration values will be used.
Methods of AppDynamics¶
-
healthrule_violations
(application, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, severity=None)¶ Return Healthrule violations for AppDynamics application.
Parameters: - application (str) – Application name or ID
- time_range_type (str) – Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and BETWEEN_TIMES. Default is BEFORE_NOW.
- duration_in_mins (int) – Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types. Default is 5 mins.
- start_time (int) – Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago.
- end_time (int) – End time (in milliseconds) until which the metric data is returned. Default is now.
- severity (str) – Filter results based on severity. Valid values are CRITICAL or WARNING.
Returns: List of healthrule violations
Return type: list
Example query:
appdynamics('https://appdynamics/controller/rest').healthrule_violations('49', time_range_type='BEFORE_NOW', duration_in_mins=5) [ { affectedEntityDefinition: { entityId: 408, entityType: "BUSINESS_TRANSACTION", name: "/error" }, detectedTimeInMillis: 0, endTimeInMillis: 0, id: 39637, incidentStatus: "OPEN", name: "Backend errrors (percentage)", severity: "CRITICAL", startTimeInMillis: 1462244635000, } ]
-
metric_data
(application, metric_path, time_range_type=BEFORE_NOW, duration_in_mins=5, start_time=None, end_time=None, rollup=True)¶ AppDynamics’s metric-data API
Parameters: - application (str) – Application name or ID
- metric_path (str) – The path to the metric in the metric hierarchy
- time_range_type (str) – Valid time range type. Valid range types are BEFORE_NOW, BEFORE_TIME, AFTER_TIME and BETWEEN_TIMES. Default is BEFORE_NOW.
- duration_in_mins (int) – Time duration in mins. Required for BEFORE_NOW, AFTER_TIME, BEFORE_TIME range types.
- start_time (int) – Start time (in milliseconds) from which the metric data is returned. Default is 5 mins ago.
- end_time (int) – End time (in milliseconds) until which the metric data is returned. Default is now.
- rollup (bool) – By default, the values of the returned metrics are rolled up into a single data point
(rollup=True). To get separate results for all values within the time range, set the
rollup
parameter toFalse
.
Returns: metric values for a metric
Return type: list
-
query_logs
(q='', body=None, size=100, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5)¶ Perform search query on AppDynamics ES logs.
Parameters: - q (str) – Query string used in search.
- body (dict) – (dict) holding an ES query DSL.
- size (int) – Number of hits to return. Default is 100.
- source_type (str) –
sourceType
field filtering. Default toapplication-log
, and will be part ofq
. - duration_in_mins (int) – Duration in mins before current time. Default is 5 mins.
Returns: ES query result
hits
.Return type: list
-
count_logs
(q='', body=None, source_type=SOURCE_TYPE_APPLICATION_LOG, duration_in_mins=5)¶ Perform count query on AppDynamics ES logs.
Parameters: - q (str) – Query string used in search. Will be ingnored if
body
is not None. - body (dict) – (dict) holding an ES query DSL.
- source_type (str) –
sourceType
field filtering. Default toapplication-log
, and will be part ofq
. - duration_in_mins (int) – Duration in mins before current time. Default is 5 mins. Will be ignored if
body
is not None.
Returns: Query match count.
Return type: - q (str) – Query string used in search. Will be ingnored if
Note
In case of passing an ES query DSL in body
, then all filter parameters should be explicitly added in the query body (e.g. eventTimestamp
, application_id
, sourceType
).
Cassandra¶
Provides access to a Cassandra cluster via cassandra()
wrapper object.
-
cassandra
(node, keyspace, username=None, password=None, port=9042, connect_timeout=1, protocol_version=3) Initialize cassandra wrapper.
Parameters: - node (str) – Cassandra host.
- keyspace (str) – Cassandra keyspace used during the session.
- username (str) – Username used in connection. It is recommended to use unprivileged user for cassandra checks.
- password (str) – Password used in connection.
- port (int) – Cassandra host port. Default is 9042.
- connect_timeout (int) – Connection timeout.
- protocol_version (str) – Protocol version used in connection. Default is 3.
Note
You should always use an unprivileged user to access your databases. Use plugin.cassandra.user
and plugin.cassandra.pass
to configure credentials for the zmon-worker.
CloudWatch¶
If running on AWS you can use cloudwatch()
to access AWS metrics easily.
-
cloudwatch
(region=None, assume_role_arn=None) Initialize CloudWatch wrapper.
Parameters:
Methods of Cloudwatch¶
-
query_one
(dimensions, metric_name, statistics, namespace, period=60, minutes=5, start=None, end=None, extended_statistics=None)¶ Query a single AWS CloudWatch metric and return a single scalar value (float). Metric will be aggregated over the last five minutes using the provided aggregation type.
This method is a more low-level variant of the
query
method: all parameters, including all dimensions need to be known.Parameters: - dimensions (dict) – Cloudwatch dimensions. Example
{'LoadBalancerName': 'my-elb-name'}
- metric_name (str) – Cloudwatch metric. Example
'Latency'
. - statistics (list) – Cloudwatch metric statistics. Example
'Sum'
- namespace (str) – Cloudwatch namespace. Example
'AWS/ELB'
- period (int) – Cloudwatch statistics granularity in seconds. Default is 60.
- minutes (int) – Used to determine
start
time of the Cloudwatch query. Default is 5. Ignored ifstart
is supplied. - start (int) – Cloudwatch start timestamp. Default is
None
. - end (int) – Cloudwatch end timestamp. Default is
None
. If not supplied, then end time is now. - extended_statistics (list) – Cloudwatch ExtendedStatistics for percentiles query. Example
['p95', 'p99']
.
Returns: Return a float if single value, dict otherwise.
Return type: Example query with percentiles for AWS ALB:
cloudwatch().query_one({'LoadBalancer': 'app/my-alb/1234'}, 'TargetResponseTime', 'Average', 'AWS/ApplicationELB', extended_statistics=['p95', 'p99', 'p99.45']) { 'Average': 0.224, 'p95': 0.245, 'p99': 0.300, 'p99.45': 0.500 }
- dimensions (dict) – Cloudwatch dimensions. Example
Note
In very rare cases, e.g. for ELB metrics, you may see only 1/2 or 1-2/3 of the value in ZMON due to a race condition of what data is already present in cloud watch. To fix this click “evaluate” on the alert, this will trigger the check and move its execution time to a new start time.
-
query
(dimensions, metric_name, statistics='Sum', namespace=None, period=60, minutes=5)¶ Query AWS CloudWatch for metrics. Metrics will be aggregated over the last five minutes using the provided aggregation type (default “Sum”).
dimensions is a dictionary to filter the metrics to query. See the list_metrics boto documentation. You can provide the special value “NOT_SET” for a dimension to only query metrics where the given key is not set. This makes sense e.g. for ELB metrics as they are available both per AZ (“AvailabilityZone” has a value) and aggregated over all AZs (“AvailabilityZone” not set). Additionally you can include the special “*” character in a dimension value to do fuzzy (shell globbing) matching.
metric_name is the name of the metric to filter against (e.g. “RequestCount”).
namespace is an optional namespace filter (e.g. “AWS/EC2).
To query an ELB for requests per second:
# both using special "NOT_SET" and "*" in dimensions here: val = cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'LoadBalancerName': 'pierone-*'}, 'RequestCount', 'Sum')['RequestCount'] requests_per_second = val / 60
You can find existing metrics with the AWS CLI tools:
$ aws cloudwatch list-metrics --namespace "AWS/EC2"
Use the “dimensions” argument to select on what dimension(s) to aggregate over:
$ aws cloudwatch list-metrics --namespace "AWS/EC2" --dimensions Name=AutoScalingGroupName,Value=my-asg-FEYBCZF
The desired metric can now be queried in ZMON:
cloudwatch().query({'AutoScalingGroupName': 'my-asg-*'}, 'DiskReadBytes', 'Sum')
-
alarms
(alarm_names=None, alarm_name_prefix=None, state_value=STATE_ALARM, action_prefix=None, max_records=50)¶ Retrieve cloudwatch alarms filtered by state value.
See describe_alarms boto documentation for more details.
Parameters: - alarm_names (list) – List of alarm names.
- alarm_name_prefix (str) – Prefix of alarms. Cannot be specified if
alarm_names
is specified. - state_value (str) – State value used in alarm filtering. Available values are
OK
,ALARM
(default) andINSUFFICIENT_DATA
. - action_prefix (str) – Action name prefix. Example
arn:aws:autoscaling:
to filter results for all autoscaling related alarms. - max_records (int) – Maximum records to be returned. Default is 50.
Returns: List of MetricAlarms.
Return type: list
cloudwatch().alarms(state_value='ALARM')[0]
{
'ActionsEnabled': True,
'AlarmActions': ['arn:aws:autoscaling:...'],
'AlarmArn': 'arn:aws:cloudwatch:...',
'AlarmConfigurationUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 15, 707000, tzinfo=tzutc()),
'AlarmDescription': 'Scale-down if CPU < 50% for 10.0 minutes (Average)',
'AlarmName': 'metric-alarm-for-service-x',
'ComparisonOperator': 'LessThanThreshold',
'Dimensions': [
{
'Name': 'AutoScalingGroupName',
'Value': 'service-x-asg'
}
],
'EvaluationPeriods': 2,
'InsufficientDataActions': [],
'MetricName': 'CPUUtilization',
'Namespace': 'AWS/EC2',
'OKActions': [],
'Period': 300,
'StateReason': 'Threshold Crossed: 1 datapoint (36.1) was less than the threshold (50.0).',
'StateReasonData': '{...}',
'StateUpdatedTimestamp': datetime.datetime(2016, 5, 12, 10, 44, 16, 294000, tzinfo=tzutc()),
'StateValue': 'ALARM',
'Statistic': 'Average',
'Threshold': 50.0
}
Counter¶
The counter()
function allows you to get increment rates of increasing counter values.
Main use case for using counter()
is to get rates per second of JMX counter beans (e.g. “Tomcat Requests”).
The counter function requires one parameter key
to identify the counter.
-
per_second
(value)¶ counter('requests').per_second(get_total_requests())
Returns the value’s increment rate per second. Value must be a float or integer.
-
per_minute
(value)¶ counter('requests').per_minute(get_total_requests())
Convenience method to return the value’s increment rate per minute (same as result of
per_second()
divided by 60).
Internally counter values and timestamps are stored in Redis.
Data Pipeline¶
If running on AWS you can use datapipeline()
to access AWS Data Pipelines’ health easily.
-
datapipeline
(region=None) Initialize Data Pipeline wrapper.
Parameters: region (str) – AWS region for Data Pipeline queries. Eg. “eu-west-1”. Defaults to the region in which the check is being executed. Note that Data Pipeline is not availabe in “eu-central-1” at time of writing.
Methods of Data Pipeline¶
-
get_details
(pipeline_ids)¶ Query AWS Data Pipeline IDs supplied as a String (single) or list of Strings (multiple). Return a dict of ID(s) and status dicts as described in describe_pipelines boto documentation.
Parameters: pipeline_ids (Union[str, list]) – Data Pipeline IDs. Example df-0123456789ABCDEFGHI
Return type: dict Example query with single Data Pipeline ID supplied in a list:
datapipeline().get_details(pipeline_ids=['df-exampleA']) { "df-exampleA": { "@lastActivationTime": "2018-01-30T14:23:52", "pipelineCreator": "ABCDEF:auser", "@scheduledPeriod": "24 hours", "@accountId": "0123456789", "name": "exampleA", "@latestRunTime": "2018-01-04T03:00:00", "@id": "df-0441325MB6VYFI6MUU1", "@healthStatusUpdatedTime": "2018-01-01T10:00:00", "@creationTime": "2018-01-01T10:00:00", "@userId": "0123456789", "@sphere": "PIPELINE", "@nextRunTime": "2018-01-05T03:00:00", "@scheduledStartTime": "2018-01-02T03:00:00", "@healthStatus": "HEALTHY", "uniqueId": "exampleA", "*tags": "[{\"key\":\"DataPipelineName\",\"value\":\"exampleA\"},{\"key\":\"DataPipelineId\",\"value\":\"df-exampleA\"}]", "@version": "2", "@firstActivationTime": "2018-01-01T10:00:00", "@pipelineState": "SCHEDULED" } }
EBS¶
Allows to describe EBS objects (currently, only Snapshots are supported).
-
ebs
()
Methods of EBS¶
-
list_snapshots
(account_id, max_items)¶ List the EBS Snapshots owned by the given account_id. By default, listing is possible for up to 1000 items, so we use pagination internally to overcome this.
Parameters: - account_id – AWS account id number (as a string). Defaults to the AWS account id where the check is running.
- max_items – the maximum number of snapshots to list. Defaults to 100.
Returns: an
EBSSnapshotsList
object-
class
EBSSnapshotsList
¶ -
items
()¶ Returns a list of dicts like
{ "id": "snap-12345", "description": "Snapshot description...", "size": 123, "start_time": "2017-07-16T01:01:21Z", "state": "completed" }
-
Example usage:
ebs().list_snapshots().items() snapshots = ebs().list_snapshots(max_items=1000).items() # for listing more than the default of 100 snapshots start_time = snapshots[0]["start_time"].isoformat() # returns a string that can be passed to time() age = time() - time(start_time)
Elasticsearch¶
Provides search queries and health check against an Elasticsearch cluster.
-
elasticsearch
(url=None, timeout=10, oauth2=False)
Note
If url
is None, then the plugin will use the default Elasticsearch cluster set in worker configuration.
Methods of Elasticsearch¶
-
search
(indices=None, q='', body=None, source=True, size=DEFAULT_SIZE)¶ Search ES cluster using URI or Request body search. If
body
is None then GET request will be used.Parameters: - indices (list) – List of indices to search. Limited to only 10 indices. [‘_all’] will search all available indices, which effectively leads to same results as None. Indices can accept wildcard form.
- q (str) – Search query string. Will be ignored if
body
is not None. - body (dict) – Dict holding an ES query DSL.
- source (bool) – Whether to include _source field in query response.
- size (int) – Number of hits to return. Maximum value is 1000. Set to 0 if interested in hits count only.
Returns: ES query result.
Return type: Example query:
elasticsearch('http://es-cluster').search(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500', size=0, source=False) { "_shards": { "failed": 0, "successful": 5, "total": 5 }, "hits": { "hits": [], "max_score": 0.0, "total": 1 }, "timed_out": false, "took": 2 }
-
count
(indices=None, q='', body=None)¶ Return ES count of matching query.
Parameters: - indices (list) – List of indices to search. Limited to only 10 indices. [‘_all’] will search all available indices, which effectively leads to same results as None. Indices can accept wildcard form.
- q (str) – Search query string. Will be ignored if
body
is not None. - body (dict) – Dict holding an ES query DSL.
Returns: ES query result.
Return type: Example query:
elasticsearch('http://es-cluster').count(indices=['logstash-*'], q='client:192.168.20.* AND http_status:500') { "_shards": { "failed": 0, "successful": 16, "total": 16 }, "count": 12 }
-
health
()¶ Return ES cluster health.
Returns: Cluster health result. Return type: dict elasticsearch('http://es-cluster').health() { "active_primary_shards": 11, "active_shards": 11, "active_shards_percent_as_number": 50.0, "cluster_name": "big-logs-cluster", "delayed_unassigned_shards": 0, "initializing_shards": 0, "number_of_data_nodes": 1, "number_of_in_flight_fetch": 0, "number_of_nodes": 1, "number_of_pending_tasks": 0, "relocating_shards": 0, "status": "yellow", "task_max_waiting_in_queue_millis": 0, "timed_out": false, "unassigned_shards": 11 }
Entities¶
Provides access to ZMON entities.
-
entities
(service_url, infrastructure_account, verify=True, oauth2=False) Initialize entities wrapper.
Parameters:
Note
If service_url or infrastructure_account were not supplied, their corresponding values in worker plugin config will be used.
Methods of Entities¶
-
search_local
(**kwargs)¶ Search entities in local infrastructure account. If infrastructure_account is not supplied in kwargs, then should search entities “local” to your filtered entities by using the same infrastructure_account as a default filter.
Parameters: kwargs (str) – Filtering kwargs Returns: Entities Return type: list Example searching all
instance
entities in local account:entities().search_local(type='instance')
-
search_all
(**kwargs)¶ Search all entities.
Parameters: kwargs (str) – Filtering kwargs Returns: Entities Return type: list
-
alert_coverage
(**kwargs)¶ Return alert coverage for infrastructure_account.
Parameters: kwargs (str) – Filtering kwargs Returns: Alert coverage result. Return type: list entities().alert_coverage(type='instance', infrastructure_account='1052643') [ { 'alerts': [], 'entities': [ {'id': 'app-1-instance', 'type': 'instance'} ] } ]
EventLog¶
The eventlog()
function allows you to conveniently count EventLog events by type and time.
-
count
(event_type_ids, time_from[, time_to=None][, group_by=None]) Return event counts for given parameters.
event_type_ids is either a single integer (use hex notation, e.g.
0x96001
) or a list of integers.time_from is a string time specification (
'-5m'
means 5 minutes ago,'-1h'
means 1 hour ago).time_to is a string time specification and defaults to now if not given.
group_by can specify an EventLog field name to group counts by
eventlog().count(0x96001, time_from='-1m') # returns a single number eventlog().count([0x96001, 0x63005], time_from='-1m') # returns dict {'96001': 123, '63005': 456} eventlog().count(0x96001, time_from='-1m', group_by='appDomainId') # returns dict {'1': 123, '5': 456, ..}
The
count()
method internally requests the EventLog Viewer’s “count” JSON endpoint.
History¶
Wrapper for KairosDB to access history data about checks.
-
history
(url=None, check_id='', entities=None, oauth2=False)
Methods of History¶
-
result
(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)¶ Return query result.
Parameters: - time_from – Relative time from in seconds. Default is
ONE_WEEK_AND_5MIN
. - time_to – Relative time to in seconds. Default is
ONE_WEEK
.
Returns: Json result
Return type: - time_from – Relative time from in seconds. Default is
-
get_one
(time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)¶ Return first result values.
Parameters: - time_from – Relative time from in seconds. Default is
ONE_WEEK_AND_5MIN
. - time_to – Relative time to in seconds. Default is
ONE_WEEK
.
Returns: List of values
Return type: list
- time_from – Relative time from in seconds. Default is
-
get_aggregated
(key, aggregator, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)¶ Return first result values. If no
key
filtering matches, empty list is returned.Parameters: Returns: List of values
Return type: list
-
get_avg
(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)¶ Return aggregated average.
Parameters: - key (str) – Tag key used in filtering the results.
- time_from – Relative time from in seconds. Default is
ONE_WEEK_AND_5MIN
. - time_to – Relative time to in seconds. Default is
ONE_WEEK
.
Returns: List of values
Return type: list
-
get_std_dev
(key, time_from=ONE_WEEK_AND_5MIN, time_to=ONE_WEEK)¶ Return aggregated standard deviation.
Parameters: - key (str) – Tag key used in filtering the results.
- time_from – Relative time from in seconds. Default is
ONE_WEEK_AND_5MIN
. - time_to – Relative time to in seconds. Default is
ONE_WEEK
.
Returns: List of values
Return type: list
-
distance
(self, weeks=4, snap_to_bin=True, bin_size='1h', dict_extractor_path='')¶ For detailed docs on distance function please see History distance functionality .
HTTP¶
Access to HTTP (and HTTPS) endpoints is provided by the http()
function.
-
http
(url[, method='GET'][, timeout=10][, max_retries=0][, verify=True][, oauth2=False][, allow_redirects=None][, headers=None]) Parameters: - url (str) – The URL that is to be queried. See below for details.
- method (str) – The HTTP request method. Allowed values are
GET
orHEAD
. - timeout (float) – The timeout for the HTTP request, in seconds. Defaults to
10
. - max_retries (int) – The number of times the HTTP request should be retried if it fails. Defaults to
0
. - verify (bool) – Can be set to
False
to disable SSL certificate verification. - oauth2 (bool) – Can be set to
True
to inject a OAuth 2Bearer
access token in the outgoing request - oauth2_token_name (str) – The name of the OAuth 2 token. Default is
uid
. - allow_redirects (bool) – Follow request redirects. If
None
then it will be set toTrue
in case ofGET
andFalse
in case ofHEAD
request. - headers (dict) – The headers to be used in the HTTP request.
Returns: An object encapsulating the response from the server. See below.
For checks on entities that define the attributes
url
orhost
, the given URL may be relative. In that case, the URLhttp://<value><url>
is queried, where<value>
is the value of that attribute, and<url>
is the URL passed to this function. If an entity defines bothurl
andhost
, the former is used.This function cannot query URLs using a scheme other than HTTP or HTTPS; URLs that do not start with
http://
orhttps://
are considered to be relative.Example:
http('http://www.example.org/data?fetch=json').json() # avoid raising error in case the response error status (e.g. 500 or 503) # but you are interested in the response json http('http://www.example.org/data?fetch=json').json(raise_error=False)
HTTP Responses¶
The object returned by the http()
function provides methods: json()
, text()
, headers()
, cookies()
, content_size()
, time()
and code()
.
-
json
(raise_error=True)¶ This method returns an object representing the content of the JSON response from the queried endpoint. Usually, this will be a map (represented by a Python
dict
), but, depending on the endpoint, it may also be a list, string, set, integer, floating-point number, or Boolean.
-
text
(raise_error=True)¶ Returns the text response from queried endpoint:
http("/heartbeat.jsp", timeout=5).text().strip()=='OK: JVM is running'
Since we’re using a relative url, this check has to be defined for specific entities (e.g. type=zomcat will run it on all zomcat instances). The strip function removes all leading and trailing whitespace.
-
headers
(raise_error=True)¶ Returns the response headers in a case-insensitive dict-like object:
http("/api/json", timeout=5).headers()['content-type']=='application/json'
Returns the response cookies in a dict like object:
http("/heartbeat.jsp", timeout=5).cookies()['my_custom_cookie'] == 'custom_cookie_value'
-
content_size
(raise_error=True)¶ Returns the length of the response content:
http("/heartbeat.jsp", timeout=5).content_size() > 1024
-
time
(raise_error=True)¶ Returns the elapsed time in seconds until response was received:
http("/heartbeat.jsp", timeout=5).time() > 1.5
-
code
()¶ Return HTTP status code from the queried endpoint.:
http("/heartbeat.jsp", timeout=5).code()
-
actuator_metrics
(prefix='zmon.response.', raise_error=True)¶ Parses the json result of a metrics endpoint into a map ep->method->status->metric
http(“/metrics”, timeout=5).actuator_metrics()
-
prometheus
()¶ Parse the resulting text result according to the Prometheus specs using their prometheus_client.
http(“/metrics”, timeout=5).prometheus()
-
prometheus_flat
()¶ Parse the resulting text result according to the Prometheus specs using their prometheus_client and flattens the outcome.
http(“/metrics”, timeout=5).prometheus_flat()
-
jolokia
(read_requests, raise_error=False)¶ Does a POST request to the endpoint given in the wrapper, with validating the endpoint and setting the request to be read-only.
Parameters: - read_requests (list) – see https://jolokia.org/reference/html/protocol.html#post-request
- raise_error – bool
Returns: Jolokia response
Example:
requests = [ {'mbean': 'org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency'}, {'mbean': 'org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency'}, ] results = http('http://{}:8778/jolokia/'.format(entity['ip']), timeout=15).jolokia(requests)
JMX¶
To use JMXQuery, run “jmxquery” (this is not yet released)
Queries beans’ attributes on hosts specified in entities filter:
jmx().query('java.lang:type=Memory', 'HeapMemoryUsage', 'NonHeapMemoryUsage').results()
Another example:
jmx().query('java.lang:type=Threading', 'ThreadCount', 'DaemonThreadCount', 'PeakThreadCount').results()
This would return a dict like:
{
"DaemonThreadCount": 524,
"PeakThreadCount": 583,
"ThreadCount": 575
}
KairosDB¶
Provides read access to the target KairosDB
-
kairosdb
(url, oauth2=False)
Methods of KairosDB¶
-
query
(name, group_by = None, tags = None, start = -5, end = 0, time_unit='seconds', aggregators = None, start_absolute = None, end_absolute = None) Query kairosdb.
Parameters: - name (str) – Metric name.
- group_by (list) – List of fields to group by.
- tags (dict) –
Filtering tags. Example of tags object:
{ "key": ["max"] }
- start (int) – Relative start time. Default is 5. Should be greater than or equal 1.
- end (int) – End time. Default is 0. If not 0, then it should be greater than or equal to 1.
- time_unit (str) – Time unit (‘seconds’, ‘minutes’, ‘hours’). Default is ‘minutes’.
- aggregators (list) –
List of aggregators. Aggregator is an object that looks like
{ "name": "max", "sampling": { "value": "1", "unit": "minutes" }, "align_sampling": true }
- start_absolute (long) – Absolute start time in milliseconds, overrides the start parameter which is relative
- end_absolute (long) – Absolute end time in milliseconds, overrides the end parameter which is relative
Returns: Result queries.
Return type:
-
query_batch
(self, metrics, start=5, end=0, time_unit='minutes', start_absolute=None, end_absolute=None)¶ Query kairosdb for several checks at once.
Parameters: - metrics (dict) –
list of KairosDB metric queries, one query per metric name, e.g.
[ { 'name': 'metric_name', # name of the metric 'group_by': ['foo'], # list of fields to group by 'aggregators': [ # list of aggregator objects { # structure of a single aggregator 'name': 'max', 'sampling': { 'value': '1', 'unit': 'minutes' }, 'align_sampling': True } ], 'tags': { # dict with filtering tags 'key': ['max'] # a key is a tag name, list of values is used to filter # all the records with given tag and given values } } ]
- start (int) – Relative start time. Default is 5.
- end (int) – End time. Default is 0.
- time_unit (str) – Time unit (‘seconds’, ‘minutes’, ‘hours’). Default is ‘minutes’.
- start_absolute (long) – Absolute start time in milliseconds, overrides the start parameter which is relative
Returns: Array of results for each queried metric
Return type: list
- metrics (dict) –
Kubernetes¶
Provides a wrapper for querying Kubernetes cluster resources.
-
kubernetes
(namespace='default') If
namespace
isNone
then all namespaces will be queried. This however will increase the number of calls to Kubernetes API server.
Note
- Kubernetes wrapper will authenticate using service account, which assumes the worker is running in a Kubernetes cluster.
- All Kubernetes wrapper calls are scoped to the Kubernetes cluster hosting the worker. It is not intended to be used in querying multiple clusters.
Label Selectors¶
Kubernetes API provides a way to filter resources using labelSelector. Kubernetes wrapper provides a friendly syntax for filtering.
The following examples show different usage of the Kubernetes wrapper utilizing label filtering:
# Get all pods with label ``application`` equal to ``zmon-worker``
kubernetes().pods(application='zmon-worker')
kubernetes().pods(application__eq='zmon-worker')
# Get all pods with label ``application`` **not equal to** ``zmon-worker``
kubernetes().pods(application__neq='zmon-worker')
# Get all pods with label ``application`` **any of** ``zmon-worker`` or ``zmon-agent``
kubernetes().pods(application__in=['zmon-worker', 'zmon-agent'])
# Get all pods with label ``application`` **not any of** ``zmon-worker`` or ``zmon-agent``
kubernetes().pods(application__notin=['zmon-worker', 'zmon-agent'])
Methods of Kubernetes¶
-
pods
(name=None, phase=None, ready=None, **kwargs)¶ Return list of Pods.
Parameters: - name (str) – Pod name.
- phase (str) – Pod status phase. Valid values are: Pending, Running, Failed, Succeeded or Unknown.
- ready (bool) – Pod readiness status. If
None
then all pods are returned. - kwargs (dict) – Pod Label Selectors filters.
Returns: List of pods. Typical pod has “metadata”, “status” and “spec” fields.
Return type: list
-
nodes
(name=None, **kwargs)¶ Return list of Nodes. Namespace does not apply.
Parameters: - name (str) – Node name.
- kwargs (dict) – Node Label Selectors filters.
Returns: List of nodes. Typical pod has “metadata”, “status” and “spec” fields.
Return type: list
-
services
(name=None, **kwargs)¶ Return list of Services.
Parameters: - name (str) – Service name.
- kwargs (dict) – Service Label Selectors filters.
Returns: List of services. Typical service has “metadata”, “status” and “spec” fields.
Return type: list
-
endpoints
(name=None, **kwargs)¶ Return list of Endpoints.
Parameters: - name (str) – Endpoint name.
- kwargs (dict) – Endpoint Label Selectors filters.
Returns: List of Endpoints. Typical Endpoint has “metadata”, and “subsets” fields.
Return type: list
-
ingresses
(name=None, **kwargs)¶ Return list of Ingresses.
Parameters: - name (str) – Ingress name.
- kwargs (dict) – Ingress Label Selectors filters.
Returns: List of Ingresses. Typical Ingress has “metadata”, “spec” and “status” fields.
Return type: list
-
statefulsets
(name=None, replicas=None, **kwargs)¶ Return list of Statefulsets.
Parameters: - name (str) – Statefulset name.
- replicas (int) – Statefulset replicas.
- kwargs (dict) – Statefulset Label Selectors filters.
Returns: List of Statefulsets. Typical Statefulset has “metadata”, “status” and “spec” fields.
Return type: list
-
daemonsets
(name=None, **kwargs)¶ Return list of Daemonsets.
Parameters: - name (str) – Daemonset name.
- kwargs (dict) – Daemonset Label Selectors filters.
Returns: List of Daemonsets. Typical Daemonset has “metadata”, “status” and “spec” fields.
Return type: list
-
replicasets
(name=None, replicas=None, **kwargs)¶ Return list of ReplicaSets.
Parameters: - name (str) – ReplicaSet name.
- replicas (int) – ReplicaSet replicas.
- kwargs (dict) – ReplicaSet Label Selectors filters.
Returns: List of ReplicaSets. Typical ReplicaSet has “metadata”, “status” and “spec” fields.
Return type: list
-
deployments
(name=None, replicas=None, ready=None, **kwargs)¶ Return list of Deployments.
Parameters: - name (str) – Deployment name.
- replicas (int) – Deployment replicas.
- ready (bool) – Deployment readiness status.
- kwargs (dict) – Deployment Label Selectors filters.
Returns: List of Deployments. Typical Deployment has “metadata”, “status” and “spec” fields.
Return type: list
-
configmaps
(name=None, **kwargs)¶ Return list of ConfigMaps.
Parameters: - name (str) – ConfigMap name.
- kwargs (dict) – ConfigMap Label Selectors filters.
Returns: List of ConfigMaps. Typical ConfigMap has “metadata” and “data”.
Return type: list
-
persistentvolumeclaims
(name=None, phase=None, **kwargs)¶ Return list of PersistentVolumeClaims.
Parameters: - name (str) – PersistentVolumeClaim name.
- phase (str) – Volume phase.
- kwargs (dict) – PersistentVolumeClaim Label Selectors filters.
Returns: List of PersistentVolumeClaims. Typical PersistentVolumeClaim has “metadata”, “status” and “spec” fields.
Return type: list
-
persistentvolumes
(name=None, phase=None, **kwargs)¶ Return list of PersistentVolumes.
Parameters: - name (str) – PersistentVolume name.
- phase (str) – Volume phase.
- kwargs (dict) – PersistentVolume Label Selectors filters.
Returns: List of PersistentVolumes. Typical PersistentVolume has “metadata”, “status” and “spec” fields.
Return type: list
-
jobs
(name=None, **kwargs)¶ Return list of Jobs.
Parameters: - name (str) – Job name.
- **kwargs –
Job labelSelector filters.
Returns: List of Jobs. Typical Job has “metadata”, “status” and “spec”.
Return type: list
LDAP¶
Retrieve OpenLDAP statistics (needs “cn=Monitor” database installed in LDAP server).
ldap().statistics()
This would return a dict like:
{
"connections_current": 77,
"connections_per_sec": 27.86,
"entries": 359369,
"max_file_descriptors": 65536,
"operations_add_per_sec": 0.0,
"operations_bind_per_sec": 27.99,
"operations_delete_per_sec": 0.0,
"operations_extended_per_sec": 0.23,
"operations_modify_per_sec": 0.09,
"operations_search_per_sec": 24.09,
"operations_unbind_per_sec": 27.82,
"waiters_read": 76,
"waiters_write": 0
}
All information is based on the cn=Monitor OpenLDAP tree. You can get more information in the OpenLDAP Administrator’s Guide. The meaning of the different fields is as follows:
connections_current
- Number of currently established TCP connections.
connections_per_sec
- Increase of connections per second.
entries
- Number of LDAP records.
operations_*_per_sec
- Number of operations per second per operation type (add, bind, search, ..).
waiters_read
- Number of waiters for read (whatever that means, OpenLDAP documentation does not say anything).
Memcached¶
Read-only access to memcached servers is provided by the memcached()
function.
-
memcached
([host=some.host][, port=11211]) Returns a connection to the Memcached server at
<host>:<port>
, where<host>
is the value of the current entity’shost
attribute, and<port>
is the given port (default11211
). See below for a list of methods provided by the returned connection object.
Methods of the Memcached Connection¶
The object returned by the memcached()
function provides the following methods:
-
get
(key)¶ Returns the string stored at key. If key does not exist an error is raised.
memcached().get("example_memcached_key")
-
json
(key) Returns the data of the key as unserialized JSON data. I.e. you can store a JSON object as value of the key and get a dict back
memcached().json("example_memcached_key")
-
stats
([extra_keys=[STR, STR])¶ Returns a
dict
with general Memcached statistics such as memory usage and operations/s. All values are extracted using the Memcached STATS command.The extra_keys may be retrieved as returned as well from the memcached server’s stats command, e.g. version or uptime.
Example result:
{
"incr_hits_per_sec": 0,
"incr_misses_per_sec": 0,
"touch_misses_per_sec": 0,
"decr_misses_per_sec": 0,
"touch_hits_per_sec": 0,
"get_expired_per_sec": 0,
"get_hits_per_sec": 100.01,
"cmd_get_per_sec": 119.98,
"cas_hits_per_sec": 0,
"cas_badval_per_sec": 0,
"delete_misses_per_sec": 0,
"bytes_read_per_sec": 6571.76,
"auth_errors_per_sec": 0,
"cmd_set_per_sec": 19.97,
"bytes_written_per_sec": 6309.17,
"get_flushed_per_sec": 0,
"delete_hits_per_sec": 0,
"cmd_flush_per_sec": 0,
"curr_items": 37217768,
"decr_hits_per_sec": 0,
"connections_per_sec": 0.02,
"cas_misses_per_sec": 0,
"cmd_touch_per_sec": 0,
"bytes": 3902170728,
"evictions_per_sec": 0,
"auth_cmds_per_sec": 0,
"get_misses_per_sec": 19.97
}
Nagios¶
This function provides a wrapper for Nagios plugins.
-
check_load
()¶ nagios().nrpe('check_load')
Example check result as JSON:
{ "load1": 2.86, "load15": 3.13, "load5": 3.23 }
-
check_list_timeout
()¶ nagios().nrpe('check_list_timeout', path="/data/production/", timeout=10)
This command will run “timeout 10 ls /data/production/” on the target host via nrpe.
Example check result as JSON:
{ "exit":0, "timeout":0 }
Exit is the exitcode from nrpe 0 for OK, 2 for ERROR. Timeout should not be used, yet.
-
check_diff_reverse
()¶ nagios().nrpe('check_diff_reverse')
Example check result as JSON:
{ "CommitLimit-Committed_AS": 16022524 }
-
check_mailq_postfix
()¶ nagios().nrpe('check_mailq_postfix')
Example check result as JSON:
{ "unsent": 0 }
-
check_memcachestatus
()¶ nagios().nrpe('check_memcachestatus', port=11211)
Example check result as JSON:
{ "curr_connections": 0.0, "cmd_get": 3569.09, "bytes_written": 66552.9, "get_hits": 1593.9, "cmd_set": 0.04, "curr_items": 0.0, "get_misses": 1975.19, "bytes_read": 83077.28 }
-
check_findfiles
()¶ Find-file analyzer plugin for Nagios. This plugin checks for newer files within a directory and checks their access time, modification time and count.
nagios().nrpe('check_findfiles', directory='/data/example/error/', epoch=1)
Example check result as JSON:
{ "ftotal": 0, "faccess": 0, "fmodify": 0 }
-
check_findolderfiles
()¶ Find-file analyzer plugin for Nagios. This plugin checks for files within a directory older than 2 given times in minutes.
nagios().nrpe('check_findolderfiles', directory='/data/stuff,/mnt/other', time01=480, time02=600)
Example check result as JSON:
{ "total files": 831, "files older than time01": 112, "files older than time02": 0 }
-
check_findfiles_names
()¶ Find-file analyzer plugin for Nagios. This plugin checks for newer files within a directory, optionally matching a filename pattern, and checks their access time, modification time and count.
nagios().nrpe('check_findfiles_names', directory='/mnt/storage/error/', epoch=1, name='app*')
Example check result as JSON:
{ "ftotal": 0, "faccess": 0, "fmodify": 0 }
-
check_findfiles_names_exclude
()¶ Find-file analyzer plugin for Nagios. This plugin checks for newer files within a directory, optionally matching a filename pattern(in this command the files are excluded), and checks their access time, modification time and count.
nagios().nrpe('check_findfiles_names_exclude', directory='/mnt/storage/error/', epoch=1, name='app*')
Example check result as JSON:
{ "ftotal": 0, "faccess": 0, "fmodify": 0 }
-
check_logwatch
()¶ nagios().nrpe('check_logwatch', logfile='/var/logs/example/p{}/catalina.out'.format(entity['instance']), pattern='Full.GC')
Example check result as JSON:
{ "last": 0, "total": 0 }
-
check_ntp_time
()¶ nagios().nrpe('check_ntp_time')
Example check result as JSON:
{ "offset": 0.003063 }
-
check_iostat
()¶ nagios().nrpe('check_iostat', disk='sda')
Example check result as JSON:
{ "tps": 944.7, "iowrite": 6858.4, "ioread": 6268.4 }
-
check_hpacucli
()¶ nagios().nrpe('check_hpacucli')
Example check result as JSON:
{ "logicaldrive_1": "OK", "logicaldrive_2": "OK", "logicaldrive_3": "OK", "physicaldrive_2I:1:6": "OK", "physicaldrive_2I:1:5": "OK", "physicaldrive_1I:1:3": "OK", "physicaldrive_1I:1:2": "OK", "physicaldrive_1I:1:1": "OK", "physicaldrive_1I:1:4": "OK" }
-
check_hpasm_fix_power_supply
()¶ nagios().nrpe('check_hpasm_fix_power_supply')
Example check result as JSON:
{ "status": "OK", "message": "System: 'proliant dl360 g6', S/N: 'CZJ947016M', ROM: 'P64 05/05/2011', hardware working fine, da: 3 logical drives, 6 physical drives cpu_0=ok cpu_1=ok ps_2=ok fan_1=46% fan_2=46% fan_3=46% fan_4=46% temp_1=21 temp_2=40 temp_3=40 temp_4=36 temp_5=35 temp_6=37 temp_7=32 temp_8=36 temp_9=32 temp_10=36 temp_11=32 temp_12=33 temp_13=48 temp_14=29 temp_15=32 temp_16=30 temp_17=29 temp_18=39 temp_19=37 temp_20=38 temp_21=45 temp_22=42 temp_23=39 temp_24=48 temp_25=35 temp_26=46 temp_27=35 temp_28=71 | fan_1=46%;0;0 fan_2=46%;0;0 fan_3=46%;0;0 fan_4=46%;0;0 'temp_1_ambient'=21;42;42 'temp_2_cpu#1'=40;82;82 'temp_3_cpu#2'=40;82;82 'temp_4_memory_bd'=36;87;87 'temp_5_memory_bd'=35;78;78 'temp_6_memory_bd'=37;87;87 'temp_7_memory_bd'=32;78;78 'temp_8_memory_bd'=36;87;87 'temp_9_memory_bd'=32;78;78 'temp_10_memory_bd'=36;87;87 'temp_11_memory_bd'=32;78;78 'temp_12_power_supply_bay'=33;59;59 'temp_13_power_supply_bay'=48;73;73 'temp_14_memory_bd'=29;60;60 'temp_15_processor_zone'=32;60;60 'temp_16_processor_zone'=3" }
-
check_hpasm_gen8
()¶ nagios().nrpe('check_hpasm_gen8')
Example check result as JSON:
{ "status": "OK", "message": "ignoring 16 dimms with status 'n/a' , System: 'proliant dl360p gen8', S/N: 'CZJ2340R6C', ROM: 'P71 08/20/2012', hardware working fine, da: 1 logical drives, 4 physical drives" }
-
check_openmanage
()¶ nagios().nrpe('check_openmanage')
Example check result as JSON:
{ "status": "OK", "message": "System: 'PowerEdge R720', SN: 'GN2J8X1', 256 GB ram (16 dimms), 5 logical drives, 10 physical drives|T0_System_Board_Inlet=21C;42;47 T1_System_Board_Exhaust=36C;70;75 T2_CPU1=59C;95;100 T3_CPU2=52C;95;100 W2_System_Board_Pwr_Consumption=168W;896;980 A0_PS1_Current_1=0.8A;0;0 A1_PS2_Current_2=0.2A;0;0 V25_PS1_Voltage_1=230V;0;0 V26_PS2_Voltage_2=232V;0;0 F0_System_Board_Fan1=1680rpm;0;0 F1_System_Board_Fan2=1800rpm;0;0 F2_System_Board_Fan3=1680rpm;0;0 F3_System_Board_Fan4=2280rpm;0;0 F4_System_Board_Fan5=2400rpm;0;0 F5_System_Board_Fan6=2400rpm;0;0" }
-
check_ping
()¶ nagios().local('check_ping')
Example check result as JSON:
{ "rta": 1.899, "pl": 0.0 }
-
check_apachestatus_uri
()¶ nagios().nrpe('check_apachestatus_uri', url='http://127.0.0.1/server-status?auto') or nagios().nrpe('check_apachestatus_uri', url='http://127.0.0.1:10083/server-status?auto')
Example check result as JSON:
{ "idle": 60.0, "busy": 15.0, "hits": 24.256, "kBytes": 379.692 }
-
check_check_command_procs
()¶ nagios().nrpe('check_command_procs', process='httpd')
Example check result as JSON:
{ "procs": 33 }
-
check_http_expect_port_header
()¶ nagios().nrpe('check_http_expect_port_header', ip='localhost', url= '/', redirect='warning', size='9000:90000', expect='200', port='88', hostname='www.example.com')
Example check result as JSON:
{ "size": 33633.0, "time": 0.080755 }
NOTE: if the nrpe check returns an ‘expect’result(return code is not the expected) , the check returns a NagiosError
-
check_mysql_processes
()¶ nagios().nrpe('check_mysql_processes', host='localhost', port='/var/lib/mysql/mysql.sock', user='myuser', password='mypas')
Example check result as JSON:
{ "avg": 0, "threads": 1 }
-
check_mysqlperformance
()¶ nagios().nrpe('check_mysqlperformance', host='localhost', port='/var/lib/mysql/mysql.sock', user='myuser', password='mypass')
Example check result as JSON:
{ "Com_select": 15.27, "Table_locks_waited": 0.01, "Select_scan": 2.25, "Com_change_db": 0.0, "Com_insert": 382.26, "Com_replace": 8.09, "Com_update": 335.7, "Com_delete": 0.02, "Qcache_hits": 16.57, "Questions": 768.14, "Qcache_not_cached": 1.8, "Created_tmp_tables": 2.43, "Created_tmp_disk_tables": 2.25, "Aborted_clients": 0.3 }
-
check_mysql_slave
()¶ nagios().nrpe('check_mysql_slave', host='localhost', port='/var/lib/mysql/mysql.sock', database='mydb', user='myusr', password='mypwd')
Example check result as JSON:
{ "Uptime": 6215760.0, "Open tables": 3953.0, "Slave IO": "Yes", "Queries per second avg": 967.106, "Slow queries": 1047406.0, "Seconds Behind Master": 0.0, "Threads": 1262.0, "Questions": 6011300666.0, "Slave SQL": "Yes", "Flush tables": 1.0, "Opens": 59466.0 }
-
check_ssl_cert
()¶ nagios().nrpe('check_ssl_cert', host_ip='91.240.34.73', domain_name='www.example.com') or nagios().local('check_ssl_cert', host_ip='91.240.34.73', domain_name='www.example.com')
Example check result as JSON:
{ "days": 506 }
NRPE checks for Windows Hosts¶
Checks are based on nsclient++ v.0.4.1. For more info look: http://docs.nsclient.org/
-
CheckCounter
()¶ Returns performance counters for a process(usedMemory/WorkingSet)
nagios().win('CheckCounter', process='eo_server')
Example check result as JSON:
used memory in bytes
{ "ProcUsedMem": 811024384 }
-
CheckCPU
()¶ nagios().win('CheckCPU')
Example check result as JSON:
{ "1": 4, "10": 8, "5": 14 }
-
CheckDriveSize
()¶ nagios().win('CheckDriveSize')
Example check result as JSON:
Used Space in MByte{ "C:\\ %": 61.0, "C:\\": 63328.469 }
-
CheckEventLog
()¶ nagios().win('CheckEventLog', log='application', query='generated gt -7d AND type=\'error\'')
‘generated gt -7d’ means in the last 7 days
Example check result as JSON:
{ "eventlog": 20 }
-
CheckFiles
()¶ nagios().win('CheckFiles', path='C:\\Import\\Exchange2Clearing', pattern='*.*', query='creation lt -1h')
‘creation lt -1h’ means older than 1 hour
Example check result as JSON:
{ "found files": 22 }
-
CheckLogFile
()¶ nagios().win('CheckLogFile', logfile='c:\Temp\log\maxflow_portal.log', seperator=' ', query='column4 = \'ERROR\' OR column4 = \'FATAL\'')
Example check result as JSON:
{ "count": 4 }
-
CheckMEM
()¶ nagios().win('CheckMEM')
Example check result as JSON:
used memory in MBytes
{ "page file %": 16.0, "page file": 5534.105, "physical memory": 3331.109, "virtual memory": 268.777, "virtual memory %": 0.0, "physical memory %": 20.0 }
-
CheckProcState
()¶ nagios().win('CheckProcState', process='check_mk_agent.exe')
Example check result as JSON:
{ "status": "OK", "message": "check_mk_agent.exe: running" }
-
CheckServiceState
()¶ nagios().win('CheckServiceState', service='ENAIO_server')
Example check result as JSON:
{ "status": "OK", "message": "ENAIO_server: started" }
-
CheckUpTime
()¶ nagios().win('CheckUpTime')
Example check result as JSON:
uptime in ms
{ "uptime": 412488000 }
Ping¶
Simple ICMP ping function which returns True
if the ping command returned without error and False
otherwise.
-
ping
(timeout=1) ping()
The
timeout
argument specifies the timeout in seconds. Internally it just runs the following system command:ping -c 1 -w <TIMEOUT> <HOST>
Redis¶
Read-only access to Redis servers is provided by the redis()
function.
-
redis
([port=6379][, db=0]) Returns a connection to the Redis server at
<host>:<port>
, where<host>
is the value of the current entity’shost
attribute, and<port>
is the given port (default6379
). See below for a list of methods provided by the returned connection object.Parameters:
Note
If password
param is not supplied, then plugin configuration values will be used.
You can use plugin.redis.password
to configure redis password authentication for zmon-worker.
Please also have a look at the Redis documentation.
Methods of the Redis Connection¶
The object returned by the redis()
function provides the following methods:
-
llen
(key)¶ Returns the length of the list stored at key. If key does not exist, it’s value is treated as if it were an empty list, and 0 is returned. If key exists but is not a list, an error is raised.
redis().llen("prod_eventlog_queue")
-
lrange
(key, start, stop)¶ Returns the elements of the list stored at key in the range [start, stop]. If key does not exist, it’s value is treated as if it were an empty list. If key exists but is not a list, an error is raised.
The parameters start and stop are zero-based indexes. Negative numbers are converted to indexes by adding the length of the list, so that
-1
is the last element of the list,-2
the second-to-last element of the list, and so on.Indexes outside the range of the list are not an error: If both start and stop are less than 0 or greater than or equal to the length of the list, an empty list is returned. Otherwise, if start is less than 0, it is treated as if it were 0, and if stop is greater than or equal to the the length of the list, it is treated as if it were equal to the length of the list minus 1. If start is greater than stop, an empty list is returned.
Note that this method is subtly different from Python’s list slicing syntax, where
list[start:stop]
returns elements in the range [start, stop).redis().lrange("prod_eventlog_queue", 0, 9) # Returns *ten* elements! redis().lrange("prod_eventlog_queue", 0, -1) # Returns the entire list.
-
get
(key) Returns the string stored at key. If key does not exist, returns
None
. If key exists but is not a string, an error is raised.redis().get("example_redis_key")
-
keys
(pattern)¶ Returns list of keys from Redis matching pattern.
redis().keys("*downtime*")
-
hget
(key, field)¶ Returns the value of the field field of the hash key. If key does not exist or does not have a field named field, returns
None
. If key exists but is not a hash, an error is raised.redis().hget("example_hash_key", "example_field_name")
-
hgetall
(key)¶ Returns a
dict
of all fields of the hash key. If key does not exist, returns an emptydict
. If key exists but is not a hash, an error is raised.redis().hgetall("example_hash_key")
-
scan
(cursor[, match=None][, count=None])¶ Returns a
set
with the next cursor and the results from this scan. Please see the Redis documentation on how to use this function exactly: http://redis.io/commands/scanredis().scan(0, 'prefix*', 10)
-
smembers
(key)¶ Returns members of set
key
in Redis.redis().smembers("zmon:alert:1")
-
ttl
(key)¶ Return the time to live of an expiring key.
redis().ttl('lock')
-
scard
(key)¶ Return the number of elements in set
key
redis().scard("example_hash_key")
-
zcard
(key)¶ Return the number of elements in the sorted set
key
redis().zcard("example_sorted_set_key")
-
statistics
()¶ Returns a
dict
with general Redis statistics such as memory usage and operations/s. All values are extracted using the Redis INFO command.Example result:
{ "blocked_clients": 2, "commands_processed_per_sec": 15946.48, "connected_clients": 162, "connected_slaves": 0, "connections_received_per_sec": 0.5, "dbsize": 27351, "evicted_keys_per_sec": 0.0, "expired_keys_per_sec": 0.0, "instantaneous_ops_per_sec": 29626, "keyspace_hits_per_sec": 1195.43, "keyspace_misses_per_sec": 1237.99, "used_memory": 50781216, "used_memory_rss": 63475712 }
Please note that the values for both used_memory and used_memory_rss are in Bytes.
S3¶
Allows data to be pulled from S3 Objects.
-
s3
()
Methods of S3¶
-
get_object_metadata
(bucket_name, key)¶ Get the metadata associated with the given
bucket_name
andkey
. The metadata allows you to check for the existance of the key within the bucket and to check how large the object is without reading the whole object into memory.Parameters: - bucket_name – the name of the S3 Bucket
- key – the key that identifies the S3 Object within the S3 Bucket
Returns: an
S3ObjectMetadata
object-
class
S3ObjectMetadata
¶ -
exists
()¶ Will return True if the object exists.
-
size
()¶ Returns the size in bytes for the object. Will return -1 for objects that do not exist.
-
Example usage:
s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').exists() s3().get_object_metadata('my bucket', 'mykeypart1/mykeypart2').size()
-
get_object
(bucket_name, key)¶ Get the S3 Object associated with the given
bucket_name
andkey
. This method will cause the object to be read into memory.Parameters: - bucket_name – the name of the S3 Bucket
- key – the key that identifies the S3 Object within the S3 Bucket
Returns: an
S3Object
object-
class
S3Object
¶ -
text
()¶ Get the S3 Object data
-
json
()¶ If the object exists, parse the object as JSON.
Returns: a dict containing the parsed JSON or None if the object does not exist.
-
exists
()¶ Will return True if the object exists.
-
size
()¶ Returns the size in bytes for the object. Will return -1 for objects that do not exist.
-
Example usage:
s3().get_object('my bucket', 'mykeypart1/my_text_doc.txt').text() s3().get_object('my bucket', 'mykeypart1/my_json_doc.json').json()
-
list_bucket
(bucket_name, prefix, max_items=100, recursive=True)¶ List the S3 Object associated with the given
bucket_name
, matchingprefix
. By default, listing is possible for up to 1000 keys, so we use pagination internally to overcome this.Parameters: - bucket_name – the name of the S3 Bucket
- prefix – the prefix to search under
- max_items – the maximum number of objects to list. Defaults to 100.
- recursive – if the listing should contain deeply nested keys. Defaults to True.
Returns: an
S3FileList
object-
class
S3FileList
¶ -
files
()¶ Returns a list of dicts like
{ "file_name": "foo", "size": 12345, "last_modified": "2017-07-17T01:01:21Z" }
-
Example usage:
s3().list_bucket('my bucket', 'some_prefix').files() files = s3().list_bucket('my bucket', 'some_prefix', 10000).files() # for listing a lot of keys last_modified = files[0]["last_modified"].isoformat() # returns a string that can be passed to time() age = time() - time(last_modified)
Scalyr¶
Wrapper¶
The scalyr()
wrapper enables querying Scalyr from your AWS worker if the credentials have been specified for the worker instance(s).
For more description of each type of query, please refer to https://www.scalyr.com/help/api .
Default parameters:
minutes
specifies the start time of the query. I.e. “5” will mean 5 minutes ago.end
specifies the end time of the query. I.e. “2” will mean until 2 minutes ago. If set toNone
, then the end is set to 24h afterminutes
. The default “0” means now.
-
count
(query, minutes=5, end=0) Run a count query against Scalyr, depending on number of queries you may run into rate limit.
scalyr().count(' ERROR ')
-
timeseries
(query, minutes=30, end=0)¶ Runs a timeseries query against Scalyr with more generous rate limits. (New time series are created on the fly by Scalyr)
-
facets
(filter, field, max_count=5, minutes=30, end=0)¶ This method is used to retrieve the most common values for a field.
-
logs
(query, max_count=100, minutes=5, continuation_token=None, columns=None, end=0)¶ Runs a query against Scalyr and returns logs that match the query. At most
max_count
log lines will be returned. More can be fetched with the same query by passing back the continuation_token from the last response into the logs method.Specific columns can be returned (as defined in scalyr parser) using the columns array e.g.
columns=['severity','threadName','timestamp']
. If this is unspecified, only the message column will be returned.An example logs result as JSON:
{ "messages": [ "message line 1", "message line 2" ], "continuation_token": "a token" }
-
power_query
(query, minutes=5, end=0)¶ Runs a power query against Scalyr and returns the results as response. You can create and test power queries also via the _UI:https://eu.scalyr.com/query . More information on power queries can be found _here:https://eu.scalyr.com/help/power-queries
An example response as JSON:
{ "columns": [ { "name": "cluster" }, { "name": "application" }, { "name": "volume" } ], "warnings": [], "values": [ [ "cluster-1-eu-central-1:kube-1", "application-2", 9481810.0 ], [ "cluster-2-eu-central-1:kube-1", "application-1", 8109726.0 ] ], "matchingEvents": 8123.0, "status": "success", "omittedEvents": 0.0 }
Custom Scalyr Region¶
By default the Scalyr wrapper uses https://www.scalyr.com/ as the default region. Overriding is possible using scalyr(scalyr_region='eu')
if you want to use their Europe environment https://eu.scalyr.com/.
scalyr(scalyr_region='eu').count(' ERROR ')
SNMP¶
Provides a wrapper for SNMP functions listed below. SNMP checks require specifying hosts in the entities filter. The partial object snmp() accepts a timeout=seconds parameter, default is 5 seconds timeout. NOTE: this timeout is per answer, so multiple answers will add up and may block the whole check
-
memory
()¶ snmp().memory()
Returns host’s memory usage statistics. All values are in KiB (1024 Bytes).
Example check result as JSON:
{ "ram_buffer": 359404, "ram_cache": 6478944, "ram_free": 20963524, "ram_shared": 0, "ram_total": 37066332, "ram_total_free": 22963392, "swap_free": 1999868, "swap_min": 16000, "swap_total": 1999868, }
-
load
()¶ snmp().load()
Returns host’s CPU load average (1 minute, 5 minute and 15 minute averages).
Example check result as JSON:
{"load1": 0.95, "load5": 0.69, "load15": 0.72}
-
cpu
()¶ snmp().cpu()
Returns host’s CPU usage in percent.
Example check result as JSON:
{"cpu_system": 0, "cpu_user": 17, "cpu_idle": 81}
-
df
()¶ snmp().df()
Example check result as JSON:
{ "/data/postgres-wal-nfs-example": { "available_space": 524287840, "device": "example0-2-stp-123:/vol/example_pgwal", "percentage_inodes_used": 0, "percentage_space_used": 0, "total_size": 524288000, "used_space": 160, } }
-
logmatch
()¶ snmp().logmatch()
-
interfaces
()¶ snmp().interfaces()
Example check result as JSON:
{ "lo": { "in_octets": 63481918397415, "in_discards": 11, "adStatus": 1, "out_octets": 63481918397415, "opStatus": 1, "out_discards": 0, "speed": "10", "in_error": 0, "out_error": 0 }, "eth1": { "in_octets": 55238870608924, "in_discards": 8344, "adStatus": 1, "out_octets": 6801703429894, "opStatus": 1, "out_discards": 0, "speed": "10000", "in_error": 0, "out_error": 0 }, "eth0": { "in_octets": 3538944286327, "in_discards": 1130, "adStatus": 1, "out_octets": 16706789573119, "opStatus": 1, "out_discards": 0, "speed": "10000", "in_error": 0, "out_error": 0 } }
-
get
() snmp().get('iso.3.6.1.4.1.42253.1.2.3.1.4.7.47.98.105.110.47.115.104', 'stunnel', int)
Example check result as JSON:
{ "stunnel": 0 }
SQL¶
-
sql
([shard]) Provides a wrapper for connection to PostgreSQL database and allows executing queries. All queries are executed in read-only transactions. The connection wrapper requires one parameters: list of shard connections. The shard connections must come from the entity definition (see database-entities). Example query for log database which returns a primitive long value:
sql().execute("SELECT count(*) FROM zl_data.log WHERE log_created > now() - '1 hour'::interval").result()
Example query which will return a single dict with keys
a
andb
:sql().execute('SELECT 1 AS a, 2 AS b').result()
The SQL wrapper will automatically sum up values over all shards:
sql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards)
It’s also possible to query a single shard by providing its name:
sql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard
It’s also possible to query another database on the same server overwriting the shards information:
sql(shards={'customer_db' : entity['host'] + ':' + str(entity['port']) + '/another_db'}).execute('SELECT COUNT(1) AS c FROM my_table').results()
To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter:
[ { "type": "database", "name": "customer", "environment": "live", "role": "master" } ]
The check command will have the form
>>> sql().execute('SELECT 1 AS a').result() 8 # Returns a single value: the sum over the result from all shards >>> sql().execute('SELECT 1 AS a').results() [{'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}, {'a': 1}] # Returns a list of the results from all shards >>> sql(shard='customer1').execute('SELECT 1 AS a').results() [{'a': 1}] # Returns the result from the specified shard in a list of length one >>> sql().execute('SELECT 1 AS a, 2 AS b').result() {'a': 8, 'b': 16} # Returns a dict of the two values, which are each the sum over the result from all shards
The results() function has several additional parameters:
sql().execute('SELECT 1 AS ONE, 2 AS TWO FROM dual').results([max_results=100], [raise_if_limit_exceeded=True])
max_results
- The maximum number of rows you expect to get from the call. If not specified, defaults to 100. You cannot have an unlimited number of rows. There is an absolute maximum of 1,000,000 results that cannot be overridden. Note: If you require processing of larger dataset, it is recommended to revisit architecture of your monitoring subsystem and possibly move logic that does calculation into external web service callable by ZMON 2.0.
raise_if_limit_exceeded
- Raises an exception if the limit of rows would have been exceeded by the issued query.
-
orasql
()¶ Provides a wrapper for connection to Oracle database and allows executing queries. All queries are executed in read-only transactions. The connection wrapper requires three parameters: host, port and sid, that must come from the entity definition (see database-entities). One idiosyncratic behaviour to be aware, is that when your query produces more than one value, and you get a dict with keys being the column names or aliases you used in your query, you will always get back those keys in uppercase. For clarity, we recommend that you write all aliases and column names in uppercase, to avoid confusion due to case changes. Example query of the simplest query, which returns a single value:
orasql().execute("SELECT 'OK' from dual").result()
Example query which will return a single dict with keys
ONE
andTWO
:orasql().execute('SELECT 1 AS ONE, 2 AS TWO from dual').result()
To execute a SQL statement on a LIVE server, tagged with the name business_intelligence, for example, use the following entity filter:
[ { "type": "oracledb", "name": "business_intelligence", "environment": "live", "role": "master" } ]
-
exacrm
()¶ Provides a wrapper for connection to the CRM Exasol database executing queries. The connection wrapper requires one parameter: the query.
Example query:
exacrm().execute("SELECT 'OK';").result()
To execute a SQL statement on the itr-crmexa* servers use the following entity filter:
[ { "type": "host", "host_role_id": "117" } ]
-
mysql
([shard])¶ Provides a wrapper for connection to MySQL database and allows executing queries. The connection wrapper requires one parameters: list of shard connections. The shard connections must come from the entity definition (see database-entities). Example query of the simplest query, which returns a single value:
mysql().execute("SELECT count(*) FROM mysql.user").result()
Example query which will return a single dict with keys
h
andu
:mysql().execute('SELECT host AS h, user AS u FROM mysql.user').result()
The SQL wrapper will automatically sum up values over all shards:
mysql().execute('SELECT count(1) FROM zc_data.customer').result() # will return a single integer value (sum over all shards)
It’s also possible to query a single shard by providing its name:
mysql(shard='customer1').execute('SELECT COUNT(1) AS c FROM zc_data.customer').results() # returns list of values from a single shard
To execute a SQL statement on all LIVE customer shards, for example, use the following entity filter:
[ { "type": "mysqldb", "name": "lounge", "environment": "live", "role": "master" } ]
TCP¶
This function opens a TCP connection to a host on a given port. If the connection succeeds, it returns ‘OK’. The host can be provided directly for global checks or resolved from entities filter. Assuming that we have an entity filter type=host, the example below will try to connect to every host on port 22:
tcp().open(22)
Zomcat¶
Retrieve zomcat instance status (memory, CPU, threads).
zomcat().health()
This would return a dict like:
{
"cpu_percentage": 5.44,
"gc_percentage": 0.11,
"gcs_per_sec": 0.25,
"heap_memory_percentage": 6.52,
"heartbeat_enabled": true,
"http_errors_per_sec": 0.0,
"jobs_enabled": true,
"nonheap_memory_percentage": 20.01,
"requests_per_sec": 1.09,
"threads": 128,
"time_per_request": 42.58
}
Most of the values are retrieved via JMX:
cpu_percentage
- CPU usage in percent (retrieved from JMX).
gc_percentage
- Percentage of time spent in garbage collection runs.
gcs_per_sec
- Garbage collections per second.
heap_memory_percentage
- Percentage of heap memory used.
nonheap_memory_percentage
- Percentage of non-heap memory (e.g. permanent generation) used.
heartbeat_enabled
- Boolean indicating whether heartbeat.jsp is enabled (
true
) or not (false
). If/heartbeat.jsp
cannot be retrieved, the value isnull
. http_errors_per_sec
- Number of Tomcat HTTP errors per second (all 4xx and 5xx HTTP status codes).
jobs_enabled
- Boolean indicating whether jobs are enabled (
true
) or not (false
). If/jobs.monitor
cannot be retrieved, the value isnull
. requests_per_sec
- Number of HTTP/AJP requests per second.
threads
- Total number of threads.
time_per_request
- Average time in milliseconds per HTTP/AJP request.
Helper Functions¶
The following general-purpose functions are available in check commands:
-
abs
(number)¶ Returns the absolute value of the argument. Does not have overflow issues.
>>> abs(-1) 1 >>> abs(1) 1 >>> abs(-2147483648) 2147483648
-
all
(iterable)¶ Returns
True
if none of the elements of iterable are falsy.>>> all([4, 2, 8, 0, 3]) False >>> all([]) True
-
any
(iterable)¶ Returns
True
if at least one element of iterable is truthy.>>> any([None, [], '', {}, 0, 0.0, False]) False >>> any([]) False
-
avg
(results)¶ Returns the arithmetic mean of the values in results. Returns
None
if there are no values. results must not be an iterator.>>> avg([1, 2, 3]) 2.0 >>> avg([]) None
-
basestring
()¶ Superclass of
str
andunicode
useful for checking whether a value is a string of some sort.>>> isinstance('foo', basestring) True >>> isinstance(u'ˈ', basestring) True
-
bin
(n)¶ Returns a string of the given integer in binary representation.
>>> bin(1000) '0b1111101000'
-
bool
(x)¶ Returns
True
if x is truthy, andFalse
otherwise. Does not parse strings. Also usable to check whether a value is Boolean.>>> bool(None) False >>> bool('False') True >>> isinstance(False, bool) True
-
chain
(*iterables)¶ Returns an iterator that that yields the elements of the first iterable, followed by the elements of the second iterable, and so on.
>>> list(chain([1, 2, 3], 'abc')) [1, 2, 3, 'a', 'b', 'c'] >>> list(chain()) []
-
chr
(n)¶ Returns the character for the given ASCII code.
>>> chr(65) 'A'
-
class
Counter
([iterable-or-mapping])¶ Creates a specialized
dict
for counting things. See the official Python documentation for details.
-
dict
([mapping][, **kwargs])¶ Creates a new
dict
. Usually, using a literal will be simpler, but the function may be useful to copydict
s, to covert a list of key/value pairs to adict
, or to check whether some object is adict
.>>> dict(a=1, b=2, c=3) {'a': 1, 'c': 3, 'b': 2} >>> dict({'a': 1, 'b': 2, 'c': 3}) {'a': 1, 'c': 3, 'b': 2} # This is a copy of the original dict. >>> dict([['a', 1], ['b', 2], ['c', 3]]) {'a': 1, 'c': 3, 'b': 2} >>> isinstance({}, dict) True
-
divmod(x, y):
Performs integer division and modulo as a single operation.
>>> divmod(23, 5) (4, 3)
-
empty
(v)¶ Indicates whether v is falsy. Equivalent to
not v
.>>> empty([]) True >>> empty([0]) False
-
enumerate
(iterable[, start=0])¶ Generates tuples
(start + 0, iterable[0]), (start + 1, iterable[1]), ...
. Useful to have access to the index in a loop.>>> list(enumerate(['a', 'b', 'c'], start=1)) [(1, 'a'), (2, 'b'), (3, 'c')]
-
filter
(function, iterable)¶ Returns a list of all objects in iterable for which function returns a truthy value. If function is
None
, the returned list contains all truthy objects in iterable.>>> filter(lambda n: n % 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) [1, 2, 4, 5, 7, 8, 10] >>> filter(None, [False, None, 0, 0.0, '', [], {}, 1000]) [1000]
-
float
(x)¶ Returns x as a floating-point number. Parses stings.
>>> float('2.5') 2.5 >>> float('-inf') -inf >>> float(2) 2.0
This is useful to force proper division:
>>> 2 / 5 0 >>> float(2) / 5 0.4
Also usable to check whether a value is a floating-point number:
>>> isinstance(2.5, float) True >>> isinstance(2, float) False
-
groupby
(iterable[, key])¶ A somewhat obscure function for grouping consecutive equal elements in an iterable. See the official Python documentation for more details.
>>> [(k, list(v)) for k, v in groupby('abba')] [('a', ['a']), ('b', ['b', 'b']), ('a', ['a'])]
-
hex
(n)¶ Returns a string of the given integer in hexadecimal representation.
>>> hex(1000) '0x3e8'
-
int
(x[, base])¶ Returns x as an integer. Truncates floating-point numbers and parses strings. Also usable to check whether a value is an integer.
>>> int(2.5) 2 >>> int(-2.5) 2 >>> int('2') 2 >>> int('abba', 16) 43962 >>> isinstance(2, int) True
-
isinstance
(object, classinfo)¶ Indicates whether object is an instance of the given class or classes.
>>> isinstance(2, int) True >>> isinstance(2, (int, float)) True >>> isinstance('2', int) False
-
json
(s) Converts the given JSON string to a Python object.
>>> json('{"list": [1, 2, 3, 4]}') {u'list': [1, 2, 3, 4]}
-
jsonpath_flat_filter
(obj, path)¶ Executes json path expression using jsonpath_rw and returns a flat dict of (full_path, value).
>>> data = {"timers":{"/api/v1/":{"m1.rate": 12, "99th": "3ms"}}} >>> jsonpath_flat_filter(data, "timers.*.*.'m1.rate'") {"timers./api/v1/.m1.rate": 12}
-
jsonpath_parse
(path)¶ Creates a json path parse object from the jsonpath_rw to be used in your check command.
-
len
(s)¶ Returns the length of the given collection.
>>> len('foo') 3 >>> len([0, 1, 2]) 3 >>> len({'a': 1, 'b': 2, 'c': 3}) 3
-
list
(iterable)¶ Creates a new list. Usually, using a literal will be simpler, but the function may be useful to copy lists, to covert some other iterable to a list, or to check whether some object is a list.
>>> list({'a': 1, 'b': 2, 'c': 3}) ['a', 'c', 'b'] >>> list(chain([1, 2, 3], 'abc')) [1, 2, 3, 'a', 'b', 'c'] # Without the list call, this would be a chain object. >>> isinstance([1, 2, 3], list) True
-
long
(x[, base])¶ Converts a number or string to a long integer.
>>> long(2.5) 2L >>> long(-2.5) -2L >>> long('2') 2L >>> long('abba', 16) 43962L
-
map
(function, iterable)¶ Calls function on each element of iterable and returns the results as a list.
>>> map(lambda n: n**2, [0, 1, 2, 3, 4, 5]) [0, 1, 4, 9, 16, 25]
-
max
(iterable)¶ Returns the greatest element of iterable. With two or more arguments, returns the greatest argument instead.
>>> max([2, 4, 1, 3]) 4 >>> max(2, 4, 1, 3) 4
-
min
(iterable)¶ Returns the smallest element of iterable. With two or more arguments, returns the smallest argument instead.
>>> min([2, 4, 1, 3]) 1 >>> min(2, 4, 1, 3) 1
-
normalvariate
(mu, sigma)¶ Returns a normally distributed random variable with the given mean and standard derivation.
>>> normalvariate(0, 1) -0.1711153439880709
-
oct
(n)¶ Returns a string of the given integer in octal representation.
>>> oct(1000) '01750'
-
ord
(n)¶ Returns the ASCII code of the given character.
>>> ord('A') 65
-
parse_cert
(pem[, decode_base64])¶ Returns a Certificate object for details. The first argument pem is the PEM encoded certificate as string and the optional argument is used to decode Base64 before parsing the string.
-
pow
(x, y[, z])¶ Computes x to the power of y. The result is modulo z, if z is given, and the function is much, much faster than
(x ** y) % z
in that case.>>> pow(56876845793546543243783543735425734536873, 12425445412439354354394354397364398364378, 10) 9L
-
range
([start, ]stop[, step])¶ Returns a list of integers
[start, start + step * 1, start + step * 2, ...]
where all integers are less than stop, or greater than stop if step is negative.>>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(1, 11) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> range(1, 1) [] >>> range(11, 1) [] >>> range(0, 10, 3) [0, 3, 6, 9] >>> range(10, -1, -1) [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
-
reduce
(function, iterable[, initializer])¶ Calls
function(r, e)
for each elemente
in iterable, wherer
is the result of the last such call, or initializer for the first such call. If iterable has no elements, returns initializer.If initializer is ommitted, the first element of iterable is removed and used in place of initializer. In that case, an error is raised if iterable has no elements.
>>> reduce(lambda a, b: a * b, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1) 3628800 # 10!
Note: Because of a Python bug,
reduce
used to be unreliable. This issue should now be fixed.
-
reversed
(iterable)¶ Returns an iterator that iterates over the elements in iterable in reverse order.
>>> list(reversed([1, 2, 3])) [3, 2, 1]
-
round
(n[, digits=0])¶ Rounds the given number to the given number of digits, rounding half away from zero.
>>> round(23.4) 23.0 >>> round(23.5) 24.0 >>> round(-23.4) -23.0 >>> round(-23.5) -24.0 >>> round(0.123456789, 3) 0.123 >>> round(987654321, -3) 987654000.0
-
set
(iterable)¶ Returns a set built from the elements of iterable. Useful to remove duplicates from some collection.
>>> set([1, 2, 1, 4, 3, 2, 2, 3, 4, 1]) set([1, 2, 3, 4])
-
sorted
(iterable[, reverse])¶ Returns a sorted list containing the elements of iterable.
>>> sorted([2, 4, 1, 3]) [1, 2, 3, 4] >>> sorted([2, 4, 1, 3], reverse=True) [4, 3, 2, 1]
-
str
(object)¶ Returns the string representation of object. Also usable to check whether a value is a string. If the result would contain Unicode characters, the
unicode()
function must be used instead.>>> str(2) '2' >>> str({'a': 1, 'b': 2, 'c': 3}) "{'a': 1, 'c': 3, 'b': 2}" >>> isinstance('foo', str) True
-
sum
(iterable)¶ Returns the sum of the elements of iterable, or
0
if iterable is empty.>>> sum([1, 2, 3, 4]) 10 >>> sum([]) 0
-
time
([spec][, utc]) Given a time specification such as
'-10m'
for “ten minutes ago” or'+3h'
for “in three hours”, returns an object representing that timestamp. Valid units ares
for seconds,m
for minutes,h
for hours, andd
for days.The time specification spec can also be a Unix epoch/timestamp or a valid ISO timestamp in of the following formats:
YYYY-MM-DD HH:MM:SS.mmmmm
,YYYY-MM-DD HH:MM:SS
,YYYY-MM-DD HH:MM
orYYYY-MM-DD
.If spec is omitted, the current time is used. If utc is True the timestamp uses UTC, otherwise it uses local time.
The returned object has two methods:
-
isoformat
([sep])¶ Returns the timestamp as a string of the form
YYYY-MM-DD HH:MM:SS.mmmmmm
. The default behavior is to omit theT
between date and time. This can be overridden by passing the optional sep parameter to the method.>>> time('+4d').isoformat() '2014-03-29 18:05:50.098919' >>> time(1396112750).isoformat() '2014-03-29 18:05:50' >>> time('+4d').isoformat('T') '2014-03-29T18:05:50.098919'
-
format
(fmt)¶ Returns the timestamp as a string formatted according to the given format. See the official Python documentation for an incomplete list of supported format directives.
Additionally, the subtraction operator is overloaded and returns the time difference in seconds:
>>> time('2014-01-01 01:13') - time('2014-01-01 01:01') 12
-
-
timestamp
()¶ Returns Unix time stamp. This wraps time.time()
-
tuple
(iterable)¶ Returns the given iterable as a tuple (an immutable list, basically). Also usable to check whether a value is a tuple.
>>> tuple([1, 2, 3]) (1, 2, 3) >>> isinstance((1, 2, 3), tuple) True
-
unicode
(object)¶ Returns the string representation of object as a Unicode string. Also usable to check whether a value is a Unicode string.
>>> unicode({u'α': 1, u'β': 2, u'γ': 3}) u"{u'\\u03b1': 1, u'\\u03b3': 3, u'\\u03b2': 2}" >>> isinstance(u'ˈ', unicode) True
-
unichr
(n)¶ Returns the unicode character with the given code point. Might be limited to code points less than 0x10000.
>>> unichr(0x2a13) # LINE INTEGRATION WITH SEMICIRCULAR PATH AROUND POLE u'⨓'
-
zip
(*iterables)¶ Returns a list of tuples where the i-th tuple contains the i-th element from each of the given iterables. Uses the lowest length if the iterables have different lengths.
>>> zip(['a', 'b', 'c'], [1, 2, 3]) [('a', 1), ('b', 2), ('c', 3)] >>> zip(['A', 'B', 'C'], ['a', 'b', 'c'], [1, 2, 3]) [('A', 'a', 1), ('B', 'b', 2), ('C', 'c', 3)] >>> zip([], [1, 2, 3]) []
-
re
()¶ Python regex
re
module for all regex operations.>>> re.match(r'^ab.*', 'a123b') != None False >>> re.match(r'^ab.*', 'ab123') != None True
-
math
()¶ Python
math
module for all math operations.>>> math.log(4, 2) 2.0
Alert Functions Reference¶
Time Specifications¶
Whenever one of these functions takes an argument named time_spec
, that argument is a string of the form <magnitude><unit>
, where <magnitude>
is an positive integer, and <unit>
is one of s
(for seconds), m
(for minutes), h
(for hours), and d
(for days).
Therefore, a value of 5m
would indicate that all values gathered in the last five minutes should be taken into account.
Note
Trial Run doesn’t provide any previous values. Please check how functions depending on check values behave in case values were not available.
Timeseries functions¶
All of the timeseries_*
functions below additionally accept a named parameter key=func
which can be used to extract the wanted value
from a dict or an array. To get the value of the key my-key
from a dict, you can use e.g.
res = timeseries_sum('5m', key=lambda x: x.get('my-key', 0))
Note
The values for the timeseries_*
functions are retrieved from the local redis instance. By default the last 20 check results are kept in this
instance. Time ranges which exceed 20 times the check interval will lead to unexpected results.
Previous Check results¶
The data source for the alert_series
and value_series
is the same as for the timeseries_*
functions. Both functions return up to the
requested number of results - as much as data is available. By default the maximum is 20 (see the above note for the timeseries functions).
Alert condition functions¶
The following functions are available in the alert condition expression:
-
alert_series
(f[, n=1])¶ Returns True if function f either raises exception or returns True for the last n check values for the given entity. Use this function to build an alert that only is raised if the last n intervals are up. This can solve alert where you face flapping due to technical issues.
# check that the value is bigger than 5 the last 3 runs alert_series(lambda v: v > 5, 3)
Note
If number of check values is less than
n
, thenf
will be evaluated for those values and alerts could be raised accordingly.
-
capture
(value)¶ -
capture
(name=value) Saves the given value as a capture, and returns it unaltered. In the first form, the capture receives a generated name (
capture_N
). In the second form, the specified name is used as the name of the capture.Example:
capture(foo=1)
saves the value1
in a capture namedfoo
and returns1
.
-
entity_results
()¶ List for every entity containing a dict with the following keys:
value
(the most recent value for the alert’s check on that entity),ts
(the time when the check evaluation was started, in seconds since the epoch, as a floating-point number), andtd
(the check’s duration, in seconds, as a floating-point number). Works regardless of the type of value. DOES NOT WORK in Trial Run right now!
-
entity_values
()¶ Returns a list for each entity containing the most recent value for the alert’s check on that entity. Works regardless of the type of value. DOES NOT WORK in Trial Run right now!
-
monotonic
([count=2, increasing=True, strictly=False, data=None])¶ Returns true if the values in
data
are (strictly) monotonic increasing / decreasing values. Whendata
is not given, uses the result ofvalue_series(count)
as data (only works for checks returning a single value).# check that the value of `some_key` is monotonic increasing for the last 5 checks (including this one) monotonic(data=[v.get('some_key', 0) for v in value_series(5)])
Note
The order of the
data
is expected to have the latest value first and the oldest last
-
timeseries_avg
(time_spec)¶ The arithmetic mean of the check values gathered in the specified time period. Returns
None
if there are no values. Only works for numeric values.Example: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes.
timeseries_avg('5m')
is (5 + 12 + 14 + 13 + 6) / 5 = 10.
-
timeseries_median
(time_spec)¶ The median of the check values gathered in the specified time period. If the number of such values is even, the arithmetic mean of the two middle values is returned. Returns
None
if there are no values. Equivalent totimeseries_percentile(time_spec, 0.5)
. Only works for numeric values.Example 1: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. Sorting these values gives 5, 6, 12, 13, 14. The middle value is 12. Therefore,
timeseries_median('5m')
is 12.Example 2: The check has gathered the values 12, 14, 13, and 6 over the last four minutes. Sorting these values gives 6, 12, 13, 14. The two middle values are 12 and 13. Therefore,
timeseries_median('4m')
is (12 + 13) / 2 = 12.5.
-
timeseries_percentile
(time_spec, percent)¶ The P-th percentile of the values gathered in the specified time period, where P = percent × 100, using linear interpolation. Only works for numeric values.
The P-th percentile of N values is V(⌊K⌋) + (V(⌈K⌉) − V(⌊K⌋)) × (K − ⌊K⌋), where K = (N − 1) × P / 100 and V(I) for I in [0, N) is the I-th element of the list of values sorted in ascending order. Returns
None
if there are no values.Example 1: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. Sorting these values gives 5, 6, 12, 13, 14. Let P = 30. There are N = 5 values, and K = (N − 1) × P / 100 = (5 − 1) × 30 / 100 = 1.2. The value at index ⌊1.2⌋ = 1 is 6, and the value at index ⌈1.2⌉ = 2 is 12. Therefore,
timeseries_percentile('5m', 0.3)
is 6 + (12 − 6) × (1.2 − ⌊1.2⌋) = 7.2.Example 2: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. Sorting these values gives 5, 6, 12, 13, 14. Let P = 25. There are N = 5 values, and K = (N − 1) × P / 100 = (5 − 1) × 25 / 100 = 1. ⌊1⌋ = ⌈1⌉ = 1. The value at index 1 is 6. Therefore,
timeseries_percentile('5m', 0.25)
is 6 + (6 − 6) × (1 − ⌊1⌋) = 6.
-
timeseries_first
(time_spec)¶ The oldest value among the values gathered in the specified time period. Returns
None
if there are no values. Works regardless of the type of value.Example: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. The oldest value is 5. Therefore,
timeseries_first('5m')
is 5.
-
timeseries_delta
(time_spec)¶ The newest value among the values gathered in the specified time period minus the oldest one. Returns
0
if there are no values. Only works for numeric values.Example 1: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. The newest value is 6 and the oldest value is 5. Therefore,
timeseries_delta('5m')
is 6 − 5 = 1.Example 2: The check has gathered the values 12, 14, 13, and 6 over the last four minutes. The newest value is 6 and the oldest value is 12. Therefore,
timeseries_delta('4m')
is 6 − 12 = −6 (not 6).
-
timeseries_min
(time_spec)¶ The smallest value among the values gathered in the specified time period. Returns
None
if there are no values. Works regardless of the type of value, but is unlikely to be particularly useful for non-numeric values.Example: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. The smallest value is 5. Therefore,
timeseries_min('5m')
is 5.
-
timeseries_max
(time_spec)¶ The largest value among the values gathered in the specified time period. Returns
None
if there are no values. Works regardless of the type of value, but is unlikely to be particularly useful for non-numeric values.Example: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. The largest value is 14. Therefore,
timeseries_max('5m')
is 14.
-
timeseries_sum
(time_spec)¶ The sum of the values gathered in the specified time period. Returns
0
if there are no values. Only works for numeric values.Example: The check has gathered the values 5, 12, 14, 13, and 6 over the last five minutes. Therefore,
timeseries_sum('5m')
is 5 + 12 + 14 + 13 + 6 = 50.
-
value_series
([n=1])¶ Returns the last n values for the underlying checks and the current entity. Return
[]
if there are no values.
History distance functionality¶
The history distance functionality currently only works for numeric values, and not for structured ones, or arrays. Call for a DistanceWrapper object.
history().distance([weeks=4], [bin_size='1h'], [snap_to_bin = True], [dict_extractor_path=''])
An object will be returned, where you can call additional functions on. The default parameters should be good for most cases, but in case you’d like to change them:
weeks
- Changes how far you’d like to look into the past. It is good to average more than one week, since you might have seen something unusual a week ago, and I assume you would like to get warned in the next week if something similar happens.
bin_size
- Defines the size of the bins you are using to aggregate the history. Defaults to 1h. Is a
time_spec
. See the next parameter for an explanation of the bins. snap_to_bin
Determines wether you’d like to have sliding bins, or fixed bin start points. Consider the following example: You run your check at monday, 10.30 AM. If
snap_to_bin
isTrue
, you would gather data from the past 4 weeks, every monday from 10 AM to 11 AM, and then calculate the mean and standard deviation to use in the functions below. If the value issnap_to_bin
isFalse
, you would gather data from every monday, 9.30 AM to 10.30 AM.Setting the value to
True
allows for some internal caching of already-calculated values for a bin, since the mean and standard deviation don’t change for about an hour, so you don’t stress the network and servers as much as with having it set toFalse
. Attention: Caching optimizations forsnap_to_bin
not yet implemented. Please use it nevertheless, so that we can benefit from optimizations in the future.dict_extractor_path
Takes a string that is used for accessing the
value
if it is not a scalar value, but a dict. Normally, the history functionality only works for scalar values. Using this access string, you can use structured values, too. The dict_extractor_path is of the form ‘a.b.c’ for a dict with the structure {‘a’:{‘b’:{‘c’:5}}} to extract the value 5. Effectively, you use the dict_extractor_path to boil a structured check value down to a scalar value. The dict_extractor_path is applied on the historic values, and on the parameters of thesigma()
andabsolute()
functions.Example: Your check gives you a map of data instead of a single value:
{"CREDITCARD": 25, "PAYPAL": 10, "MAK": 10, "PTF": 30}
which contains the number of requests for the payment methods CREDITCARD, PAYPAL, MAKSUTURVA and PRZELEWY24 of the last few minutes. If you want to check the history of Paypal orders, take this one:history().distance(dict_extractor_path = 'PAYPAL').sigma(value) < 2.0
which will take a look at the history of Paypal orders only and warn you if there is something unusual (too low number of requests). An even better query would be:
capture(suspect_payment_methods= { k: value[k] for k,v in { payment_method: history().distance(dict_extractor_path = payment_method).sigma(value) for payment_method in value.keys() }.items() if v < -2.0 } )
which takes a look at the history of every payment method and then tells you in a capture which payment methods are suspect and should be looked at manually.
Attention: Some structured values are not written to the history (when they are too complex). If you have trouble, try to change your check to return less complex values. Lists are currently not supported.
-
absolute
(value)¶ Returns the absolute distance of the actual value to the history of the check that is linked to this function. The absolute distance is just the difference of the value provided and the mean of the history values.
Example: You can use it e.g. to warn when you get 5 more exceptions than you would get on average:
history().distance().absolute(value) < 5
The distance is directed, which means that you will not get warned if you get “too little” exceptions. You can use abs() to get an undirected value.
-
sigma
(value)¶ Returns the distance of the actual value to the history of the check, normalized by the standard deviation.
Example: You can use it e.g. to get warned when you get more exceptions than usual:
history().distance().sigma(value) < 2.0
This check warns you in 4% of all cases on average. You will not be warned if there are some small spikes in the exception count, but you will be warned if there are spikes that are twice as far away from the mean as what is usual.
The distance is directed, which means that you will not get warned if you get “too little” exceptions. You can use abs() to get an undirected value.
-
bin_mean
()¶ Returns the mean of the bins that were aggregated.
-
bin_standard_deviation
()¶ Returns the standard deviation of the bins that were aggregated.
Additional helper functions¶
You can also use some additional functions that are used in check commands.
Notifications Reference¶
ZMON provides several means of notification in case of alerts. Notifications will be triggered when alert status change. Please refer to Notification options for different worker configuration options.
Google Hangouts Chat¶
Notify Google Hangouts Chat room with alert status.
-
send_google_hangouts_chat
(webhook_link=None, message=None, color='red')¶ Send Google Hangouts Chat notification.
Parameters: - webhook_link (str) – Webhook Link in Google Hangouts Chat Room. Create a Google Hangouts Chat Webhook and copy the link here.
- multiline (bool) – Should the Text in the notification span multiple lines or not? Default is
True
. - message (str) – Message to be sent. If
None
, then a message constructed from the alert will be sent. - color (str) – Message color. Default is
red
if alert is raised.
Note
Message color will be determined based on alert status. If alert has ended, then color
will be green
, otherwise color
argument will be used.
Hipchat¶
Notify Hipchat room with alert status.
-
send_hipchat
(room=None, message=None, token=None, message_format='html', notify=False, color='red', link=False, link_text='go to alert')¶ Send Hipchat notification to specified room.
Parameters: - room (str) – Room to be notified.
- message (str) – Message to be sent. If
None
, then a message constructed from the alert will be sent. - token (str) – Hipchat API token.
- message_format (str) – message format -
html
(default) ortext
(which correctly treats @mentions). - notify (bool) – Hipchat notify flag. Default is False.
- color (str) – Message color. Default is
red
if alert is raised. - link (bool) – Add link to Hipchat message. Default is
False
. - link_text (str) – if
link
param isTrue
, this will be displayed as a link in the hipchat message. Default isgo to alert
.
Note
Message color will be determined based on alert status. If alert has ended, then color
will be green
, otherwise color
argument will be used.
Example message - using html format (default):
{
"message": "NEW ALERT: Requests failing with status 500 on host-production-1-entity",
"color": "red",
"notify": true
}
Example message - using text format with @mention:
{
"message": "@here NEW ALERT: Requests failing with status 500 on host-production-1-entity",
"color": "red",
"notify": true,
"message_format": "text"
}
HTTP¶
Provides notification by invoking HTTP call to certain endpoint. HTTP notification uses POST
method when invoking the call.
-
notify_http
(url=None, body=None, params=None, headers=None, timeout=5, oauth2=False, include_alert=True)¶ Send HTTP notification to specified endpoint.
Parameters: - url (str) – HTTP endpoint URL. If not passed, then default URL will be used in worker configuration.
- body (dict) – Request body.
- params (dict) – Request URL params.
- headers (dict) – HTTP headers.
- timeout (int) – Request timeout. Default is 5 seconds.
- oauth2 (bool) – Add OAUTH2 authentication headers. Default is False.
- include_alert (bool) – Include alert data in request body. Default is
True
.
Example:
notify_http('https://some-notification-service/alert', body={'zmon': True}, headers={'X-TOKEN': 1234})
Note
If include_alert
is True
, then request body will include alert data. This is usually useful, since it provides valuable info like is_alert
and changed
which can indicate whether the alert has started or ended.
{
"body": null,
"alert": {
"is_alert": true,
"changed": true,
"duration": 2.33,
"captures": {},
"entity": {"type": "GLOBAL", "id": "GLOBAL"},
"worker": "plocal.zmon",
"value": {"td": 0.00037, "worker": "plocal.zmon", "ts": 1472032348.665247, "value": 51.67797677979191},
"alert_def": {
"name": "Random Example Alert", "parameters": null, "check_id": 4, "entities_map": [], "responsible_team": "ZMON", "period": "", "priority": 1,
"notifications": ["notify_http()"], "team": "ZMON", "id": 3, "condition": ">40"
}
}
}
Hubot¶
Send Hubot notification.
Mail¶
Send email notifications.
-
send_mail
(subject=None, cc=None, html=False, hide_recipients=True, include_value=True, include_definition=True, include_captures=True, include_entity=True, per_entity=True)¶ Send email notification.
Parameters: - subject (str or unicode or None) – Email subject. You must use a unicode string (e.g. u’äöüß’) if you have non-ASCII characters in there. If None, the alert name will be used.
- cc (list) – List of CC recipients.
- html (bool) – HTML email.
- hide_recipients (bool) – Hide recipients. Will be sent as BCC.
- include_value (bool) – Include alert value in notification message.
- include_definition (bool) – Include alert definition details in notification message.
- include_captures (bool) – Include alert captures in message.
- include_entity (bool) – Include affected entities in notification message.
- per_entity (bool) – Send new email notification per entity. Default is
True
.
Note
send_email
is an alias for this notification function.
Opsgenie¶
Notify Opsgenie of a new alert status. If alert is active, then a new opsgenie alert will be created. If alert is inactive then the alert will be closed.
-
notify_opsgenie
(message='', teams=None, per_entity=False, priority=None, include_alert=True, description='', **kwargs)¶ Send notifications to Opsgenie.
Parameters: - message (str) – Alert message. If empty, then a message will be generated from the alert data.
- teams (str | list) – Opsgenie teams to be notified. Value can be a single team or a list of teams.
- per_entity (bool) – Send new alert per entity. This affects the
alias
value and impacts how de-duplication is handled in Opsgenie. Default isFalse
. - priority (str) – Set Opsgenie priority for this notification. Valid values are
P1
,P2
,P3
,P4
orP5
. - include_alert (bool) – Include alert data in alert body
details
. Default isTrue
. - include_captures (bool) – Include captures data in alert body
details
. Default isFalse
. - description (str) – An optional description. If present, this is inserted into the opsgenie alert description field.
Example:
notify_opsgenie(teams=['zmon', 'ops'], message='Number of failed requests is too high!', include_alert=True)
Note
If priority
is not set, then ZMON will set the priority according to the alert priority.
Pagerduty¶
Notify Pagerduty of a new alert status. If alert is active, then a new pagerduty incident with type trigger
will be sent. If alert is inactive then incident type will be updated to resolve
.
Note
Pagerduty notification plugin uses API v2.
-
notify_pagerduty
(message='', per_entity=False, include_alert=True, routing_key=None, alert_class=None, alert_group=None, **kwargs)¶ Send notifications to Pagerduty.
Parameters: - message (str) – Incident message. If empty, then a message will be generated from the alert data.
- per_entity (bool) – Send new alert per entity. This affects the
dedup_key
value and impacts how de-duplication is handled in Pagerduty. Default isFalse
. - include_alert (bool) – Include alert data in incident payload
custom_details
. Default isTrue
. - routing_key (str) – Pagerduty service
routing_key
. If not specified, then the service key configured for the worker will be used. - alert_class (str) – Set the Pagerduty incident class.
- alert_group (str) – Set the Pagerduty incident group.
Example:
notify_pagerduty(message='Number of failed requests is too high!', include_alert=True, alert_class='API health', alert_group='production')
Push¶
Send push notification via ZMON notification service.
-
send_push
(url=None, key=None, message=None)¶ Send Push notification to mobile devices.
Parameters:
Note
If Message is None
then it will be generated from alert status.
Slack¶
Notify Slack channel with alert status. A webhook
is required for notifications.
-
notify_slack
(webhook=None, channel='#general', message=None)¶ Send Slack notification to specified channel.
Parameters:
Twilio¶
Use Twilio to receive phone calls if alerts pop up. This includes basic ACK and escalation. Requires account at Twilio and the notifiction service deployed. Low investment to get going though. WORK IN PROGRESS.
-
notifiy_twilio
(numbers=[], message="ZMON Alert Up: Some Alert")¶ Make phone call to supplied numbers. First number will be called immediately. After two minutes, another call is made to that number if no ACK. Other numbers follow at 5min interval without ACK.
Parameters: - message (str) – Message to be sent. If
None
, then a message constructed from the alert will be sent. - numbers – Numbers to call
- message (str) – Message to be sent. If
Note
Remember to configure your worker for this.
NOTIFICATION_SERVICE_URL
NOTIFICATION_SERVICE_KEY
Monitoring on AWS¶
This section assumes that you’re running zmon-aws-agent, which automatically discovers your EC2 instances, auto-scaling of groups, ELBs, and more.
ZMON AWS agent syncs the following entities from AWS infrastructure:
- EC2 instances
- Auto-Scaling groups
- ELBs (classic and ELBv2)
- Elasticaches
- RDS instances
- DynamoDB tables
- IAM/ACM certificates
Note
ZMON AWS Agent can be also deployed via a single appliance, which runs AWS Agent, ZMON worker and ZMON scheduler.
CloudWatch Metrics¶
You can achieve most basic monitoring with AWS CloudWatch. CloudWatch EC2 metrics contain the following information:
- CPU Utilization
- Network traffic
- Disk throughput/operations per second (only for ephemeral storage; EBS volumes are not included)
ZMON allows querying arbitrary CloudWatch metrics using the cloudwatch() wrapper.
Security Groups¶
Depending on your AWS setup, you’ll probably have to open particular ports/instances to access from ZMON. Using a limited set of ports to expose management APIs and the Prometheus node exporter will make your life easier. ZMON allows parsing of Prometheus metrics via the http().prometheus().
You can deploy ZMON into each of your AWS accounts to allow cross-team monitoring and dashboards. Make sure that your security groups allow ZMON to connect to port 9100 of your monitored instances.
Not having the proper security groups configured is mainly visible by not getting the expected results at all, as packages are dropped by the EC2 instance rather then e.g. getting a connection refused.
Low-Level or Basic Properties¶
EC2 Instances¶
Having enough diskspace on your instance is important; here’s a sample check. By default, you can only get space used from CloudWatch. Using Amazon’s own script, you can push free space to CloudWatch and pull this data via ZMON. Alternatively, you can run the Prometheus Node exporter to pull disk space data from the EC2 node itself via HTTP.
Similarly, you can pull CPU-related metrics from CloudWatch. The Prometheus Node exporter also exposes these metrics.
You also need enough available INodes.
Regarding memory, you can either query via CloudWatch, use Prometheus Node exporter to feed ZMON, or go with low-level snmp()
[not recommended].
The following block shows part of EC2 instance entity properties:
id: a-app-1-2QBrR1[aws:123456789:eu-west-1]
type: instance
aws_id: i-87654321
created_by: agent
host: 172.33.173.201
infrastructure_account: aws:123456789
instance_type: t2.medium
ip: 172.33.173.201
ports:
'5432': 5432
'8008': 8008
region: eu-west-1
An example check using cloudwatch wrapper and entity properties would look like the following:
cloudwatch().query_one({'InstanceId': entity['aws_id']}, 'CPUUtilization', 'Average', 'AWS/EC2', period=120)
Elastic Load Balancers¶
You can query AWS CloudWatch to get ELB-specific metrics. The ZMON agent will put data into the ELB entity, allowing you to monitor instance and healthy instance count.
id: elb-a-app-1[aws:123456789:eu-west-1]
type: elb
elb_type: classic
active_members: 1
created_by: agent
dns_name: internal-a-app-1.eu-west-1.elb.amazonaws.com
host: internal-a-app-1.eu-west-1.elb.amazonaws.com
infrastructure_account: aws:123456789
members: 3
region: eu-west-1
scheme: internal
ZMON AWS agent will detect both ELBs, classic and application load balancers. Both ELBs entities will be created in ZMON with type:elb
. In order to distinguish between them in your checks, there is another property elb_type
which holds either classic
or application
.
Since Cloudwatch metrics are different for each ELB type, please check CloudWatch ELB metrics for detailed reference. An example check using Cloudwatch wrapper and entity properties would look like the following:
# Classic ELB
lb_name = entity['name']
key = 'LoadBalancerName'
namespace = 'AWS/ELB'
# Check if Application ELBv2 entity
if entity.get('elb_type') == 'application':
lb_name = entity['cloudwatch_name']
key = 'LoadBalancer'
namespace = 'AWS/ApplicationELB'
cloudwatch().query_one({key: lb_name}, 'RequestCount', 'Sum', namespace)
Note
ELB entities contain a special flag dns_traffic
which is an indicator about the load balancer being actively serving traffic.
Auto-Scaling Groups¶
ZMON’s agent creates an auto-scaling group entity that provides you with the number of desired instances and the number of instances in a healthy state. This enables you to monitor whether the ASG actually works and hosts spawn into a productive state.
id: asg-proxy-1[aws:123456789:eu-central-1]
type: asg
name: proxy-1
created_by: agent
desired_capacity: 2
dns_traffic: 'true'
dns_weight: 200
infrastructure_account: aws:123456789
instances:
- aws_id: i-123456
ip: 172.33.109.201
- aws_id: i-654321
ip: 172.33.109.202
max_size: 4
min_size: 2
region: eu-central-1
RDS Instances¶
ZMON AWS agent will detect RDS instances and store them as entities with type database
.
id: rds-db-1[aws:123456789]
type: database
name: db-1
created_by: agent
engine: postgres
host: db-1.rds.amazonaws.com
infrastructure_account: aws:123456789
port: 5432
region: eu-west-1
cloudwatch().query_one({'DBInstanceIdentifier': entity['name']}, 'DatabaseConnections', 'Sum', 'AWS/RDS')
ElastiCache Redis¶
Elasticache instances are stored as entities with type elc
.
id: elc-redis-1[aws:123456789:eu-central-1]
type: elc
cluster_id: all-redis-001
cluster_num_nodes: 1
created_by: agent
engine: redis
host: redis-1.cache.amazonaws.com
infrastructure_account: aws:123456789
port: 6379
region: eu-central-1
IAM/ACM Certificates¶
ZMON AWS agent will also sync IAM/ACM SSL certificates, with type certificate
. Certificate entities could be used to create an alert in case a certificate is about to expire for instance.
id: cert-acm-example.org[aws:123456789:eu-central-1]
type: certificate
name: '*.example.org'
status: ISSUED
arn: arn:aws:acm:eu-central-1:123456789:certificate/123456-123456-123456-123456
certificate_type: acm
created_by: agent
expiration: '2017-07-28T12:00:00+00:00'
infrastructure_account: aws:123456789
region: eu-central-1
Application API Monitoring¶
When monitoring an application, you’ll usually want to check the number of received requests, latency patterns, and the number of returned status codes. These data points form a pretty clear picture of what is going on with the application.
Additional metrics will help you find problems as well as opportunities for improvement. Assuming that your applications provide HTTP APIs hidden behind ELBs, you can use ZMON to gather this data from CloudWatch.
For more detailed data, ZMON offers options for different languages and frameworks.
One is zmon-actuator for Spring Boot.
ZMON gathers the data by querying a JSON endpoint /metrics
adhering to the DropWizard metrics layout with some convention on the naming of timers.
Basically on timer per API path and status code.
We also recommend checking out Friboo for working with Clojure, the Python/Flask framework Connexion or Markscheider for Play/Scala development.
The http(url=…).actuator_metrics() will parse the data into a Python dict that allows you to easily monitor and alert on changes in API behavior.
This also drives ZMON’s cloud UI.

Requirements¶
The requirements below are all open soure technologies that need to be available for ZMON to run with all its features.
Redis¶
The Redis service is one of the core dependencies, ZMON uses Redis for its task queue and to store its current state.
PostgreSQL¶
PostgreSQL is ZMONs data store for entities, checks, alerts, dashboards and Grafana dashboards. The entities service relies on PostgreSQL’s jsonb data type thus you need a PostgreSQL 9.4+ running.
Cassandra¶
Cassandra needs to be available for KairosDB if you want to have historic data and make use of Grafana, this is highly suggested. We strongly recommend to run Cassandra 3.7+ and using TimeWindow compaction strategy for KairosDB. This will nicely split your SSTables into a single file per day (depending on your config).
KairosDB¶
KairosDB is our time series database of choice, however by now we are running our own fork. This is not required for standard volume scenarios we believe. ZMON will store every metric gathered in KairosDB so that you can use it directly or via Graphana to access historic data. ZMON itself allows you to plot charts from KairosDB in Dashboard widgets or go to check/alert specific charts directly.
Essential ZMON Components¶
To use ZMON requires these four components: zmon-controller, zmon-scheduler, zmon-worker, and zmon-eventlog-service.
Controller¶
zmon-controller runs ZMON’s AngularJS frontend and serves as an endpoint for retrieving data and managing your ZMON deployment via REST API (with help from the command line client). It needs a connection configured to:
- PostgreSQL to store/retrieve all kind of data: entities, checks, dashboards, alerts
- Redis, to keep the state of ZMON’s alerts
- KairosDB, if you want charts/Grafana
To provide a means of authentication and authorization, you can choose between the following options:
- A basic credential file
- An OAuth2 identity provider, e.g., GitHub
Scheduler¶
zmon-scheduler is responsible for keeping track of all existing entities, checks and alerts and scheduling checks in time for applicable entities, which are then executed by the worker.
Needs connections to:
- Redis, which serves ZMON as a task queue
- Controller, to get check/alerts/entities
- Custom adapters might need connections for entity discovery in your platform
Worker¶
zmon-worker does the heavy lifting — executing tasks against entities and evaluating all alerts assigned to this check. Tasks are picked up from Redis and the resulting check value plus alert state changes are written back to Redis.
- Needs connection to:
- Redis to retrieve tasks and update current state
- KairosDB if you want to have metrics
- EventLog service to store history events for alert state changes
EventLog Service¶
zmon-eventlog-service is our slim implementation of an event store, keeping track of Events related to alert state changes as well as events like alert and check modification by the user.
- Needs connection to:
- PostgreSQL to store events using jsonb
Component Configuration¶
In this section we assume that you want to use Docker as means of deployment. The ZMON Dockerimages in Zalando’s Open Source registry are exactly the ones we use ourselves, injecting all configuartion via environment variables.
If this does not fit your needs you can run the artifacts directly and decide to use environment variables or modify the example config files.
At this point we also assume the requirements in terms of PostgreSQL, Redis and KairosDB are available and you have the credentials at hand. If not see Requirements. The minimal configuration options below are taken from the Demo’s Bootstrap script!
Authentication¶
For the ZMON controller we assume that it is publicly accessible.
Thus the UI always requires users to login and the REST API, too.
The REST API relies on tokens via the Authorization: Bearer <token>
header to allow access.
For environments where you have no OAauth2 setup you can configure pre-shared keys for API access.
Note
Feel free to look at Zalando’s Plan-B, which is a freely available OAuth2 provider we use for our platform to secure service to service communication.
Creating a preshared token can be achieved like this and adding them to the Controller configuration.
SCHEDULER_TOKEN=$(makepasswd --string=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --chars 32)
Warning
Due to magic in matching env vars token must be ALL UPPERCASE
Scheduler and worker both at times call the controller’s REST API thus you need to configure tokens for them. For the scheduler, KairosDB, eventlog-service and metric-cache if deployed we assume for now they are private. Theses services are accessed only by worker and controller and do not need to be public. Same is true for Redis, PostgreSQL and Cassandra. However in general we advise you to setup proper credentials and roles where possible.
Running Docker¶
First we need to figure out what tags to run. Belows bash snippet helps you to retrieve and set the latest available tags.
function get_latest () {
name=$1
# REST API returns tags sorted by time
tag=$(curl --silent https://registry.opensource.zalan.do/teams/stups/artifacts/$name/tags | jq .[].name -r | tail -n 1)
echo "$name:$tag"
}
echo "Retrieving latest versions.."
REPO=registry.opensource.zalan.do/stups
POSTGRES_IMAGE=$REPO/postgres:9.4.5-1
REDIS_IMAGE=$REPO/redis:3.2.0-alpine
CASSANDRA_IMAGE=$REPO/cassandra:2.1.5-1
ZMON_KAIROSDB_IMAGE=$REPO/$(get_latest kairosdb)
ZMON_EVENTLOG_SERVICE_IMAGE=$REPO/$(get_latest zmon-eventlog-service)
ZMON_CONTROLLER_IMAGE=$REPO/$(get_latest zmon-controller)
ZMON_SCHEDULER_IMAGE=$REPO/$(get_latest zmon-scheduler)
ZMON_WORKER_IMAGE=$REPO/$(get_latest zmon-worker)
ZMON_METRIC_CACHE=$REPO/$(get_latest zmon-metric-cache)
To run the selected images use Docker’s run command together with the options explained below. We use the following wrapper for this:
function run_docker () {
name=$1
shift 1
echo "Starting Docker container ${name}.."
# ignore non-existing containers
docker kill $name &> /dev/null || true
docker rm -f $name &> /dev/null || true
docker run --restart "on-failure:10" --net zmon-demo -d --name $name $@
}
run_docker zmon-controller \
# -e ......... \
# -e ......... \
$ZMON_CONTROLLER_IMAGE
Controller¶
Authentication¶
Configure your Github application
-e SPRING_PROFILES_ACTIVE=github \
-e ZMON_OAUTH2_SSO_CLIENT_ID=64210244ddd8378699d6 \
-e ZMON_OAUTH2_SSO_CLIENT_SECRET=48794a58705d1ba66ec9b0f06a3a44ecb273c048 \
Make everyone admin for now:
-e ZMON_AUTHORITIES_SIMPLE_ADMINS=* \
Logout URL¶
When switching to TV Mode, you can use this to enable the Pop-up dialog described in “Read Only” Display Login which opens the Logout URL in a new Tab to terminate the user’s session.
-e ZMON_LOGOUT_URL="https://example.com/logout"
Dependencies¶
Configure PostgreSQL access:
-e POSTGRES_URL=jdbc:postgresql://$PGHOST:5432/local_zmon_db \
-e POSTGRES_PASSWORD=$PGPASSWORD \
Setup Redis connection:
-e REDIS_HOST=zmon-redis \
-e REDIS_PORT=6379 \
Set CORS allowed origins:
-e ENDPOINTS_CORS_ALLOWED_ORIGINS=https://demo.zmon.io \
Setup URLs for other services:
-e ZMON_EVENTLOG_URL=http://zmon-eventlog-service:8081/ \
-e ZMON_KAIROSDB_URL=http://zmon-kairosdb:8083/ \
-e ZMON_METRICCACHE_URL=http://zmon-metric-cache:8086/ \
-e ZMON_SCHEDULER_URL=http://zmon-scheduler:8085/ \
And last but not least, configure a preshared token, to allow the scheduler and worker to access the REST API. Remember tokens need to all uppercase here.
-e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_UID=zmon-scheduler \
-e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_EXPIRES_AT=1758021422 \
-e PRESHARED_TOKENS_${SCHEDULER_TOKEN}_AUTHORITY=user
Firebase and Webpush¶
Enable desktop push notification UI with the following options:
-e ZMON_ENABLE_FIREBASE=true \
-e ZMON_NOTIFICATIONSERVICE_URL=http://zmon-notification-service:8087/ \
-e ZMON_FIREBASE_API_KEY="AIzaSyBM1ktKS5u_d2jxWPHVU7Xk39s-PG5gy7c" \
-e ZMON_FIREBASE_AUTH_DOMAIN="zmon-demo.firebaseapp.com" \
-e ZMON_FIREBASE_DATABASE_URL="https://zmon-demo.firebaseio.com" \
-e ZMON_FIREBASE_STORAGE_BUCKET="zmon-demo.appspot.com" \
-e ZMON_FIREBASE_MESSAGING_SENDER_ID="280881042812" \
This feature requires additional config for the worker and to run the notification-service.
Scheduler¶
Specify the Redis server you want to use:
-e SCHEDULER_REDIS_HOST=zmon-redis \
-e SCHEDULER_REDIS_PORT=6379 \
Setup access to the controller and entity service (both provided by the controller): Not the reuse of the above defined pre shared key!
-e SCHEDULER_OAUTH2_STATIC_TOKEN=$SCHEDULER_TOKEN \
-e SCHEDULER_URLS_WITHOUT_REST=true \
-e SCHEDULER_ENTITY_SERVICE_URL=http://zmon-controller:8080/ \
-e SCHEDULER_CONTROLLER_URL=http://zmon-controller:8080/ \
If you run into scenarios of different queues or the demand for different levels of parallelism, e.g. limiting number of queries run at MySQL/PostgreSQL databases use the following as an example:
-e SPRING_APPLICATION_JSON='{"scheduler":{"queue_property_mapping":{"zmon:queue:mysql":[{"type":"mysql"}]}}}'
This will route checks agains entities of type “mysql” to another queue.
Worker¶
The worker configuration is split into essential configuration options, like Redis and KairosDB and the plugin configuration, e.g. PostgreSQL credentials, …
Essential Options¶
Configure Redis Access:
-e WORKER_REDIS_SERVERS=zmon-redis:6379 \
Configure parallelism and throughput:
-e WORKER_ZMON_QUEUES=zmon:queue:default/25,zmon:queue:mysql/3
Specify the number of worker processes that are polling the queues and execute tasks. You can specify multiple queues here to listen to.
Configure KairosDB:
-e WORKER_KAIROSDB_HOST=zmon-kairosdb \
Configure EventLog service:
-e WORKER_EVENTLOG_HOST=zmon-eventlog-service \
-e WORKER_EVENTLOG_PORT=8081 \
Configure Worker token to access controller API: (relying on Python tokens library here)
-e OAUTH2_ACCESS_TOKENS=uid=$WORKER_TOKEN \
Configure Worker named tokens to access external APIs:
-e WORKER_PLUGIN_HTTP_OAUTH2_TOKENS=token_name1=scope1,scope2,scope3:token_name2=scope1,scope2
Configure Metric Cache (optional):
-e WORKER_METRICCACHE_URL=http://zmon-metric-cache:8086/api/v1/rest-api-metrics/ \
-e WORKER_METRICCACHE_CHECK_ID=9 \
Notification Options¶
Firebase and Webpush¶
To trigger notifications for desktop web and mobile apps set the following params to point to notification service.
WORKER_NOTIFICATION_SERVICE_URL
- Notification service base url
WORKER_NOTIFICATION_SERVICE_KEY
- (optional, if not using oauth2) A shared key configured in the notification service
Hipchat¶
WORKER_NOTIFICATIONS_HIPCHAT_TOKEN
- Access token for HipChat notifications.
WORKER_NOTIFICATIONS_HIPCHAT_URL
- URL of HipChat server.
HTTP¶
This allows to trigger HTTP Post calls to arbitrary services.
WORKER_NOTIFICATIONS_HTTP_DEFAULT_URL
- HTTP endpoint default URL.
WORKER_NOTIFICATIONS_HTTP_WHITELIST_URLS
- List of whitelist URL endpoints. If URL is not in this list, then exception will be raised.
WORKER_NOTIFICATIONS_HTTP_ALLOW_ALL
- Allow any URL to be used in HTTP notification.
WORKER_NOTIFICATIONS_HTTP_HEADERS
- Default headers to be used in HTTP requests.
Mail¶
WORKER_NOTIFICATIONS_MAIL_HOST
- SMTP host for email notifications.
WORKER_NOTIFICATIONS_MAIL_PORT
- SMTP port for email notifications.
WORKER_NOTIFICATIONS_MAIL_SENDER
- Sender address for email notifications.
WORKER_NOTIFICATIONS_MAIL_USER
- SMTP user for email notifications.
WORKER_NOTIFICATIONS_MAIL_PASSWORD
- SMTP password for email notifications.
Slack¶
WORKER_NOTIFICATIONS_SLACK_WEBHOOK
- Slack webhook for channel notifications.
Twilio¶
WORKER_NOTIFICATION_SERVICE_URL
- URL of notification service (needs to be publicly accessible)
WORKER_NOTIFICATION_SERVICE_KEY
- (optional, if not using oauth2) Preshared key to call notification service
Pagerduty¶
WORKER_NOTIFICATIONS_PAGERDUTY_SERVICEKEY
- Routing key for a Pagerduty service
Plug-In Options¶
All plug-in options have the prefix WORKER_PLUGIN_<plugin-name>_
, i.e. if you want to set option “bar” of the plugin “foo” to “123” via environment variable:
WORKER_PLUGIN_FOO_BAR=123
If you plan to access your PostgreSQL cluster specify the credentials below. We suggest to use a distinct user for ZMON with limited read only privileges.
WORKER_PLUGIN_SQL_USER
WORKER_PLUGIN_SQL_PASS
If you need to access MySQL specify the user credentials below, again we suggest to use a user with limited privileges only.
WORKER_PLUGIN_MYSQL_USER
WORKER_PLUGIN_MYSQL_PASS
Notification Service¶
Optional component to service mobile API, push notifications and Twilio notifications.
Authentication¶
SPRING_APPLICATION_JSON
Use this to define pre-shared keys if not using OAuth2. Specify key and max validity.
{"notifications":{"shared_keys":{"<your random key>": 1504981053654}}}
Firebase and Web Push¶
NOTIFICATIONS_GOOGLE_PUSH_SERVICE_API_KEY
- Private Firebase messaging server key
NOTIFICATIONS_ZMON_URL
- ZMON’s base URL
Twilio options¶
NOTIFICATIONS_TWILIO_API_KEY
- Private API Key
NOTIFICATIONS_TWILIO_USER
- User
NOTIFICATIONS_TWILIO_PHONE_NUMBER
- Phone number to use
NOTIFICATIONS_DOMAIN
- Domain under which notification service is reachable
Rest API¶
Authentication & Authorization¶
You need to obtain a token to access ZMON’s REST API. For the default deployment using Github rely on access tokens from Github, otherwise it depends on your selected provider.
Your application should always examine the HTTP status of the response. Any value other than 200 indicates a failure.
Here are some examples:
Request with invalid credentials:
HTTP/1.1 401 Unauthorized
Content-Type: application/json;charset=UTF-8
Content-Length: 29
Date: Thu, 21 Aug 2014 10:28:10 GMT
{"message":"Bad credentials"}
Request without proper authentication:
HTTP/1.1 401 Unauthorized
Content-Type: application/json;charset=UTF-8
Content-Length: 69
Date: Thu, 21 Aug 2014 10:29:14 GMT
{"message":"Full authentication is required to access this resource"}
Request without proper authorization:
HTTP/1.1 403 Forbidden
Content-Type: application/json;charset=UTF-8
Content-Length: 30
Date: Thu, 21 Aug 2014 10:31:20 GMT
{"message":"Access is denied"}
Entities¶
see CLI entities
Check Definitions¶
Dashboards¶
Downtimes¶
For more info about this feature, please check this
Scheduling a downtime¶
Resource URL: POST /api/v1/downtimes
Description
Create a new downtime, returning the id of the newly created resource. If none of the alert definition entities match this request it will succeed and return an empty list of entities/alert definitions. Any attempt to execute this method without proper authentication will result in a 401. If the user does not have enough permissions (role: api-writer) this method will return an HTTP 403. In case of malformed syntax or missing mandatory fields this method will return an HTTP 400 and the client SHOULD NOT repeat the request without modifications. In case of success this method will return HTTP 200.
Note
Alerts and checks with hard-coded entity identifiers in the check command are not covered.
Parameters:
Name | Data Type | Mandatory | Description |
---|---|---|---|
comment | String | yes | Downtime comment |
start_time | Number | no | The start time in seconds since epoch. Default: current time |
end_time | Number | yes | The end time in seconds since epoch. Precondition: end_time > start_time |
entities | Array | yes | Array of entities to set in downtime. (e.g. htt01:4420) Precondition: The array should have at least one element |
alert_definitions | Array | no | Alert definition ids. If specified, only entities belonging to these alert definitions will be set in downtime. |
Example:
curl -v --user hjacobs:test 'https://zmon.example.com/api/v1/downtimes' \
-H 'Content-Type: application/json' \
--data-binary $'{"comment":"Cities downtime","end_time":1408665600,"entities":["cd-kinshasa", "cn-peking"]}'
Request:
POST /api/v1/downtimes HTTP/1.1
Authorization: Basic aGphY29iczp0ZXN0
User-Agent: curl/7.30.0
Host: zmon.example.com
Accept: */*
Content-Type: application/json
Content-Length: 91
{"comment":"Cities downtime","end_time":1408665600,"entities":["cd-kinshasa", "cn-peking"]}
Response:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Thu, 21 Aug 2014 14:26:02 GMT
{"comment":"Cities downtime","start_time":1408631162,"end_time":1408665600,"created_by":"hjacobs",
"id":"cf6ada50-3eb2-4c17-8d09-4eb03dc19cf5","entities":["cn-peking","cd-kinshasa"],"alert_definitions":[704]}
Deleting a downtime¶
Resource URL: DELETE /api/v1/downtimes/{id}
Description
Attempt to delete the downtime with the specified id. If the downtime ID doesn’t exist, the request will succeed and return an empty list of entities/alert definitions. Any attempt to execute this method without proper authentication will result in a 401. If the user doesn’t have enough permissions (role: api-writer) this method will return an HTTP 403. In case of malformed syntax or missing mandatory fields this method will return an HTTP 400 and the client SHOULD NOT repeat the request without modifications. In case of success this method will return HTTP 200.
Parameters:
Name | Data Type | Mandatory | Description |
---|---|---|---|
id | String | yes | Id of the downtime to delete |
Example:
curl -v --user hjacobs:test 'https://zmon.example.com/api/v1/downtimes/cf6ada50-3eb2-4c17-8d09-4eb03dc19cf5' \
-H 'Content-Type: application/json' \
-X DELETE
Request:
DELETE /api/v1/downtimes/cf6ada50-3eb2-4c17-8d09-4eb03dc19cf5 HTTP/1.1
Authorization: Basic aGphY29iczp0ZXN0
User-Agent: curl/7.30.0
Host: zmon.example.com
Accept: */*
Content-Type: application/json
Response:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Thu, 21 Aug 2014 15:16:51 GMT
{"comment":"Cities downtime","start_time":1408633908,"end_time":1408665600,"created_by":"hjacobs",
"id":"0ff6ed67-9521-42a7-8132-5ab837193af9","entities":["cn-peking","cd-kinshasa"],"alert_definitions":[704]}
Alert Definitions¶
For more info about this feature, please check this
Creating a new Alert Definition¶
Resource URL: POST /api/v1/alert-definitions
Description
Create a new alert definition, returning the id of the newly created resource. Alert definitions can be created based on another alert definition whereby a child reuses attributes from the parent. Each alert definition can only inherit from a single alert definition (single inheritance).
One can also create templates. A Template is basically an alert definition with a subset of mandatory attributes that is not evaluated and is only used for extension.
Any attempt to execute this method without proper authentication will result in a 401. In case of success this method will return HTTP 200.
Parameters:
Name | Data Type | Mandatory | Inherited | Description |
---|---|---|---|---|
name | String | yes | yes | The alert’s display name on the dashboard. This field can contain curly-brace variables like {mycapture} that are replaced by capture’s value when the alert is triggered. It’s also possible to format decimal precision (e.g. “My alert {mycapture:.2f}” would show as “My alert 123.45” if mycapture is 123.456789). To include a comma separated list of entities as part of the alert’s name, just use the special placeholder {entities}. This field can be omitted if the new definition extends an existing one with this field defined (templates might not have all fields). |
description | String | yes | yes | Meaningful text for people trying to handle the alert. This field can be omitted if the new definition extends an existing one with this field defined. |
team | String | yes | no | Team dashboard to show the alert on. |
responsible_team | String | yes | no | Additional team field that allows one to delegate alert monitoring to other teams. The responsible team’s name will be shown on the dashboard. This team is responsible for fixing the problem in case the alert is triggered. |
entities | Array | yes | yes | Filter used to select a subset of check definition entities. If empty, the condition will be evaluated in all entities defined in check definition. This field can be omitted if the new definition extends an existing one with this fields defined. |
entities_exclude | Array | yes | yes | This filter is useful to exclude entities from the final entity set. If empty, none of the entities will be excluded. This field can be omitted if the new definition extends an existing one with this fields defined |
condition | String | yes | yes | Valid Python expression to return true when alert should be triggered. This field can be omitted if the new definition extends an existing one with this fields defined. |
notifications | String | no | yes | List of notification commands. One could either send emails (send_mail) or sms (send_sms). |
check_definition_id | Number | yes | yes | Id of the check definition. This field can be omitted if the new definition extends an existing one with this fields defined. |
status | String | yes | no | Alert definition status. Possible values are:
Alerts are only triggered if the alert definition is active. |
priority | Number | yes | yes | Alert priority. Possible values are:
|
period | String | no | yes | Notification time period. |
template | Boolean | yes | no | A template is an alert definition that is not evaluated and can only be used for extension. |
parent_id | Number | no | no | Id of the parent alert definition. All fields defined on the parent will be inherited. |
parameters | Object | no | yes | Alert definition parameters allows one to decouple alert condition from constants that are used inside it. One can define parameters in the python condition and specify its values in this field. e.g. {“KEY1”: 1, “KEY2”, “foo”} |
tags | Array | no | yes | keyword assigned to a alert definition. This metadata helps describe an alert definition and allows it to be found by searching. |
Example:
curl --user hjacobs:test 'https://zmon.example.com/api/v1/alert-definitions' -H 'Content-Type: application/json' \
--data-binary $'{"name": "City Longitude >0", "description": "Test whether a city lies east or west", "team": "Platform/Software", "responsible_team": "Platform/Software", "entities": [{"type": "city"}], "entities_exclude": [], "condition": "capture(longitude=float(value)) > longitude_param", "notifications": [], "check_definition_id": 20, "status": "ACTIVE", "priority": 2, "period": "", "template": false, "parameters": {"longitude_param": {"comment": "Longitude parameter","type": "float", "value": 0}}, "tags": ["CITY"]}'
Request:
POST /api/v1/alert-definitions HTTP/1.1
Authorization: Basic aGphY29iczp0ZXN0
User-Agent: curl/7.30.0
Host: zmon.example.com
Accept: */*
Content-Type: application/json
Response:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Tue, 26 Aug 2014 18:02:29 GMT
{"id":788,"name":"City Longitude >0","description":"Test whether a city lies east or west",
"team":"Platform/Software","responsible_team":"Platform/Software","entities":[{"type":"city"}],
"entities_exclude":[],"condition":"capture(longitude=float(value)) > longitude_param","notifications":[],
"check_definition_id":20,"status":"ACTIVE","priority":2,"last_modified":1409076149956,"last_modified_by":"hjacobs",
"period":"","template":false,"parent_id":null,
"parameters":{"longitude_param":{"value":0,"comment":"Longitude parameter","type":"float"}},"tags":["CITY"]}
Updating an Alert Definition¶
Resource URL: PUT /api/v1/alert-definitions/{id}
Description
Updates an existing alert definition. If the alert definintion doesn’t exist, this method will return a 404.
For more info about the parameters, please check how to create a new Alert Definition
Example:
curl --user hjacobs:test 'https://zmon.example.com/api/v1/alert-definitions/788' \
-H 'Content-Type: application/json' \
--data-binary $'{"name": "City Longitude >0", "description": "Checks whether a city lies east or west", "team": "Platform/Software", "responsible_team": "Platform/Software", "entities": [{"type": "city"}], "entities_exclude": [], "condition": "capture(longitude=float(value)) > longitude_param", "notifications": [], "check_definition_id": 20, "status": "ACTIVE", "priority": 2, "period": "", "template": false, "parameters": {"longitude_param": {"comment": "Longitude parameter","type": "float", "value": 0}}, "tags": ["CITY"]}' \
-X PUT
Request:
PUT /api/v1/alert-definitions/788 HTTP/1.1
Authorization: Basic aGphY29iczp0ZXN0
User-Agent: curl/7.30.0
Host: zmon.example.com
Accept: */*
Content-Type: application/json
Response:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Tue, 26 Aug 2014 18:47:00 GMT
{"id":788,"name":"City Longitude >0","description":"Checks whether a city lies east or west",
"team":"Platform/Software","responsible_team":"Platform/Software","entities":[{"type":"city"}],
"entities_exclude":[],"condition":"capture(longitude=float(value)) > longitude_param","notifications":[],
"check_definition_id":20,"status":"ACTIVE","priority":2,"last_modified":1409078820694,"last_modified_by":"hjacobs",
"period":"","template":false,"parent_id":null,
"parameters":{"longitude_param":{"value":0,"comment":"Longitude parameter","type":"float"}},"tags":["CITY"]}
Find an Alert Defintion by ID¶
Resource URL: GET /api/v1/alert-definitions/{id}
Description
Find an existing alert definition by id. If the alert definintion doesn’t exist, this method will return a 404.
Example:
curl -v --user hjacobs:test 'https://zmon.example.com/api/v1/alert-definitions/788' \
-H 'Content-Type: application/json'
Request:
GET /api/v1/alert-definitions/788 HTTP/1.1
Authorization: Basic aGphY29iczp0ZXN0
User-Agent: curl/7.30.0
Host: zmon.example.com
Accept: */*
Content-Type: application/json
Response:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Tue, 26 Aug 2014 18:47:00 GMT
{"id":788,"name":"City Longitude >0","description":"Checks whether a city lies east or west",
"team":"Platform/Software","responsible_team":"Platform/Software","entities":[{"type":"city"}],
"entities_exclude":[],"condition":"capture(longitude=float(value)) > longitude_param","notifications":[],
"check_definition_id":20,"status":"ACTIVE","priority":2,"last_modified":1409078820694,"last_modified_by":"hjacobs",
"period":"","template":false,"parent_id":null,
"parameters":{"longitude_param":{"value":0,"comment":"Longitude parameter","type":"float"}},"tags":["CITY"]}
Retrieving Alert Status¶
Resource URL: GET /api/v1/status/alert/{alert ids}/
Description
Returns current status of the given alert IDs. The information comes directly from Redis and represents results of the last alert evaluation
The results are returned in the following format (so basically for each alert and entity you get information
- when alert started (ts)
- how long has evaluation taken (td)
- are there any downtimes (downtimes)
- capture values, if available (captures)
- which worker has processed the value (worker)
- the latest check value (value)
NOTE Please keep in mind that this request will only work if you specify trailing slash (as in the example below).
{"alert id":
{
"entity name":
{
"td":0.013866,
"downtimes":[],
"captures":{"count":1},
"start_time":1.416391418749185E9,
"worker":"p3426.itr-monitor01",
"ts":1.4164876292204E9,
"value":1
}
}
}
Any attempt to execute this method without proper authentication will result in a 401. In case of success this method will return HTTP 200.
Example:
curl --user hjacobs:test 'https://zmon.example.com/api/v1/status/alert/69,3454/'
Request:
GET https://zmon.example.com/api/v1/status/alert/69,3454/ HTTP/1.1
Authorization: Basic aGphY29iczp0ZXN0
User-Agent: curl/7.30.0
Host: zmon.example.com
Accept: */*
Response:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Vary: Accept-Encoding
Date: Thu, 20 Nov 2014 12:47:37 GMT
{"69":{"itr-elsn02:5827":{"td":0.013866,"downtimes":[],"captures":{"count":1},"start_time":1.416391418749185E9,"worker":"p3426.itr-monitor01","ts":1.4164876292204E9,"value":1},"elsn03:5827":{"td":0.015576,"downtimes":[],"captures":{"count":8},"start_time":1.416391397741839E9,"worker":"p3426.monitor02","ts":1.416487629218565E9,"value":8},"elsn02:5827":{"td":0.024973,"downtimes":[],"captures":{"count":9},"start_time":1.416330457394862E9,"worker":"p3426.itr-monitor01","ts":1.416487629223615E9,"value":9},"itr-elsn03:5827":{"td":0.020491,"downtimes":[],"captures":{"count":1},"start_time":1.416255229204794E9,"worker":"p3426.itr-monitor01","ts":1.41648762923005E9,"value":1},"elsn01:5827":{"td":0.019912,"downtimes":[],"captures":{"count":8},"start_time":1.416391418966269E9,"worker":"p3426.monitor03","ts":1.416487629216758E9,"value":8},"itr-elsn01:5827":{"td":0.015741,"downtimes":[],"captures":{"count":2},"start_time":1.416391429438217E9,"worker":"p3426.itr-monitor01","ts":1.416487629224237E9,"value":2}},"3454":{"monitor02":{"td":0.027714,"downtimes":[],"captures":{},"start_time":1.414754929626809E9,"worker":"p3426.monitor02","ts":1.416487578812573E9,"value":{"load1":8.71,"load15":9.73,"load5":10.22}},"monitor03":{"td":0.028951,"downtimes":[],"captures":{},"start_time":1.41475492971822E9,"worker":"p3426.monitor02","ts":1.41648757881069E9,"value":{"load1":9.25,"load15":11.17,"load5":10.9}}}}
Command Line Client¶
The command line client makes your life easier when interacting with the REST API. The ZMON scheduler will refresh modified data (checks, alerts, entities every 60 seconds).
Installation¶
pip3 install --upgrade zmon-cli
Authentication¶
ZMON CLI tool must authenticate against ZMON. Internally it uses zign to obtain access token, but you can override that behaviour by exporting a variable ZMON_TOKEN.
export ZMON_TOKEN=myfancytoken
If you are using github for authentication, have an unprivileged personal access token ready.
Entities¶
Create or update¶
Pushing entities with the zmon cli is as easy as:
zmon entities push \
'{"id":"localhost:3421","type":"instance","name":"zmon-scheduler-ng","host":"localhost","ports":{"3421":3421}}'
Existing entities with the same ID will be updated.
The client however also supports loading data from .json and .yaml files, both may contain a list for creating/updating many entities at once.
zmon entities push your-entities.yaml
Note
Creating an entity of type GLOBAL is not allowed. GLOBAL as an entity type is reserved for ZMON’s internal use.
Tip
All commands and subcommands can be abbreviated, i.e. the following lines are equivalent:
$ zmon entities push my-data.yaml $ zmon ent pu my-data.yaml
Search and filter¶
Show all entities:
zmon entities
Filter by type “instance”
zmon entities filter type instance
Check Definitions¶
Create and Update¶
Create or update from file, existing check with same “owning_team” and “name” will be updated.
zmon check-definition update your-check.yaml
Alert Definitions¶
Similar to check defintions you can also manage your alert definitions via the ZMON cli.
Keep in mind that for alerts the same constraints apply as in the UI. For creating/modifying an alert you need to be a member of the team selected for “team” (unlike the responsible team).
Init¶
zmon alert-definition init your-new-alert.yaml
Create¶
zmon alert-definition create your-new-alert.yaml
Get¶
zmon alert-definition get 1999
Update¶
zmon alert-definition update host-load-5.yaml
Python Client¶
ZMON provides a python client library that can be imported and used in your own software.
Usage¶
Using ZMON client is pretty straight forward.
>>> from zmon_cli.client import Zmon
>>> zmon = Zmon('https://zmon.example.org', token='123')
>>> entity = zmon.get_entity('entity-1')
{
'id': 'entity-1',
'team': 'ZMON',
'type': 'instance',
'data': {'host': '192.168.20.16', 'port': 8080, 'name': 'entity-1-instance'}
}
>>> zmon.delete_entity('entity-102')
True
>>> check = zmon.get_check_definition(123)
>>> check['command']
http('http://www.custom-service.example.org/health').code()
>>> check['command'] = "http('http://localhost:9090/health').code()"
>>> zmon.update_check_definition(check)
{
'command': "http('http://localhost:9090/health').code()",
'description': 'Check service health',
'entities': [{'application_id': 'custom-service', 'type': 'instance'}],
'id': 123,
'interval': 60,
'last_modified_by': 'admin',
'name': 'Check service health',
'owning_team': 'ZMON',
'potential_analysis': None,
'potential_impact': None,
'potential_solution': None,
'source_url': None,
'status': 'ACTIVE',
'technical_details': None
}
Client¶
Exceptions¶
Zmon¶
-
class
zmon_cli.client.
Zmon
(url, token=None, username=None, password=None, timeout=10, verify=True, user_agent='zmon-client/1.1.61')[source]¶ ZMON client class that enables communication with ZMON backend.
Parameters: - url (str) – ZMON backend base url.
- token (str) – ZMON authentication token.
- username (str) – ZMON authentication username. Ignored if
token
is used. - password (str) – ZMON authentication password. Ignored if
token
is used. - timeout (int) – HTTP requests timeout. Default is 10 sec.
- verify (bool) – Verify SSL connection. Default is
True
. - user_agent (str) – ZMON user agent. Default is generated by ZMON client and includes lib version.
-
add_entity
(entity: dict, **kwargs) → requests.models.Response[source]¶ Create or update an entity on ZMON.
Note
ZMON PUT entity API doesn’t return JSON response.
Parameters: entity (dict) – Entity dict. Returns: Response object. Return type: requests.Response
-
alert_details_url
(alert: dict) → str[source]¶ Return direct deeplink to alert details view on ZMON UI.
Parameters: alert (dict) – alert dict. Returns: Deeplink to alert details view. Return type: str
-
check_definition_url
(check_definition: dict) → str[source]¶ Return direct deeplink to check definition view on ZMON UI.
Parameters: check_definition (dict) – check_difinition dict. Returns: Deeplink to check definition view. Return type: str
-
create_alert_definition
(alert_definition: dict, **kwargs) → dict[source]¶ Create new alert definition.
Attributes
last_modified_by
andcheck_definition_id
are required. Ifstatus
is not set, then it will be set toACTIVE
.Parameters: alert_definition (dict) – ZMON alert definition dict. Returns: Alert definition dict. Return type: dict
-
create_downtime
(downtime: dict, **kwargs) → dict[source]¶ Create a downtime for specific entities.
Atrributes
entities
list,start_time
andend_time
timestamps are required.Parameters: downtime (dict) – Downtime dict. Returns: Downtime dict. Return type: dict Example downtime:
{ "entities": ["entity-id-1", "entity-id-2"], "comment": "Planned maintenance", "start_time": 1473337437.312921, "end_time": 1473341037.312921, }
-
dashboard_url
(dashboard_id: int) → str[source]¶ Return direct deeplink to ZMON dashboard.
Parameters: dashboard_id (int) – ZMON Dashboard ID. Returns: Deeplink to dashboard. Return type: str
-
delete_alert_definition
(alert_definition_id: int, **kwargs) → dict[source]¶ Delete existing alert definition.
Parameters: alert_definition_id (int) – ZMON alert definition ID. Returns: Alert definition dict. Return type: dict
-
delete_check_definition
(check_definition_id: int, **kwargs) → requests.models.Response[source]¶ Delete existing check definition.
Parameters: check_definition_id (int) – ZMON check definition ID. Returns: HTTP response. Return type: requests.Response
-
delete_entity
(entity_id: str, **kwargs) → bool[source]¶ Delete entity from ZMON.
Note
ZMON DELETE entity API doesn’t return JSON response.
Parameters: entity_id (str) – Entity ID. Returns: True if succeeded, False otherwise. Return type: bool
-
get_alert_data
(alert_id: int, **kwargs) → dict[source]¶ Retrieve alert data.
Response is a
dict
with entity ID as a key, and check return value as a value.Parameters: alert_id (int) – ZMON alert ID. Returns: Alert data dict. Return type: dict Example:
{ "entity-id-1": 122, "entity-id-2": 0, "entity-id-3": 100 }
-
get_alert_definition
(alert_id: int, **kwargs) → dict[source]¶ Retrieve alert definition.
Parameters: alert_id (int) – Alert definition ID. Returns: Alert definition dict. Return type: dict
-
get_alert_definitions
() → list[source]¶ Return list of all
active
alert definitions.Returns: List of alert-defs. Return type: list
-
get_check_definition
(definition_id: int, **kwargs) → dict[source]¶ Retrieve check defintion.
Parameters: defintion_id (int) – Check defintion id. Returns: Check definition dict. Return type: dict
-
get_check_definitions
() → list[source]¶ Return list of all
active
check definitions.Returns: List of check-defs. Return type: list
-
get_dashboard
(dashboard_id: str, **kwargs) → dict[source]¶ Retrieve a ZMON dashboard.
Parameters: dashboard_id (int, str) – ZMON dashboard ID. Returns: Dashboard dict. Return type: dict
-
get_entities
(query=None, **kwargs) → list[source]¶ Get ZMON entities, with optional filtering.
Parameters: query (dict) – Entity filtering query. Default is None
. Example query{'type': 'instance'}
to return all entities of type:instance
.Returns: List of entities. Return type: list
-
get_entity
(entity_id: str, **kwargs) → str[source]¶ Retrieve single entity.
Parameters: entity_id (str) – Entity ID. Returns: Entity dict. Return type: dict
-
get_grafana_dashboard
(grafana_dashboard_uid: str, **kwargs) → dict[source]¶ Retrieve Grafana dashboard.
Parameters: grafana_dashboard_uid (str) – Grafana dashboard UID. Returns: Grafana dashboard dict. Return type: dict
-
get_onetime_token
() → str[source]¶ Retrieve new one-time token.
You can use
zmon_cli.client.Zmon.token_login_url()
to return a deeplink to one-time login.Returns: One-time token. Retype: str
-
grafana_dashboard_url
(dashboard: dict) → str[source]¶ Return direct deeplink to Grafana dashboard.
Parameters: dashboard (dict) – Grafana dashboard dict. Returns: Deeplink to Grafana dashboard. Return type: str
-
list_onetime_tokens
() → list[source]¶ List exisitng one-time tokens.
Returns: List of one-time tokens, with relevant attributes. Retype: list Example:
- bound_at: 2016-09-08 14:00:12.645999 bound_expires: 1503744673506 bound_ip: 192.168.20.16 created: 2016-08-26 12:51:13.506000 token: 9pSzKpcO
-
search
(q, limit=None, teams=None, **kwargs) → dict[source]¶ Search ZMON dashboards, checks, alerts and grafana dashboards with optional team filtering.
Parameters: - q (str) – search query.
- teams (list) – List of team IDs. Default is None.
Returns: Search result.
Return type: Example:
{ "alerts": [{"id": "123", "title": "ZMON alert", "team": "ZMON"}], "checks": [{"id": "123", "title": "ZMON check", "team": "ZMON"}], "dashboards": [{"id": "123", "title": "ZMON dashboard", "team": "ZMON"}], "grafana_dashboards": [{"id": "123", "title": "ZMON grafana", "team": ""}], }
-
token_login_url
(token: str) → str[source]¶ Return direct deeplink to ZMON one-time login.
Parameters: token (str) – One-time token. Returns: Deeplink to ZMON one-time login. Return type: str
-
update_alert_definition
(alert_definition: dict, **kwargs) → dict[source]¶ Update existing alert definition.
Atrributes
id
,last_modified_by
andcheck_definition_id
are required. Ifstatus
is not set, then it will be set toACTIVE
.Parameters: alert_definition (dict) – ZMON alert definition dict. Returns: Alert definition dict. Return type: dict
-
update_check_definition
(check_definition, skip_validation=False, **kwargs) → dict[source]¶ Update existing check definition.
Atrribute
owning_team
is required. Ifstatus
is not set, then it will be set toACTIVE
.Parameters: Returns: Check definition dict.
Return type:
-
update_dashboard
(dashboard: dict, **kwargs) → dict[source]¶ Create or update dashboard.
If dashboard has an
id
then dashboard will be updated, otherwise a new dashboard is created.Parameters: dashboard (int, str) – ZMON dashboard dict. Returns: Dashboard dict. Return type: dict
A Short Python Tutorial¶
This tutorial explains by example how to process a dict
using Python’s list comprehension facilities.
Suppose we’re interested in the total number or order failures.
First, we need to query the appropriate endpoint to get the data, and call the
json()
method.http('http://www.example.com/foo/bar/data.json').json()
This endpoint returns JSON data that is structured as follows (with much of the data omitted):
{ ... "itr-http04_orderfails": [1, 0], "itr-http05_addtocart": [0.05, 0.0875], "http17_addtocart": [0.075, 0.066667], "http27_requests": [14.666667, 12.195833], "http13_orderfails": [null, 2], ... }
The parsed object will therefore be a
dict
mapping strings to lists of numbers, which may containNone
values.We need to find all entries ending in
_orderfails
. In Python, we can transform adict
in a list of tuples(key, value)
using theitems()
method:http(...).json().items()
We now need to filter this list to include only order failure information. Using a loop and an if statement, this could be accomplished like this:
result = [] for key, value in http(...).json().items(): if key.endswith('_orderfails'): result.append(value)
(Note how the tuples in the list returned by
items()
are automatically “unpacked”, their elements being assigned tokey
andvalue
, respectively.)Since the check command needs to be a single expression, not a series of statements, this is unfortunately not an option. Fortunately, Python provides a feature called list comprehension, which allows us to express the code above as follows:
[value for key, value in http(...).json().items() if key.endswith('_orderfails')]
That is, code of the form
result = [] for ELEMENT in LIST: if CONDITION: result.append(RESULT_ELEMENT)
becomes
[RESULT_ELEMENT for ELEMENT in LIST if CONDITION]
(The
if CONDITION
part is optional.)We now have a list of lists
[[1, 0], [None, 2]]
.In order to sum the list, we’d need to flatten it first, so that it has the form
[1, 0, None, 2]
. This can be accomplished with thechain()
function. Given one or more iterable objects (such as lists),chain()
returns a new iterable object produced by concatenating the given objects. That ischain([1, 0], [None, 2])
would return
[1, 0, None, 2]
Unfortunately, the lists we want to chain are themselves elements of a list, and calling
chain([[1, 0], [None, 2]])
would just concatenate the list with nothing and return the it unchanged. We therefore need to tell Python to unpack the list, so that each of its elements becomes a new argument for the invocation ofchain()
.This can be accomplished by the
*
operator:chain(*[[1, 0], [None, 2]])
That is, out expression is now
chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')])
Now we need to remove that pesky
None
from the list. This could be accomplished with another list comprehension:[value for value in chain(...) if value is not None]
For didactic reasons, we shall use the
filter()
function instead.filter()
takes two arguments: a function that is called for each element in the filtered list and indicates whether that element should be in the resulting list, and the list that is to be filtered itself. We can create an anonymous function for this purpose using a lambda expression:filter(lambda element: element is not None, chain(...))
In this case, we can use a somewhat obscure shortcut, though. If the function given to
filter()
isNone
, the identity function is used. Therefore, objects will be included in the resulting list if and only if they are “truthy”, whichNone
isn’t. The integer0
isn’t truthy either, but this isn’t a problem in this case since the presence or absence of zeros does not affect the sum. Therefore, we can use the expressionfilter(None, chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')]))
Finally, we need to sum the elements of the list. For that, we can just use the
sum()
function, so that the expression is nowsum(filter(None, chain(*[value for key, value in http(...).json().items() if key.endswith('_orderfails')])))
Python Recipes¶
-
Merging Data Into One Result
You can merge heterogeneous data into a single result object:
{ 'http_data': http(...).json()[...], 'jmx_data': jmx().query(...).results()[...], 'sql_data': sql().execute(...)[...], }
-
Mapping SQL Results by ID
The SQL
results()
methods returns a list of maps ([{'id': 1, 'data': 1000}, {'id': 2, 'data': 2000}]
). You can convert this to a single map ({1: 1000, 2: 2000}
) like this:{ row['id']: row['data'] for row in sql().execute(...).results() }
-
Using Multiple Captures
If you have a alert condition such as
FOO > 10 or BAR > 10
adding capures is a bit tricky. If you use
capture(foo=FOO) > 10 or capture(bar=BAR) > 10
and both
FOO
andBAR
are greater than 10, onlyfoo
will be captured because theor
uses short-circuit evaluation (True or X
is true for allX
, soX
doesn’t need to be evaluated). Instead, you can useany([capture(foo=FOO) > 10, capture(bar=BAR) > 10])
which will always evaluate both comparisons and thus capture both values.
-
Defining Temporary Variables
You aren’t supposed to be able to do define variables, but you can work around this restriction as follows:
(lambda x: # Some complex operation using x multiple times )( x = sql().execute(...) # Some complex or expensive query )
-
Defining Functions
Since you can define variables with the trick above, you can also define functions:
(lambda f: # Some complex operation calling f multiple times )( f = lambda a, b, c: sql().execute(...) # Some code using the arguments a, b, and c )
Tests¶
Acceptance and Unit Tests¶
These tests must be run from inside the vagrant box.:
$ vagrant ssh
vagrant@zmon:~$ cd /vagrant/vagrant/
vagrant@zmon:/vagrant/vagrant$ sudo ./test.sh
An example output of the previous command can look similar to this:
Starting Xvfb...
[13:36:12] Using gulpfile /vagrant/zmon-controller/src/main/webapp/gulpfile.js
[13:36:12] Starting 'test'...
Starting selenium standalone server...
Selenium standalone server started at http://10.0.2.15:47833/wd/hub
Testing dashboard features
should display the search form - pass
Finished in 3.24 seconds
1 test, 1 assertion, 0 failures
Shutting down selenium standalone server.
[13:36:22] Finished 'test' after 10 s
Only one single acceptance test and no unit tests are provided so far. This is still a work in progress.
Redis Data Structure¶
ZMON stores its primary working data in Redis. This page describes the used Redis keys and data structures.
Queues are Redis keys like zmon:queue:<NAME>
of type “list”, e.g. zmon:queue:default
.
New queue items are added by the ZMON Scheduler via the Redis “rpush” command.
Important Redis key patterns are:
zmon:queue:<QUEUE-NAME>
- List of worker tasks for given queue.
zmon:checks
- Set of all executed check IDs.
zmon:checks:<CHECK-ID>
- Set of entity IDs having check results.
zmon:checks:<CHECK-ID>:<ENTITY-ID>
- List of last N check results. The first list item contains the most recent check result.
Each check result is a JSON object with the keys
ts
(result timestamp),td
(check duration),value
(actual result value) andworker
(ID of worker having produced the check result). zmon:alerts
- Set of all active alert IDs.
zmon:alerts:<ALERT-ID>
- Set of entity IDs in alert state.
zmon:alerts:<ALERT-ID>:entities
- Hash of entity IDs to alert captures. This hash contains all entity IDs matched by the alert, i.e. not only entities in alert state.
zmon:alerts:<ALERT-ID>:<ENTITY-ID>
- Alert detail JSON containing alert start time, captures, worker, etc.
zmon:downtimes
- Set of all alert IDs having downtimes.
zmon:downtimes:<ALERT-ID>
- Set of all entity IDs having a downtime for this alert.
zmon:downtimes:<ALERT-ID>:<ENTITY-ID>
- Hash of downtimes for this entity/alert. Each hash value is a JSON object with keys
start_time
,end_time
andcomment
. zmon:active_downtimes
- Set of currently active downtimes. Each set item has the form
<ALERT-ID>:<ENTITY-ID>:<DOWNTIME-ID>
. zmon:metrics
- Set of worker and scheduler IDs with metrics.
zmon:metrics:<WORKER-OR-SCHEDULER-ID>:ts
- Timestamp of last worker or scheduler metrics update.
zmon:metrics:<WORKER-OR-SCHEDULER-ID>:check.count
- Increasing counter of executed (or scheduled) checks.
Glossary¶
- alert definition
- Alert definitions define when to trigger an alert and for which entity. See Alert Definitions
- alert condition
- Python expression defining the “threshold” when to trigger an alert. See Condition.
- check command
- Python expression defining the value of a check. See Check Command Reference.
- check definition
- A check definition provides a source of data for alerts to monitor. See Check Definitions
- dashboard
- A dashboard is the main monitoring page of ZMON and consists of widgets and the list of active alerts. See Dashboards
- downtime
- In ZMON, downtime refers to a period of time where certain alerts/entities should not be triggered. One use case for downtimes are scheduled maintenance works. See Downtimes
- entity
- Entities are “objects” to be monitored. Entities can be hosts, Zomcat instances, but they can also be more abstract things like app domains. See Entities
- JSON
- JavaScript Object Notation. A minimal data interchange format. You probably already know it. If you don’t, there’s good documentation on its official page.
- Markdown
- A simple markup language that can mostly pass for plain text. There’s an introduction and a syntax reference on its official page.
- time period
- Alert definition’s time period can restrict its active alerting to certain time frames. This allows for alerts to be active e.g. only during work hours. See Time periods
- YAML
Not actually Yet Another Markup Language. A powerful but succinct data interchange format. This document should be sufficient to learn how to use YAML in ZMON. In case it isn’t, the Wikipedia entry on YAML is actually slightly more useful that the official documentation.
Note that YAML is a strict superset of JSON. That is, wherever YAML is required, JSON can be used instead.
Introduction¶
ZMON is a flexible and extensible open-source platform monitoring tool developed at Zalando and is in production use since early 2014. It offers proven scaling with its distributed nature and fast storage with KairosDB on top of Cassandra. ZMON splits checking(data acquisition) from the alerting responsibilities and uses abstract entities to describe what’s being monitored. Its checks and alerts rely on Python expressions, giving the user a lot of power and connectivity. Besides the UI it provides RESTful APIs to manage and configure most properties automatically.
Anyone can use ZMON, but offers particular advantages for technical organizations with many autonomous teams. Its front end (see Demo / Bootstrap / Kubernetes/ Vagrant) comes with Grafana3 “built-in,” enabling teams to create and manage their own data-driven dashboards along side ZMON’s own team/personal dashboards for alerts and custom widgets. Being able to inherit and clone alerts makes it easier for teams to reuse and share code. Alerts can trigger HipChat, Slack, and E-Mail notifications. iOS and Android clients are works in progress, but push notifications are already implemented.
ZMON also enables painless integration with CMDBs and deployment tools. It also supports service discovery via custom adapters or its built-in entity service’s REST API. For an example, see zmon-aws-agent to learn how we connect AWS service discovery with our monitoring in the cloud.
Feel free to contact us via slack.zmon.io.
ZMON Components¶
A minimum ZMON setup requires these four components:
- zmon-controller: UI/Grafana/Oauth2 Login/Github Login
- zmon-scheduler: Scheduling check/alert evaluation
- zmon-worker: Doing the heavy lifting
- zmon-eventlog-service: History for state changes and modifications
Plus the storage covered in the Requirements section.
The following components are optional:
- zmon-cli: A command line client for managing entities/checks/alerts if needed
- zmon-aws-agent: Works with the AWS API to retrieve “known” applications
- zmon-data-service: API for multi DC federation: receiver for remote workers primarily
- zmon-metric-cache: Small scale special purpose metric store for API metrics in ZMON’s cloud UI
- zmon-notification-service: Provides mobile API and push notification support for GCM to Android/iOS app
- zmon-android: An Android client for ZMON monitoring
- zmon-ios: An iOS client for ZMON monitoring
ZMON Origins¶
ZMON was born in late 2013 during Zalando’s annual Hack Week, when a group of Zalando engineers aimed to develop a replacement for ICINGA. Scalability, manageability and flexibility were all critical, as Zalando’s small teams needed to be able to monitor their services independent of each other. In early 2014, Zalando teams began migrating all checks to ZMON, which continues to serve Zalando Tech.
Entities¶
ZMON uses entities to describe your infrastructure or platform, and to bind check variables to fixed values.
{
"type":"host",
"id":"cassandra01",
"host":"cassandra01",
"role":"cassandra-host",
"ip":"192.168.1.17",
"dc":"data-center-1"
}
Or more abstract objects:
{
"type":"postgresql-cluster",
"id":"article-cluster",
"name":"article-cluster",
"shards": {
"shard1":"articledb01:5432/shard1",
"shard2":"articledb02:5432/shard2"
}
}
Entity properties are not defined in any schema, so you can add properties as you see fit. This enables finer-grained filtering or selection of entities later on. As an example, host entities can include a physical model to later select the proper hardware checks.
Below you see an exmple of the entity view with alerts per entity.

Checks¶
A check describes how data is acquired. Its key properties are: a command to execute and an entity filter. The filter selects a subset of entities by requiring an overlap on specified properties. An example:
{
"type":"postgresql-cluster", "name":"article-cluster"
}
The check command itself is an executable Python expression. ZMON provides many custom wrappers that bind to the selected entity. The following example uses a PostgreSQL wrapper to execute a query on every shard defined above:
# sql() in this context is aware of the "shards" property
sql().execute('SELECT count(1) FROM articles "total"').result()
A check command always returns a value to the alert. This can be of any Python type.
Not familiar with Python’s functional expressions? No worries: ZMON allows you to define a top-level function and define your command in an easier, less functional way:
def check():
# sql() binds to the entity used and thus knows the connection URLs
return sql().execute('SELECT count(1) FROM articles "total"').result()
Alerts¶
A basic alert consists of an alert condition, an entity filter, and a team. An alert has only two states: up or down. An alert is up if it yields anything but False; this also includes exceptions thrown during evaluation of the check or alert, e.g. in the event of connection problems. ZMON does not support levels of criticality, or something like “unknown”, but you have a color option to customize sort and style on your dashboard (red, orange, yellow).
Let’s revisit the above PostgreSQL check again. The alert below would either popup if there are no articles found or if we get an exception connecting to the PostgreSQL database.
team: database
entities:
- type: postgresql-cluster
alert_condition: |
value <= 0
Alerts raised by exceptions are marked in the dashboard with a “!”.
Via ZMON’s UI, alerts support parameters to the alert condition. This makes it easy for teams/users to implement different thresholds, and — with the priority field defining the dashboard color — render their dashboards to reflect their priorities.
Dashboards¶
Dashboards include a widget area where you can render important data with charts, gauges, or plain text. Another section features rendering of all active alerts for the team filter, defined at the dashboard level. Using the team filter, select the alerts you want your dashboard to include. Specify multiple teams, if necessary. TAGs are supported to subselect topics.

REST API and CLI¶
To make your life easier, ZMON’s REST API manages all the essential moving parts to support your daily work — creating and updating entities to allow for sync-up with your existing infrastructure. When you create and modify checks and alerts, the scheduler will quickly pick up these changes so you won’t have to restart or deploy anything.
And ZMON’s command line client - a slim wrapper around the REST API - also adds usability by making it simpler to work with YAML files or push collections of entities.
Development Status¶
The team behind ZMON continues to improve performance and functionality. Please let us know via GitHub’s issues tracker if you find any bugs or issues.