Event Handlers
Introduction
Event handlers are optional system commands (scripts or executables) that are run whenever a host or service state change occurs.
An obvious use for event handlers is the ability for Shinken to proactively fix problems before anyone is notified. Some other uses for event handlers include:
- Restarting a failed service
- Entering a trouble ticket into a helpdesk system
- Logging event information to a database
- Cycling power on a host*
- etc.
Cycling power on a host that is experiencing problems with an automated script should not be implemented lightly. Consider the consequences of this carefully before implementing automatic reboots. :-)
When Are Event Handlers Executed?
Event handlers are executed when a service or host:
- Is in a SOFT problem state
- Initially goes into a HARD problem state
- Initially recovers from a SOFT or HARD problem state
SOFT and HARD states are described in detail here .
Event Handler Types
There are different types of optional event handlers that you can define to handle host and state changes:
- Global host event handler
- Global service event handler
- Host-specific event handlers
- Service-specific event handlers
Global host and service event handlers are run for every host or service state change that occurs, immediately prior to any host- or service-specific event handler that may be run.
Event handlers offer functionality similar to notifications (launch some command) but are called each state change, soft or hard. This allows to call handler function and react to problems before Shinken raises a hard state and starts sending out notifications.
You can specify global event handler commands by using the global_host_event_handler and global_service_event_handler options in your main configuration file.
Individual hosts and services can have their own event handler command that should be run to handle state changes. You can specify an event handler that should be run by using the “event_handler” directive in your host and service definitions. These host- and service-specific event handlers are executed immediately after the (optional) global host or service event handler is executed.
Enabling Event Handlers
Event handlers can be enabled or disabled on a program-wide basis by using the enable_event_handlers in your main configuration file.
Host- and service-specific event handlers can be enabled or disabled by using the “event_handler_enabled” directive in your host and service definitions. Host- and service-specific event handlers will not be executed if the global enable_event_handlers option is disabled.
Event Handler Execution Order
As already mentioned, global host and service event handlers are executed immediately before host- or service-specific event handlers.
Event handlers are executed for HARD problem and recovery states immediately after notifications are sent out.
Writing Event Handler Commands
Event handler commands will likely be shell or perl scripts, but they can be any type of executable that can run from a command prompt. At a minimum, the scripts should take the following macros as arguments:
For Services: $SERVICESTATE$, $SERVICESTATETYPE$, $SERVICEATTEMPT$
For Hosts: $HOSTSTATE$, $HOSTSTATETYPE$, $HOSTATTEMPT$
The scripts should examine the values of the arguments passed to it and take any necessary action based upon those values. The best way to understand how event
handlers work is to see an example. Lucky for you, one is provided below.
Additional sample event handler scripts can be found in the “contrib/eventhandlers/” subdirectory of the Nagios distribution. Some of these sample scripts demonstrate the use of external commands to implement a redundant and distributed monitoring environments.
Permissions For Event Handler Commands
Event handler commands will normally execute with the same permissions as the user under which Shinken is running on your machine. This can present a problem if you want to write an event handler that restarts system services, as root privileges are generally required to do these sorts of tasks.
Ideally you should evaluate the types of event handlers you will be implementing and grant just enough permissions to the Shinken user for executing the necessary system commands. You might want to try using sudo to accomplish this.
Service Event Handler Example
The example below assumes that you are monitoring the “HTTP” server on the local machine and have specified restart-httpd as the event handler command for the “HTTP” service definition. Also, I will be assuming that you have set the “max_check_attempts” option for the service to be a value of 4 or greater (i.e. the service is checked 4 times before it is considered to have a real problem). An abbreviated example service definition might look like this...
define service{
host_name somehost
service_description HTTP
max_check_attempts 4
event_handler restart-httpd
...
}
Once the service has been defined with an event handler, we must define that event handler as a command. An example command definition for restart-httpd is shown below. Notice the macros in the command line that I am passing to the event handler script - these are important!
define command{
command_name restart-httpd
command_line /usr/local/nagios/libexec/eventhandlers/restart-httpd $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
Now, let’s actually write the event handler script (this is the “/usr/local/nagios/libexec/eventhandlers/restart-httpd” script).
#!/bin/sh
#
# Event handler script for restarting the web server on the local machine
#
# Note: This script will only restart the web server if the service is
# retried 3 times (in a "soft" state) or if the web service somehow
# manages to fall into a "hard" error state.
#
# What state is the HTTP service in?
case "$1" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
# Aha! The HTTP service appears to have a problem - perhaps we should restart the server...
# Is this a "soft" or a "hard" state?
case "$2" in
# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified...
SOFT)
# What check attempt are we on? We don't want to restart the web server on the first
# check, because it may just be a fluke!
case "$3" in
# Wait until the check has been tried 3 times before restarting the web server.
# If the check fails on the 4th time (after we restart the web server), the state
# type will turn to "hard" and contacts will be notified of the problem.
# Hopefully this will restart the web server successfully, so the 4th check will
# result in a "soft" recovery. If that happens no one gets notified because we
# fixed the problem!
3)
echo -n "Restarting HTTP service (3rd soft critical state)..."
# Call the init script to restart the HTTPD server
/etc/rc.d/init.d/httpd restart
;;
esac
;;
# The HTTP service somehow managed to turn into a hard error without getting fixed.
# It should have been restarted by the code above, but for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem with the service at this
# point (unless you disabled notifications for this service)
HARD)
echo -n "Restarting HTTP service..."
# Call the init script to restart the HTTPD server
/etc/rc.d/init.d/httpd restart
;;
esac
;;
esac
exit 0
The sample script provided above will attempt to restart the web server on the local machine in two different instances:
- After the service has been rechecked for the 3rd time and is in a SOFT CRITICAL state
- After the service first goes into a HARD CRITICAL state
The script should theoretically restart and web server and fix the problem before the service goes into a HARD problem state, but we include a fallback case in the event it doesn’t work the first time. It should be noted that the event handler will only be executed the first time that the service falls into a HARD problem state. This prevents Shinken from continuously executing the script to restart the web server if the service remains in a HARD problem state. You don’t want that. :-)
That’s all there is to it! Event handlers are pretty simple to write and implement, so give it a try and see what you can do.
- Note: you may need to:
- disable event handlers during downtimes (either by setting no_event_handlers_during_downtimes=1, or by checking $HOSTDOWNTIME$ and $SERVICEDOWNTIME$)
- make sure you want event handlers to be run even outside of the notification_period
Service and Host Freshness Checks
Introduction
Shinken supports a feature that does “freshness” checking on the results of host and service checks. The purpose of freshness checking is to ensure that host and service checks are being provided passively by external applications on a regular basis.
Freshness checking is useful when you want to ensure that passive checks are being received as frequently as you want. This can be very useful in distributed and failover monitoring environments.
How Does Freshness Checking Work?
Shinken periodically checks the freshness of the results for all hosts services that have freshness checking enabled.
- A freshness threshold is calculated for each host or service.
- For each host/service, the age of its last check result is compared with the freshness threshold.
- If the age of the last check result is greater than the freshness threshold, the check result is considered “stale”.
- If the check results is found to be stale, Shinken will force an active check of the host or service by executing the command specified by in the host or service definition.
An active check is executed even if active checks are disabled on a program-wide or host- or service-specific basis.
For example, if you have a freshness threshold of 60 for one of your services, Shinken will consider that service to be stale if its last check result is older than 60 seconds.
Enabling Freshness Checking
Here’s what you need to do to enable freshness checking...
- Enable freshness checking on a program-wide basis with the check_service_freshness and check_host_freshness directives.
- Use service_freshness_check_interval and host_freshness_check_interval options to tell Shinken how often it should check the freshness of service and host results.
- Enable freshness checking on a host- and service-specific basis by setting the “check_freshness” option in your host and service definitions to a value of 1.
- Configure freshness thresholds by setting the “freshness_threshold” option in your host and service definitions.
- Configure the “check_command” option in your host or service definitions to reflect a valid command that should be used to actively check the host or service when it is detected as stale.
- The “check_period” option in your host and service definitions is used when Shinken determines when a host or service can be checked for freshness, so make sure it is set to a valid timeperiod.
If you do not specify a host- or service-specific “freshness_threshold” value (or you set it to zero), Shinken will automatically calculate a threshold automatically, based on a how often you monitor that particular host or service. I would recommended that you explicitly specify a freshness threshold, rather than let Shinken pick one for you.
Example
An example of a service that might require freshness checking might be one that reports the status of your nightly backup jobs. Perhaps you have a external script that submit the results of the backup job to Shinken once the backup is completed. In this case, all of the checks/results for the service are provided by an external application using passive checks. In order to ensure that the status of the backup job gets reported every day, you may want to enable freshness checking for the service. If the external script doesn’t submit the results of the backup job, you can have Shinken fake a critical result by doing something like this...
Here’s what the definition for the service might look like (some required options are omitted)...
define service{
host_name backup-server
service_description ArcServe Backup Job
active_checks_enabled 0 ; active checks are NOT enabled
passive_checks_enabled 1 ; passive checks are enabled (this is how results are reported)
check_freshness 1
freshness_threshold 93600 ; 26 hour threshold, since backups may not always finish at the same time
check_command no-backup-report ; this command is run only if the service results are “stale"
...other options...
}
Notice that active checks are disabled for the service. This is because the results for the service are only made by an external application using passive checks. Freshness checking is enabled and the freshness threshold has been set to 26 hours. This is a bit longer than 24 hours because backup jobs sometimes run late from day to day (depending on how much data there is to backup, how much network traffic is present, etc.). The “no-backup-report” command is executed only if the results of the service are determined to be stale. The definition of the “no-backup-report” command might look like this...
define command{
command_name no-backup-report
command_line /var/lib/shinken/libexec/check_dummy 2 "CRITICAL: Results of backup job were not reported!"
}
If Shinken detects that the service results are stale, it will run the “no-backup-report” command as an active service check. This causes the check_dummy plugin to be executed, which returns a critical state to Shinken. The service will then go into to a critical state (if it isn’t already there) and someone will probably get notified of the problem.
Distributed Monitoring
Introduction
Shinken can be configured to support distributed monitoring of network services and resources. Shinken is designed for it in contrast to the Nagios way of doing it: which is more of a “MacGyver” way.
Goals
The goal in the distributed monitoring environment is to offload the overhead (CPU usage, etc.) of performing and receiving service checks from a “central” server onto one or more “distributed” servers. Most small to medium sized shops will not have a real need for setting up such an environment. However, when you want to start monitoring thousands of hosts (and several times that many services) using Shinken, this becomes quite important.
The global architecture
Shinken’s architecture has been designed according to the Unix Way: one tool, one task. Shinken has an architecture where each part is isolated and connects to the others via standard interfaces. Shinken is based on the a HTTP backend. This makes building a highly available or distributed monitoring architecture quite easy. In contrast, the Nagios daemon does nearly everything: it loads the configuration, schedules and launches checks, and raises notifications.
- Major innovations of Shinken over Nagios are to :
- split the different roles into separate daemons
- permit the use of modules to extend and enrich the various Shinken daemons
Shinken core uses distributed programming, meaning a daemon will often do remote invocations of code on other daemons, this means that to ensure maximum compatibility and stability, the core language, paths and module versions must be the same everywhere a daemon is running.
Shinken Daemon roles
- Arbiter: The arbiter daemon reads the configuration, divides it into parts (N schedulers = N parts), and distributes them to the appropriate Shinken daemons. Additionally, it manages the high availability features: if a particular daemon dies, it re-routes the configuration managed by this failed daemon to the configured spare. Finally, it receives input from users (such as external commands from nagios.cmd) or passive check results and routes them to the appropriate daemon. Passive check results are forwarded to the Scheduler responsible for the check. There can only be one active arbiter with other arbiters acting as hot standby spares in the architecture.
- Modules for data collection: NSCA, TSCA, Ws_arbiter (web service)
- Modules for configuration data storage: MongoDB,
- Modules for status retention: PickleRententionArbiter
- Modules for configuration manipulation: IP_Tag, MySQLImport, GLPI, vmware autolinking and other task specific modules
- Scheduler: The scheduler daemon manages the dispatching of checks and actions to the poller and reactionner daemons respectively. The scheduler daemon is also responsible for processing the check result queue, analyzing the results, doing correlation and following up actions accordingly (if a service is down, ask for a host check). It does not launch checks or notifications. It just keeps a queue of pending checks and notifications for other daemons of the architecture (like pollers or reactionners). This permits distributing load equally across many pollers. There can be many schedulers for load-balancing or hot standby roles. Status persistence is achieved using a retention module.
- Modules for status retention: pickle, nagios, memcache, redis and MongoDB are available.
- Poller: The poller daemon launches check plugins as requested by schedulers. When the check is finished it returns the result to the schedulers. Pollers can be tagged for specialized checks (ex. Windows versus Unix, customer A versus customer B, DMZ) There can be many pollers for load-balancing or hot standby spare roles.
- Module for data acquisition: NRPE Module
- Module for data acquisition: CommandFile (Used for check_mk integration which depends on the nagios.cmd named pipe )
- Module for data acquisition: SNMPbooster (in development)
- Reactionner: The reactionner daemon issues notifications and launches event_handlers. This centralizes communication channels with external systems in order to simplify SMTP authorizations or RSS feed sources (only one for all hosts/services). There can be many reactionners for load-balancing and spare roles
* Module for external communications: AndroidSMS
- Broker: The broker daemon exports and manages data from schedulers. The management can is done exclusively with modules. Multiple Broker modules can be enabled simultaneously.
- Module for centralizing Shinken logs: Simple-log (flat file)
- Modules for data retention: Pickle , ToNdodb_Mysql, ToNdodb_Oracle, couchdb
- Modules for exporting data: Graphite-Perfdata, NPCDMOD(PNP4Nagios) and Syslog
- Modules for the Livestatus API - status retention and history: SQLite (default), MongoDB (experimental)
- Modules for the Shinken WebUI: GRAPHITE_UI, PNP_UI. Trending and data visualization.
- Modules for compatibility: Service-Perfdata, Host-Perfdata and Status-Dat
- Receiver (optional): The receiver daemon receives passive check data and serves as a distributed passive command buffer that will be read by the arbiter daemon. There can be many receivers for load-balancing and hot standby spare roles. The receiver can also use modules to accept data from different protocols. Anyone serious about using passive check results should use a receiver to ensure that when the arbiter is not available (when updating a configuration) all check results are buffered by the receiver and forwarded when the arbiter is back on-line.
- Module for passive data collection: NSCA, TSCA, Ws_arbiter (web service)
This architecture is fully flexible and scalable: the daemons that require more performance are the poller and the schedulers. The administrator can add as many as he wants. The broker daemon should be on a well provisioned server for larger installations, as only a single broker can be active at one time. A picture is worth a thousand words:
The smart and automatic load balancing
Shinken is able to cut the user configuration into parts and dispatch it to the schedulers. The load balancing is done automatically: the administrator does not need to remember which host is linked with another one to create packs, Shinken does it for him.
The dispatch is a host-based one: that means that all services of a host will be in the same scheduler as this host. The major advantage of Shinken is the ability to create independent configurations: an element of a configuration will not have to call an element of another pack. That means that the administrator does not need to know all relations among elements like parents, hostdependencies or service dependencies: Shinken is able to look at these relations and put these related elements into the same packs.
This action is done in two parts:
- create independent packs of elements
- paste packs to create N configurations for the N schedulers
Creating independent packs
The cutting action is done by looking at two elements: hosts and services. Services are linked with their host so they will be in the same pack. Other relations are taken into account :
- parent relationship for hosts (like a distant server and its router)
- hostdependencies
- servicesdependencies
Shinken looks at all these relations and creates a graph with it. A graph is a relation pack. This can be illustrated by the following picture :
In this example, we will have two packs:
- pack 1: Host-1 to host-5 and all their services
- pack 2: Host-6 to Host-8 and all their services
The packs aggregations into scheduler configurations
When all relation packs are created, the Arbiter aggregates them into N configurations if the administrator has defined N active schedulers (no spares). Packs are aggregated into configurations (it’s like “Big packs”). The dispatch looks at the weight property of schedulers: the higher weight a scheduler has, the more packs it will have. This can be shown in the following picture :
The configurations sending to satellites
When all configurations are created, the Arbiter sends them to the N active Schedulers. A Scheduler can start processing checks once it has received and loaded it’s configuration without having to wait for all schedulers to be ready(v1.2). For larger configurations, having more than one Scheduler, even on a single server is highly recommended, as they will load their configurations(new or updated) faster. The Arbiter also creates configurations for satellites (pollers, reactionners and brokers) with links to Schedulers so they know where to get jobs to do. After sending the configurations, the Arbiter begins to watch for orders from the users and is responsible for monitoring the availability of the satellites.
The high availability
The shinken architecture is a high availability one. Before looking at how this works,let’s take a look at how the load balancing works if it’s now already done.
When a node dies
Nobody is perfect. A server can crash, an application too. That is why administrators have spares: they can take configurations of failing elements and reassign them. For the moment the only daemon that does not have a spare is the Arbiter, but this will be added in the future. The Arbiter regularly checks if everyone is available. If a scheduler or another satellite is dead, it sends its conf to a spare node, defined by the administrator. All satellites are informed by this change so they can get their jobs from the new element and do not try to reach the dead one. If a node was lost due to a network interruption and it comes back up, the Arbiter will notice and ask the old system to drop its configuration.
The availability parameters can be modified from the default settings when using larger configurations as the Schedulers or Brokers can become busy and delay their availability responses. The timers are aggressive by default for smaller installations. See daemon configuration parameters for more information on the three timers involved.
This can be explained by the following picture :
External commands dispatching
The administrator needs to send orders to the schedulers (like a new status for passive checks). In the Shinken way of thinking, the users only need to send orders to one daemon that will then dispatch them to all others. In Nagios the administrator needs to know where the hosts or services are to send the order to the right node. In Shinken the administrator just sends the order to the Arbiter, that’s all. External commands can be divided into two types :
- commands that are global to all schedulers
- commands that are specific to one element (host/service).
For each command, Shinken knows if it is global or not. If global, it just sends orders to all schedulers. For specific ones instead it searches which scheduler manages the element referred by the command (host/service) and sends the order to this scheduler. When the order is received by schedulers they just need to apply them.
Different types of Pollers: poller_tag
The current Shinken architecture is useful for someone that uses the same type of poller for checks. But it can be useful to have different types of pollers, like GNU/Linux ones and Windows ones. We already saw that all pollers talk to all schedulers. In fact, pollers can be “tagged” so that they will execute only some checks.
This is useful when the user needs to have hosts in the same scheduler (like with dependencies) but needs some hosts or services to be checked by specific pollers (see usage cases below).
These checks can in fact be tagged on 3 levels :
The parameter to tag a command, host or service, is “poller_tag”. If a check uses a “tagged” or “untagged” command in a untagged host/service, it takes the poller_tag of this host/service. In a “untagged” host/service, it’s the command tag that is taken into account.
The pollers can be tagged with multiple poller_tags. If they are tagged, they will only take checks that are tagged, not the untagged ones, unless they defined the tag “None”.
Use cases
This capability is useful in two cases:
- GNU/Linux and Windows pollers
- DMZ
In the first case, it can be useful to have a windows box in a domain with a poller daemon running under a domain account. If this poller launches WMI queries, the user can have an easy Windows monitoring.
The second case is a classic one: when you have a DMZ network, you need to have a dedicated poller that is in the DMZ, and return results to a scheduler in LAN. With this, you can still have dependencies between DMZ hosts and LAN hosts, and still be sure that checks are done in a DMZ-only poller.
Different types of Reactionners: reactionner_tag
Like for the pollers, reactionners can also have ‘tags’. So you can tag your host/service or commands with
“reactionner_tag”. If a notification or an event handler uses a “tagged” or “untagged” command in a untagged host/service, it takes the reactionner_tag of this host/service. In a “untaged” host/service, it’s the command tag that is taken into account.
The reactionners can be tagged with multiple reactionner_tags. If they are tagged, they will only take checks that are tagged, not the untagged ones, unless they defined the tag “None”.
Like for the poller case, it’s mainly useful for DMZ/LAN or GNU/Linux/Windows cases.
Advanced architectures: Realms
Shinken’s architecture allows the administrator to have a unique point of administration with numerous schedulers, pollers, reactionners and brokers. Hosts are dispatched with their own services to schedulers and the satellites (pollers/reactionners/brokers) get jobs from them. Everyone is happy.
Or almost everyone. Think about an administrator who has a distributed architecture around the world. With the current Shinken architecture the administrator can put a couple scheduler/poller daemons in Europe and another set in Asia, but he cannot “tag” hosts in Asia to be checked by the asian scheduler . Also trying to check an asian server with an european scheduler can be very sub-optimal, read very sloooow. The hosts are dispatched to all schedulers and satellites so the administrator cannot be sure that asian hosts will be checked by the asian monitoring servers.
In the normal Shinken Architecture is useful for load balancing with high availability, for single site.
Shinken provides a way to manage different geographic or organizational sites.
We will use a generic term for this site management, Realms.
Realms in few words
A realm is a pool of resources (scheduler, poller, reactionner and broker) that hosts or hostgroups can be attached to. A host or hostgroup can be attached to only one realm. All “dependancies” or parents of this hosts must be in the same realm. A realm can be tagged “default”’ and realm untagged hosts will be put into it. In a realm, pollers, reactionners and brokers will only get jobs from schedulers of the same realm.
Sub realms
A realm can contain another realm. It does not change anything for schedulers: they are only responsible for hosts of their realm not the ones of the sub realms. The realm tree is useful for satellites like reactionners or brokers: they can get jobs from the schedulers of their realm, but also from schedulers of sub realms. Pollers can also get jobs from sub realms, but it’s less useful so it’s disabled by default. Warning: having more than one broker in a scheduler is not a good idea. The jobs for brokers can be taken by only one broker. For the Arbiter it does not change a thing: there is still only one Arbiter and one configuration whatever realms you have.
Example of realm usage
Let’s take a look at two distributed environnements. In the first case the administrator wants totally distinct daemons. In the second one he just wants the schedulers/pollers to be distincts, but still have one place to send notifications (reactionners) and one place for database export (broker).
Distincts realms :
More common usage, the global realm with reactionner/broker, and sub realms with schedulers/pollers :
Satellites can be used for their realm or sub realms too. It’s just a parameter in the configuration of the element.
Detection and Handling of State Flapping
Introduction
Shinken supports optional detection of hosts and services that are “flapping”. Flapping occurs when a service or host changes state too frequently, resulting in a storm of problem and recovery notifications. Flapping can be indicative of configuration problems (i.e. thresholds set too low), troublesome services, or real network problems.
How Flap Detection Works
Before I get into this, let me say that flapping detection has been a little difficult to implement. How exactly does one determine what “too frequently” means in regards to state changes for a particular host or service? When I first started thinking about implementing flap detection I tried to find some information on how flapping could/should be detected. I couldn’t find any information about what others were using (where they using any?), so I decided to settle with what seemed to me to be a reasonable solution...
Whenever Shinken checks the status of a host or service, it will check to see if it has started or stopped flapping. It does this by.
- Storing the results of the last 21 checks of the host or service
- Analyzing the historical check results and determine where state changes/transitions occur
- Using the state transitions to determine a percent state change value (a measure of change) for the host or service
- Comparing the percent state change value against low and high flapping thresholds
A host or service is determined to have started flapping when its percent state change first exceeds a high flapping threshold.
A host or service is determined to have stopped flapping when its percent state goes below a low flapping threshold (assuming that is was previously flapping).
Example
Let’s describe in more detail how flap detection works with services...
The image below shows a chronological history of service states from the most recent 21 service checks. OK states are shown in green, WARNING states in yellow, CRITICAL states in red, and UNKNOWN states in orange.
The historical service check results are examined to determine where state changes/transitions occur. State changes occur when an archived state is different from the archived state that immediately precedes it chronologically. Since we keep the results of the last 21 service checks in the array, there is a possibility of having at most 20 state changes. The 20 value can be changed in the main configuration file, see flap_history. In this example there are 7 state changes, indicated by blue arrows in the image above.
The flap detection logic uses the state changes to determine an overall percent state change for the service. This is a measure of volatility/change for the service. Services that never change state will have a 0% state change value, while services that change state each time they’re checked will have 100% state change. Most services will have a percent state change somewhere in between.
When calculating the percent state change for the service, the flap detection algorithm will give more weight to new state changes compare to older ones. Specifically, the flap detection routines are currently designed to make the newest possible state change carry 50% more weight than the oldest possible state change. The image below shows how recent state changes are given more weight than older state changes when calculating the overall or total percent state change for a particular service.
Using the images above, lets do a calculation of percent state change for the service. You will notice that there are a total of 7 state changes (at t3, t4, t5, t9, t12, t16, and t19). Without any weighting of the state changes over time, this would give us a total state change of 35%:
(7 observed state changes / possible 20 state changes) * 100 = 35 %
Since the flap detection logic will give newer state changes a higher rate than older state changes, the actual calculated percent state change will be slightly less than 35% in this example. Let’s say that the weighted percent of state change turned out to be 31%...
The calculated percent state change for the service (31%) will then be compared against flapping thresholds to see what should happen:
- If the service was not previously flapping and 31% is equal to or greater than the high flap threshold, Shinken considers the service to have just started flapping.
- If the service was previously flapping and 31% is less than the low flap threshold, Shinken considers the service to have just stopped flapping.
If neither of those two conditions are met, the flap detection logic won’t do anything else with the service, since it is either not currently flapping or it is still flapping.
Flap Detection for Services
Shinken checks to see if a service is flapping whenever the service is checked (either actively or passively).
The flap detection logic for services works as described in the example above.
Flap Detection for Hosts
Host flap detection works in a similar manner to service flap detection, with one important difference: Shinken will attempt to check to see if a host is flapping whenever:
- The host is checked (actively or passively)
- Sometimes when a service associated with that host is checked. More specifically, when at least x amount of time has passed since the flap detection was last performed, where x is equal to the average check interval of all services associated with the host.
Why is this done? With services we know that the minimum amount of time between consecutive flap detection routines is going to be equal to the service check interval. However, you might not be monitoring hosts on a regular basis, so there might not be a host check interval that can be used in the flap detection logic. Also, it makes sense that checking a service should count towards the detection of host flapping. Services are attributes of or things associated with host after all... At any rate, that’s the best method I could come up with for determining how often flap detection could be performed on a host, so there you have it.
Flap Detection Thresholds
Shinken uses several variables to determine the percent state change thresholds is uses for flap detection. For both hosts and services, there are global high and low thresholds and host- or service-specific thresholds that you can configure. Shinken will use the global thresholds for flap detection if you to not specify host- or service- specific thresholds.
The table below shows the global and host- or service-specific variables that control the various thresholds used in flap detection.
States Used For Flap Detection
Normally Shinken will track the results of the last 21 checks of a host or service, regardless of the check result (host/service state), for use in the flap detection logic.
You can exclude certain host or service states from use in flap detection logic by using the “flap_detection_options” directive in your host or service definitions. This directive allows you to specify what host or service states (i.e. “UP, “DOWN”, “OK, “CRITICAL”) you want to use for flap detection. If you don’t use this directive, all host or service states are used in flap detection.
Flap Handling
When a service or host is first detected as flapping, Shinken will:
- Log a message indicating that the service or host is flapping.
- Add a non-persistent comment to the host or service indicating that it is flapping.
- Send a “flapping start” notification for the host or service to appropriate contacts.
- Suppress other notifications for the service or host (this is one of the filters in the notification logic).
When a service or host stops flapping, Shinken will:
- Log a message indicating that the service or host has stopped flapping.
- Delete the comment that was originally added to the service or host when it started flapping.
- Send a “flapping stop” notification for the host or service to appropriate contacts.
- Remove the block on notifications for the service or host (notifications will still be bound to the normal notification logic).
Enabling Flap Detection
In order to enable the flap detection features in Shinken, you’ll need to:
- Set enable_flap_detection directive is set to 1.
- Set the “flap_detection_enabled” directive in your host and service definitions is set to 1.
If you want to disable flap detection on a global basis, set the enable_flap_detection directive to 0.
If you would like to disable flap detection for just a few hosts or services, use the “flap_detection_enabled” directive in the host and/or service definitions to do so.
Notification Escalations
Introduction
Shinken supports optional escalation of contact notifications for hosts and services. Escalation of host and service notifications is accomplished by defining host escalations and service escalations in your Object Configuration Overview.
The examples I provide below all make use of service escalation definitions, but host escalations work the same way. Except, of course, that they’re for hosts instead of services. :-)
When Are Notifications Escalated?
Notifications are escalated if and only if one or more escalation definitions matches the current notification that is being sent out. If a host or service notification does not have any valid escalation definitions that applies to it, the contact group(s) specified in either the host group or service definition will be used for the notification. Look at the example below:
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 90
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 6
last_notification 10
notification_interval 60
contact_groups nt-admins,managers,everyone
}
Notice that there are “holes” in the notification escalation definitions. In particular, notifications 1 and 2 are not handled by the escalations, nor are any notifications beyond 10. For the first and second notification, as well as all notifications beyond the tenth one, the default contact groups specified in the service definition are used. For all the examples I’ll be using, I’ll be assuming that the default contact groups for the service definition is called nt-admins.
Overlapping Escalation Ranges
Notification escalation definitions can have notification ranges that overlap. Take the following example:
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 20
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 4
last_notification 0
notification_interval 30
contact_groups on-call-support
}
In the example above:
- The nt-admins and managers contact groups get notified on the third notification
- All three contact groups get notified on the fourth and fifth notifications
- Only the on-call-support contact group gets notified on the sixth (or higher) notification
Recovery Notifications
Recovery notifications are slightly different than problem notifications when it comes to escalations. Take the following example:
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 20
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 4
last_notification 0
notification_interval 30
contact_groups on-call-support
}
If, after three problem notifications, a recovery notification is sent out for the service, who gets notified? The recovery is actually the fourth notification that gets sent out. However, the escalation code is smart enough to realize that only those people who were notified about the problem on the third notification should be notified about the recovery. In this case, the nt-admins and managers contact groups would be notified of the recovery.
Notification Intervals
You can change the frequency at which escalated notifications are sent out for a particular host or service by using the notification_interval option of the hostgroup or service escalation definition. Example:
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 45
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 6
last_notification 0
notification_interval 60
contact_groups nt-admins,managers,everyone
}
In this example we see that the default notification interval for the services is 240 minutes (this is the value in the service definition). When the service notification is escalated on the 3rd, 4th, and 5th notifications, an interval of 45 minutes will be used between notifications. On the 6th and subsequent notifications, the notification interval will be 60 minutes, as specified in the second escalation definition.
Since it is possible to have overlapping escalation definitions for a particular hostgroup or service, and the fact that a host can be a member of multiple hostgroups, Shinken has to make a decision on what to do as far as the notification interval is concerned when escalation definitions overlap. In any case where there are multiple valid escalation definitions for a particular notification, Shinken will choose the smallest notification interval. Take the following example:
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 45
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 4
last_notification 0
notification_interval 60
contact_groups nt-admins,managers,everyone
}
We see that the two escalation definitions overlap on the 4th and 5th notifications. For these notifications, Shinken will use a notification interval of 45 minutes, since it is the smallest interval present in any valid escalation definitions for those notifications.
One last note about notification intervals deals with intervals of 0. An interval of 0 means that Shinken should only sent a notification out for the first valid notification during that escalation definition. All subsequent notifications for the hostgroup or service will be suppressed. Take this example:
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 45
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 3
last_notification 5
notification_interval 45
contact_groups nt-admins,managers
}
define serviceescalation{
host_name webserver
service_description HTTP
first_notification 7
last_notification 0
notification_interval 30
contact_groups nt-admins,managers
}
In the example above, the maximum number of problem notifications that could be sent out about the service would be four. This is because the notification interval of 0 in the second escalation definition indicates that only one notification should be sent out (starting with and including the 4th notification) and all subsequent notifications should be repressed. Because of this, the third service escalation definition has no effect whatsoever, as there will never be more than four notifications.
Escalations based on time
The escalations can also be based on time, instead of notification number. It’s very easy to setup and work like for the old way but with time instead.
define escalation{
first_notification_time 60
last_notification_time 120
contact_groups nt-admins,managers
}
It will use the interval length for the value you set for first/last notification time. Here, it will escalate after 1 hour problem, and stop at 2 hours. You cannot have in the same escalation time and number escalation rules. But of course you can have escalations based on time and escalation based on notification number applied on hosts and services.
Escalations based on time short time
It’s also interesting to see that with escalation based on time, if the notification interval is longer than the next escalation time, it’s this last value that will be taken into account.
Let take an example where your service got:
define service{
notification_interval 1440
escalations ToLevel2,ToLevel3
}
Then with the escalations objects:
define escalation {
escalation_name ToLevel2
first_notification_time 60
last_notification_time 120
contact_groups level2
}
define escalation {
escalation_name ToLevel3
first_notification_time 120
last_notification_time 0
contact_groups level3
}
Here let say you have a problem HARD on the service at t=0. It will notify the level1. The next notification should be at t=1440 minutes, so tomorrow. It’s ok for classic services (too much notification is DANGEROUS!) but not for escalated ones.
Here, at t=60 minutes, the escalation will raise, you will notify the level2 contact group, and then at t=120 minutes you will notify the level3, and here one a day until they solve it!
So you can put large notification_interval and still have quick escalations times, it’s not a problem :)
Time Period Restrictions
Under normal circumstances, escalations can be used at any time that a notification could normally be sent out for the host or service. This “notification time window” is determined by the “notification_period” directive in the host or service definition.
You can optionally restrict escalations so that they are only used during specific time periods by using the “escalation_period” directive in the host or service escalation definition. If you use the “escalation_period” directive to specify a Time Period Definition during which the escalation can be used, the escalation will only be used during that time. If you do not specify any “escalation_period” directive, the escalation can be used at any time within the “notification time window” for the host or service.
Escalated notifications are still subject to the normal time restrictions imposed by the “notification_period” directive in a host or service definition, so the timeperiod you specify in an escalation definition should be a subset of that larger “notification time window”.
State Restrictions
If you would like to restrict the escalation definition so that it is only used when the host or service is in a particular state, you can use the “escalation_options” directive in the host or service escalation definition. If you do not use the “escalation_options” directive, the escalation can be used when the host or service is in any state.
On-Call Rotations
Introduction
Admins often have to shoulder the burden of answering pagers, cell phone calls, etc. when they least desire them. No one likes to be woken up at 4 am to fix a problem. But its often better to fix the problem in the middle of the night, rather than face the wrath of an unhappy boss when you stroll in at 9 am the next morning.
For those lucky admins who have a team of gurus who can help share the responsibility of answering alerts, on-call rotations are often setup. Multiple admins will often alternate taking notifications on weekends, weeknights, holidays, etc.
I’ll show you how you can create timeperiod definitions in a way that can facilitate most on-call notification rotations. These definitions won’t handle human issues that will inevitably crop up (admins calling in sick, swapping shifts, or throwing their pagers into the river), but they will allow you to setup a basic structure that should work the majority of the time.
Scenario 1: Holidays and Weekends
Two admins - John and Bob - are responsible for responding to Shinken alerts. John receives all notifications for weekdays (and weeknights) - except for holidays - and Bob gets handles notifications during the weekends and holidays. Lucky Bob. Here’s how you can define this type of rotation using timeperiods...
First, define a timeperiod that contains time ranges for holidays:
define timeperiod{
name holidays
timeperiod_name holidays
january 1 00:00-24:00 ; New Year's Day
2008-03-23 00:00-24:00 ; Easter (2008)
2009-04-12 00:00-24:00 ; Easter (2009)
monday -1 may 00:00-24:00 ; Memorial Day (Last Monday in May)
july 4 00:00-24:00 ; Independence Day
monday 1 september 00:00-24:00 ; Labor Day (1st Monday in September)
thursday 4 november 00:00-24:00 ; Thanksgiving (4th Thursday in November)
december 25 00:00-24:00 ; Christmas
december 31 17:00-24:00 ; New Year's Eve (5pm onwards)
}
Next, define a timeperiod for John’s on-call times that include weekdays and weeknights, but excludes the dates/times defined in the holidays timeperiod above:
define timeperiod{
timeperiod_name john-oncall
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
exclude holidays ; Exclude holiday dates/times defined elsewhere
}
You can now reference this timeperiod in John’s contact definition:
define contact{
contact_name john
...
host_notification_period john-oncall
service_notification_period john-oncall
}
Define a new timeperiod for Bob’s on-call times that include weekends and the dates/times defined in the holidays timeperiod above:
define timeperiod{
timeperiod_name bob-oncall
friday 00:00-24:00
saturday 00:00-24:00
use holidays ; Also include holiday date/times defined elsewhere
}
You can now reference this timeperiod in Bob’s contact definition:
define contact{
contact_name bob
...
host_notification_period bob-oncall
service_notification_period bob-oncall
}
Scenario 2: Alternating Days
In this scenario John and Bob alternate handling alerts every other day - regardless of whether its a weekend, weekday, or holiday.
Define a timeperiod for when John should receive notifications. Assuming today’s date is August 1st, 2007 and John is handling notifications starting today, the definition would look like this:
define timeperiod{
timeperiod_name john-oncall
2007-08-01 / 2 00:00-24:00 ; Every two days, starting August 1st, 2007
}
Now define a timeperiod for when Bob should receive notifications. Bob gets notifications on the days that John doesn’t, so his first on-call day starts tomorrow (August 2nd, 2007).
define timeperiod{
timeperiod_name bob-oncall
2007-08-02 / 2 00:00-24:00 ; Every two days, starting August 2nd, 2007
}
Now you need to reference these timeperiod definitions in the contact definitions for John and Bob:
define contact{
contact_name john
...
host_notification_period john-oncall
service_notification_period john-oncall
}
define contact{
contact_name bob
...
host_notification_period bob-oncall
service_notification_period bob-oncall
}
Scenario 3: Alternating Weeks
In this scenario John and Bob alternate handling alerts every other week. John handles alerts Sunday through Saturday one week, and Bob handles alerts for the following seven days. This continues in perpetuity.
Define a timeperiod for when John should receive notifications. Assuming today’s date is Sunday, July 29th, 2007 and John is handling notifications this week (starting today), the definition would look like this:
define timeperiod{
timeperiod_name john-oncall
2007-07-29 / 14 00:00-24:00 ; Every 14 days (two weeks), starting Sunday, July 29th, 2007
2007-07-30 / 14 00:00-24:00 ; Every other Monday starting July 30th, 2007
2007-07-31 / 14 00:00-24:00 ; Every other Tuesday starting July 31st, 2007
2007-08-01 / 14 00:00-24:00 ; Every other Wednesday starting August 1st, 2007
2007-08-02 / 14 00:00-24:00 ; Every other Thursday starting August 2nd, 2007
2007-08-03 / 14 00:00-24:00 ; Every other Friday starting August 3rd, 2007
2007-08-04 / 14 00:00-24:00 ; Every other Saturday starting August 4th, 2007
}
Now define a timeperiod for when Bob should receive notifications. Bob gets notifications on the weeks that John doesn’t, so his first on-call day starts next Sunday (August 5th, 2007).
define timeperiod{
timeperiod_name bob-oncall
2007-08-05 / 14 00:00-24:00 ; Every 14 days (two weeks), starting Sunday, August 5th, 2007
2007-08-06 / 14 00:00-24:00 ; Every other Monday starting August 6th, 2007
2007-08-07 / 14 00:00-24:00 ; Every other Tuesday starting August 7th, 2007
2007-08-08 / 14 00:00-24:00 ; Every other Wednesday starting August 8th, 2007
2007-08-09 / 14 00:00-24:00 ; Every other Thursday starting August 9th, 2007
2007-08-10 / 14 00:00-24:00 ; Every other Friday starting August 10th, 2007
2007-08-11 / 14 00:00-24:00 ; Every other Saturday starting August 11th, 2007
}
Now you need to reference these timeperiod definitions in the contact definitions for John and Bob:
define contact{
contact_name mjohn
...
host_notification_period john-oncall
service_notification_period john-oncall
}
define contact{
contact_name bob
...
host_notification_period bob-oncall
service_notification_period bob-oncall
}
Scenario 4: Vacation Days
In this scenarios, John handles notifications for all days except those he has off. He has several standing days off each month, as well as some planned vacations. Bob handles notifications when John is on vacation or out of the office.
First, define a timeperiod that contains time ranges for John’s vacation days and days off:
define timeperiod{
name john-out-of-office
timeperiod_name john-out-of-office
day 15 00:00-24:00 ; 15th day of each month
day -1 00:00-24:00 ; Last day of each month (28th, 29th, 30th, or 31st)
day -2 00:00-24:00 ; 2nd to last day of each month (27th, 28th, 29th, or 30th)
january 2 00:00-24:00 ; January 2nd each year
june 1 - july 5 00:00-24:00 ; Yearly camping trip (June 1st - July 5th)
2007-11-01 - 2007-11-10 00:00-24:00 ; Vacation to the US Virgin Islands (November 1st-10th, 2007)
}
Next, define a timeperiod for John’s on-call times that excludes the dates/times defined in the timeperiod above:
define timeperiod{
timeperiod_name john-oncall
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
exclude john-out-of-office ; Exclude dates/times John is out
}
You can now reference this timeperiod in John’s contact definition:
define contact{
contact_name john
...
host_notification_period john-oncall
service_notification_period john-oncall
}
Define a new timeperiod for Bob’s on-call times that include the dates/times that John is out of the office:
define timeperiod{
timeperod_name bob-oncall
use john-out-of-office ; Include holiday date/times that John is out
}
You can now reference this timeperiod in Bob’s contact definition:
define contact{
contact_name bob
...
host_notification_period bob-oncall
service_notification_period bob-oncall
}
Other Scenarios
There are a lot of other on-call notification rotation scenarios that you might have. The date exception directive in timeperiod definitions is capable of handling most dates and date ranges that you might need to use, so check out the different formats that you can use. If you make a mistake when creating timeperiod definitions, always err on the side of giving someone else more on-call duty time. :-)
Host and Service Dependencies
Introduction
Service and host dependencies are an advanced feature of Shinken that allow you to control the behavior of hosts and services based on the status of one or more other hosts or services. I’ll explain how dependencies work, along with the differences between host and service dependencies.
Service Dependencies Overview
There are a few things you should know about service dependencies:
- A service can be dependent on one or more other services
- A service can be dependent on services which are not associated with the same host
- Service dependencies are not inherited (unless specifically configured to)
- Service dependencies can be used to cause service check execution and service notifications to be suppressed under different circumstances (OK, WARNING, UNKNOWN, and/or CRITICAL states)
- Service dependencies might only be valid during specific timeperiods
Defining Service Dependencies
First, the basics. You create service dependencies by adding service dependency definitions in your object config file(s). In each definition you specify the dependent service, the service you are depending on, and the criteria (if any) that cause the execution and notification dependencies to fail (these are described later).
You can create several dependencies for a given service, but you must add a separate service dependency definition for each dependency you create.
Example Service Dependencies
The image below shows an example logical layout of service notification and execution dependencies. Different services are dependent on other services for notifications and check execution.
In this example, the dependency definitions for Service F on Host C would be defined as follows:
define servicedependency{
host_name Host B
service_description Service D
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria o
notification_failure_criteria w,u
}
define servicedependency{
host_name Host B
service_description Service E
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria n
notification_failure_criteria w,u,c
}
define servicedependency{
host_name Host B
service_description Service C
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria w
notification_failure_criteria c
}
The other dependency definitions shown in the image above would be defined as follows:
define servicedependency{
host_name Host A
service_description Service A
dependent_host_name Host B
dependent_service_description Service D
execution_failure_criteria u
notification_failure_criteria n
}
define servicedependency{
host_name Host A
service_description Service B
dependent_host_name Host B
dependent_service_description Service E
execution_failure_criteria w,u
notification_failure_criteria c
}
define servicedependency{
host_name Host B
service_description Service C
dependent_host_name Host B
dependent_service_description Service E
execution_failure_criteria n
notification_failure_criteria w,u,c
}
How Service Dependencies Are Tested
Before Shinken executes a service check or sends notifications out for a service, it will check to see if the service has any dependencies. If it doesn’t have any dependencies, the check is executed or the notification is sent out as it normally would be. If the service does have one or more dependencies, Shinken will check each dependency entry as follows:
- Shinken gets the current status of the service that is being depended upon.
- Shinken compares the current status of the service that is being depended upon against either the execution or notification failure options in the dependency definition (whichever one is relevant at the time).
- If the current status of the service that is being depended upon matches one of the failure options, the dependency is said to have failed and Shinken will break out of the dependency check loop.
- If the current state of the service that is being depended upon does not match any of the failure options for the dependency entry, the dependency is said to have passed and Shinken will go on and check the next dependency entry.
This cycle continues until either all dependencies for the service have been checked or until one dependency check fails.
- One important thing to note is that by default, Shinken will use the most current hard state of the service(s) that is/are being depended upon when it does the dependency checks. If you want Shinken to use the most current state of the services (regardless of whether its a soft or hard state), enable the soft_state_dependencies option.
Execution Dependencies
Execution dependencies are used to restrict when active checks of a service can be performed. Passive checks are not restricted by execution dependencies.
If all of the execution dependency tests for the service passed, Shinken will execute the check of the service as it normally would. If even just one of the execution dependencies for a service fails, Shinken will temporarily prevent the execution of checks for that (dependent) service. At some point in the future the execution dependency tests for the service may all pass. If this happens, Shinken will start checking the service again as it normally would. More information on the check scheduling logic can be found here.
In the example above, Service E would have failed execution dependencies if Service B is in a WARNING or UNKNOWN state. If this was the case, the service check would not be performed and the check would be scheduled for (potential) execution at a later time.
Notification Dependencies
If all of the notification dependency tests for the service passed, Shinken will send notifications out for the service as it normally would. If even just one of the notification dependencies for a service fails, Shinken will temporarily repress notifications for that (dependent) service. At some point in the future the notification dependency tests for the service may all pass. If this happens, Shinken will start sending out notifications again as it normally would for the service. More information on the notification logic can be found here.
In the example above, Service F would have failed notification dependencies if Service C is in a CRITICAL state, and/or Service D is in a WARNING or UNKNOWN state, and/or/ if Service E is in a WARNING, UNKNOWN, or CRITICAL state. If this were the case, notifications for the service would not be sent out.
Dependency Inheritance
As mentioned before, service dependencies are not inherited by default. In the example above you can see that Service F is dependent on Service E. However, it does not automatically inherit Service E’s dependencies on Service B and Service C. In order to make Service F dependent on Service C we had to add another service dependency definition. There is no dependency definition for Service B, so Service F is not dependent on Service B.
If you do wish to make service dependencies inheritable, you must use the inherits_parent directive in the service dependency definition. When this directive is enabled, it indicates that the dependency inherits dependencies of the service that is being depended upon (also referred to as the master service). In other words, if the master service is dependent upon other services and any one of those dependencies fail, this dependency will also fail.
In the example above, imagine that you want to add a new dependency for service F to make it dependent on service A. You could create a new dependency definition that specified service F as the dependent service and service A as being the master service (i.e. the service that is being dependend on). You could alternatively modify the dependency definition for services D and F to look like this:
define servicedependency{
host_name Host B
service_description Service D
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria o
notification_failure_criteria n
inherits_parent 1
}
Since the inherits_parent directive is enabled, the dependency between services A and D will be tested when the dependency between services F and D are being tested.
Dependencies can have multiple levels of inheritance. If the dependency definition between A and D had its inherits_parent directive enable and service A was dependent on some other service (let’s call it service G), the service F would be dependent on services D, A, and G (each with potentially different criteria).
Host Dependencies
As you’d probably expect, host dependencies work in a similar fashion to service dependencies. The difference is that they’re for hosts, not services.
Do not confuse host dependencies with parent/child host relationships. You should be using parent/child host relationships (defined with the parents directive in host definitions) for most cases, rather than host dependencies. A description of how parent/child host relationships work can be found in the documentation on network reachability.
Here are the basics about host dependencies:
- A host can be dependent on one or more other host
- Host dependencies are not inherited (unless specifically configured to)
- Host dependencies can be used to cause host check execution and host notifications to be suppressed under different circumstances (UP, DOWN, and/or UNREACHABLE states)
- Host dependencies might only be valid during specific timeperiods
Example Host Dependencies
The image below shows an example of the logical layout of host notification dependencies. Different hosts are dependent on other hosts for notifications.
In the example above, the dependency definitions for Host C would be defined as follows:
define hostdependency{
host_name Host A
dependent_host_name Host C
notification_failure_criteria d
}
define hostdependency{
host_name Host B
dependent_host_name Host C
notification_failure_criteria d,u
}
As with service dependencies, host dependencies are not inherited. In the example image you can see that Host C does not inherit the host dependencies of Host B. In order for Host C to be dependent on Host A, a new host dependency definition must be defined.
Host notification dependencies work in a similar manner to service notification dependencies. If all of the notification dependency tests for the host pass, Shinken will send notifications out for the host as it normally would. If even just one of the notification dependencies for a host fails, Shinken will temporarily repress notifications for that (dependent) host. At some point in the future the notification dependency tests for the host may all pass. If this happens, Shinken will start sending out notifications again as it normally would for the host. More information on the notification logic can be found here.
State Stalking
Introduction
State “stalking” is a feature which is probably not going to used by most users. When enabled, it allows you to log changes in the output service and host checks even if the state of the host or service does not change. When stalking is enabled for a particular host or service, Shinken will watch that host or service very carefully and log any changes it sees in the output of check results. As you’ll see, it can be very helpful to you in later analysis of the log files.
How Does It Work?
Under normal circumstances, the result of a host or service check is only logged if the host or service has changed state since it was last checked. There are a few exceptions to this, but for the most part, that’s the rule.
If you enable stalking for one or more states of a particular host or service, Shinken will log the results of the host or service check if the output from the check differs from the output from the previous check. Take the following example of eight consecutive checks of a service:
Service Check #: |
Service State: |
Service Check Output: |
Logged Normally |
Logged With Stalking |
x |
OK |
RAID array optimal |
|
|
x+1 |
OK |
RAID array optimal |
|
|
x+2 |
WARNING |
RAID array degraded (1 drive bad, 1 hot spare rebuilding) |
|
|
x+3 |
CRITICAL |
RAID array degraded (2 drives bad, 1 host spare online, 1 hot spare rebuilding) |
|
|
x+4 |
CRITICAL |
RAID array degraded (3 drives bad, 2 hot spares online) |
|
|
x+5 |
CRITICAL |
RAID array failed |
|
|
x+6 |
CRITICAL |
RAID array failed |
|
|
x+7 |
CRITICAL |
RAID array failed |
|
|
Given this sequence of checks, you would normally only see two log entries for this catastrophe. The first one would occur at service check x+2 when the service changed from an OK state to a WARNING state. The second log entry would occur at service check x+3 when the service changed from a WARNING state to a CRITICAL state.
For whatever reason, you may like to have the complete history of this catastrophe in your log files. Perhaps to help explain to your manager how quickly the situation got out of control, perhaps just to laugh at it over a couple of drinks at the local pub...
Well, if you had enabled stalking of this service for CRITICAL states, you would have events at x+4 and x+5 logged in addition to the events at x+2 and x+3. Why is this? With state stalking enabled, Shinken would have examined the output from each service check to see if it differed from the output of the previous check. If the output differed and the state of the service didn’t change between the two checks, the result of the newer service check would get logged.
A similar example of stalking might be on a service that checks your web server. If the check_http plugin first returns a WARNING state because of a 404 error and on subsequent checks returns a WARNING state because of a particular pattern not being found, you might want to know that. If you didn’t enable state stalking for WARNING states of the service, only the first WARNING state event (the 404 error) would be logged and you wouldn’t have any idea (looking back in the archived logs) that future WARNING states were not due to a 404, but rather some text pattern that could not be found in the returned web page.
Should I Enable Stalking?
First, you must decide if you have a real need to analyze archived log data to find the exact cause of a problem. You may decide you need this feature for some hosts or services, but not for all. You may also find that you only have a need to enable stalking for some host or service states, rather than all of them. For example, you may decide to enable stalking for WARNING and CRITICAL states of a service, but not for OK and UNKNOWN states.
The decision to to enable state stalking for a particular host or service will also depend on the plugin that you use to check that host or service. If the plugin always returns the same text output for a particular state, there is no reason to enable stalking for that state.
How Do I Enable Stalking?
You can enable state stalking for hosts and services by using the stalking_options directive in host and service definitions.
How Does Stalking Differ From Volatile Services?
Volatile services are similar, but will cause notifications and event handlers to run. Stalking is purely for logging purposes.
Caveats
You should be aware that there are some potential pitfalls with enabling stalking. These all relate to the reporting functions found in various CGIs (histogram, alert summary, etc.). Because state stalking will cause additional alert entries to be logged, the data produced by the reports will show evidence of inflated numbers of alerts.
As a general rule, I would suggest that you not enable stalking for hosts and services without thinking things through. Still, it’s there if you need and want it.
Performance Data
Introduction
Shinken is designed to allow plugins to return optional performance data in addition to normal status data, as well as allow you to pass that performance data to external applications for processing. A description of the different types of performance data, as well as information on how to go about processing that data is described below...
Scheduled Downtime
Introduction
Shinken allows you to schedule periods of planned downtime for hosts and service that you’re monitoring. This is useful in the event that you actually know you’re going to be taking a server down for an upgrade, etc.
Scheduling Downtime
You can schedule downtime with your favorite UI or as an external command in cli.
Once you schedule downtime for a host or service, Shinken will add a comment to that host/service indicating that it is scheduled for downtime during the period of time you indicated. When that period of downtime passes, Shinken will automatically delete the comment that it added. Nice, huh?
Fixed vs. Flexible Downtime
When you schedule downtime for a host or service through the web interface you’ll be asked if the downtime is fixed or flexible. Here’s an explanation of how “fixed” and “flexible” downtime differs:
“Fixed” downtime starts and stops at the exact start and end times that you specify when you schedule it. Okay, that was easy enough...
“Flexible” downtime is intended for times when you know that a host or service is going to be down for X minutes (or hours), but you don’t know exactly when that’ll start. When you schedule flexible downtime, Shinken will start the scheduled downtime sometime between the start and end times you specified. The downtime will last for as long as the duration you specified when you scheduled the downtime. This assumes that the host or service for which you scheduled flexible downtime either goes down (or becomes unreachable) or goes into a non-OK state sometime between the start and end times you specified. The time at which a host or service transitions to a problem state determines the time at which Shinken actually starts the downtime. The downtime will then last for the duration you specified, even if the host or service recovers before the downtime expires. This is done for a very good reason. As we all know, you might think you’ve got a problem fixed, but then have to restart a server ten times before it actually works right. Smart, eh?
Triggered Downtime
When scheduling host or service downtime you have the option of making it “triggered” downtime. What is triggered downtime, you ask? With triggered downtime the start of the downtime is triggered by the start of some other scheduled host or service downtime. This is extremely useful if you’re scheduling downtime for a large number or hosts or services and the start time of the downtime period depends on the start time of another downtime entry. For instance, if you schedule flexible downtime for a particular host (because its going down for maintenance), you might want to schedule triggered downtime for all of that hosts’s “children”.
How Scheduled Downtime Affects Notifications
When a host or service is in a period of scheduled downtime, Shinken will not allow normal notifications to be sent out for the host or service. However, a “DOWNTIMESTART” notification will get sent out for the host or service, which will serve to put any admins on notice that they won’t receive upcoming problem alerts.
When the scheduled downtime is over, Shinken will allow normal notifications to be sent out for the host or service again. A “DOWNTIMEEND” notification will get sent out notifying admins that the scheduled downtime is over, and they will start receiving normal alerts again.
If the scheduled downtime is cancelled prematurely (before it expires), a “DOWNTIMECANCELLED” notification will get sent out to the appropriate admins.
Overlapping Scheduled Downtime
I like to refer to this as the “Oh crap, its not working” syndrome. You know what I’m talking about. You take a server down to perform a “routine” hardware upgrade, only to later realize that the OS drivers aren’t working, the RAID array blew up, or the drive imaging failed and left your original disks useless to the world. Moral of the story is that any routine work on a server is quite likely to take three or four times as long as you had originally planned...
Let’s take the following scenario:
- You schedule downtime for host A from 7:30pm-9:30pm on a Monday
- You bring the server down about 7:45pm Monday evening to start a hard drive upgrade
- After wasting an hour and a half battling with SCSI errors and driver incompatibilities, you finally get the machine to boot up
- At 9:15 you realize that one of your partitions is either hosed or doesn’t seem to exist anywhere on the drive
- Knowing you’re in for a long night, you go back and schedule additional downtime for host A from 9:20pm Monday evening to 1:30am Tuesday Morning.
If you schedule overlapping periods of downtime for a host or service (in this case the periods were 7:40pm-9:30pm and 9:20pm-1:30am), Shinken will wait until the last period of scheduled downtime is over before it allows notifications to be sent out for that host or service. In this example notifications would be suppressed for host A until 1:30am Tuesday morning.
Predictive Dependency Checks
Note
The predictive dependency check functionality is not managed from now in Shinken.
Introduction
Host and service dependencies can be defined to allow you greater control over when checks are executed and when notifications are sent out. As dependencies are used to control basic aspects of the monitoring process, it is crucial to ensure that status information used in the dependency logic is as up to date as possible.
Shinken allows you to enable predictive dependency checks for hosts and services to ensure that the dependency logic will have the most up-to-date status information when it comes to making decisions about whether to send out notifications or allow active checks of a host or service.
How Do Predictive Checks Work?
The image below shows a basic diagram of hosts that are being monitored by Shinken, along with their parent/child relationships and dependencies.
The Switch2 host in this example has just changed state from an UP state to a problem state. Shinken needs to determine whether the host is DOWN or UNREACHABLE, so it will launch parallel checks of Switch2‘s immediate parents (Firewall1) and children (Comp1, Comp2, and Switch3). This is a normal function of the host reachability logic.
You will also notice that Switch2 is depending on Monitor1 and File1 for either notifications or check execution (which one is unimportant in this example). If predictive host dependency checks are enabled, Shinken will launch parallel checks of Monitor1 and File1 at the same time it launches checks of Switch2‘s immediate parents and children. Shinken does this because it knows that it will have to test the dependency logic in the near future (e.g. for purposes of notification) and it wants to make sure it has the most current status information for the hosts that take part in the dependency.
That’s how predictive dependency checks work. Simple, eh?
Predictive service dependency checks work in a similar manner to what is described above. Except, of course, they deal with services instead of hosts.
Enabling Predictive Checks
Predictive dependency checks involve rather little overhead, so I would recommend that you enable them. In most cases, the benefits of having accurate information for the dependency logic outweighs the extra overhead imposed by these checks.
Enabling predictive dependency checks is easy:
Cached Checks
Predictive dependency checks are on-demand checks and are therefore subject to the rules of cached checks. Cached checks can provide you with performance improvements by allowing Shinken to forgo running an actual host or service check if it can use a relatively recent check result instead. More information on cached checks can be found here.
Cached Checks
Introduction
The performance of Shinken’s monitoring logic can be significantly improved by implementing the use of cached checks. Cached checks allow Shinken to forgo executing a host or service check command if it determines a relatively recent check result will do instead.
For On-Demand Checks Only
Regularly scheduled host and service checks will not see a performance improvement with use of cached checks. Cached checks are only useful for improving the performance of on-demand host and service checks. Scheduled checks help to ensure that host and service states are updated regularly, which may result in a greater possibility their results can be used as cached checks in the future.
For reference, on-demand host checks occur...
And on-demand service checks occur...
Unless you make use of service dependencies, Shinken will not be able to use cached check results to improve the performance of service checks. Don’t worry about that - its normal. Cached host checks are where the big performance improvements lie, and everyone should see a benefit there.
How Caching Works
When Shinken needs to perform an on-demand host or service check, it will make a determination as to whether it can used a cached check result or if it needs to perform an actual check by executing a plugin. It does this by checking to see if the last check of the host or service occurred within the last X seconds, where X is the cached host or service check horizon.
If the last check was performed within the timeframe specified by the cached check horizon variable, Shinken will use the result of the last host or service check and will not execute a new check. If the host or service has not yet been checked, or if the last check falls outside of the cached check horizon timeframe, Shinken will execute a new host or service check by running a plugin.
What This Really Means
Shinken performs on-demand checks because it need to know the current state of a host or service at that exact moment in time. Utilizing cached checks allows you to make Shinken think that recent check results are “good enough” for determining the current state of hosts, and that it doesn’t need to go out and actually re-check the status of that host or service.
The cached check horizon tells Shinken how recent check results must be in order to reliably reflect the current state of a host or service. For example, with a cached check horizon of 30 seconds, you are telling Shinken that if a host’s state was checked sometime in the last 30 seconds, the result of that check should still be considered the current state of the host.
The number of cached check results that Shinken can use versus the number of on-demand checks it has to actually execute can be considered the cached check “hit” rate. By increasing the cached check horizon to equal the regular check interval of a host, you could theoretically achieve a cache hit rate of 100%. In that case all on-demand checks of that host would use cached check results. What a performance improvement! But is it really? Probably not.
The reliability of cached check result information decreases over time. Higher cache hit rates require that previous check results are considered “valid” for longer periods of time. Things can change quickly in any network scenario, and there’s no guarantee that a server that was functioning properly 30 seconds ago isn’t on fire right now. There’s the tradeoff - reliability versus speed. If you have a large cached check horizon, you risk having unreliable check result values being used in the monitoring logic.
Shinken will eventually determine the correct state of all hosts and services, so even if cached check results prove to unreliably represent their true value, it will only work with incorrect information for a short period of time. Even short periods of unreliable status information can prove to be a nuisance for admins, as they may receive notifications about problems which no longer exist.
There is no standard cached check horizon or cache hit rate that will be acceptable to every users. Some people will want a short horizon timeframe and a low cache hit rate, while others will want a larger horizon timeframe and a larger cache hit rate (with a low reliability rate). Some users may even want to disable cached checks altogether to obtain a 100% reliability rate. Testing different horizon timeframes, and their effect on the reliability of status information, is the only want that an individual user will find the “right” value for their situation. More information on this is discussed below.
Configuration Variables
The following variables determine the timeframes in which a previous host or service check result may be used as a cached host or service check result:
Optimizing Cache Effectiveness
In order to make the most effective use of cached checks, you should:
- Schedule regular checks of your hosts
- Use MRTG to graph statistics for 1) on-demand checks and 2) cached checks
- Adjust cached check horizon variables to fit your needs
You can schedule regular checks of your hosts by specifying a value greater than 0 for check_interval option in your host definitions.
A good way to determine the proper value for the cached check horizon options is to compare how many on-demand checks Shinken has to actually run versus how may it can use cached values for. The nagiostats utility can produce information on cached checks, which can then be graphed with MRTG. Example MRTG graphs that show cached vs. actual on-demand checks are shown to the right.
The monitoring installation which produced the graphs above had:
- A total of 44 hosts, all of which were checked at regular intervals
- An average (regularly scheduled) host check interval of 5 minutes
- A cached_host_check_horizon of 15 seconds
The first MRTG graph shows how many regularly scheduled host checks compared to how many cached host checks have occurred. In this example, an average of 53 host checks occur every five minutes. 9 of these (17%) are on-demand checks.
The second MRTG graph shows how many cached host checks have occurred over time. In this example an average of 2 cached host checks occurs every five minutes.
Remember, cached checks are only available for on-demand checks. Based on the 5 minute averages from the graphs, we see that Nagios is able to used cached host check results every 2 out of 9 times an on-demand check has to be run. That may not seem much, but these graphs represent a small monitoring environment. Consider that 2 out of 9 is 22% and you can start to see how this could significantly help improve host check performance in large environments. That percentage could be higher if the cached host check horizon variable value was increased, but that would reduce the reliability of the cached host state information.
Once you’ve had a few hours or days worth of MRTG graphs, you should see how many host and service checks were done by executing plugins versus those that used cached check results. Use that information to adjust the cached check horizon variables appropriately for your situation. Continue to monitor the MRTG graphs over time to see how changing the horizon variables affected cached check statistics. Rinse and repeat as necessary.
Object Inheritance
Introduction
This documentation attempts to explain object inheritance and how it can be used in your object definitions.
If you are confused about how recursion and inheritance work after reading this, take a look at the sample object config files provided in the Shinken distribution. If that still doesn’t help, have a look to the shinken resources for help.
Basics
There are three variables affecting recursion and inheritance that are present in all object definitions. They are indicated in red as follows...
define someobjecttype{
object-specific variables ...
name template_name
use name_of_template_to_use
register [0/1]
}
The first variable is “name”. Its just a “template” name that can be referenced in other object definitions so they can inherit the objects properties/variables. Template names must be unique amongst objects of the same type, so you can’t have two or more host definitions that have “hosttemplate” as their template name.
The second variable is “use”. This is where you specify the name of the template object that you want to inherit properties/variables from. The name you specify for this variable must be defined as another object’s template named (using the name variable).
The third variable is “register”. This variable is used to indicate whether or not the object definition should be “registered” with Shinken. By default, all object definitions are registered. If you are using a partial object definition as a template, you would want to prevent it from being registered (an example of this is provided later). Values are as follows: 0 = do NOT register object definition, 1 = register object definition (this is the default). This variable is NOT inherited; every (partial) object definition used as a template must explicitly set the “register” directive to be 0. This prevents the need to override an inherited “register” directive with a value of 1 for every object that should be registered.
Local Variables vs. Inherited Variables
One important thing to understand with inheritance is that “local” object variables always take precedence over variables defined in the template object. Take a look at the following example of two host definitions (not all required variables have been supplied):
define host{
host_name bighost1
check_command check-host-alive
notification_options d,u,r
max_check_attempts 5
name hosttemplate1
}
define host{
host_name bighost2
max_check_attempts 3
use hosttemplate1
}
You’ll note that the definition for host bighost1 has been defined as having hosttemplate1 as its template name. The definition for host bighost2 is using the definition of bighost1 as its template object. Once Shinken processes this data, the resulting definition of host bighost2 would be equivalent to this definition:
define host{
host_name bighost2
check_command check-host-alive
notification_options d,u,r
max_check_attempts 3
}
You can see that the “check_command” and “notification_options” variables were inherited from the template object (where host bighost1 was defined). However, the host_name and max_check_attempts variables were not inherited from the template object because they were defined locally. Remember, locally defined variables override variables that would normally be inherited from a template object. That should be a fairly easy concept to understand.
If you would like local string variables to be appended to inherited string values, you can do so. Read more about how to accomplish this below.
Inheritance Chaining
Objects can inherit properties/variables from multiple levels of template objects. Take the following example:
define host{
host_name bighost1
check_command check-host-alive
notification_options d,u,r
max_check_attempts 5
name hosttemplate1
}
define host{
host_name bighost2
max_check_attempts 3
use hosttemplate1
name hosttemplate2
}
define host{
host_name bighost3
use hosttemplate2
}
You’ll notice that the definition of host bighost3 inherits variables from the definition of host bighost2, which in turn inherits variables from the definition of host bighost1. Once Shinken processes this configuration data, the resulting host definitions are equivalent to the following:
define host{
host_name bighost1
check_command check-host-alive
notification_options d,u,r
max_check_attempts 5
}
define host{
host_name bighost2
check_command check-host-alive
notification_options d,u,r
max_check_attempts 3
}
define host{
host_name bighost3
check_command check-host-alive
notification_options d,u,r
max_check_attempts 3
}
There is no inherent limit on how “deep” inheritance can go, but you’ll probably want to limit yourself to at most a few levels in order to maintain sanity.
Using Incomplete Object Definitions as Templates
It is possible to use incomplete object definitions as templates for use by other object definitions. By “incomplete” definition, I mean that all required variables in the object have not been supplied in the object definition. It may sound odd to use incomplete definitions as templates, but it is in fact recommended that you use them. Why? Well, they can serve as a set of defaults for use in all other object definitions. Take the following example:
define host{
check_command check-host-alive
notification_options d,u,r
max_check_attempts 5
name generichosttemplate
register 0
}
define host{
host_name bighost1
address 192.168.1.3
use generichosthosttemplate
}
define host{
host_name bighost2
address 192.168.1.4
use generichosthosttemplate
}
Notice that the first host definition is incomplete because it is missing the required “host_name” variable. We don’t need to supply a host name because we just want to use this definition as a generic host template. In order to prevent this definition from being registered with Shinken as a normal host, we set the “register” variable to 0.
The definitions of hosts bighost1 and bighost2 inherit their values from the generic host definition. The only variable we’ve chosed to override is the “address” variable. This means that both hosts will have the exact same properties, except for their “host_name” and “address” variables. Once Shinken processes the config data in the example, the resulting host definitions would be equivalent to specifying the following:
define host{
host_name bighost1
address 192.168.1.3
check_command check-host-alive
notification_options d,u,r
max_check_attempts 5
}
define host{
host_name bighost2
address 192.168.1.4
check_command check-host-alive
notification_options d,u,r
max_check_attempts 5
}
At the very least, using a template definition for default variables will save you a lot of typing. It’ll also save you a lot of headaches later if you want to change the default values of variables for a large number of hosts.
Custom Object Variables
Any custom object variables that you define in your host, service, or contact definition templates will be inherited just like other standard variables. Take the following example:
define host{
_customvar1 somevalue ; <-- Custom host variable
_snmp_community public ; <-- Custom host variable
name generichosttemplate
register 0
}
define host{
host_name bighost1
address 192.168.1.3
use generichosthosttemplate
}
The host bighost1 will inherit the custom host variables “_customvar1” and “_snmp_community”, as well as their respective values, from the generichosttemplate definition. The effective result is a definition for bighost1 that looks like this:
define host{
host_name bighost1
address 192.168.1.3
_customvar1 somevalue
_snmp_community public
}
Cancelling Inheritance of String Values
In some cases you may not want your host, service, or contact definitions to inherit values of string variables from the templates they reference. If this is the case, you can specify “null” (without quotes) as the value of the variable that you do not want to inherit. Take the following example:
define host{
event_handler my-event-handler-command
name generichosttemplate
register 0
}
define host{
host_name bighost1
address 192.168.1.3
event_handler null
use generichosthosttemplate
}
In this case, the host bighost1 will not inherit the value of the “event_handler” variable that is defined in the generichosttemplate. The resulting effective definition of bighost1 is the following:
define host{
host_name bighost1
address 192.168.1.3
}
Additive Inheritance of String Values
Shinken gives preference to local variables instead of values inherited from templates. In most cases local variable values override those that are defined in templates. In some cases it makes sense to allow Shinken to use the values of inherited and local variables together.
This “additive inheritance” can be accomplished by prepending the local variable value with a plus sign (+). This features is only available for standard (non-custom) variables that contain string values. Take the following example:
define host{
hostgroups all-servers
name generichosttemplate
register 0
}
define host{
host_name linuxserver1
hostgroups +linux-servers,web-servers
use generichosthosttemplate
}
In this case, the host linuxserver1 will append the value of its local “hostgroups” variable to that from generichosttemplate. The resulting effective definition of linuxserver1 is the following:
define host{
host_name linuxserver1
hostgroups all-servers,linux-servers,web-servers
}
Important
If you use a field twice using several templates, the value of the field will be the first one found!
In the example above, fields values in all-servers won’t we be replaced. Be careful with overlaping field!
Implied Inheritance
Normally you have to either explicitly specify the value of a required variable in an object definition or inherit it from a template. There are a few exceptions to this rule, where Shinken will assume that you want to use a value that instead comes from a related object. For example, the values of some service variables will be copied from the host the service is associated with if you don’t otherwise specify them.
The following table lists the object variables that will be implicitly inherited from related objects if you don’t explicitly specify their value in your object definition or inherit them from a template.
Object Type |
Object Variable |
Implied Source |
Services |
contact_groups |
contact_groups in the associated host definition |
notification_interval |
notification_interval in the associated host definition |
|
notification_period |
notification_period in the associated host definition |
|
check_period |
check_period in the associated host definition |
|
Host Escalations |
contact_groups |
contact_groups in the associated host definition |
notification_interval |
notification_interval in the associated host definition |
|
escalation_period |
notification_period in the associated host definition |
|
Service Escalations |
contact_groups |
contact_groups in the associated service definition |
notification_interval |
notification_interval in the associated service definition |
|
escalation_period |
notification_period in the associated service definition |
|
Implied/Additive Inheritance in Escalations
Service and host escalation definitions can make use of a special rule that combines the features of implied and additive inheritance. If escalations 1) do not inherit the values of their “contact_groups” or “contacts” directives from another escalation template and 2) their “contact_groups” or “contacts” directives begin with a plus sign (+), then the values of their corresponding host or service definition’s “contact_groups” or “contacts” directives will be used in the additive inheritance logic.
Confused? Here’s an example:
define host{
name linux-server
contact_groups linux-admins
...
}
define hostescalation{
host_name linux-server
contact_groups +management
...
}
This is a much simpler equivalent to:
define hostescalation{
host_name linux-server
contact_groups linux-admins,management
...
}
Multiple Inheritance Sources
Thus far, all examples of inheritance have shown object definitions inheriting variables/values from just a single source. You are also able to inherit variables/values from multiple sources for more complex configurations, as shown below.
# Generic host template
define host{
name generic-host
active_checks_enabled 1
check_interval 10
register 0
}
# Development web server template
define host{
name development-server
check_interval 15
notification_options d,u,r
...
register 0
}
# Development web server
define host{
use generic-host,development-server
host_name devweb1
...
}
In the example above, devweb1 is inheriting variables/values from two sources: generic-host and development-server. You’ll notice that a check_interval variable is defined in both sources. Since generic-host was the first template specified in devweb1’s use directive, its value for the “check_interval” variable is inherited by the devweb1 host. After inheritance, the effective definition of devweb1 would be as follows:
# Development web serve
define host{
host_name devweb1
active_checks_enabled 1
check_interval 10
notification_options d,u,r
...
}
Precedence With Multiple Inheritance Sources
When you use multiple inheritance sources, it is important to know how Shinken handles variables that are defined in multiple sources. In these cases Shinken will use the variable/value from the first source that is specified in the use directive. Since inheritance sources can themselves inherit variables/values from one or more other sources, it can get tricky to figure out what variable/value pairs take precedence.
Consider the following host definition that references three templates:
# Development web server
define host{
use 1, 4, 8
host_name devweb1
...
}
If some of those referenced templates themselves inherit variables/values from one or more other templates, the precendence rules are shown below. Testing, trial, and error will help you better understand exactly how things work in complex inheritance situations like this. :-)
Inheritance overriding
Inheritance is a core feature allowing to factorize configuration. It is possible from an host or a service template to build a very large set of checks with relatively few lines. The drawback of this approach is that it requires all hosts or services to be consistent. But if it is easy to instanciate new hosts with their own definitions attributes sets, it is generally more complicated with services, because the order of magnitude is larger (hosts * services per host), and because few attributes may come from the host. This is is especially true for packs, which is a generalization of the inheritance usage.
If some hosts require special directives for the services they are hosting (values that are different from those defined at template level), it is generally necessary to define new service.
Imagine two web servers clusters, one for the frontend, the other for the backend, where the frontend servers should notify any HTTP service in CRITICAL and WARNING state, and backend servers should only notify on CRITICAL state.
To implement this configuration, we may define 2 different HTTP services with different notification options.
Example:
define service {
service_description HTTP Front
hostgroup_name front-web
notification_options c,w,r
...
}
define service {
service_description HTTP Back
hostgroup_name front-back
notification_options c,r
...
}
define host {
host_name web-front-01
hostgroups web-front
...
}
...
define host {
host_name web-back-01
hostgroups web-back
...
}
...
Another way is to inherit attributes on the service side directly from the host: some service attributes may be inherited directly from the host if not defined on the service template side (see Implied Inheritance), but not all. Our notification_options in our example cannot be picked up from the host.
If the attribute you want to be set a custom value cannot be inherited from the host, you may use the service_overrides host directive. Its role is to enforce a service directive directly from the host. This allows to define specific service instance attributes from a same generalized service definition.
Its syntax is:
service_overrides xxx,yyy zzz
It could be summarized as “For the service bound to me, named ``xxx``, I want the directive ``yyy`` set to ``zzz`` rather tran the inherited value“
Example:
define service {
service_description HTTP
hostgroup_name web
notification_options c,w,r
...
}
define host {
host_name web-front-01
hostgroups web
...
}
...
define host {
host_name web-back-01
hostgroups web
service_overrides HTTP,notification_options c,r
...
}
...
In the previous example, we defined only one instance of the HTTP service, and we enforced the service notification_options for the web servers composing the backend. The final result is the same, but the second example is shorter, and does not require the second service definition.
Using packs allows an even shorter configuration.
Example:
define host {
use http
host_name web-front-01
...
}
...
define host {
use http
host_name web-back-01
service_overrides HTTP,notification_options c,r
...
}
...
In the packs example, the web server from the front-end cluster uses the value defined in the pack, and the one from the backend cluster has its HTTP service (inherited from the HTTP pack also) enforced its notification_options directive.
Important
The service_overrides attribute may himself be inherited from an upper host template. This is a multivalued attribute wchich syntax requires that each value is set on its own line. If you add a line on an host instance, it will not add it to the ones defined at template level, it will overlobad them. If some of the values on the template level are needed, they have to be explicitely copied.
Example:
define host {
name web-front
service_overrides HTTP,notification_options c,r
...
register 0
}
...
define host {
use web-fromt
host_name web-back-01
hostgroups web
service_overrides HTTP,notification_options c,r
service_overrides HTTP,notification_interval 15
...
}
...
Inheritance exclusions
Packs and hostgroups allow de factorize the configuration and greatly reduce the amount of configuration to write to describe infrastructures. The drawback is that it forces hosts to be consistent, as the same configuration is applied to a possibly very large set of machines.
Imagine a web servers cluster. All machines except one should be checked its managenent interface (ILO, iDRAC). In the cluster, there is one virtual server that should be checked the exact same services than the others, except the management interface (as checking it on a virtual server has no meaning). The corresponding service comes from a pack.
In this situation, there is several ways to manage the situation:
- create in intermadiary template on the pack level to have the management interface check attached to an upper level template
- re define all the services for the specifed host.
- use service overrides to set a dummy command on the corresponding service.
None of these options are satisfying.
There is a last solution that conists of exclude the corresponding service from the specified host. This may be done using the service_excludes directive.
Example:
define host {
use web-fromt
host_name web-back-01
...
}
define host {
use web-fromt
host_name web-back-02 ; The virtual server
service_excludes Management interface
...
}
...
Defining advanced service dependencies
First, the basics. You create service dependencies by adding service dependency definitions in your object config file(s). In each definition you specify the dependent service, the service you are depending on, and the criteria (if any) that cause the execution and notification dependencies to fail (these are described later).
You can create several dependencies for a given service, but you must add a separate service dependency definition for each dependency you create.
Example Service Dependencies
The image below shows an example logical layout of service notification and execution dependencies. Different services are dependent on other services for notifications and check execution.
In this example, the dependency definitions for Service F on Host C would be defined as follows:
define servicedependency{
host_name Host B
service_description Service D
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria o
notification_failure_criteria w,u
}
define servicedependency{
host_name Host B
service_description Service E
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria n
notification_failure_criteria w,u,c
}
define servicedependency{
host_name Host B
service_description Service C
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria w
notification_failure_criteria c
}
The other dependency definitions shown in the image above would be defined as follows:
define servicedependency{
host_name Host A
service_description Service A
dependent_host_name Host B
dependent_service_description Service D
execution_failure_criteria u
notification_failure_criteria n
}
define servicedependency{
host_name Host A
service_description Service B
dependent_host_name Host B
dependent_service_description Service E
execution_failure_criteria w,u
notification_failure_criteria c
}
define servicedependency{
host_name Host B
service_description Service C
dependent_host_name Host B
dependent_service_description Service E
execution_failure_criteria n
notification_failure_criteria w,u,c
}
How Service Dependencies Are Tested
Before Shinken executes a service check or sends notifications out for a service, it will check to see if the service has any dependencies. If it doesn’t have any dependencies, the check is executed or the notification is sent out as it normally would be. If the service does have one or more dependencies, Shinken will check each dependency entry as follows:
- Shinken gets the current status:ref:* <advanced/dependencies#advancedtopics_dependencies_hard_dependencies> of the service that is being depended upon.
- Shinken compares the current status of the service that is being depended upon against either the execution or notification failure options in the dependency definition (whichever one is relevant at the time).
- If the current status of the service that is being depended upon matches one of the failure options, the dependency is said to have failed and Shinken will break out of the dependency check loop.
- If the current state of the service that is being depended upon does not match any of the failure options for the dependency entry, the dependency is said to have passed and Shinken will go on and check the next dependency entry.
This cycle continues until either all dependencies for the service have been checked or until one dependency check fails.
- One important thing to note is that by default, Shinken will use the most current hard state of the service(s) that is/are being depended upon when it does the dependency checks. If you want Shinken to use the most current state of the services (regardless of whether its a soft or hard state), enable the soft_state_dependencies option.
Execution Dependencies
Execution dependencies are used to restrict when active checks of a service can be performed. Passive checks are not restricted by execution dependencies.
If all of the execution dependency tests for the service passed, Shinken will execute the check of the service as it normally would. If even just one of the execution dependencies for a service fails, Shinken will temporarily prevent the execution of checks for that (dependent) service. At some point in the future the execution dependency tests for the service may all pass. If this happens, Shinken will start checking the service again as it normally would. More information on the check scheduling logic can be found here.
In the example above, Service E would have failed execution dependencies if Service B is in a WARNING or UNKNOWN state. If this was the case, the service check would not be performed and the check would be scheduled for (potential) execution at a later time.
Warning
Execution dependencies will limit the load due to useless checks, but can limit some correlation logics, and so should be used only if you trully need them.
Notification Dependencies
If all of the notification dependency tests for the service passed, Shinken will send notifications out for the service as it normally would. If even just one of the notification dependencies for a service fails, Shinken will temporarily repress notifications for that (dependent) service. At some point in the future the notification dependency tests for the service may all pass. If this happens, Shinken will start sending out notifications again as it normally would for the service. More information on the notification logic can be found here.
In the example above, Service F would have failed notification dependencies if Service C is in a CRITICAL state, //and/or* Service D is in a WARNING or UNKNOWN state, and/or// if **Service E* is in a WARNING, UNKNOWN, or CRITICAL state. If this were the case, notifications for the service would not be sent out.
Dependency Inheritance
As mentioned before, service dependencies are not inherited by default. In the example above you can see that Service F is dependent on Service E. However, it does not automatically inherit Service E’s dependencies on Service B and Service C. In order to make Service F dependent on Service C we had to add another service dependency definition. There is no dependency definition for Service B, so Service F is not dependent on Service B.
If you do wish to make service dependencies inheritable, you must use the inherits_parent directive in the service dependency definition. When this directive is enabled, it indicates that the dependency inherits dependencies of the service that is being depended upon (also referred to as the master service). In other words, if the master service is dependent upon other services and any one of those dependencies fail, this dependency will also fail.
In the example above, imagine that you want to add a new dependency for service F to make it dependent on service A. You could create a new dependency definition that specified service F as the dependent service and service A as being the master service (i.e. the service that is being dependend on). You could alternatively modify the dependency definition for services D and F to look like this:
define servicedependency{
host_name Host B
service_description Service D
dependent_host_name Host C
dependent_service_description Service F
execution_failure_criteria o
notification_failure_criteria n
inherits_parent 1
}
Since the inherits_parent directive is enabled, the dependency between services A and D will be tested when the dependency between services F and D are being tested.
Dependencies can have multiple levels of inheritance. If the dependency definition between A and D had its inherits_parent directive enable and service A was dependent on some other service (let’s call it service G), the service F would be dependent on services D, A, and G (each with potentially different criteria).
Host Dependencies
As you’d probably expect, host dependencies work in a similar fashion to service dependencies. The difference is that they’re for hosts, not services.
Do not confuse host dependencies with parent/child host relationships. You should be using parent/child host relationships (defined with the parents directive in host definitions) for most cases, rather than host dependencies. A description of how parent/child host relationships work can be found in the documentation on network reachability.
Here are the basics about host dependencies:
- A host can be dependent on one or more other host
- Host dependencies are not inherited (unless specifically configured to)
- Host dependencies can be used to cause host check execution and host notifications to be suppressed under different circumstances (UP, DOWN, and/or UNREACHABLE states)
- Host dependencies might only be valid during specific timeperiods
Example Host Dependencies
The image below shows an example of the logical layout of host notification dependencies. Different hosts are dependent on other hosts for notifications.
In the example above, the dependency definitions for Host C would be defined as follows:
define hostdependency{
host_name Host A
dependent_host_name Host C
notification_failure_criteria d
}
define hostdependency{
host_name Host B
dependent_host_name Host C
notification_failure_criteria d,u
}
As with service dependencies, host dependencies are not inherited. In the example image you can see that Host C does not inherit the host dependencies of Host B. In order for Host C to be dependent on Host A, a new host dependency definition must be defined.
Host notification dependencies work in a similar manner to service notification dependencies. If all of the notification dependency tests for the host pass, Shinken will send notifications out for the host as it normally would. If even just one of the notification dependencies for a host fails, Shinken will temporarily repress notifications for that (dependent) host. At some point in the future the notification dependency tests for the host may all pass. If this happens, Shinken will start sending out notifications again as it normally would for the host. More information on the notification logic can be found here.
Shinken’s distributed architecture with realms
Multi customers and/or sites: REALMS
Shinken’s architecture like we saw allows us to have a unique administration and data location. All pollers the hosts are cut and sent to schedulers, and the pollers take jobs from all schedulers. Every one is happy.
Every one? In fact no. If an administrator got a continental distributed architecture he can have serious problems. If the architecture is common to multiple customers network, a customer A scheduler can have a customer B poller that asks him jobs. It’s not a good solution. Even with distributed network, distant pollers should not ask jobs to schedulers in the other continent, it’s not network efficient.
That is where the site/customers management is useful. In Shinken, it’s managed by the realms.
A realm is a group of resources that will manage hosts or hostgroups. Such a link will be unique: a host cannot be in multiple realms. If you put an hostgroup in a realm, all hosts in this group will be in the realm (unless a host already has the realm set, the host value will be taken).
A realm is:
- at least a scheduler
- at least a poller
- can have a reactionner
- can have a broker
In a realm, all realm pollers will take all realm schedulers jobs.
Important
Very important: there is only ONE arbiter (and a spare of course) for ALL realms. The arbiter manages all realms and all that is inside.
Sub-realms
A realm can have sub-realms. It doesn’t change anything for schedulers, but it can be useful for other satellites and spares. Reactionners and brokers are linked to a realm, but they can take jobs from all sub-realms too. This way you can have less reactionners and brokers (like we soon will see).
The fact that reactionners/brokers (and in fact pollers too) can take jobs from sub-schedulers is decided by the presence of the manage_sub_realms parameter. For pollers the default value is 0, but it’s 1 for reactionners/brokers.
An example
To make it simple: you put hosts and/or hostgroups in a realm. This last one is to be considered as a resources pool. You don’t need to touch the host/hostgroup definition if you need more/less performances in the realm or if you want to add a new satellites (a new reactionner for example).
Realms are a way to manage resources. They are the smaller clouds in your global cloud infrastructure :)
If you do not need this feature, that’s not a problem, it’s optional. There will be a default realm created and every one will be put into.
It’s the same for hosts that don’t have a realm configured: they will be put in the realm that has the “default” parameter.
Picture example
Diagrams are good :)
Let’s take two examples of distributed architectures around the world. In the first case, the administrator don’t want to share resources between realms. They are distinct. In the second, the reactionners and brokers are shared with all realms (so all notifications are send from a unique place, and so is all data).
Here is the isolated one:
And a more common way of sharing reactionner/broker:
Like you can see, all elements are in a unique realm. That’s the sub-realm functionality used for reactionner/broker.
Configuration of the realms
Here is the configuration for the shared architecture:
define realm {
realm_name All
realm_members Europe,US,Asia
default 1 ;Is the default realm. Should be unique!
}
define realm{
realm_name Europe
realm_members Paris ;This realm is IN Europe
}
An now the satellites:
define scheduler{
scheduler_name scheduler_Paris
realm Paris ;It will only manage Paris hosts
}
define reactionner{
reactionner_name reactionner-master
realm All ;Will reach ALL schedulers
}
And in host/hostgroup definition:
define host{
host_name server-paris
realm Paris ;Will be put in the Paris realm
[...]
}
define hostgroups{
hostgroup_name linux-servers
alias Linux Servers
members srv1,srv2
realm Europe ;Will be put in the Europe realm
}
Multi levels brokers
In the previous samples, if you put numerous brokers into the realm, each scheduler will have only one broker at the same time. It was also impossible to have a common Broker in All, and one brokers in each sub-realms.
You can activate multi-brokers features with a realm parameter, the broker_complete_links option (0 by default).
You will have to enable this option in ALL your realms! For example:
define realm{
realm_name Europe
broker_complete_links 1
}
This will enable the fact that each scehduler will be linked with each brokers. This will make possible to have dedicated brokers in a same realm (one for WebUI, another for Graphite for example). It will also make possible to have a common Broker in “All”, and one broker in each of its sub-realms (Europe, US and Asia). Of course the sub-brokers will only see the data from their realms, and the sub-realms (like Paris for Europe for example).
Unused nagios parameters
The parameters below are managed in Nagios but not in Shinken because they are useless in the architecture. If you really need one of them, please use Nagios instead or send us a patch :)
Note
The title is quite ambiguous : a not implemented parameter is different from an unused parameter.
The difference has been done in this page, why about creating a not_implemented_nagios_parameters?
External Command Check Interval (Unused)
Format: |
command_check_interval=<xxx>[s] |
Example: |
command_check_interval=1 |
If you specify a number with an “s” appended to it (i.e. 30s), this is the number of seconds to wait between external command checks. If you leave off the “s”, this is the number of “time units” to wait between external command checks. Unless you’ve changed the Timing Interval Length value (as defined below) from the default value of 60, this number will mean minutes.
By setting this value to -1, Nagios will check for external commands as often as possible. Each time Nagios checks for external commands it will read and process all commands present in the External Command File before continuing on with its other duties. More information on external commands can be found here.
External Command Buffer Slots (Not implemented)
Format: |
external_command_buffer_slots=<#> |
Example: |
external_command_buffer_slots=512 |
This is an advanced feature.
This option determines how many buffer slots Nagios will reserve for caching external commands that have been read from the external command file by a worker thread, but have not yet been processed by the main thread of the Nagios deamon. Each slot can hold one external command, so this option essentially determines how many commands can be buffered. For installations where you process a large number of passive checks (e.g. distributed setups), you may need to increase this number. You should consider using MRTG to graph Nagios’ usage of external command buffers.
Use Retained Scheduling Info Option (Not implemented)
Format: |
use_retained_scheduling_info=<0/1> |
Example: |
use_retained_scheduling_info=1 |
This setting determines whether or not Nagios will retain scheduling info (next check times) for hosts and services when it restarts. If you are adding a large number (or percentage) of hosts and services, I would recommend disabling this option when you first restart Nagios, as it can adversely skew the spread of initial checks. Otherwise you will probably want to leave it enabled.
- 0 = Don’t use retained scheduling info
- 1 = Use retained scheduling info (default)
Retained Host and Service Attribute Masks (Not implemented)
Format: |
retained_host_attribute_mask=<number>
retained_service_attribute_mask=<number>
|
Example: |
retained_host_attribute_mask=0
retained_service_attribute_mask=0
|
This is an advanced feature. You’ll need to read the Nagios source code to use this option effectively.
These options determine which host or service attributes are NOT retained across program restarts. The values for these options are a bitwise AND of values specified by the “MODATTR_” definitions in the “include/common.h” source code file. By default, all host and service attributes are retained.
Retained Process Attribute Masks (Not implemented)
Format: |
retained_process_host_attribute_mask=<number>
retained_process_service_attribute_mask=<number>
|
Example: |
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
|
This is an advanced feature. You’ll need to read the Nagios source code to use this option effectively.
These options determine which process attributes are NOT retained across program restarts. There are two masks because there are often separate host and service process attributes that can be changed. For example, host checks can be disabled at the program level, while service checks are still enabled. The values for these options are a bitwise AND of values specified by the “MODATTR_” definitions in the “include/common.h” source code file. By default, all process attributes are retained.
Service Inter-Check Delay Method (Unused)
Format: |
service_inter_check_delay_method=<n/d/s/x.xx> |
Example: |
service_inter_check_delay_method=s |
This option allows you to control how service checks are initially “spread out” in the event queue. Using a “smart” delay calculation (the default) will cause Nagios to calculate an average check interval and spread initial checks of all services out over that interval, thereby helping to eliminate CPU load spikes. Using no delay is generally not recommended, as it will cause all service checks to be scheduled for execution at the same time. This means that you will generally have large CPU spikes when the services are all executed in parallel. More information on how to estimate how the inter-check delay affects service check scheduling can be found here. Values are as follows:
- n = Don’t use any delay - schedule all service checks to run immediately (i.e. at the same time!)
- d = Use a “dumb” delay of 1 second between service checks
- s = Use a “smart” delay calculation to spread service checks out evenly (default)
- x.xx = Use a user-supplied inter-check delay of x.xx seconds
Inter-Check Sleep Time (Unused)
Format: |
sleep_time=<seconds> |
Example: |
sleep_time=1 |
This is the number of seconds that Nagios will sleep before checking to see if the next service or host check in the scheduling queue should be executed. Note that Nagios will only sleep after it “catches up” with queued service checks that have fallen behind.
Service Interleave Factor (Unused)
Format: |
service_interleave_factor=<s/x> |
Example: |
service_interleave_factor=s |
This variable determines how service checks are interleaved. Interleaving allows for a more even distribution of service checks, reduced load on remote hosts, and faster overall detection of host problems. Setting this value to 1 is equivalent to not interleaving the service checks (this is how versions of Nagios previous to 0.0.5 worked). Set this value to s (smart) for automatic calculation of the interleave factor unless you have a specific reason to change it. The best way to understand how interleaving works is to watch the status CGI (detailed view) when Nagios is just starting. You should see that the service check results are spread out as they begin to appear. More information on how interleaving works can be found here.
- x = A number greater than or equal to 1 that specifies the interleave factor to use. An interleave factor of 1 is equivalent to not interleaving the service checks.
- s = Use a “smart” interleave factor calculation (default)
Maximum Concurrent Service Checks (Unused)
Format: |
max_concurrent_checks=<max_checks> |
Example: |
max_concurrent_checks=20 |
This option allows you to specify the maximum number of service checks that can be run in parallel at any given time. Specifying a value of 1 for this variable essentially prevents any service checks from being run in parallel. Specifying a value of 0 (the default) does not place any restrictions on the number of concurrent checks. You’ll have to modify this value based on the system resources you have available on the machine that runs Nagios, as it directly affects the maximum load that will be imposed on the system (processor utilization, memory, etc.). More information on how to estimate how many concurrent checks you should allow can be found here.
Check Result Reaper Frequency (Unused)
Format: |
check_result_reaper_frequency=<frequency_in_seconds> |
Example: |
check_result_reaper_frequency=5 |
This option allows you to control the frequency in seconds of check result “reaper” events. “Reaper” events process the results from host and service checks that have finished executing. These events consitute the core of the monitoring logic in Nagios.
Maximum Check Result Reaper Time
Note
Is it Unused or Not Implemeted??
Format: |
max_check_result_reaper_time=<seconds> |
Example: |
max_check_result_reaper_time=30 |
This option allows you to control the maximum amount of time in seconds that host and service check result “reaper” events are allowed to run. “Reaper” events process the results from host and service checks that have finished executing. If there are a lot of results to process, reaper events may take a long time to finish, which might delay timely execution of new host and service checks. This variable allows you to limit the amount of time that an individual reaper event will run before it hands control back over to Nagios for other portions of the monitoring logic.
Check Result Path (Unused)
Format: |
check_result_path=<path> |
Example: |
check_result_path=/var/spool/nagios/checkresults |
This options determines which directory Nagios will use to temporarily store host and service check results before they are processed. This directory should not be used to store any other files, as Nagios will periodically clean this directory of old file (see the :ref:Max Check Result File Age option above for more information).
Make sure that only a single instance of Nagios has access to the check result path. If multiple instances of Nagios have their check result path set to the same directory, you will run into problems with check results being processed (incorrectly) by the wrong instance of Nagios!
Max Check Result File Age (Unused)
Format: |
max_check_result_file_age=<seconds> |
Example: |
max_check_result_file_age=3600 |
This options determines the maximum age in seconds that Nagios will consider check result files found in the check_result_path directory to be valid. Check result files that are older that this threshold will be deleted by Nagios and the check results they contain will not be processed. By using a value of zero (0) with this option, Nagios will process all check result files - even if they’re older than your hardware :-).
Host Inter-Check Delay Method (Unused)
Format: |
host_inter_check_delay_method=<n/d/s/x.xx> |
Example: |
host_inter_check_delay_method=s |
This option allows you to control how host checks that are scheduled to be checked on a regular basis are initially “spread out” in the event queue. Using a “smart” delay calculation (the default) will cause Nagios to calculate an average check interval and spread initial checks of all hosts out over that interval, thereby helping to eliminate CPU load spikes. Using no delay is generally not recommended. Using no delay will cause all host checks to be scheduled for execution at the same time. More information on how to estimate how the inter-check delay affects host check scheduling can be found here. Values are as follows:
- n = Don’t use any delay - schedule all host checks to run immediately (i.e. at the same time!)
- d = Use a “dumb” delay of 1 second between host checks
- s = Use a “smart” delay calculation to spread host checks out evenly (default)
- x.xx = Use a user-supplied inter-check delay of x.xx seconds
Auto-Rescheduling Option (Not implemented)
Format: |
auto_reschedule_checks=<0/1> |
Example: |
auto_reschedule_checks=1 |
This option determines whether or not Nagios will attempt to automatically reschedule active host and service checks to “smooth” them out over time. This can help to balance the load on the monitoring server, as it will attempt to keep the time between consecutive checks consistent, at the expense of executing checks on a more rigid schedule.
THIS IS AN EXPERIMENTAL FEATURE AND MAY BE REMOVED IN FUTURE VERSIONS. ENABLING THIS OPTION CAN DEGRADE PERFORMANCE - RATHER THAN INCREASE IT - IF USED IMPROPERLY!
Auto-Rescheduling Interval (Not implemented)
Format: |
auto_rescheduling_interval=<seconds> |
Example: |
auto_rescheduling_interval=30 |
This option determines how often (in seconds) Nagios will attempt to automatically reschedule checks. This option only has an effect if the Auto-Rescheduling Option option is enabled. Default is 30 seconds.
THIS IS AN EXPERIMENTAL FEATURE AND MAY BE REMOVED IN FUTURE VERSIONS. ENABLING THE AUTO-RESCHEDULING OPTION CAN DEGRADE PERFORMANCE - RATHER THAN INCREASE IT - IF USED IMPROPERLY!
Auto-Rescheduling Window (Not implemented)
Format: |
auto_rescheduling_window=<seconds> |
Example: |
auto_rescheduling_window=180 |
This option determines the “window” of time (in seconds) that Nagios will look at when automatically rescheduling checks. Only host and service checks that occur in the next X seconds (determined by this variable) will be rescheduled. This option only has an effect if the Auto-Rescheduling Option option is enabled. Default is 180 seconds (3 minutes).
THIS IS AN EXPERIMENTAL FEATURE AND MAY BE REMOVED IN FUTURE VERSIONS. ENABLING THE AUTO-RESCHEDULING OPTION CAN DEGRADE PERFORMANCE - RATHER THAN INCREASE IT - IF USED IMPROPERLY!
Translate Passive Host Checks Option (Not implemented)
Format: |
translate_passive_host_checks=<0/1> |
Example: |
translate_passive_host_checks=1 |
This option determines whether or not Nagios will translate DOWN/UNREACHABLE passive host check results to their “correct” state from the viewpoint of the local Nagios instance. This can be very useful in distributed and failover monitoring installations. More information on passive check state translation can be found here.
- 0 = Disable check translation (default)
- 1 = Enable check translation
Child Process Memory Option (Unused)
Format: |
free_child_process_memory=<0/1> |
Example: |
free_child_process_memory=0 |
This option determines whether or not Nagios will free memory in child processes when they are fork()ed off from the main process. By default, Nagios frees memory. However, if the use_large_installation_tweaks option is enabled, it will not. By defining this option in your configuration file, you are able to override things to get the behavior you want.
- 0 = Don’t free memory
- 1 = Free memory
Child Processes Fork Twice (Unused)
Format: |
child_processes_fork_twice=<0/1> |
Example: |
child_processes_fork_twice=0 |
This option determines whether or not Nagios will fork() child processes twice when it executes host and service checks. By default, Nagios fork()s twice. However, if the use_large_installation_tweaks option is enabled, it will only fork() once. By defining this option in your configuration file, you are able to override things to get the behavior you want.
- 0 = Fork() just once
- 1 = Fork() twice
Event Broker Options (Unused)
Format: |
event_broker_options=<#> |
Example: |
event_broker_options=-1 |
This option controls what (if any) data gets sent to the event broker and, in turn, to any loaded event broker modules. This is an advanced option. When in doubt, either broker nothing (if not using event broker modules) or broker everything (if using event broker modules). Possible values are shown below.
- 0 = Broker nothing
- -1 = Broker everything
- # = See BROKER_* definitions in source code (“include/broker.h”) for other values that can be OR’ed together
Event Broker Modules (Unused)
Format: |
broker_module=<modulepath> [moduleargs] |
Example: |
broker_module=/usr/local/nagios/bin/ndomod.o cfg_file=/usr/local/nagios/etc/ndomod.cfg |
This directive is used to specify an event broker module that should by loaded by Nagios at startup. Use multiple directives if you want to load more than one module. Arguments that should be passed to the module at startup are separated from the module path by a space.
Do NOT overwrite modules while they are being used by Nagios or Nagios will crash in a fiery display of SEGFAULT glory. This is a bug/limitation either in “dlopen()”, the kernel, and/or the filesystem. And maybe Nagios...
The correct/safe way of updating a module is by using one of these methods:
- Shutdown Nagios, replace the module file, restart Nagios
- While Nagios is running... delete the original module file, move the new module file into place, restart Nagios
Debug File (Unused)
Format: |
debug_file=<file_name> |
Example: |
debug_file=/usr/local/nagios/var/nagios.debug |
This option determines where Nagios should write debugging information. What (if any) information is written is determined by the Debug Level and Debug Verbosity options. You can have Nagios automaticaly rotate the debug file when it reaches a certain size by using the Maximum Debug File Size option.
Debug Level (Unused)
Format: |
debug_level=<#> |
Example: |
debug_level=24 |
This option determines what type of information Nagios should write to the Debug File. This value is a logical OR of the values below.
- -1 = Log everything
- 0 = Log nothing (default)
- 1 = Function enter/exit information
- 2 = Config information
- 4 = Process information
- 8 = Scheduled event information
- 16 = Host/service check information
- 32 = Notification information
- 64 = Event broker information
Debug Verbosity (Unused)
Format: |
debug_verbosity=<#> |
Example: |
debug_verbosity=1 |
This option determines how much debugging information Nagios should write to the Debug File.
- 0 = Basic information
- 1 = More detailed information (default)
- 2 = Highly detailed information
Maximum Debug File Size (Unused)
Format: |
max_debug_file_size=<#> |
Example: |
max_debug_file_size=1000000 |
This option determines the maximum size (in bytes) of the debug file. If the file grows larger than this size, it will be renamed with a .old extension. If a file already exists with a .old extension it will automatically be deleted. This helps ensure your disk space usage doesn’t get out of control when debugging Nagios.