LCG-FR / SA1-FR Monitoring NagiosWithQuattor
Installing Nagios with quattor
Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on.
Configuring the Nagios server
The configuration of a Nagios server is done in a set of standard templates, in the 'monitoring/nagios3' namespace.
Repository Used
- Sensors are provided for some of the grid plugins from the SA1 repository: http://www.sysadmin.hep.ac.uk/rpms/grid-services/RPMS.monitoring/
- RPMs for nagios and nagios-plugins (+dépendances) are compiled for each supported OS, and are put in the repository « updates » on quattorsrv.lal.in2p3.fr.
Ex. : http://quattor.web.lal.in2p3.fr/packages/os/sl440-i386/updates/
- Plugins « nagios-grid-plugins » are in noarch RPM in the repository « nagios »
Ex. : http://quattor.web.lal.in2p3.fr/packages/nagios/
see template : nagios3/plugins/config.tpl
Server Template
An example Nagios server template is here :
This machine should be a UI to monitor grid services.
Who is monitored
2 conditions
- Hosts from site (variable SITES) and present in config/’sitename’_nodes_properties.tpl will be monitored
Template example for hosts declaration are in LCGQWG: https://trac.lal.in2p3.fr/LCGQWG/browser/templates/trunk/sites/example/site/config
- AND host that has NAGIOS_CLIENT_ENABLED set to "true":
variable NAGIOS_CLIENT_ENABLED ?= true; include { if(NAGIOS_CLIENT_ENABLED) 'monitoring/nagios3/client/config'};
If you want to set this variable to "true" for all your hosts you can put it in the following template:
template site/pro_site_common_config
You can tune this with
NAGIOS_IGNORED_NODES , NAGIOS_MONITORED_HOSTGROUPS
see the profile above.
What is monitored
Services
Services are added in the template « server/cfgfiles/services.tpl » , adding a service can be done like this :
variable TMP_SERVICE=nlist( "use"," generic-service", "host_name"," node07.org.fr", "service_description"," Workers ssh_known_hosts", "contact_groups"," admins", "check_command"," check_nrpe_long!check_ssh_known_hosts!60", "normal_check_interval"," 60 ; check every hour", "max_check_attempts"," 1", );
If the second parameter of the function nagios_add_service is « true » , a dependency will be added on the NRPE daemon for all the nodes « "*,!NOQUATTOR » for the service defined .... need to improve on this...
It's possible to add dependency on a services for a host, with a service from another host well defined:
variable NAGIOS_USER_DEFINED_HOST_DEPENDENCIES = nagios_add_host_service_dependency( "node07.datagrid.cea.fr","nrpe daemon", "node07.datagrid.cea.fr","Workers ssh_known_hosts" );
It's not possible to add dependency between hostgroups (for the moment ?)
commands and NRPE
Nagios configuration files doesn't need complex quattor structure template and so are created with filecopy :
- adding commands is done in:
monitoring/nagios3/server/cfgfiles/commands
- adding NRPE commands is done in:
monitoring/nagios3/client/cfgfiles/nrpe_commands
It's possible to add dependency on the NRPE daemon for services wich are not defined on all the hosts(template services.tpl)
variable NRPE_HOSTGROUPS_SPECIFIC_DEPENDENT_SERVICES=nlist( escape("CE,CE-MPI,!NOQUATTOR") ,"black hole workers", escape("CE,CE-MPI,LFC,SE_DPM,SE_DISK,MON,VOBOX,WMS,!NOQUATTOR"),"host certificate", escape("MON") ,"apel publisher", escape("CE,CE-MPI") ,"apel parser", escape("WN,CE-MPI,UI,VOBOX") ,"home partition freespace", escape("WN") ,"pbs_mom transfers", );
Proxy management
Need to have a valid certificate for local grid probe. 2 mechanisms are possible: Renewal and Retrieval. In cfg/standard/monitoring/nagios3/server/config.tpl
include { if(NAGIOS_NCG_CONFIG && NAGIOS_MODE_PROXY_RENEW) 'monitoring/nagios3/server/vobox'}; include { if(NAGIOS_NCG_CONFIG && NAGIOS_MODE_PROXY_RETRIEVE) 'monitoring/nagios3/server/proxy_retrieval'};
Les variables associées:
NAGIOS_MODE_PROXY_RENEW NAGIOS_RENEW_PROXY NAGIOS_OUTPUT_PROXY
NAGIOS_MODE_PROXY_RETRIEVE NAGIOS_MYPROXY_NAME MYPROXY_SERVER NAGIOS_VONAME_PROXY
client configuration
Les variables
variable NAGIOS_RPM_VERSION ?= "3.0.5-1" |
string. Nagios Server RPM version |
variable NAGIOS_NODES_PROPERTIES |
nlist. The site nodes properties. For instance : nlist( escape("mynode.mydomain"), nlist("type","NFS", "monitoring","yes", "os","undef", "ip","192.168.0.1" , "hardware","undef")) |
Installation Exemple
With Quattor
server profile creation look at the profile above.
svn add cfg/clusters/your-3.1/profiles/profile_node58.tpl
Modify your list of machines:
vi ./cfg/sites/your/site/config/your_nodes_properties.tpl
create your hardware template
svn cp ./cfg/sites/your/hardware/virtual_machine_3.tpl ./cfg/sites/your/hardware/virtual_machine_13.tpl
Configure your clients adding: variable NAGIOS_CLIENT_ENABLED = true; include { if(NAGIOS_CLIENT_ENABLED) 'monitoring/nagios3/client/config'};
in the template template site/pro_site_common_config
Comit your change:
svn ci -m 'adding serveur nagios'
on the nagios server
- As root
vi /var/log/spma.log vi /var/log/ncm-cdispd.log /etc/init.d/nagios status /etc/init.d/nagios start add your server certificate in /etc/grid-security
- As nagios
add your personnal certificate and create a local proxy (or you can do that from another UI):
voms-proxy-init --voms vo.grif.fr myproxy-init -c 336 -k xxxxx-s myproxy.grif.fr -l nagios -x -Z "/O=GRID-FR/C=FR/O=xxx/OU=xxx/CN=xxxx.xxxx.fr"
- As root: check the proxy retrieval mechanism:
/usr/sbin/nagios-proxy-refresh MyProxy credential retrieved. VOMS credential retrieved. # voms-proxy-info --all -file /etc/nagios/globus/userproxy.pem subject : /O=GRID-FR/C=FR/O=CEA/OU=IRFU/CN=Christine Leroy/CN=proxy/CN=proxy/CN=proxy/CN=proxy issuer : /O=GRID-FR/C=FR/O=CEA/OU=IRFU/CN=Christine Leroy/CN=proxy/CN=proxy/CN=proxy identity : /O=GRID-FR/C=FR/O=CEA/OU=IRFU/CN=Christine Leroy/CN=proxy/CN=proxy/CN=proxy type : proxy strength : 1024 bits path : /etc/nagios/globus/userproxy.pem timeleft : 11:27:56 === VO vo.grif.fr extension information === VO : vo.grif.fr subject : /O=GRID-FR/C=FR/O=CEA/OU=IRFU/CN=Christine Leroy issuer : /O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=grid12.lal.in2p3.fr attribute : /vo.grif.fr/Role=NULL/Capability=NULL timeleft : 11:27:56 uri : grid12.lal.in2p3.fr:20001