Difference between revisions of "Nagios regional"
(→Services nagios BOX) |
(→Services nagios BOX) |
||
(6 intermediate revisions by 2 users not shown) | |||
Ligne 274: | Ligne 274: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Check if the CA distribution installed on the nagios boxe is up-to-date | |
|---- | |---- | ||
| | | | ||
Ligne 281: | Ligne 281: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Check the validity of the nagios boxe certificate | |
|---- | |---- | ||
| | | | ||
Ligne 288: | Ligne 288: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Try to retrieve a proxy from the available MyProxy server@cea | |
|---- | |---- | ||
| | | | ||
Ligne 295: | Ligne 295: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Check if the proxy owned by nagios on the nagios boxe is valid - This is very important unless grid probes could not run | |
|---- | |---- | ||
| | | | ||
Ligne 316: | Ligne 316: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Import from the GOC DB downtimes related to sites from ROC Fr | |
|---- | |---- | ||
| | | | ||
Ligne 337: | Ligne 337: | ||
cc | cc | ||
| width="30%" | | | width="30%" | | ||
− | + | Y | |
|---- | |---- | ||
| | | | ||
Ligne 344: | Ligne 344: | ||
cc | cc | ||
| width="30%" | | | width="30%" | | ||
− | y : a identifier les differents topics ( | + | y : a identifier les differents topics ( qui remplit le repertoire outgoing ???, trois topics ??) |
|---- | |---- | ||
| | | | ||
Ligne 351: | Ligne 351: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Check critical disk spaces - warning when 80% is full - critical when 95% is full | |
|---- | |---- | ||
| | | | ||
Ligne 358: | Ligne 358: | ||
nl | nl | ||
| width="30%" | | | width="30%" | | ||
− | + | Y Check if main processes crond are running on the nagios boxe | |
|---- | |---- | ||
| | | | ||
Ligne 377: | Ligne 377: | ||
org.sam.CE-JobMonit-ops | org.sam.CE-JobMonit-ops | ||
| width="5%" | | | width="5%" | | ||
− | + | cc | |
| width="30%" | | | width="30%" | | ||
− | + | y | |
|---- | |---- | ||
| | | | ||
org.sam.CREAMCE-JobMonit-ops | org.sam.CREAMCE-JobMonit-ops | ||
| width="5%" | | | width="5%" | | ||
− | + | cc | |
| width="30%" | | | width="30%" | | ||
− | + | ? : je supose que le resultat est correct (?) | |
|---- | |---- | ||
| | | |
Latest revision as of 10:02, 9 avril 2010
Installation d'une NOGIOS box pour le ROC France
Sommaire
1) Installation de base :
a. Machine installée par les sysadmin du CC : OS + VOBOX + certificat + pas dans les NIS
b. Accessible via gsissh (port 1975)
2) Action faite au préalable :
a. Faire la demande pour que la machine soit autorisée à récupérer les SAM tests : https://gus.fzk.de/ws/ticket_info.php?ticket=55132
b. Autoriser la nagios box à récupérer les proxy en mode « retrieval », et stocker un proxy depuis une UI récuperable par la nagiosBOX voir Annexe0
c. Certificat utilisé pour :
i. Access to GOCDB PI for ROCS GOCDB PI level 2 required
ii. Recuperation de proxy pour les sondes locales
d. s'inscrire dans la mailing liste: regional-nagios-admins@cern.ch (très réactive)
3) Installation de Nagios
Reference : https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim
Installation des packages via yum, Ajout des repos suivant
a. mirrors-rpmforge (rpm rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm) b. rpmforge-testing.repo c. rpmforge.repo d. glite-UI.repo e. sa1-centos5-release.repo (rpm: sa1-release-2-1.el5.noarch.rpm) f. glite-BDII.repo
Problèmes de dépendances rencontrées si besoin se référer Annexe1 (mais j’avais dû oublier de faire un : yum install egee-NAGIOS)
Installation subversion
Certain fichier de config sont a recuperer via svn au CERN:
yum install subversion svn co http://svnweb.cern.ch/guest/sam/trunk/roc-config
cp /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm.orig cp roc-config/Hash.pm_55_ALL_CRIT /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm
Mysql
1. Installer la dernière version de Mysql (server + client) ;
2. configurer le mot de passe admin se référer Annexe2a ;
3. Configurer les users utiles à la nagios Box se référer Annexe2b ;
Configuration via yaim et NCG (en 2 fois)
1. Remplir /etc/ncg/ncg.localdb avec la liste des sites se referer Annexe3
2. Remplir le /opt/glite/yaim/site-info.def voir Annexe4 ;
3. groupadd nagios
4. modif des uids/gids : /opt/glite/yaim/examples/edgusers.conf Annexe5
5. lancement de la configuration automatique via yaim (All On One Box) :
/opt/glite/yaim/bin/yaim -s /opt/glite/yaim/site-info.def -c -n glite-UI -n glite-NAGIOS
6. lancement de la configuration via NCG (All On One Box) :
/usr/sbin/ncg.pl
.....MODIFICATION importante:: voir: https://twiki.cern.ch/twiki/bin/view/EGEE/ValidateROCNagios La partie:
- Configure manually /etc/ncg/ncg.conf (Template can be found here: ncg.conf.template: )
- Switch off automatic generation of ncg.conf by YAIM (NAGIOS_NCG_ENABLE_CONFIG"="false") - Take ncg.conf.template and copy it to /etc/ncg/ncg.conf
- Set $ROC_NAME$ - Set $YOUR_MYPROXY_SERVER$ (e.g. myproxy-fts.cern.ch) - Set $MYPROXY_NAME$ (e.g.: NagiosRetrieve-sam-ap-roc.cern.ch) - Set $NAGIOS_ROLE$ (e.g.: ngi, roc)
- Take Hash.pm file that we are currently using at the CERN ROC Nagios boxes. In this file, we have removed the metrics that do not belong to the SAM_Critical profile, so only the metrics defined at ROC_SAM_critical are configured, and not relevant services like SRMv1 are commented out. The file is located under /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm
Pour éviter d'avoir des sondes inappropriés.
Tuning the configuration
1. Verifier dans /etc/sysconfig/nagios:
LD_LIBRARY_PATH=/opt/classads/lib64:/opt/glite/lib64:/opt/globus/lib:/opt/c-ares/lib:/opt/classads/lib64
2. Desactiver les notifications, dans les fichiers de conf:
a. /etc/nagios/nagios.cfg, désactiver les notifications:enable_notifications=0 ;log_notifications=0 b. /etc/nagios/wlcg.d/host.templates.cfg c. /etc/nagios/wlcg.d/service.templates.cfg
4. N’autoriser que les dteam/France à visualiser notre interface nagios : modifier le fichier :
/etc/voms2htpasswd.conf, avec : vomss://voms.cern.ch:8443/voms/dteam?/dteam/france
Plus utile site le site-info.def est défini correctement:
VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/france'
ActiveMQ
/usr/sbin/msg-to-queue --prefix /queue/grid.probe.metricOutput.EGEE.a635834332381123c8b296d02b682f8f --broker-uri stomp://prod-grid-msg.cern.ch:6163
[root@cclcgvmli03 cron.hourly]# cat check_msg-to-queue.sh
Verifier les messages:
/usr/libexec/grid-monitoring/plugins/nagios/recv_from_queue -v
faire une update de perl-GridMon
The problem here is in the message handler (/usr/lib/perl5/vendor_perl/5.8.8/GridMon/MsgHandler/MetricOutput.pm). Probe on WN reports hostname localhost.localdomain and serviceURI CE hostname. In the previous version message handler first checked hostname value and then serviceURI. That is the reason why Christine is seeing results for localhost.localdomain. However, we fixed this and the latest version (1.0.34) parses messages correctly.
yum update perl-GridMon
Systeme d'information
La nagios Box doit etre enregistré dans la GOCDB et publié dans le site-BDII:
ldapsearch -x -H ldap://ccnagboxli01.in2p3.fr:2170 -b 'Mds-vo-name=resource,o=Grid'
validation
https://twiki.cern.ch/twiki/bin/view/EGEE/ValidateROCNagios
now you should publish your Nagios in your Site BDII And we should declare the node in the GOCDB: Regional-NAGIOS ou National-Nagios, a determiner
liste des sondes a valider: d'après: https://twiki.cern.ch/twiki/bin/view/LCG/SAMCriticalTestsForCODs
Services grilles
Type de Noeuds |
Validé par |
Valid (Y or N) |
APEL |
cl |
N: Pas de métric ? |
BDII (site and TOP) |
cl |
Y |
CE |
cc |
- Probleme identifié pour la variable LCG_GFAL_INFOSYS quand c'est une liste (CE lpnce.in2p3.fr): LCG_GFAL_INFOSYS=topbdii.grif.fr:2170,cclcgtopbdii01.in2p3.fr:2170,lapp-bdii01.in2p3.fr:2170 -Tout le mecanisme de recuperation des outputs n'est pas compris mais globalement on a les meme resultat que celui du CERN |
CREAMCE |
cl |
Y: idem CERN nagiosBOX (mais sonde Brokerinfo toujours en erreur: probleme connu: https://savannah.cern.ch/bugs/?61322) |
FTS |
cl |
Y and NO: For the check_command those options should be added: --cert /etc/nagios/globus/hostcert.pem --key /etc/nagios/globus/hostkey.pem -x $USER2$ (this should be fixed in the next release) |
gRB/WMS |
cl |
Y:(certificate lifetime only) |
LFC_C |
cl |
Y: mais il faudrait peut etre demander au site de supporter la VO ops pour que leur LFC soit testé (lfc-ls et lfc-ping)? |
LFC_L |
cl |
Y: mais il faudrait peut etre demander au site de supporter la VO ops pour que leur LFC soit testé (lfc-ls et lfc-ping)? |
MPI |
em |
n |
MyProxy |
em |
n |
RB |
nobody |
N: Service Deprecated. Does NOT apply in Nagios |
RGMA / MON |
cl |
Y (certificate lifetime only) |
SRMv2 |
cl |
n |
VOBOX |
cl |
Y |
VOMS |
em |
n |
Services nagios BOX
Services |
Validé par |
Valid (Y or N) |
hr.srce.CAdist-Version |
nl |
Y Check if the CA distribution installed on the nagios boxe is up-to-date |
hr.srce.CertLifetime |
nl |
Y Check the validity of the nagios boxe certificate |
hr.srce.GridProxy-Get-ops |
nl |
Y Try to retrieve a proxy from the available MyProxy server@cea |
hr.srce.GridProxy-Valid-ops |
nl |
Y Check if the proxy owned by nagios on the nagios boxe is valid - This is very important unless grid probes could not run |
org.egee.ATPSync |
? |
n |
org.egee.CheckConfig |
cc |
y |
org.egee.ImportGocdbDowntimes |
nl |
Y Import from the GOC DB downtimes related to sites from ROC Fr |
org.egee.MDDBSync |
? |
n |
org.egee.RecvFromQueue |
cc |
y |
org.egee.SendToMetricStore |
cc |
Y |
org.egee.SendToMsg |
cc |
y : a identifier les differents topics ( qui remplit le repertoire outgoing ???, trois topics ??) |
org.nagios.DiskCheck |
nl |
Y Check critical disk spaces - warning when 80% is full - critical when 95% is full |
org.nagios.ProcessCrond |
nl |
Y Check if main processes crond are running on the nagios boxe |
org.nagios.ProcessMsgToQueue |
cc |
y: test if msg-to-queue proc is running |
org.nagios.ProcessNpcd |
cc |
y: test if npcd pnp4nagios (pour les graphes de performances) proc is running |
org.sam.CE-JobMonit-ops |
cc |
y |
org.sam.CREAMCE-JobMonit-ops |
cc |
? : je supose que le resultat est correct (?) |
org.sam.mpi.CE-JobMonit-ops |
? |
n |
hr.srce.MyProxy-ProxyLifetime-ops |
cl |
n |
Mise à jour
Pour modifier la configuration (Exemple nouveau site a ajouter pour le FOC france):
Modifier /etc/ncg/ncg.localdb en conséquence.
Au prealablefaire une "sauvegarde" local de la config nagios:
mv /etc/nagios/wlcg.d /etc/nagios/wlcg.d.old
Puis executer le script ncg:
/usr/sbin/ncg.pl
Et redemarrer nagios avec la nouvelle configuration:
/etc/init.d/nagios restart
Verifier que la config est OK et faire un commit dans svn svn ci pour la sauvegarde de la configuration.
svn ci xxxxx
Source d'information pour NCG:
https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgOverview
https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgRecipes
bascule
Si probleme hardware sur la machine=> bascule sur la machine virtuelle de secours (Nadia ou Jacques). Faire ensuite un svn co de la config nagios pour avoir la derniere config de production.
Annexes
Annexe0
Sur le myproxyserver, il faut que le nagios server soit "trusted retriever" et "authorized retrievers":
[cleroy@grid08 ~]$ grep cclcgvmli03 /opt/glite/etc/myproxy-server.conf trusted_retrievers /O=GRID-FR/C=FR/O=CNRS/OU=CC-LYON/CN=cclcgvmli03.in2p3.fr authorized_retrievers /O=GRID-FR/C=FR/O=CNRS/OU=CC-LYON/CN=cclcgvmli03.in2p3.fr
Sur une UI:
voms-proxy-init -voms ops:/ops/Role=lcgadmin myproxy-init -c 336 -k nagios_roc_fr2-ops -s myproxy.grif.fr -l nagios -x -Z "/O=GRID-FR/C=FR/O=CNRS/OU=CC-IN2P3/CN=ccnagboxli01.in2p3.fr"
Annexe 1
rpm -ivh http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/x86_64/sa1-release-2-1.el5.noarch.rpm rpm -ivh http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm
yum install atp yum install bouncycastle yum install broker yum install broker-cache yum install dcache-srmclient yum install dummy-ca-certs yum install egee-NAGIOS yum install egee-NAGIOS egee-NRPE yum install egee-NRPE yum install fetch-crl yum install fipscheck fipscheck-lib yum install glite-UI yum install glite-security-voms-clients yum install glite-wms-ui-commands yum install glite-yaim-core yum install glite-yaim-nagios yum install httpd yum install jdk yum install lcg-CA yum install lcg-CA egee-NAGIOS yum install lcg_util yum install mddb yum install msg-publish-simple yum install myproxy yum install mysql-client yum install mysql-server yum install nagios-proxy-refresh yum install perl-Config-Tiny yum install perl-DBD-MySQL yum install perl-rrdtool-1.3.8-2.el5.rf.x86_64 yum install python-yaml yum install sun-jaf yum install uberftp-client yum install vdt_globus_rm_client
yum update glite-yaim-clients yum update glite-yaim-core yum update glite-yaim-nagios yum update mysql-server yum update perl-DBI
Annexe2
a)Mot de passe admin mysql : Yum pour recupérer la derniere version de Mysql : MySQL-server-community, ne pas oublier le client (pas de dépendance dessus) Demmarrage de mysql avec --skip-grant-tables (pour ne pas avoir de mot de passe a rentrer)
mysqld_safe --skip-grant-tables & [root@cclcgvmli03 ~]# mysql -u root update user set password=PASSWORD("NEW-ROOT-PASSWORD") where User='root';
b)creation des users pour le nagios regional:
[root@cclcgvmli03 ~]# mysql -u root -p mysql> GRANT SELECT, INSERT, UPDATE, DELETE ON nagios.* TO 'ndouser'@'localhost' IDENTIFIED by 'ROCfr2009'; mysql> GRANT SELECT, INSERT, UPDATE, DELETE ON atp.* TO 'atpuser'@'localhost' IDENTIFIED by 'ROCfr2009';
Annexe 3
[root@cclcgvmli03 ~]# cat /etc/ncg/ncg.localdb # # Local Rules file to modify NCG configuration # SITE!AUVERGRID SITE!CGG-LCG2 SITE!ESRF SITE!GRIF SITE!IBCP-GBIO SITE!IN2P3-CC SITE!IN2P3-CC-PPS SITE!IN2P3-CC-T2 SITE!IN2P3-CPPM SITE!IN2P3-IPNL SITE!IN2P3-IRES SITE!IN2P3-LAPP SITE!IN2P3-LPC SITE!IN2P3-LPSC SITE!IN2P3-SUBATECH SITE!IPSL-IPGP-LCG2 SITE!M3PEC SITE!MSFG SITE!MSFG-MULTI SITE!MSFG-OPEN SITE!OBSPM SITE!PARIS-UREC-IPV6 SITE!SN-UCAD SITE!ROC-FR SITE!SOLEIL SITE!StratusLab [root@cclcgvmli03 ~]#
Annexe 4
SITE_EMAIL=c.leroy@cea.fr SITE_NAME=ROC-FR RB_HOST=node04.datagrid.cea.fr WMS_HOST=node04.datagrid.cea.fr PX_HOST=myproxy.grif.fr BDII_HOST=topbdii.grif.fr SITE_BDII_HOST=bdii.grif.fr MON_HOST=node06.datagrid.cea.fr VOS="dteam" DTEAM_GROUP_ENABLE="dteam" VO_DTEAM_SW_DIR=$VO_SW_DIR/dteam VO_DTEAM_DEFAULT_SE=$SE_HOST VO_DTEAM_STORAGE_DIR=$CLASSIC_STORAGE_DIR/dteam VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/' VO_DTEAM_VOMSES="'dteam lcg-voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' 'dteam voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'" VO_DTEAM_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" NAGIOS_HOST=cclcgvmli03.in2p3.fr NAGIOS_ADMIN_DNS="/O=GRID-FR/C=FR/O=CEA/OU=IRFU/CN=Christine Leroy","/O=GRID-FR/C=FR/O=CNRS/OU=CC-LYON/CN=Nadia Lajili","/O=GRID-FR/C=FR/O=CNRS/OU=LPC/CN=Emmanuel Medernach","/O=GRID-FR/C=FR/O=CNRS/OU=CPPM/CN=Juan Carlos Carranza" NAGIOS_NCG_ENABLE_CONFIG=true NAGIOS_NAGIOS_ENABLE_CONFIG=true NCG_GOCDB_ROC_NAME=France ROC_NAME=France NCG_PROBES_TYPE=remote,native,local NCG_VO=dteam NAGIOS_MYPROXY_NAME=nagios_roc_fr2 NAGIOS_MYPROXY_USER=nagios MSG_BROKER_CACHE_NETWORK=PROD NAGIOS_ROLE=roc NAGIOS_HTTPD_ENABLE_CONFIG=true NAGIOS_SUDO_ENABLE_CONFIG=true NAGIOS_CGI_ENABLE_CONFIG=true NCG_LDAP_FILTER=GlueSiteOtherInfo=EGEE_ROC=France NAGIOS_DB_PASS=x NAGIOS_NSCA_PASS=x MYSQL_ADMIN=x ATP_DB_PASS=x MDDB_DB_PASS=x MS_DB_PASS=x MYSQL_PASSWORD=x MYEGEE_DB_PASS=x
Annexe 5
[root@cclcgvmli03 ~]# cat /opt/glite/yaim/examples/edgusers.conf 11151:${DPMMGR_USER}:11151:${DPMMGR_GROUP}:DPM user: 11152:${EDG_USER}:11152,11156:${EDG_GROUP},${INFOSYS_GROUP}:EDG user:${EDG_HOME_DIR} 11153:${EDGINFO_USER}:11153,1156:${EDGINFO_USER},${INFOSYS_GROUP}:EDG info user:${EDGINFO_HOME_DIR} 11154:${RGMA_USER}:11154,1156:${RGMA_GROUP},${INFOSYS_GROUP}:RGMA user:${INSTALL_ROOT}/glite/etc/rgma 11155:${GLITE_USER}:11155:${GLITE_GROUP}:gLite user:${GLITE_HOME_DIR} 11156:${BDII_USER}:11158:${BDII_GROUP}:BDII user:${BDII_HOME_DIR}
Annexe 6
Probleme rencontré sur ccnagboxli01 (arret de ypbind + rpm manquant)
chown -R nagios:nagios /var/spool/pnp4nagios chown -R nagios:nagios /var/spool/msg-nagios-bridge chown root:nagios /etc/atp/atp_db.conf chown root:nagios /etc/mddb/databases.yml chown nagios:apache /var/nagios/rw chown -R nagios:apache /var/nagios/rw chown root:nagios /etc/nagios/plugins/send_to_db.ini chown -R nagios:nagios /var/cache/msg/config-cache chown -R nagios:root /var/log/mddb chown -R nagios:root /var/log/pnp4nagios chown -R nagios:root /var/log/nagios
useradd -g 40011 -u 10011 nagios useradd --group mysql -u 2730 mysql useradd -g mysql -u 2730 mysql
groupadd -g 40010 nagiosmaster groupadd -g 146 leftuser
yum install glite-security-voms-api-c-1.8.12-2.sl5.x86_64 yum update perl-Net-SSLeay yum update perl-IO-Socket-SSL-1.01-1.fc6.noarch yum install glite-wms-ui-commands yum install lcg_util-1.7.6-1.sl5.x86_64 yum install GFAL-client-1.11.8-2.sl5.x86_64 GFAL-client-1.11.8-2.sl5.i386 yum install CGSI_gSOAP_2.7-1.3.3-1.sl5.x86_64 yum install LFC-interfaces-1.7.3-1sec.sl5.x86_64 LFC-client-1.7.3-1sec.sl5.x86_64 lcg-dm-common-1.7.3-1sec.sl5.x86_64