ARCHIVES/LCG-FR / SA1-FR Monitoring WG: Difference between revisions

← Différence précédente Différence suivante →

Version du 18:34, 25 mai 2009

Contacts (Mailing list)

LCG-SA1FR-MONITORING-L@IN2P3.FR mailing list : List managers : Christine Leroy (Irfu/CEA)

les membres du groupe: http://lcg.in2p3.fr/wiki/images/MembresOnly_v2.doc

Mandat du Groupe

Document consultable: https://edms.in2p3.fr/file/I-013168/2/LCG-France-SA1-FR_WGMonitoring.pdf

Le groupe de travail se chargera dans un délai de 6 mois :

collecter les besoins des responsables des sites et de services de la région,

recenser les pratiques des sites et les outils de monitoring utilisés,

représenter et défendre les intérêts de la région dans les différents groupes de travail existants au sein de WLCG-EGEE sur un sujet connexe (En cours de construction un Groupe EGEE: OAT)

identifier les standards qui doivent être suivis ainsi que les outils pertinents à tous les niveaux (services, site et région), proposer aux responsables des sites, des services grid et de l’opération régionale de la grille EGEE un ensemble d'outils répondant à leurs besoins,

proposer, si besoin, des améliorations sur les outils et procédures d’alertes au niveau des sites, au niveau de la région

établir, s’il y a lieu, un plan pour la poursuite de ses travaux au-delà de la période initiale des 6 premiers mois.

Toutes les propositions et recommandations devront a priori être cohérentes avec les orientations des projets EGEE et WLCG.

L'organisation et le mode de fonctionnement du groupe seront définis par le responsable du groupe et les membres eux-mêmes.

Réunions, Journées...

Infrastructure de monitoring

l’infrastructure d’échange de messages pour le « multi-level » monitoring: https://twiki.cern.ch/twiki/bin/view/LCG/MessagingSystemforGrid

https://twiki.cern.ch/twiki/bin/view/EGEE/MsgServerDetails

Site Monitoring

ce qui doit être monitoré

Type de Noeuds	Type de test	Who + link URL	Validé par
WNs	NFS mounts failing - check that files can be written and read. In particular checking that the VO_[VONAME]_SW_DIR is readable.	GRIF with scripting	Non validé
WNs	ssh keys - some modes of operation require unchallenged ssh between WNs and the CE, or for MPI among the WNs. First simple check is to verify that the wn can copy back a file to the ce.	GRIF with Nagios	Non validé
WNs	Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not		Non validé
All service nodes	Host certificates expiring - make sure they get renewed in good time		Non validé
All nodes	CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose		Non validé
All nodes	Filesystem in ReadOnly Mode	GRIF with Nagios	Non validé
All nodes	Node crashes	GRIF with Nagios	Non validé
All nodes	Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space. Also check that disks don't run out of inodes	GRIF with Nagios	Non validé

Nagios

pour les graphes: http://nagiosgraph.sourceforge.net/

truc et astuces

Installation

Sondes nagios

Nagios et GridMonitoring

Repository du Projet:

Fabric : http://www.sysadmin.hep.ac.uk/svn/fabric-monitoring/

Grid_services: http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/sl4 & external: http://linuxsoft.cern.ch/dag/redhat/el4/

Recuperer les resultats des Tests SAM: http://www.gridpp.ac.uk/wiki/Nagios_sam-query_Plugin

Manuel et tutorial:

https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg

https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial

Messaging system:

https://twiki.cern.ch/twiki/bin/view/LCG/MessagingSystemforGrid

Gadget WEB

Un certain nombre de flux RSS et widgets sont disponibles:

exemple

en utilisant:

CMS

liste de Widgets disponible: [1]

Alice

liste de Flux RSS, disponible: [2]

Accounting

Flux RSS: http://goc-accounting.grid-support.ac.uk/rss/YOUR-SITE-NAME_ApelSync.xml

Monitoring de l'activité des VOs orienté site

Il s'agit d'un outil à destination des sites permettant de suivre l'état du site vis à vis de l'activité des VOs supportées. L'idée est de rassembler en un seul display (type Gridmap)l'ensemble des informations significatives collectées auprès des différents outils de monitoring spécifiques aux différentes VOs et publiées dans une base de données communes.

Roadmap for site monitoring...providing a site view of VO activities (présentation Workshop WLCG 14/11/08

Dashboard VO LHC avec une vue site

Services Grilles

FTS

CCIN2P3 RAL atlas RAL Ganglia

GridFtp

ICEPP

DPM

DPM monitoring by Gridpp

WMS

WMS monitoring by CNAF

Services VOs

LHC VOs

Experiment Dashboard

ALICE

* Monalisa monitoring:  http://pcalimonitor.cern.ch/ 
* Job Monitoring:  http://dashboard.cern.ch/alice/
* Daily reports:     http://dashb-alice.cern.ch/dashboard/data/
* Site efficiency :  http://dboard-gr.cern.ch/dashboard/data/summaries/

ATLAS

* Dashboard :  http://dashboard.cern.ch/atlas/
* Installation SW : https://atlas-install.roma1.infn.it/atlas_install/
* Bilan mensuel du nombre de jobs exécutés et de l'efficacité par site : http://dashb-atlas-job.cern.ch/dashboard/request.py/MonthlyReportIndex
* PanDA : http://gridinfo.triumf.ca/panglia. Il y a un URL par queue utilisée par les jobs de productions + 1 URL spécifique pour les queues utilisées par les jobs d'analyse (ANALY_xxx). Pour les jobs de productions

Le dashboard a tendance à remplacer les autres (excepté pour le suivi des installations). C'est le plus complet et le plus riche. Il permet en particulier d'obtenir la liste des jobs en erreur avec des informations détaillées sur l'erreur, le WN impliqué...

CMS

* site commissioning metrics : http://lhcweb.pic.es/cms/SiteCommissioningGlobalStatus_Sites.html
* Dashboard CMS (Starting Point) : http://arda-dashboard.cern.ch/cms/
 See instructions from Facility Operation team : https://twiki.cern.ch/twiki/bin/view/CMS/SAMChecklist
* Phedex monitoring tool for transfer activities : http://cmsweb.cern.ch/phedex/
* Widget CMS: http://iglezh.web.cern.ch/iglezh/widgets/
* Job Monitioring : http://dashboard.cern.ch/cms
* CRAB JobRobot summary : http://jobrobot.web.cern.ch/JobRobot/summary_071002.html
* CMS SAM Visualization : http://lxarda16.cern.ch/dashboard/request.py/latestresultsview
* CMS Site status board
* Site Commissionning board : http://lxarda16.cern.ch/dashboard/request.py/siteview?debug=1
* Link Commissioning Status : 
 T1-T2_FR downlinks
 T2_FR-T1 uplinks

LHCb

* Site status for LHCb usage : http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html
* Dashboard : http://dashboard.cern.ch/lhcb/
* Monitoring (job LHCb) : http://lhcbweb.pic.es/DIRAC/LHCb-Production/visitor/info/general/diracOverview

Infrastructure Grille

GridMap Prototype visualizing the "State" of the Grid

EGEE Monitoring Group (OAT)

https://espace.cern.ch/sa1-share/oat/Shared Documents/ : Documents de l'OAT

Old one:

WLCG Monitoring Working groups

3 groups have been created. See https://twiki.cern.ch/twiki/bin/view/LCG/LCGMonitoringWorkingGroups The most active one is the Grid Service Monitoring group chaired by James Casey and Ian Neilson (FC)

System management : Fabric Management, best practices, security
Grid service monitoring : See the Nagios prototype for grid services monitoring
System analysis: mainly focus on applications monitoring
High Level model for WLCG Monitoring
J.Casey's Presentations@GDB 05 March 2008
- A strategy for WLCG Monitoring
- Monitoring - some worked examples

@@ Ligne 52: / Ligne 52: @@
 === ce qui doit être monitoré ===
-*  Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space.
-* Also check that disks don't run out of inodes.
-* Node crashes and disk failures.
-*
-* Clock skew - often because the ntpd has died, or sometimes due to a problem with the clock to which ntpd is synchronised.
-*  * CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose.
-* Check that you can use GridFTP from each WN to the CE and SE (although this will need a valid proxy on the WN).
-* Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not.
-* Check log files for signs of trouble. Look for permission denied
-* Monitor the duration of jobs by WN - if all jobs to a particular WN are ending quickly it may well be faulty.
@@ Ligne 79: / Ligne 69: @@
 NFS mounts failing - check that files can be written and read. In particular checking that the VO_[VONAME]_SW_DIR is readable.
 | width="50%" |
-GRIF
+GRIF with scripting
 | width="50%" |
 Non validé
@@ Ligne 88: / Ligne 78: @@
 ssh keys - some modes of operation require unchallenged ssh between WNs and the CE, or for MPI among the WNs. First simple check is to verify that the wn can copy back a file to the ce.
 | width="50%" |
-GRIF
+GRIF with Nagios
+| width="50%" |
+Non validé
+|----
+|
+WNs
+| width="50%" |
+Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not
+| width="50%" |
 | width="50%" |
 Non validé
@@ Ligne 97: / Ligne 96: @@
 Host certificates expiring - make sure they get renewed in good time
 | width="50%" |
-GRIF
 | width="50%" |
 Non validé
@@ Ligne 106: / Ligne 105: @@
 CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose
 | width="50%" |
-GRIF
+| width="50%" |
+Non validé
+|----
+|
+All nodes
+| width="50%" |
+Filesystem in ReadOnly Mode
+| width="50%" |
+GRIF with Nagios
+| width="50%" |
+Non validé
+|----
+|
+All nodes
+| width="50%" |
+Node crashes
+| width="50%" |
+GRIF with Nagios
+| width="50%" |
+Non validé
+|----
+|
+All nodes
+| width="50%" |
+Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space. Also check that disks don't run out of inodes
+| width="50%" |
+GRIF with Nagios
 | width="50%" |
 Non validé

ARCHIVES/LCG-FR / SA1-FR Monitoring WG: Difference between revisions

Version du 18:34, 25 mai 2009

Sommaire

Contacts (Mailing list)

Mandat du Groupe

Réunions, Journées...

Infrastructure de monitoring

Site Monitoring

ce qui doit être monitoré

Nagios

truc et astuces

Nagios et GridMonitoring

Lemon

Cacti

autres projets

Gadget WEB

CMS

Alice

Accounting

Monitoring de l'activité des VOs orienté site

Services Grilles

FTS

GridFtp

DPM

WMS

Services VOs

LHC VOs

ALICE

ATLAS

CMS

LHCb

Infrastructure Grille

GridMap Prototype visualizing the "State" of the Grid

EGEE Monitoring Group (OAT)

WLCG Monitoring Working groups

Open GGUS Tickets assigned to ROC-France

Navigation menu

ARCHIVES/LCG-FR / SA1-FR Monitoring WG: Difference between revisions

Version du 18:34, 25 mai 2009

Contacts (Mailing list)

Mandat du Groupe

Réunions, Journées...

Infrastructure de monitoring

Site Monitoring

ce qui doit être monitoré

Nagios

truc et astuces

Nagios et GridMonitoring

Lemon

Cacti

autres projets

Gadget WEB

CMS

Alice

Accounting

Monitoring de l'activité des VOs orienté site

Services Grilles

FTS

GridFtp

DPM

WMS

Services VOs

LHC VOs

ALICE

ATLAS

CMS

LHCb

Infrastructure Grille

GridMap Prototype visualizing the "State" of the Grid

EGEE Monitoring Group (OAT)

WLCG Monitoring Working groups

Open GGUS Tickets assigned to ROC-France

Navigation menu

Rechercher