Difference between revisions of "Atlas:Analysis Challenge-STsummary"
(→To be tracked, open questions) |
(→Test 82 (Dec 15-17)) |
||
Ligne 46: | Ligne 46: | ||
probably due to lcg-gt timeout | probably due to lcg-gt timeout | ||
- <b>CPPM</b> jobs remained in running state with wrong turl pattern | - <b>CPPM</b> jobs remained in running state with wrong turl pattern | ||
− | probably due to lcg-gt timeout corrupted AOD AOD.027076._10514.pool.root | + | probably due to lcg-gt timeout |
− | + | job.50 failure due to the corrupted AOD AOD.027076._10514.pool.root | |
− | + | File: rfio:marjoe.in2p3.fr/../AOD.027076._10514.pool.root.1.3258341.0 badread=0 | |
− | |||
- <b>GRIF-LPNHE</b>Problem (user mapping misconfiguration) identified and solved - GRIF UNSCHEDULED/DOWNTIME | - <b>GRIF-LPNHE</b>Problem (user mapping misconfiguration) identified and solved - GRIF UNSCHEDULED/DOWNTIME | ||
Version du 12:31, 18 décembre 2008
Sommaire
Target and metrics
- Nb of events : Few hundred up to 1000 jobs/site
- Rate (evt/s) : up to 15 Hz
- Efficiency (success/failure rate) : 80 %
- CPU utilization : CPUtime / Walltime > 50 %
Monitoring
- Job status : c (completed) | f (failed) | r (running) | s (submitted)
- Log files : o (stdout) | e (stderr) | j (job wrapper output) | w (wms log)
- Job execution evolution plot is in minutes
- CPU/Walltime and events/second plots are only measured with respect to the athena execution time. Athena setup, data preparation and data stageout are notincluded in this.
- Mean Athena Software Setup Time, Mean Prepare Inputs Time and Mean Output Storage Time are in seconds
- Mean Athena software : time measured from the start of ganga worker node script until data preparation (Athena distribution kit setup, unpacking of the tarball with the pre-compiled code and setup of the pre-compiled code.
- Mean Prepare Inputs Time covers LFC bulk access, one lcg-gt call for dCache/castor or multiple lcg-gt calls for DPM for turl retrieval
- Mean Output Storage Time : call of lcg-cr
To be tracked, open questions
- GRIF : Stress test dealing with a multi-site
- CCIN2P3 : co-located T1 and T2
- Distribution of datasets over sites for stress test purposes
- lcg-gt timeout (jobs in running state observed at LPC & CPPM)
- Corrupted AOD files
Test 82 (Dec 15-17)
300 jobs scheduled - 10 sites involved : IN2P3-CC, LPC, GRIF-LAL, GRIF-SACLAY, GRIF-LPNHE, IN2P3-LAPP, TOKYO-LCG2, RO-02, RO-07, IN2P3-CPPM
- Summary :
28 jobs sent to IN2P3-LAPP
42 jobs sent to RO-07-NIPNE
45 jobs sent to RO-02-NIPNE
74 jobs sent to IN2P3-CPPM
237 jobs addressing GRIF-SACLAY but run at LAL
238 jobs sent to IN2P3-LPC
247 jobs sent to GRIF-LPNHE
300 jobs addressing GRIF-LAL - some run at IPNO
300 jobs sent to IN2P3-CC (T2 vs T1)
300 jobs sent to TOKYO-LCG2
- LAPP 28 jobs sent there - no dataset available to stress the site - LPC half of the input files processed failed jobs with error reading file from clrgpfssrv03-dpm.in2p3.fr Connection reset by peer - Connection closed by remote end 37 % of jobs remained in running state - job blocked reading file observed at LPC (JCC) and CPPM (EK) - (stderr) send2nsd: NS009 - fatal configuration error: Host unknown: dpnshome.in2p3.fr wrong turl pattern 'rfio:/dpm/in2p3.fr/home/atlas/atlasmcdisk/mc08..' probably due to lcg-gt timeout - CPPM jobs remained in running state with wrong turl pattern probably due to lcg-gt timeout job.50 failure due to the corrupted AOD AOD.027076._10514.pool.root File: rfio:marjoe.in2p3.fr/../AOD.027076._10514.pool.root.1.3258341.0 badread=0 - GRIF-LPNHEProblem (user mapping misconfiguration) identified and solved - GRIF UNSCHEDULED/DOWNTIME
Test 61 & 62 (Dec 8-10)
300 jobs : IN2P3-LPC, GRIF-LPNHE, IN2P3-CPPM
600 jobs : TOKYO
Input DS Patterns:
mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*
mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*
mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid*
mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid*
mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*
mc08.*.recon.AOD.e*_s*_r5*tid*
- Summary:
- TOKYO reached the target : 15 evt/s - CPU/Walltime > 60 % only 3 jobs failure due to Athena crashes out of 600 jobs run Corrupted input AOD files found test 61, job 170, AOD.027097._37998.pool.root test 62, job 202, AOD.027579._24654.pool.root test 62, job 212, AOD.027076._10514.pool.root max. MySQL connections increased from 100, default, up to 200 - LPC: 218 jobs (92.4 %) completed but not all input files could be read Number of input files processed by "completed" jobs ~29000 out of 58326 (expected to be) Site events/s less than 5 Hz - CPU/Walltime < 50 % Job 1.54, DS mc08.105001.pythia_minbias.recon.AOD.e357_s462_r541_tid027069): Expected 1000, Processed 4 input files are available on disk rfio timeout seen in corresponding stderr rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/ AOD.027069._03005.pool.root.1.2344130.0 can not be opened for reading (Timed out) Possible work-around to be considered as applied in UK cloud rfio tuning - increase of rfio bufsize on WNs See http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html - CPPM: Only 69 jobs sent because input datasets were unvalaible there 27 failed jobs (no stdout, no stderr, did not even reach the CE) with the error message 'request expired' from WMS because of 'BrokerHelper: no compatible resources' (WMS log added to the monitoring page). This is seen sometimes by SAM ops test About 14 jobs remained in running state - seen running by site - doing nothing after 1 hour then killed by the batch scheduler no output/no logs = no clear disgnosis CPPM is using GRIF Topbdii. Impact of this ? Decision taken to try to redistribute DS to Marseille to get more jobs but unsuccesfully - GRIF-LPNHE 1/3 of jobs remained in running state - seen then killed by the batch scheduler - no outputs/no logs = no clear disgnosis
Test 43 (November 28)
- http://gangarobot.cern.ch/st/test_43/ 200 jobs over 12 sites
- Problems seen :
- Missing ATLAS release 14.2.20 @ CPPM 02/12/08 [DONE] validate-prod of 14.2.20 at marce01.in2p3.fr (IN2P3-CPPM) - Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple CloseSEs by LAL and IRFU cf https://savannah.cern.ch/bugs/index.php?44824 should be SOLVED by Ganga team - LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are empty at GRIF-SACLAY_MCDISK / LAL_MCDISK (could be related to same problem) - ERROR@RO7-NIPNE : failed jobs with Athena errors http://gangarobot.cern.ch/st/test_43/gangadir/workspace/gangarbt /LocalAMGA/11/1/output/stdout.gz Core dump reading AOD file error reading from file rfio:tbit00.nipne.ro//storage1/atlas/2008-11-12/ AOD.026357._00299.pool.root.1.810634.0 (Timed out) to be TRACKED - NON FATAL ERRORS (completed jobs) @RO7-NIPNE, RO2-NIPNE : related to ATLASUSERDISK space token configuration ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ... ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK CORRECTED