Atlas:Analysis Challenge-STsummary
Monitoring
- Job execution evolution plot is in minutes
- CPU/Walltime and events/second plots are only measured with respect to the athena execution time. Athena setup, data preparation and data stageout are notincluded in this.
- Mean Athena Software Setup Time, Mean Prepare Inputs Time and Mean Output Storage Time are in seconds
- Mean Athena software : time measured from the start of ganga worker node script until data preparation (Athena distribution kit setup, unpacking of the tarball with the pre-compiled code and setup of the pre-compiled code.
- Mean Prepare Inputs Time covers LFC bulk access, one lcg-gt call for dCache/castor or multiple lcg-gt calls for DPM for turl retrieval
- Mean Output Storage Time : call of lcg-cr
Test 82 (Dec 15-17)
300 jobs : IN2P3-CC, LPC, GRIF-LAL, GRIF-SACLAY, GRIF-LPNHE, IN2P3-LAPP, TOKYO-LCG2, RO-02, RO-07, IN2P3-CPPM
- Summary :
Test 61 & 62 (Dec 8-10)
300 jobs : IN2P3-LPC, GRIF-LPNHE, IN2P3-CPPM
600 jobs : TOKYO
Input DS Patterns:
mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*
mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*
mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid*
mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid*
mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*
mc08.*.recon.AOD.e*_s*_r5*tid*
- Summary:
- TOKYO reached the target : 15 evt/s - CPU/Walltime > 60 % only 3 jobs failure due to Athena crashes out of 600 jobs run Corrupted input AOD files found test 61, job 170, AOD.027097._37998.pool.root test 62, job 202, AOD.027579._24654.pool.root test 62, job 212, AOD.027076._10514.pool.root max. MySQL connections increased from 100, default, up to 200 - LPC : 218 jobs (92.4 %) completed but not all input files could be read Number of input files processed by "completed" jobs ~29000 out of 58326 (expected to be) Site events/s is less than 5 Hz- Job 1.54, DS mc08.105001.pythia_minbias.recon.AOD.e357_s462_r541_tid027069): Expected 1000, Processed 4 rfio timeout seen in corresponding stderr rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/AOD.027069._03005.pool.root.1.2344130.0 can not be opened for reading (Timed out) Possible work-around to be considered as applied in UK cloud rfio tuning - increase of rfio bufsize on WNs See
http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html
Test 43 (November 28)
- http://gangarobot.cern.ch/st/test_43/ 200 jobs over 12 sites
- Problems seen :
- Missing ATLAS release 14.2.20 @ CPPM 02/12/08 [DONE] validate-prod of 14.2.20 at marce01.in2p3.fr (IN2P3-CPPM) - Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple CloseSEs by LAL and IRFU cf https://savannah.cern.ch/bugs/index.php?44824 should be SOLVED by Ganga team - LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are empty at GRIF-SACLAY_MCDISK / LAL_MCDISK (could be related to same problem) - ERROR@RO7-NIPNE : failed jobs with Athena errors http://gangarobot.cern.ch/st/test_43/gangadir/workspace/gangarbt /LocalAMGA/11/1/output/stdout.gz Core dump reading AOD file error reading from file rfio:tbit00.nipne.ro//storage1/atlas/2008-11-12/ AOD.026357._00299.pool.root.1.810634.0 (Timed out) to be TRACKED - NON FATAL ERRORS (completed jobs) @RO7-NIPNE, RO2-NIPNE : related to ATLASUSERDISK space token configuration ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ... ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK CORRECTED