Atlas:Analysis Challenge-STsummary

Un article de lcgwiki.
Revision as of 10:12, 18 décembre 2008 by Chollet (talk | contribs)
Jump to: navigation, search

Monitoring

  • Job execution evolution plot is in minutes
  • CPU/Walltime and events/second plots are only measured with respect to the athena execution time. Athena setup, data preparation and data stageout are notincluded in this.
  • Mean Athena Software Setup Time, Mean Prepare Inputs Time and Mean Output Storage Time are in seconds
    • Mean Athena software : time measured from the start of ganga worker node script until data preparation (Athena distribution kit setup, unpacking of the tarball with the pre-compiled code and setup of the pre-compiled code.
    • Mean Prepare Inputs Time covers LFC bulk access, one lcg-gt call for dCache/castor or multiple lcg-gt calls for DPM for turl retrieval
    • Mean Output Storage Time : call of lcg-cr

Test 82 (Dec 15-17)

300 jobs : IN2P3-CC, LPC, GRIF-LAL, GRIF-SACLAY, GRIF-LPNHE, IN2P3-LAPP, TOKYO-LCG2, RO-02, RO-07, IN2P3-CPPM

  • Summary :

Test 61 & 62 (Dec 8-10)

300 jobs : IN2P3-LPC, GRIF-LPNHE, IN2P3-CPPM
600 jobs : TOKYO
Input DS Patterns: mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid* mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid* mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid* mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid* mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid* mc08.*.recon.AOD.e*_s*_r5*tid*

  • Summary:
 - TOKYO reached the target : 15 evt/s - 
   CPU/Walltime > 60 %
   only 3 jobs failure due to Athena crashes out of 600 jobs run
   Corrupted input AOD files found
   test 61, job 170,  AOD.027097._37998.pool.root  
   test 62, job 202,  AOD.027579._24654.pool.root 
   test 62, job 212, AOD.027076._10514.pool.root
   max. MySQL connections increased from 100, default, up to 200
 - LPC : 218 jobs (92.4 %) completed but not all input files could be read
   Number of input files processed by "completed" jobs ~29000 out of 58326  
   (expected to be)
   Site events/s is less than 5 Hz- 
   Job 1.54, DS mc08.105001.pythia_minbias.recon.AOD.e357_s462_r541_tid027069): Expected 1000, Processed 4
   rfio timeout seen in corresponding stderr
   rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/AOD.027069._03005.pool.root.1.2344130.0 can not be opened for reading (Timed out)
   Possible work-around to be considered as applied in UK cloud rfio tuning - increase of rfio bufsize on WNs 
   See 

http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html

Test 43 (November 28)

- Missing ATLAS release 14.2.20 @ CPPM 
  02/12/08 [DONE] validate-prod of 14.2.20 at marce01.in2p3.fr (IN2P3-CPPM)
- Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple CloseSEs
  by LAL and IRFU cf https://savannah.cern.ch/bugs/index.php?44824
  should be SOLVED by Ganga team
- LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are
  empty at GRIF-SACLAY_MCDISK / LAL_MCDISK 
 (could be related to same problem)
- ERROR@RO7-NIPNE : failed jobs with Athena errors 
  http://gangarobot.cern.ch/st/test_43/gangadir/workspace/gangarbt
  /LocalAMGA/11/1/output/stdout.gz
  Core dump reading AOD file  
  error reading from file rfio:tbit00.nipne.ro//storage1/atlas/2008-11-12/
  AOD.026357._00299.pool.root.1.810634.0 (Timed out)
  to be TRACKED 
- NON FATAL ERRORS (completed jobs) @RO7-NIPNE, RO2-NIPNE : related to
  ATLASUSERDISK space token configuration
  ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK
  ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ...
  ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK
  CORRECTED