Difference between revisions of "Atlas:Analysis Challenge-STsummary"

Un article de lcgwiki.
Jump to: navigation, search
(Test 82 (Dec 15-17))
 
(67 intermediate revisions by 2 users not shown)
Ligne 1: Ligne 1:
== Test 61&62 (Dec 8-10)  ==
+
=== Targets and metrics ===
 +
* Nb of jobs : Few hundred up to 1000 jobs/site
 +
* Rate (evt/s) : up to 15 Hz
 +
* Success rate(success/failure rate) > 80 %
 +
* CPU utilization :  CPUtime / Walltime > 50 %
 +
 
 +
== Monitoring  ==
 +
* http://gangarobot.cern.ch/st/
 +
* CPU/Walltime and events/second plots are only measured with respect to the athena execution time. Athena setup, data preparation and data stageout are <b>not</b> included in this.
 +
* Mean Athena Software Setup Time, Mean Prepare Inputs Time and Mean Output Storage Time are in seconds
 +
** Mean Athena software : time measured from the start of ganga worker node script until data preparation (Athena distribution kit setup, unpacking of the tarball with the pre-compiled code and setup of the pre-compiled code.
 +
** Mean Prepare Inputs Time covers LFC bulk access, one lcg-gt call for dCache/castor or multiple lcg-gt calls for DPM for turl retrieval
 +
** Mean Output Storage Time : call of lcg-cr
 +
 
 +
== To be tracked, open questions ==
 +
* GRIF : dealing with a multi-site : Is it possible to stress GRIF site by site ?
 +
* Distribution of datasets over sites for stress test purposes (Is it possible to improve ?)
 +
* lcg-gt timeout and overload of the srm interface (jobs in running state observed at LPC & CPPM - no clear load of dpm headnode and dpns identified !???
 +
* Corrupted AOD files
 +
* rfio tuning (read-ahead buffer size) needed at some sites ?
 +
* 2009 plans ? many problems identified but no real stress so no clear bottlenecks identified at sites for now...
 +
* comparaison and understanding of job timing (Athena software time discrepancy)
 +
 
 +
== Test 82 (Dec 15-17)  ==
 +
* http://atlas-ganga-storage.cern.ch/test_82/ 
 +
300 jobs scheduled - 10 sites involved : IN2P3-CC, LPC, GRIF-LAL, GRIF-SACLAY, GRIF-LPNHE, IN2P3-LAPP, TOKYO-LCG2, RO-02, RO-07, IN2P3-CPPM
 +
* Summary :
 +
28 jobs sent to IN2P3-LAPP <br>
 +
42 jobs sent to RO-07-NIPNE <br>
 +
45 jobs sent to RO-02-NIPNE  <br>
 +
74 jobs sent to IN2P3-CPPM <br>
 +
237 jobs addressing GRIF-SACLAY but run at LAL <br>
 +
238 jobs sent to IN2P3-LPC <br>
 +
247 jobs sent to GRIF-LPNHE <br>
 +
300 jobs addressing GRIF-LAL - some run at IPNO <br>
 +
300 jobs sent to IN2P3-CC<br>
 +
300 jobs sent to TOKYO-LCG2 <br>
 +
- <b>LAPP</b> 28 jobs sent there - no dataset available to stress the site
 +
- <b>LPC</b> ~ half of the input files processed
 +
  Seen in [http://gangarobot.cern.ch/st/test_82/gangadir/workspace/gangarbt/LocalAMGA/5/178/output/stderr.gz job 178 (completed) stderr]
 +
  file rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/
 +
  AOD.026982._00003.pool.root.1.2356089.0
 +
  can not be opened for reading (Timed out)
 +
  failed jobs with error reading file from clrgpfssrv03-dpm.in2p3.fr
 +
  Connection reset by peer - Connection closed by remote end
 +
  37 % of jobs remained in running state -
 +
  job blocked reading file observed at LPC (JCC) and CPPM (EK) -
 +
  (stderr) send2nsd: NS009 - fatal configuration error: Host unknown:   
 +
    dpnshome.in2p3.fr
 +
    wrong turl pattern 'rfio:/dpm/in2p3.fr/home/atlas/atlasmcdisk/mc08..'
 +
    probably due to lcg-gt timeout
 +
- <b>CPPM</b> 1 job remained in running state with wrong turl pattern
 +
    probably due to lcg-gt timeout 
 +
    job.50  failure due to the corrupted file AOD.027076._10514.pool.root
 +
    File: rfio:marjoe.in2p3.fr/../AOD.027076._10514.pool.root.1.3258341.0 badread=0
 +
- <b>GRIF-LPNHE</b>Problem (user mapping misconfiguration) identified
 +
  GRIF UNSCHEDULED/DOWNTIME
 +
- <b>GRIF-SACLAY</b> 6 batch queues in GANGA jdl
 +
  3 CES polgrid, grid10.lal, node07.cea
 +
  all jobs supposed to be run at SACLAY were in fact executed at LAL
 +
  with inputs taken from SACLAY SE and outputs written on LAL SE.
 +
  Good test of 5 Gb/s link  between SACLAY and LAL but we miss the ST
 +
  of GRIF-SACLAY
 +
- <b>GRIF-LAL</b> : 8 batch queues in GANGA jdl
 +
  4 CES : polgrid, grid10.lal, node07.cea, ipnls.
 +
  Difference with SACLAY ? some jobs ran at IPNO
 +
  Not all files being processed 
 +
  Job 8.19 DS mc08... Expected 999, Processed 334
 +
  Seen in [http://gangarobot.cern.ch/st/test_82/gangadir/workspace/gangarbt/LocalAMGA/8/19/output/stderr.gz stderr Job 19] (completed)
 +
  file rfio://grid19.lal.in2p3.fr//... can not be opened for reading (Bad credentials)
 +
- <b>CCIN2P3</b> 3 CES (2 from T2 and 1 from T1) in GANGA jdl
 +
  jobs ran on cclclgceli05, cclclgceli06, cclclgceli01 why ?
 +
  Overall jobs over time shows 140 jobs have started almost immediately
 +
  the remaining ones have been queued for 16 hours
 +
  <b> CC Toplevelbdii problem on Dec 15th by the end of the day
 +
  + one CE failure at T2</b> Effect on ST ?
 +
  job 36 . failure due to corrupted AOD File: AOD.027076._10514.pool.root
 +
  + AThena core dump
 +
- <b>CPU/Walltime</b>
 +
    most of the sites at ~30% except LAPP&Tokyo (~60%)
 +
    and SACLAY because jobs ran at LAL and input at Sacay
 +
    LPC : 2 bumps?
 +
-  <b>rfio</b>  Pbs at LPC and some other sites
 +
  (number of treated files <> number of expected files)
 +
  no problem at CPPM this time
 +
- <b>ATHENA setup time</b> longer time at LAL
 +
- <b>Storage time</b> time of lcg-cr command,
 +
    large differences between sites to be understood
 +
- <b>Prepare Inputs</b> Overload of lcg-gt - On going test at Glasgow 
 +
  convert the SURL 'srm://' to TURL 'rfio:/' and let DPM do the
 +
  automatic translation (no real effect - the load is shifted)
 +
 
 +
== Test 61 & 62 (Dec 8-10)  ==
 +
* http://atlas-ganga-storage.cern.ch/test_61/
 +
* http://atlas-ganga-storage.cern.ch/test_62/
 +
300 jobs : IN2P3-LPC, GRIF-LPNHE, IN2P3-CPPM <br>
 +
600 jobs : TOKYO <br>
 +
Input DS Patterns:
 +
mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*
 +
mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*
 +
mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid*
 +
mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid*
 +
mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*
 +
mc08.*.recon.AOD.e*_s*_r5*tid*
 +
 
 +
* Summary:
 +
  - <b>TOKYO</b> reached the target : 15 evt/s - CPU/Walltime > 60 %
 +
    only 3 jobs failure due to Athena crashes out of 600 jobs run
 +
    Corrupted input AOD files found
 +
    test 61, job 170,  AOD.027097._37998.pool.root 
 +
    test 62, job 202,  AOD.027579._24654.pool.root
 +
    test 62, job 212, AOD.027076._10514.pool.root
 +
    max. MySQL connections increased from 100, default, up to 200
 +
  - <b>LPC</b>: 218 jobs (92.4 %) completed but not all input files could be read
 +
    Number of input files processed by "completed" jobs ~29000 out of 58326 
 +
    (expected to be)
 +
    Site events/s less than 5 Hz - CPU/Walltime < 50 %
 +
    Job 1.54, DS mc08.105001.pythia_minbias.recon.AOD.e357_s462_r541_tid027069): Expected 1000, Processed 4
 +
    input files are available on disk
 +
    rfio timeout seen in corresponding stderr
 +
    rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/
 +
    AOD.027069._03005.pool.root.1.2344130.0 can not be opened for reading
 +
  (Timed out)
 +
    <b>Possible work-around to be considered (applied in UK cloud)</b>
 +
    rfio tuning - increase of rfio bufsize on WNs
 +
    See also http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html
 +
  - <b>CPPM</b>:
 +
    Only 69 jobs sent because input datasets were unvalaible there
 +
    27 failed jobs (no stdout, no stderr, did not even reach the CE) with
 +
    the error message 'request expired' from WMS because of 'BrokerHelper:
 +
    no compatible resources' (WMS log added to the monitoring page).
 +
    This is seen sometimes by SAM ops test
 +
    About 14 jobs remained in running state - seen running by site - doing
 +
    nothing after 1 hour then killed by the batch scheduler
 +
    no output/no logs = no clear disgnosis
 +
    CPPM is using GRIF Topbdii. Impact of this ?
 +
    <b>Decision taken to try to redistribute DS to Marseille to get more jobs
 +
    but unsuccesfully </b>
 +
  - <b>GRIF-LPNHE</b> 1/3 of jobs remained in running state - seen then
 +
    killed by the batch scheduler - no outputs/no logs = no clear disgnosis
  
* Problems seen :
 
 
== Test 43 (November 28)  ==
 
== Test 43 (November 28)  ==
200 jobs over 12 sites: http://gangarobot.cern.ch/st/test_43/
+
* http://gangarobot.cern.ch/st/test_43/ 200 jobs over 12 sites
 
* Problems seen :  
 
* Problems seen :  
 
  - Missing ATLAS release 14.2.20 @ CPPM  
 
  - Missing ATLAS release 14.2.20 @ CPPM  

Latest revision as of 16:41, 10 mars 2009

Targets and metrics

  • Nb of jobs : Few hundred up to 1000 jobs/site
  • Rate (evt/s) : up to 15 Hz
  • Success rate(success/failure rate) > 80 %
  • CPU utilization : CPUtime / Walltime > 50 %

Monitoring

  • http://gangarobot.cern.ch/st/
  • CPU/Walltime and events/second plots are only measured with respect to the athena execution time. Athena setup, data preparation and data stageout are not included in this.
  • Mean Athena Software Setup Time, Mean Prepare Inputs Time and Mean Output Storage Time are in seconds
    • Mean Athena software : time measured from the start of ganga worker node script until data preparation (Athena distribution kit setup, unpacking of the tarball with the pre-compiled code and setup of the pre-compiled code.
    • Mean Prepare Inputs Time covers LFC bulk access, one lcg-gt call for dCache/castor or multiple lcg-gt calls for DPM for turl retrieval
    • Mean Output Storage Time : call of lcg-cr

To be tracked, open questions

  • GRIF : dealing with a multi-site : Is it possible to stress GRIF site by site ?
  • Distribution of datasets over sites for stress test purposes (Is it possible to improve ?)
  • lcg-gt timeout and overload of the srm interface (jobs in running state observed at LPC & CPPM - no clear load of dpm headnode and dpns identified !???
  • Corrupted AOD files
  • rfio tuning (read-ahead buffer size) needed at some sites ?
  • 2009 plans ? many problems identified but no real stress so no clear bottlenecks identified at sites for now...
  • comparaison and understanding of job timing (Athena software time discrepancy)

Test 82 (Dec 15-17)

300 jobs scheduled - 10 sites involved : IN2P3-CC, LPC, GRIF-LAL, GRIF-SACLAY, GRIF-LPNHE, IN2P3-LAPP, TOKYO-LCG2, RO-02, RO-07, IN2P3-CPPM

  • Summary :

28 jobs sent to IN2P3-LAPP
42 jobs sent to RO-07-NIPNE
45 jobs sent to RO-02-NIPNE
74 jobs sent to IN2P3-CPPM
237 jobs addressing GRIF-SACLAY but run at LAL
238 jobs sent to IN2P3-LPC
247 jobs sent to GRIF-LPNHE
300 jobs addressing GRIF-LAL - some run at IPNO
300 jobs sent to IN2P3-CC
300 jobs sent to TOKYO-LCG2

- LAPP 28 jobs sent there - no dataset available to stress the site 
- LPC ~ half of the input files processed 
  Seen in job 178 (completed) stderr
  file rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/
  AOD.026982._00003.pool.root.1.2356089.0 
  can not be opened for reading (Timed out)
  failed jobs with error reading file from clrgpfssrv03-dpm.in2p3.fr
  Connection reset by peer - Connection closed by remote end
  37 % of jobs remained in running state - 
  job blocked reading file observed at LPC (JCC) and CPPM (EK) - 
  (stderr) send2nsd: NS009 - fatal configuration error: Host unknown:     
   dpnshome.in2p3.fr
   wrong turl pattern 'rfio:/dpm/in2p3.fr/home/atlas/atlasmcdisk/mc08..' 
   probably due to lcg-gt timeout 
- CPPM 1 job remained in running state with wrong turl pattern 
   probably due to lcg-gt timeout  
   job.50  failure due to the corrupted file AOD.027076._10514.pool.root 
   File: rfio:marjoe.in2p3.fr/../AOD.027076._10514.pool.root.1.3258341.0 badread=0
- GRIF-LPNHEProblem (user mapping misconfiguration) identified 
  GRIF UNSCHEDULED/DOWNTIME
- GRIF-SACLAY 6 batch queues in GANGA jdl 
  3 CES polgrid, grid10.lal, node07.cea 
  all jobs supposed to be run at SACLAY were in fact executed at LAL 
  with inputs taken from SACLAY SE and outputs written on LAL SE. 
  Good test of 5 Gb/s link  between SACLAY and LAL but we miss the ST
  of GRIF-SACLAY 
- GRIF-LAL : 8 batch queues in GANGA jdl
  4 CES : polgrid, grid10.lal, node07.cea, ipnls. 
  Difference with SACLAY ? some jobs ran at IPNO 
  Not all files being processed  
  Job 8.19 DS mc08... Expected 999, Processed 334
  Seen in stderr Job 19 (completed) 
  file rfio://grid19.lal.in2p3.fr//... can not be opened for reading (Bad credentials)
- CCIN2P3 3 CES (2 from T2 and 1 from T1) in GANGA jdl
  jobs ran on cclclgceli05, cclclgceli06, cclclgceli01 why ?
  Overall jobs over time shows 140 jobs have started almost immediately 
  the remaining ones have been queued for 16 hours 
   CC Toplevelbdii problem on Dec 15th by the end of the day 
  + one CE failure at T2 Effect on ST ? 
  job 36 . failure due to corrupted AOD File: AOD.027076._10514.pool.root
  + AThena core dump
- CPU/Walltime
   most of the sites at ~30% except LAPP&Tokyo (~60%)
   and SACLAY because jobs ran at LAL and input at Sacay
   LPC : 2 bumps?
-  rfio  Pbs at LPC and some other sites 
  (number of treated files <> number of expected files)
  no problem at CPPM this time
- ATHENA setup time longer time at LAL
- Storage time time of lcg-cr command, 
   large differences between sites to be understood
- Prepare Inputs Overload of lcg-gt - On going test at Glasgow  
  convert the SURL 'srm://' to TURL 'rfio:/' and let DPM do the
  automatic translation (no real effect - the load is shifted)

Test 61 & 62 (Dec 8-10)

300 jobs : IN2P3-LPC, GRIF-LPNHE, IN2P3-CPPM
600 jobs : TOKYO
Input DS Patterns: mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid* mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid* mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid* mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid* mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid* mc08.*.recon.AOD.e*_s*_r5*tid*

  • Summary:
 - TOKYO reached the target : 15 evt/s - CPU/Walltime > 60 %
   only 3 jobs failure due to Athena crashes out of 600 jobs run
   Corrupted input AOD files found
   test 61, job 170,  AOD.027097._37998.pool.root  
   test 62, job 202,  AOD.027579._24654.pool.root 
   test 62, job 212, AOD.027076._10514.pool.root
   max. MySQL connections increased from 100, default, up to 200
 - LPC: 218 jobs (92.4 %) completed but not all input files could be read
   Number of input files processed by "completed" jobs ~29000 out of 58326  
   (expected to be)
   Site events/s less than 5 Hz - CPU/Walltime < 50 %
   Job 1.54, DS mc08.105001.pythia_minbias.recon.AOD.e357_s462_r541_tid027069): Expected 1000, Processed 4
   input files are available on disk 
   rfio timeout seen in corresponding stderr
   rfio://clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-20/
   AOD.027069._03005.pool.root.1.2344130.0 can not be opened for reading
  (Timed out)
   Possible work-around to be considered (applied in UK cloud) 
   rfio tuning - increase of rfio bufsize on WNs 
   See also http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html
 - CPPM: 
   Only 69 jobs sent because input datasets were unvalaible there
   27 failed jobs (no stdout, no stderr, did not even reach the CE) with
   the error message 'request expired' from WMS because of 'BrokerHelper:
   no compatible resources' (WMS log added to the monitoring page).
   This is seen sometimes by SAM ops test
   About 14 jobs remained in running state - seen running by site - doing
   nothing after 1 hour then killed by the batch scheduler
   no output/no logs = no clear disgnosis 
   CPPM is using GRIF Topbdii. Impact of this ?
   Decision taken to try to redistribute DS to Marseille to get more jobs 
   but unsuccesfully 
 - GRIF-LPNHE 1/3 of jobs remained in running state - seen then
   killed by the batch scheduler - no outputs/no logs = no clear disgnosis

Test 43 (November 28)

- Missing ATLAS release 14.2.20 @ CPPM 
  02/12/08 [DONE] validate-prod of 14.2.20 at marce01.in2p3.fr (IN2P3-CPPM)
- Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple CloseSEs
  by LAL and IRFU cf https://savannah.cern.ch/bugs/index.php?44824
  should be SOLVED by Ganga team
- LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are
  empty at GRIF-SACLAY_MCDISK / LAL_MCDISK 
 (could be related to same problem)
- ERROR@RO7-NIPNE : failed jobs with Athena errors 
  http://gangarobot.cern.ch/st/test_43/gangadir/workspace/gangarbt
  /LocalAMGA/11/1/output/stdout.gz
  Core dump reading AOD file  
  error reading from file rfio:tbit00.nipne.ro//storage1/atlas/2008-11-12/
  AOD.026357._00299.pool.root.1.810634.0 (Timed out)
  to be TRACKED 
- NON FATAL ERRORS (completed jobs) @RO7-NIPNE, RO2-NIPNE : related to
  ATLASUSERDISK space token configuration
  ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK
  ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ...
  ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK
  CORRECTED