Atlas:Analysis Challenge ST

Un article de lcgwiki.
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Site Stress Test

Procedure

  • Replication of target datasets accross the cloud
  • Preparation of job
  • Generation n jobs per site (Each job processes 1 dataset)
  • Bulk submission to WMS (1 per site)

Test conditions

  • The testing framework is ganga-based. It is currently using LCG backend but it will soon be possible to use PANDA backend as well. Metrics are collected and displayed at http://gangarobot.cern.ch/st/
  • Both POSIX I/O and "copy mode" may be used allowing performances comparaison of the 2 modes.
  • ATLAS software release 14.2.20
  • Input DS Patterns used (first one is the preferred one for muon analysis):
   mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*
   mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*
   mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid*
   mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid*
   mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*
   mc08.*.recon.AOD.e*_s*_r5*tid* (this pattern includes all the previous ones)
  • Input datasets are read from ATLASMCDISK and outputs are stored on ATLASUSERDISK (no special requirements there). Input data access is the main issue. No problem on data output
  • Required CPUtime : GlueCEPolicyMaxCPUTime >= 1440 (1 day , typical duration : 5 hours)
  • Jobs run under DN : /O=GermanGrid/OU=LMU/CN=Johannes_Elmsheuser
  • LAN saturation observed in case of 1 Gb network connection between WN and SE.
  • It is possible for sites to limit the number of jobs sent at a time.
  • Test duration : 48 hours

File access modes used by Ganga Jobs

Two access modes may be used during ST tests.

  • (j.inputdata.type='DQ2_LOCAL') : Currently if users are submitting Ganga jobs with the LCG backend, the posix I/O access is the default method to access files
  • (j.inputdata.type='FILE_STAGER') : One of the alternative access mode used by ST tests is the FileStager mode, which copies the input files in a background thread of the athena event loop usinglcg-cp from the local SE to the worker node tmp area. This mode still needs a bit of improvements to gain full stability, but as demonstrated good performance during ST at some (but not all) sites.

Target and metrics

  • Nb of jobs : Few hundred up to 1000 jobs/site
  • Rate (evt/s) : up to 15 Hz
  • Success rate (success/failure rate) > 80 %
  • CPU utilization : CPUtime / Walltime > 50 %

Results and Monitoring

More information

  • Latest news from ADC developent meeting : http://indico.cern./conferenceDisplay.py?confId=48239
    • Usage of a prestager (copy in background of needed files on WN while processing) improve on most sites CPU/Walltime ratio (up to >90%) and Nb of events > 20Hz. However.. lcg-cp is used for copy...
    • some tests were performed on the whole LCG cloud like this one

http://gangarobot.cern.ch/st/test_105/