Atlas:Analysis HC beyond STEP09

Un article de lcgwiki.
Jump to: navigation, search

--Chollet 15:56, 19 octobre 2009 (CEST)

Distributed Analysis Stress Tests - HammerCloud beyond STEP09

Lessons learnt from STEP09

  • Sites may identify reasonable amount of analysis they can assume and set hard limits on number of analysis running jobs
  • Balancing data across many disk servers is essential.
  • Very high i/o required by analysis (5 MB/s per job). Sites should review LAN architecture to avoid bottlenecks.

Results

HC web interface

ATLAS STEP09 summary

Some conclusions about DB access HC tests

ATLAS Criteriafor T2s to be considered

(from K.Bos Nov.09) T2s are for user analysis but should be considered if - They have at least 100 TB of storage space - they have passed the HC validation : 90 % efficiency, 150 Mev/day

ATLAS Info & Contacts

  • ATLAS HammerCloud wiki pages : https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloud
  • Information via mailing list ATLAS-LCG-OP-L@in2p3.fr
  • LPC : Nabil Ghodbane AT cern.ch
  • LAL : Nicolas Makovec
  • LAPP : Stéphane Jézéquel
  • CPPM : Emmanuel Le Guirriec
  • LPSC : Sabine Crepe
  • LPNHE : Tristan Beau
  • CC-T2 : Catherine, Ghita
  • IRFU : Nathalie Besson

HC Tests

ATLAS-HC-small.jpg

Objectives

  • Improve Cloud readiness by following site&ATLAS problems week by week (SL5 migration, site upgrades)
  • Identify best data access method per site by comparing the event rate and CPU/Walltime

https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloudDataAccess#FR_cloud

  • Exercise Analysis with Conditions DB access (see where squid caching is needed) and Tag analysis

Data Access methods

Multiple data access methods are exercised

  • via Panda : A copy-to-WN access mode using rfcp is used (xrootd in ANALY-LYON)
  • via gLite WMS : 2 data access modes available
    • DQ2_LOCAL mode is a direct access mode using rfio or dcap
    • FILE_STAGER mode : data staged in by a dedicated thread running in // with Athena

Panda errors - Monitoring

Week 40

29/09/09 Test 649

  Bad efficiency - all sites affected all sites 
  Failed jobs with error : exit code 1137
  Put error: Error in copying the file from job workdir to localSE
  due to LFC ACL problem : write permissions in /grid/atlas/users/pathena
  for pilot jobs /atlas/Role=pilot and /atlas/fr/Role=pilot (newly activated)

30/09/09 Test 652, 653, 656, 657

http://lcg.in2p3.fr/wiki/images/ATLAS-HC300909.gif

Week 41

08/10/09 Test 663

  • Cosmics DPD Analysis (Release 15.5.0)
  • Input DS - DATADISK : data09_cos.*.DPD*
  • Cond DB access to Oracle in Lyon T1
  • via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
  • http://gangarobot.cern.ch/hc/663/test/
  • Sites problems or downtime :
    • LAL : downtime
    • RO : DS unavailable
    • LYON (T2) : release 15.5.0 unavalaible. No performances comparison possible with other T2s
    • TOKYO, BEIJING : Poor performance for foreign sites compared to french sites. Squid installation in progress. Tests to be performed afterwards
    • IRFU, LAPP, LPNHE : 2 peak structure observed on Nb events/Athena(s)

http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-CPU.png http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-rate.png http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-EventsAthena.png
http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-CPU.png http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-rate.png http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-EventsAthena.png

Week 42

15/10/09 Test 682

  • Muon Analysis (Release 15.3.1)
  • Input DS (STEP09) : mc08.*merge.AOD.e*_s*_r6*tid*
  • via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
  • http://gangarobot.cern.ch/hc/682/test/
  • Sites problems or results to be followed-up :
    • GRIF-IRFU : failures due to release 15.3.1 installation (has run fine last week but needed to be patched)
  gcc version 4.1.2 used instead of gcc 3.4...
  See GGUS ticket 52483
    • LYON (T2) : both queues ANALY_LYON (xrootd) and ANALY_LYON_DCACHE exercised at the same time
  limitation due to BQS resource (u_xrootd_lhc)  
  ANALY_LYON_DCACHE : limited number of jobs but good performances (effect of dcache upgrade ?)
  ANALY_LYON : many failures - problems followed by J.Y Nief (root version used by ATLAS ?

Week 43

21/10/09 Test 720

http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-CPU.png http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-rate.png http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-EventsAthena.png


Recent talks & Documents