Difference between revisions of "Atlas:Analysis HC beyond STEP09"

Un article de lcgwiki.
Jump to: navigation, search
(Results)
(ATLAS Info & Contacts)
 
(54 intermediate revisions by the same user not shown)
Ligne 2: Ligne 2:
 
   
 
   
 
'''Distributed Analysis Stress Tests - HammerCloud beyond STEP09'''
 
'''Distributed Analysis Stress Tests - HammerCloud beyond STEP09'''
[[Image:ATLAS-HC-small.jpg]]
 
  
 
== Lessons learnt from STEP09 ==
 
== Lessons learnt from STEP09 ==
Ligne 10: Ligne 9:
  
 
== Results  ==
 
== Results  ==
 +
==== HC web interface ====
 
* ATLAS HC Tests results : http://gangarobot.cern.ch/hc/ '''(New web interface!)'''
 
* ATLAS HC Tests results : http://gangarobot.cern.ch/hc/ '''(New web interface!)'''
* ATLAS STEP09 summary : http://gangarobot.cern.ch/st/summary.html
+
==== ATLAS STEP09 summary ====
* [http://lcg.in2p3.fr/wiki/images/STEP09-July09-ATLAS.png ATLAS STEP09 Cloud results]: Dan van der Ster & Johannes Elmsheuser (1st July 2009)
+
* STEP09 summary page : http://gangarobot.cern.ch/st/step09summary.html
**http://lcg.in2p3.fr/wiki/images/STEP09-July09-ATLAS.png
+
==== Some conclusions about DB access HC tests ====
 +
* Acces to E.Lançon document : http://lcg.in2p3.fr/wiki/images/0910-ELancon-frontier-test.pdf (28/10/09)  
 +
 
 +
== ATLAS Criteriafor T2s  to be considered ==
 +
(from K.Bos Nov.09)
 +
T2s are for user analysis but should be considered if
 +
- They have at least 100 TB of storage space
 +
- they have passed the HC validation : 90 % efficiency, 150 Mev/day
  
 
== ATLAS Info & Contacts ==
 
== ATLAS Info & Contacts ==
 +
* ATLAS HammerCloud wiki pages : https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloud
 
* Information via mailing list ATLAS-LCG-OP-L@in2p3.fr
 
* Information via mailing list ATLAS-LCG-OP-L@in2p3.fr
* LPC : Nabil Ghodbane - Nabil.Ghodbane@cern.ch
+
* LPC : Nabil Ghodbane AT cern.ch
 
* LAL : Nicolas Makovec  
 
* LAL : Nicolas Makovec  
 
* LAPP : Stéphane Jézéquel  
 
* LAPP : Stéphane Jézéquel  
Ligne 27: Ligne 35:
  
 
== HC Tests ==
 
== HC Tests ==
 +
[[Image:ATLAS-HC-small.jpg]]
 
===Objectives ===
 
===Objectives ===
 
*Improve Cloud readiness by following site&ATLAS problems week by week (SL5 migration, site upgrades)
 
*Improve Cloud readiness by following site&ATLAS problems week by week (SL5 migration, site upgrades)
 
*Identify best data access method per site by comparing the event rate and CPU/Walltime  
 
*Identify best data access method per site by comparing the event rate and CPU/Walltime  
https://twiki.cern.ch/twiki/bin/view/Atla/HammerCloudDataAccess#FR_cloud  
+
https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloudDataAccess#FR_cloud  
 
*Exercise Analysis with Conditions DB access (see where squid caching is needed) and Tag analysis
 
*Exercise Analysis with Conditions DB access (see where squid caching is needed) and Tag analysis
  
Ligne 38: Ligne 47:
 
* via gLite WMS : 2 data access modes available
 
* via gLite WMS : 2 data access modes available
 
**DQ2_LOCAL mode is a direct access mode using rfio or dcap  
 
**DQ2_LOCAL mode is a direct access mode using rfio or dcap  
**FILE_STAGER mode  
+
**FILE_STAGER mode : data staged in by a dedicated thread running in // with Athena
 +
 
 +
=== Panda errors - Monitoring ===
 +
* Error codes : https://twiki.cern.ch/twiki/bin/view/Atlas/PandaErrorCodes
 +
* Jobs run under DN : /O=GermanGrid/OU=LMU/CN=Johannes_Elmsheuser so Panda Monitoring helps to track Jonannes Elmsheuser's job. To find error messages and access log files, you may query for analysis job run by Johannes Elmsheuser on your site from Panda monitoring.
 +
* See http://panda.cern.ch:25980/server/pandamon/query?ui=user&name=Johannes Elmsheuser
 +
 
 
=== Week 40 ===
 
=== Week 40 ===
 +
 +
==== 29/09/09 ''<span style="color:#FF0000;">Test 649'' ====
 +
* Muon Analysis (Release 15.3.1) 
 +
* Input DS  (STEP09) : mc08.*merge.AOD.e*_s*_r6*tid*
 +
* via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
 +
* http://gangarobot.cern.ch/hc/649/test/
 +
  Bad efficiency - all sites affected all sites
 +
  Failed jobs with error : exit code 1137
 +
  Put error: Error in copying the file from job workdir to localSE
 +
  due to LFC ACL problem : write permissions in /grid/atlas/users/pathena
 +
  for pilot jobs /atlas/Role=pilot and /atlas/fr/Role=pilot (newly activated)
 +
==== 30/09/09 ''<span style="color:#00FF00;">Test 652, 653, 656, 657'' ====
 
* Muon Analysis (Release 15.3.1)   
 
* Muon Analysis (Release 15.3.1)   
 
* Input DS  (STEP09) : mc08.*merge.AOD.e*_s*_r6*tid*
 
* Input DS  (STEP09) : mc08.*merge.AOD.e*_s*_r6*tid*
* 3 HC tests of 24 hrs each :   
+
* via WMS
Test 649 via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)<br>
+
* DQ2_LOCAL mode or direct access dcap/rfio : http://gangarobot.cern.ch/hc/652/test/
Test 652/656 via WMS (DQ2_LOCAL mode or direct access dcap/rfio)<br>
+
* DQ2_LOCAL mode or direct access dcap/rfio : http://gangarobot.cern.ch/hc/656/test/
Test 653/657 via WMS (FILE_STAGER mode)<br>
+
* FILE_STAGER mode : http://gangarobot.cern.ch/hc/653/test/
 +
* FILE_STAGER mode : http://gangarobot.cern.ch/hc/657/test/  
 +
http://lcg.in2p3.fr/wiki/images/ATLAS-HC300909.gif
 +
 
 +
=== Week 41 ===
 +
 
 +
==== 08/10/09 ''<span style="color:#00FF00;">Test 663'' ====
 +
* Cosmics DPD Analysis (Release 15.5.0)
 +
* Input DS - DATADISK : data09_cos.*.DPD*
 +
* '''Cond DB access to Oracle in Lyon T1 '''
 +
* via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
 +
* http://gangarobot.cern.ch/hc/663/test/
 +
* Sites problems or downtime :
 +
** LAL : downtime
 +
** RO : DS unavailable
 +
** LYON (T2) : release 15.5.0 unavalaible. No performances comparison possible with other T2s
 +
** TOKYO, BEIJING : Poor performance for foreign sites compared to french sites. Squid installation in progress. Tests to be performed afterwards
 +
** IRFU, LAPP, LPNHE : 2 peak structure observed on Nb events/Athena(s)
 +
http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-CPU.png
 +
http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-rate.png
 +
http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-EventsAthena.png<br>
 +
http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-CPU.png
 +
http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-rate.png
 +
http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-EventsAthena.png
 +
 
 +
=== Week 42 ===
 +
 
 +
==== 15/10/09 ''<span style="color:#00FF00;">Test 682'' ====
 +
* Muon Analysis (Release 15.3.1) 
 +
* Input DS  (STEP09) : mc08.*merge.AOD.e*_s*_r6*tid*
 +
* via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
 +
* http://gangarobot.cern.ch/hc/682/test/
 +
* Sites problems or results to be followed-up :
 +
 
 +
** GRIF-IRFU : failures due to release 15.3.1 installation (has run fine last week but needed to be patched)
 +
  gcc version 4.1.2 used instead of gcc 3.4...
 +
  See [https://gus.fzk.de/ws/ticket_info.php?ticket=52483 GGUS ticket 52483]
 +
** LYON (T2) :  both queues ANALY_LYON (xrootd) and ANALY_LYON_DCACHE exercised at the same time
 +
  limitation due to BQS resource (u_xrootd_lhc) 
 +
  ANALY_LYON_DCACHE : limited number of jobs but good performances (effect of dcache upgrade ?)
 +
  ANALY_LYON : many failures - problems followed by J.Y Nief (root version used by ATLAS ?
 +
 
 +
=== Week 43 ===
 +
==== 21/10/09 ''<span style="color:#00FF00;">Test 720'' ====
 +
* Cosmic DPDs Analysis (Release 15.5.0)
 +
* Input DS - DATADISK : data09_cos.*.DPD*
 +
* '''Dedicated test to Tokyo and Beijing with access via Squid'''
 +
* via Panda (mode copy-to-WN using ddcp/rfcp)
 +
* http://gangarobot.cern.ch/hc/720/test/
 +
* [http://lcg.in2p3.fr/wiki/images/0910-ELancon-frontier-test.pdf Some conclusions] by E.Lançon (28/10/09)  
 +
http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-CPU.png
 +
http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-rate.png
 +
http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-EventsAthena.png
 +
 
  
== Recent talks ==
+
== Recent talks & Documents ==
 +
* [http://lcg.in2p3.fr/wiki/index.php/Image:0910-ELancon-frontier-test.pdf Some conclusions about DB access HC tests (663 and 720)] by Eric Lançon (28/10/09)
 
* [http://indico.in2p3.fr/getFile.py/access?contribId=6&sessionId=30&resId=0&materialId=slides&confId=2110 ATLAS : from STEP09 towards first beams] Graeme Stewart's talk@Journées Grille France (16 October 2009)
 
* [http://indico.in2p3.fr/getFile.py/access?contribId=6&sessionId=30&resId=0&materialId=slides&confId=2110 ATLAS : from STEP09 towards first beams] Graeme Stewart's talk@Journées Grille France (16 October 2009)
 
* [http://indico.cern.ch/getFile.py/access?contribId=8&sessionId=2&resId=0&materialId=slides&confId=66012 Summary of HammerCloud Tests since STEP09] Dan van der Ster's talk@ATLAS Jamboree T1/T2/T3 (13 October 2009)
 
* [http://indico.cern.ch/getFile.py/access?contribId=8&sessionId=2&resId=0&materialId=slides&confId=66012 Summary of HammerCloud Tests since STEP09] Dan van der Ster's talk@ATLAS Jamboree T1/T2/T3 (13 October 2009)
 
* [http://indico.cern.ch/getFile.py/access?contribId=31&sessionId=16&resId=0&materialId=slides&confId=50976 HammerCloud Plans] Johannes Elmsheuser's talk@ATLAS S&C Week (2 September 2009)
 
* [http://indico.cern.ch/getFile.py/access?contribId=31&sessionId=16&resId=0&materialId=slides&confId=50976 HammerCloud Plans] Johannes Elmsheuser's talk@ATLAS S&C Week (2 September 2009)

Latest revision as of 17:09, 16 décembre 2009

--Chollet 15:56, 19 octobre 2009 (CEST)

Distributed Analysis Stress Tests - HammerCloud beyond STEP09

Lessons learnt from STEP09

  • Sites may identify reasonable amount of analysis they can assume and set hard limits on number of analysis running jobs
  • Balancing data across many disk servers is essential.
  • Very high i/o required by analysis (5 MB/s per job). Sites should review LAN architecture to avoid bottlenecks.

Results

HC web interface

ATLAS STEP09 summary

Some conclusions about DB access HC tests

ATLAS Criteriafor T2s to be considered

(from K.Bos Nov.09) T2s are for user analysis but should be considered if - They have at least 100 TB of storage space - they have passed the HC validation : 90 % efficiency, 150 Mev/day

ATLAS Info & Contacts

  • ATLAS HammerCloud wiki pages : https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloud
  • Information via mailing list ATLAS-LCG-OP-L@in2p3.fr
  • LPC : Nabil Ghodbane AT cern.ch
  • LAL : Nicolas Makovec
  • LAPP : Stéphane Jézéquel
  • CPPM : Emmanuel Le Guirriec
  • LPSC : Sabine Crepe
  • LPNHE : Tristan Beau
  • CC-T2 : Catherine, Ghita
  • IRFU : Nathalie Besson

HC Tests

ATLAS-HC-small.jpg

Objectives

  • Improve Cloud readiness by following site&ATLAS problems week by week (SL5 migration, site upgrades)
  • Identify best data access method per site by comparing the event rate and CPU/Walltime

https://twiki.cern.ch/twiki/bin/view/Atlas/HammerCloudDataAccess#FR_cloud

  • Exercise Analysis with Conditions DB access (see where squid caching is needed) and Tag analysis

Data Access methods

Multiple data access methods are exercised

  • via Panda : A copy-to-WN access mode using rfcp is used (xrootd in ANALY-LYON)
  • via gLite WMS : 2 data access modes available
    • DQ2_LOCAL mode is a direct access mode using rfio or dcap
    • FILE_STAGER mode : data staged in by a dedicated thread running in // with Athena

Panda errors - Monitoring

Week 40

29/09/09 Test 649

  Bad efficiency - all sites affected all sites 
  Failed jobs with error : exit code 1137
  Put error: Error in copying the file from job workdir to localSE
  due to LFC ACL problem : write permissions in /grid/atlas/users/pathena
  for pilot jobs /atlas/Role=pilot and /atlas/fr/Role=pilot (newly activated)

30/09/09 Test 652, 653, 656, 657

http://lcg.in2p3.fr/wiki/images/ATLAS-HC300909.gif

Week 41

08/10/09 Test 663

  • Cosmics DPD Analysis (Release 15.5.0)
  • Input DS - DATADISK : data09_cos.*.DPD*
  • Cond DB access to Oracle in Lyon T1
  • via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
  • http://gangarobot.cern.ch/hc/663/test/
  • Sites problems or downtime :
    • LAL : downtime
    • RO : DS unavailable
    • LYON (T2) : release 15.5.0 unavalaible. No performances comparison possible with other T2s
    • TOKYO, BEIJING : Poor performance for foreign sites compared to french sites. Squid installation in progress. Tests to be performed afterwards
    • IRFU, LAPP, LPNHE : 2 peak structure observed on Nb events/Athena(s)

http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-CPU.png http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-rate.png http://lcg.in2p3.fr/wiki/images/HC663-081009-GRIF-Irfu-EventsAthena.png
http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-CPU.png http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-rate.png http://lcg.in2p3.fr/wiki/images/HC663-081009-Tokyo-EventsAthena.png

Week 42

15/10/09 Test 682

  • Muon Analysis (Release 15.3.1)
  • Input DS (STEP09) : mc08.*merge.AOD.e*_s*_r6*tid*
  • via Panda (mode copy-to-WN using ddcp/rfcp - xrootd in ANALY-LYON)
  • http://gangarobot.cern.ch/hc/682/test/
  • Sites problems or results to be followed-up :
    • GRIF-IRFU : failures due to release 15.3.1 installation (has run fine last week but needed to be patched)
  gcc version 4.1.2 used instead of gcc 3.4...
  See GGUS ticket 52483
    • LYON (T2) : both queues ANALY_LYON (xrootd) and ANALY_LYON_DCACHE exercised at the same time
  limitation due to BQS resource (u_xrootd_lhc)  
  ANALY_LYON_DCACHE : limited number of jobs but good performances (effect of dcache upgrade ?)
  ANALY_LYON : many failures - problems followed by J.Y Nief (root version used by ATLAS ?

Week 43

21/10/09 Test 720

http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-CPU.png http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-rate.png http://lcg.in2p3.fr/wiki/images/HC720-211009-Tokyo-EventsAthena.png


Recent talks & Documents