Difference between revisions of "Atlas:Analysis Challenge ST"

Un article de lcgwiki.
Jump to: navigation, search
 
(19 intermediate revisions by the same user not shown)
Ligne 1: Ligne 1:
 
<b> Site Stress Test </b>
 
<b> Site Stress Test </b>
 
+
--[[User:Chollet|Chollet]] 12:49, 29 janvier 2009 (CET)
 
==== Procedure ====
 
==== Procedure ====
 
* Replication of target datasets accross the cloud
 
* Replication of target datasets accross the cloud
Ligne 9: Ligne 9:
 
==== Test conditions ====
 
==== Test conditions ====
  
The testing framework is ganga-based. It is currently using LCG backend but it will soon be possible to use PANDA backend as well. Both POSIX I/O and "copy mode" may be used allowing performances comparaison of the 2 modes. <br>
+
* The testing framework is ganga-based. It is currently using LCG backend but it will soon be possible to use PANDA backend as well. Metrics are collected and displayed at http://gangarobot.cern.ch/st/
It uses regular AOD analysis in 14.2.20 with mc08*AOD*e*s*r5 DQ2 inputs<br>
+
* Both POSIX I/O and "copy mode" may be used allowing performances comparaison of the 2 modes. <br>
Input datasets are read from ATLASMCDISK and outputs are stored on ATLASUSERDISK (no special requirements there). Input data access is the main issue. No problem on data output <br>
+
* ATLAS software release 14.2.20  
Required CPUtime :~ 1 day (typical job duration 5 h) GlueCEPolicyMaxCPUTime >= 1440<br>
+
* Input DS Patterns used (first one is the preferred one for muon analysis):
LAN saturation observed at ~ 3 Hz in case of 1 Gb network connection between WN and SE. <br>
+
    mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*
 +
    mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*
 +
    mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid*
 +
    mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid*
 +
    mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*
 +
    mc08.*.recon.AOD.e*_s*_r5*tid* (this pattern includes all the previous ones)
 +
* Input datasets are read from ATLASMCDISK and outputs are stored on ATLASUSERDISK (no special requirements there). Input data access is the main issue. No problem on data output <br>
 +
* Required CPUtime : GlueCEPolicyMaxCPUTime >= 1440 (1 day , typical duration : 5 hours)
 +
* Jobs run under DN : /O=GermanGrid/OU=LMU/CN=Johannes_Elmsheuser
 +
* LAN saturation observed in case of 1 Gb network connection between WN and SE. <br>
 +
* It is possible for sites to limit the number of jobs sent at a time.
 +
* Test duration : 48 hours
  
* GangaRobot
+
=== File access modes used by Ganga Jobs ===
DA challenge runs similar to existing analysis functional tests : http://gangarobot.cern.ch/<br>
 
See [http://indico.cern.ch/getFile.py/access?contribId=63&sessionId=11&resId=0&materialId=slides&confId=22137 J.Elmsheuser's presentation (Nov. 5 08)]<br>
 
<b> Site should pass the GangaRobot test successfully </b> , especially :
 
* http://gangarobot.cern.ch/20081119_02/index.html
 
* http://gangarobot.cern.ch/20081119_03/index.html
 
  
<b>Participation required at cloud and site level. Any site in the Tiers_of_ATLAS list can participate.</b>
+
Two access modes may be used during ST tests.
 
+
* '''(j.inputdata.type='DQ2_LOCAL')''' : Currently if users are submitting Ganga jobs with the LCG backend, the posix I/O access is the default method to access files
It is possible for sites to limit the number of jobs sent at a time.  
+
* '''(j.inputdata.type='FILE_STAGER')''' : One of the alternative access mode used by ST tests is the FileStager mode, which copies the input files in a background thread of the athena event loop usinglcg-cp from the local SE to the worker node tmp area. This mode still needs a bit of improvements to gain full stability, but as demonstrated good performance during ST at some (but not all) sites.
DA team is ready to take into account site constraints.
 
DA team is open to any metrics
 
  
 
=== Target and metrics ===
 
=== Target and metrics ===
* Nb of events : Few hundred up to 1000 jobs/site  
+
* Nb of jobs : Few hundred up to 1000 jobs/site  
 
* Rate (evt/s) : up to 15 Hz  
 
* Rate (evt/s) : up to 15 Hz  
* Efficiency (success/failure rate) : 80 %
+
* Success rate (success/failure rate) > 80 %
 
* CPU utilization :  CPUtime / Walltime > 50 %
 
* CPU utilization :  CPUtime / Walltime > 50 %
  
=== Results ===
+
=== Results and Monitoring ===
See  
+
* See http://gangarobot.cern.ch/st/
* ATLAS Twiki page : https://twiki.cern.ch/twiki/bin/view/Main/GangaSiteTests
 
* Results of analysis challenge performed on [http://indico.cern.ch/getFile.py/access?contribId=128&sessionId=8&resId=0&materialId=slides&confId=22137 IT Cloud]
 
* Results of [http://indico.cern.ch/getFile.py/access?contribId=129&sessionId=8&resId=0&materialId=slides&confId=22137 DE Cloud]
 
 
 
===[http://lcg.in2p3.fr/wiki/index.php/Atlas:Analysis_Challenge-STsummary  ST summary]===
 

Latest revision as of 12:49, 29 janvier 2009

Site Stress Test --Chollet 12:49, 29 janvier 2009 (CET)

Procedure

  • Replication of target datasets accross the cloud
  • Preparation of job
  • Generation n jobs per site (Each job processes 1 dataset)
  • Bulk submission to WMS (1 per site)

Test conditions

  • The testing framework is ganga-based. It is currently using LCG backend but it will soon be possible to use PANDA backend as well. Metrics are collected and displayed at http://gangarobot.cern.ch/st/
  • Both POSIX I/O and "copy mode" may be used allowing performances comparaison of the 2 modes.
  • ATLAS software release 14.2.20
  • Input DS Patterns used (first one is the preferred one for muon analysis):
   mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*
   mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*
   mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid*
   mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid*
   mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*
   mc08.*.recon.AOD.e*_s*_r5*tid* (this pattern includes all the previous ones)
  • Input datasets are read from ATLASMCDISK and outputs are stored on ATLASUSERDISK (no special requirements there). Input data access is the main issue. No problem on data output
  • Required CPUtime : GlueCEPolicyMaxCPUTime >= 1440 (1 day , typical duration : 5 hours)
  • Jobs run under DN : /O=GermanGrid/OU=LMU/CN=Johannes_Elmsheuser
  • LAN saturation observed in case of 1 Gb network connection between WN and SE.
  • It is possible for sites to limit the number of jobs sent at a time.
  • Test duration : 48 hours

File access modes used by Ganga Jobs

Two access modes may be used during ST tests.

  • (j.inputdata.type='DQ2_LOCAL') : Currently if users are submitting Ganga jobs with the LCG backend, the posix I/O access is the default method to access files
  • (j.inputdata.type='FILE_STAGER') : One of the alternative access mode used by ST tests is the FileStager mode, which copies the input files in a background thread of the athena event loop usinglcg-cp from the local SE to the worker node tmp area. This mode still needs a bit of improvements to gain full stability, but as demonstrated good performance during ST at some (but not all) sites.

Target and metrics

  • Nb of jobs : Few hundred up to 1000 jobs/site
  • Rate (evt/s) : up to 15 Hz
  • Success rate (success/failure rate) > 80 %
  • CPU utilization : CPUtime / Walltime > 50 %

Results and Monitoring