Difference between revisions of "Atlas:Analysis Challenge"

Un article de lcgwiki.
Jump to: navigation, search
(Phase 1 : Site stress test oraganized by ATLAS and run centrally in a controlled manner (2 days))
(Phase 1 : Site stress test oraganized by ATLAS and run centrally in a controlled manner (2 days))
 
(34 intermediate revisions by 2 users not shown)
Ligne 1: Ligne 1:
02/12/08 : ''E.Lançon, F.Chollet (Thanks to Cédric Serfon)''
+
21/01/2009 : ''E.Lançon, F.Chollet <br>
 +
Thanks to Johannes Elmsheuser, Dan van der Ster, Cédric Serfon''  
  
 
====Information & Contact ====
 
====Information & Contact ====
 
Mailing list ATLAS-LCG-OP-L@in2p3.fr
 
Mailing list ATLAS-LCG-OP-L@in2p3.fr
 +
 
==== Goals ====
 
==== Goals ====
 
* measure "real" analysis job efficiency and turn around on several sites of a given cloud
 
* measure "real" analysis job efficiency and turn around on several sites of a given cloud
Ligne 9: Ligne 11:
 
* check load balancing between analysis and MC production
 
* check load balancing between analysis and MC production
  
==== Required services @ T1 ====
+
==== Required services @ T1 and GRIF====
 
* LFC catalog :  lfc-prod.in2p3.fr
 
* LFC catalog :  lfc-prod.in2p3.fr
 
* ATLAS Disk space : ATLASUSERDISK on T1 SE (fail-over for outputs in case of problems with T2 disk storage)
 
* ATLAS Disk space : ATLASUSERDISK on T1 SE (fail-over for outputs in case of problems with T2 disk storage)
 +
* GRIF and CC-IN2P3 TOP BDII : should be available as they are used remotely by some sites
 +
 +
==== Target and metrics ====
 +
* Nb of events : Few hundred up to 1000 jobs/site
 +
* Rate (evt/s) : up to 15 Hz
 +
* Efficiency (success/failure rate) : 80 %
 +
* CPU utilization :  CPUtime / Walltime > 50 %
 +
 +
==== FR Cloud ST (2009 plans) ====
 +
* See http://lcg.in2p3.fr/wiki/index.php/Atlas:Analysis_ST_2009
 +
 +
====Test report sent to ATLAS (D.van der Ster and J. Elmsheuser) Jan.2009 ====
 +
[http://lcg.in2p3.fr/wiki/images/FR-ST82-Feedback.pdf http://lcg.in2p3.fr/wiki/skins/common/images/icons/fileicon-pdf.png]
 +
 +
==== First exercise on the FR Cloud (December 2008) ====
  
==== First exercise on the FR Cloud (>= December 8th ) ====
+
=====  Phase 1 : Site stress test organized by ATLAS and run centrally in a controlled manner (2 days) =====
=====  Phase 1 : Site stress test oraganized by ATLAS and run centrally in a controlled manner (2 days) =====
 
 
DA challenges have been performed on IT and DE clouds in october 08.  
 
DA challenges have been performed on IT and DE clouds in october 08.  
 
Proposition has been made to extend this cloud-by cloud challenge to the FR Cloud.
 
Proposition has been made to extend this cloud-by cloud challenge to the FR Cloud.
Ligne 21: Ligne 37:
 
First exercise will help to identify breaking points and bottlenecks. <b>It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring.</b>  
 
First exercise will help to identify breaking points and bottlenecks. <b>It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring.</b>  
 
This first try (Stress tests) can be run centrally in a controlled manner. The testing framework is ganga-based.  
 
This first try (Stress tests) can be run centrally in a controlled manner. The testing framework is ganga-based.  
 
+
* Any site in the Tiers_of_ATLAS list can participate.
<b>Participation required at cloud and site level. Any site in the Tiers_of_ATLAS list can participate. </b>ATLAS coordination (Dan van der Ster and Johannes Elmsheuser) needs to know which sites to be tested and when.
+
* ATLAS coordination : Dan van der Ster and Johannes Elmsheuser  
 
 
It is possible for sites to limit the number of jobs sent at a time.
 
DA team is ready to take into account site constraints.
 
DA team is open to any metrics
 
 
 
 
* [http://lcg.in2p3.fr/wiki/index.php/Atlas:Analysis_Challenge_ST  Details of Site Stress test] : procedure, test conditions and targets
 
* [http://lcg.in2p3.fr/wiki/index.php/Atlas:Analysis_Challenge_ST  Details of Site Stress test] : procedure, test conditions and targets
GlueCEPolicyMaxCPUTime >= 1440 ( 1 day)
+
* GlueCEPolicyMaxCPUTime >= 1440 (1 day , typical duration : 5 hours)
* <b>Results </b>
+
* Jobs run under DN : /O=GermanGrid/OU=LMU/CN=Johannes_Elmsheuser
* Results available here : http://gangarobot.cern.ch/st/
+
* <b>Results </b> available here : http://gangarobot.cern.ch/st/
 
+
* <b>Test 43 - Nov. 28 </bhttp://atlas-ganga-storage.cern.ch/test_43/
* <b>Validation </b>
+
* <b>Test 61 - December 8-10</b> http://atlas-ganga-storage.cern.ch/test_61/
* Nov. 28 Submission (Tot 200 jobs sur 12 sites): http://gangarobot.cern.ch/st/test_43/
+
  Sites: IN2P3-LPC, GRIF-LPNHE, TOKYO-LCG2, IN2P3-CPPM
Start Time: 2008-11-28 13:00:00<br>
 
End Time: 2008-11-28 21:00:00 (test interrupted after 8 hours). Some jobs remain in "running" state.<br>
 
Job status : c (Completed) / r (Running)  / f (Failed) <br>
 
o|e : links to stdout (o) et stderr (e) <br>
 
  Problems seen :  
 
- Missing ATLAS release 14.2.20 @ CPPM
 
  <span style="color:green;">02/12/08 [DONE] validate-prod of 14.2.20 at marce01.in2p3.fr (IN2P3-CPPM)</span>
 
- Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple CloseSEs
 
  by LAL and IRFU cf https://savannah.cern.ch/bugs/index.php?44824
 
- LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are
 
  empty at GRIF-SACLAY_MCDISK / LAL_MCDISK
 
  (could be related to same problem)
 
  <span style="color:orange;">Problem tracked</span>
 
- ERROR@RO7-NIPNE : failed jobs with Athena errors
 
  NON FATAL ERRORS (completed jobs) @RO7-NIPNE, RO2-NIPNE : related to
 
  ATLASUSERDISK space token configuration
 
  ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK
 
  ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ...
 
  ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK
 
  <span style="color:green;">DONE</span>
 
* <b>Scheduled Tests </b>
 
Start Time: 2008-12-08 10:00:00
 
End Time: 2008-12-10 10:00:00
 
Test #61: http://gangarobot.cern.ch/st/test_61/
 
  Sites: IN2P3-LPC_MCDISK GRIF-LPNHE_MCDISK TOKYO-LCG2_MCDISK IN2P3-CPPM_MCDISK
 
 
  Max Jobs Per Site: 300
 
  Max Jobs Per Site: 300
Test #62: http://gangarobot.cern.ch/st/test_62/
+
* <b>Test 62 - December 8-10</b> http://atlas-ganga-storage.cern.ch/test_62/
  Sites: TOKYO-LCG2_MCDISK
+
  Sites: TOKYO
 
  Max Jobs Per Site: 300
 
  Max Jobs Per Site: 300
Output Dataset: user08.JohannesElmsheuser.ganga.sitetest.FR.081208.<sitename>
 
Input Type: DQ2_LOCAL
 
 
  Input DS Patterns: mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*  
 
  Input DS Patterns: mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid*  
 
                     mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*  
 
                     mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid*  
Ligne 72: Ligne 56:
 
                     mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*  
 
                     mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid*  
 
                     mc08.*.recon.AOD.e*_s*_r5*tid*
 
                     mc08.*.recon.AOD.e*_s*_r5*tid*
 +
* <b>Test 82 - December 15-17</b> http://atlas-ganga-storage.cern.ch/test_82/
  
Tokyo will get 600 jobs over the 2 tests. Other sites will get 300 jobs, except CPPM which will get ~80 jobs because there are not many datasets available there.
+
=====[http://lcg.in2p3.fr/wiki/index.php/Atlas:Analysis_Challenge-STsummary  ST 2008 summary]=====
  
 
===== Phase 2 : Pathena Analysis Challenge  =====
 
===== Phase 2 : Pathena Analysis Challenge  =====
Ligne 80: Ligne 65:
 
* Physicists involved : Julien Donini, Arnaud Lucotte, Bertrand Brelier, Eric Lançon, LAL ?, LPNHE ?
 
* Physicists involved : Julien Donini, Arnaud Lucotte, Bertrand Brelier, Eric Lançon, LAL ?, LPNHE ?
  
==== Planning ====
+
===== Dec. 2008 Planning =====
  
 
* Dec 8 : stop of MC production  
 
* Dec 8 : stop of MC production  
Ligne 89: Ligne 74:
 
* Dec 17 : <b> restart of MC production</b>
 
* Dec 17 : <b> restart of MC production</b>
 
* Dec 17 : <b> Beginning of Analysis Challenge (Phase 2) </b>
 
* Dec 17 : <b> Beginning of Analysis Challenge (Phase 2) </b>
 
=== Target and metrics ===
 
* Nb of events : Few hundred up to 1000 jobs/site
 
* Rate (evt/s) : up to 15 Hz
 
* Efficiency (success/failure rate) : 80 %
 
* CPU utilization :  CPUtime / Walltime > 50 %
 

Latest revision as of 17:06, 12 mars 2009

21/01/2009 : E.Lançon, F.Chollet
Thanks to Johannes Elmsheuser, Dan van der Ster, Cédric Serfon

Information & Contact

Mailing list ATLAS-LCG-OP-L@in2p3.fr

Goals

  • measure "real" analysis job efficiency and turn around on several sites of a given cloud
  • measure data access performance
  • check load balancing between different users and different analysis tools (Ganga vs pAthena)
  • check load balancing between analysis and MC production

Required services @ T1 and GRIF

  • LFC catalog : lfc-prod.in2p3.fr
  • ATLAS Disk space : ATLASUSERDISK on T1 SE (fail-over for outputs in case of problems with T2 disk storage)
  • GRIF and CC-IN2P3 TOP BDII : should be available as they are used remotely by some sites

Target and metrics

  • Nb of events : Few hundred up to 1000 jobs/site
  • Rate (evt/s) : up to 15 Hz
  • Efficiency (success/failure rate) : 80 %
  • CPU utilization : CPUtime / Walltime > 50 %

FR Cloud ST (2009 plans)

Test report sent to ATLAS (D.van der Ster and J. Elmsheuser) Jan.2009

http://lcg.in2p3.fr/wiki/skins/common/images/icons/fileicon-pdf.png

First exercise on the FR Cloud (December 2008)

Phase 1 : Site stress test organized by ATLAS and run centrally in a controlled manner (2 days)

DA challenges have been performed on IT and DE clouds in october 08. Proposition has been made to extend this cloud-by cloud challenge to the FR Cloud. See ATLAS coordination DA challenge meeting (Nov. 20)

First exercise will help to identify breaking points and bottlenecks. It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring. This first try (Stress tests) can be run centrally in a controlled manner. The testing framework is ganga-based.

Sites: IN2P3-LPC, GRIF-LPNHE, TOKYO-LCG2, IN2P3-CPPM 
Max Jobs Per Site: 300
Sites: TOKYO
Max Jobs Per Site: 300
Input DS Patterns: mc08.*Wmunu*.recon.AOD.e*_s*_r5*tid* 
                   mc08.*Zprime_mumu*.recon.AOD.e*_s*_r5*tid* 
                   mc08.*Zmumu*.recon.AOD.e*_s*_r5*tid* 
                   mc08.*T1_McAtNlo*.recon.AOD.e*_s*_r5*tid* 
                   mc08.*H*zz4l*.recon.AOD.e*_s*_r5*tid* 
                   mc08.*.recon.AOD.e*_s*_r5*tid*
ST 2008 summary
Phase 2 : Pathena Analysis Challenge
  • Data Analysis exercice open to physicists with their favorite application
  • Physicists involved : Julien Donini, Arnaud Lucotte, Bertrand Brelier, Eric Lançon, LAL ?, LPNHE ?
Dec. 2008 Planning
  • Dec 8 : stop of MC production
  • Dec. 8-9: 1rst round with Tokyo, CPPM, LPC (LAN limited to 1Gbps), GRIF-LPNHE
  • Dec 17 : restart of MC production
  • Dec 14 : stop of MC production
  • Dec. 15-16 : 2nd round with LAPP, CC-IN2P3-T2 (to be confirmed), Tokyo, CPPM, LPC, possibly GRIF (SACLAY, IRFU, LPNHE), RO-07 and RO-02
  • Dec 17 : restart of MC production
  • Dec 17 : Beginning of Analysis Challenge (Phase 2)