Difference between revisions of "Atlas:Analysis Challenge"

Un article de lcgwiki.
Jump to: navigation, search
(Phase 1 : Site stress test run centrally in a controlled manner (2 days))
Ligne 37: Ligne 37:
 
  Problems seen :  
 
  Problems seen :  
 
  - Missing ATLAS release 14.2.20 @ CPPM  
 
  - Missing ATLAS release 14.2.20 @ CPPM  
  - Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple CloseSEs by LAL and IRFU <br>
+
  - Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple <br>
cf https://savannah.cern.ch/bugs/index.php?44824
+
  CloseSEs by LAL and IRFU <br>
 +
  cf https://savannah.cern.ch/bugs/index.php?44824
 
  - LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are
 
  - LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are
 
   empty at GRIF-SACLAY_MCDISK / LAL_MCDISK (could be related to the previous problem)
 
   empty at GRIF-SACLAY_MCDISK / LAL_MCDISK (could be related to the previous problem)
 +
- ERROR@RO07-NIPNE : failed jobs with Athena errors
 +
  Completed jobs with
 +
  ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK
 +
  ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ...
 +
  ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK ..
 
* <b>Planning </b>
 
* <b>Planning </b>
 
  - Dec. 8-9: 1rst round with Tokyo, LPC (LAN limited to 1Gbps), CPPM  
 
  - Dec. 8-9: 1rst round with Tokyo, LPC (LAN limited to 1Gbps), CPPM  

Version du 11:12, 3 décembre 2008

02/12/08 : E.Lançon, F.Chollet

Information & Contact

Mailing list ATLAS-LCG-OP-L@in2p3.fr

Goals

  • measure "real" analysis job efficiency and turn around on several sites of a given cloud
  • measure data access performance
  • check load balancing between different users and different analysis tools (Ganga vs pAthena)
  • check load balancing between analysis and MC production

First exercise on the FR Cloud (>= December 8th )

Phase 1 : Site stress test run centrally in a controlled manner (2 days)

DA challenges have been performed on IT and DE clouds in october 08. Proposition has been made to extend this cloud-by cloud challenge to the FR Cloud. See ATLAS coordination DA challenge meeting (Nov. 20)

First exercise will help to identify breaking points and bottlenecks. It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring. This first try (Stress tests) can be run centrally in a controlled manner. The testing framework is ganga-based.

Participation required at cloud and site level. Any site in the Tiers_of_ATLAS list can participate. ATLAS coordination (Dan van der Ster and Johannes Elmsheuser) needs to know which sites to be tested and when.

It is possible for sites to limit the number of jobs sent at a time. 
DA team is ready to take into account site constraints.
DA team is open to any metrics

GlueCEPolicyMaxCPUTime >= 1440 ( 1 day)

Start Time: 2008-11-28 13:00:00
End Time: 2008-11-28 21:00:00 (test interrupted after 8 hours). Some jobs remain in "running" state.
Job status : c (Completed) / r (Running) / f (Failed)
o|e : links to stdout (o) et stderr (e)

Problems seen : 
- Missing ATLAS release 14.2.20 @ CPPM 
- Mapping CE-SE@GRIF-LAL and GRIF-SACLAY : Publication of multiple 
CloseSEs by LAL and IRFU
cf https://savannah.cern.ch/bugs/index.php?44824 - LFC bulk reading...End of LFC bulk reading...ERROR: Dataset(s)...is/are empty at GRIF-SACLAY_MCDISK / LAL_MCDISK (could be related to the previous problem) - ERROR@RO07-NIPNE : failed jobs with Athena errors Completed jobs with ERROR during execution of lcg-cr --vo atlas -s ATLASUSERDISK ERROR: file not saved to RO-07-NIPNE_USERDISK in attempt number 3 ... ERROR: file not saved to RO-07-NIPNE_USERDISK - using now IN2P3-CC_USERDISK ..
  • Planning
- Dec. 8-9: 1rst round with Tokyo, LPC (LAN limited to 1Gbps), CPPM 
- Dec 14 :  stop of MC production 
- Dec. 15-16 : 
    possibly : LAPP, CC-IN2P3-T2(to be confirmed), Tokyo, GRIF, 
    sites with 1gbps LAN : CPPM, NIPNE, LPC 
- Dec 17 :  restart of MC production
Phase 2 : Pathena Analysis Challenge
- Dec 17-18 : Data Analysis exercice open to physicists with their favorite application
  • Physicists involved : Julien Donini, Arnaud Lucotte, Bertrand Brelier, Eric Lançon, LAL ?, LPNHE ?

Target and metrics

  • Nb of events : Few hundred up to 1000 jobs/site
  • Rate (evt/s) : up to 15 Hz
  • Efficiency (success/failure rate) : 80 %
  • CPU utilization : CPUtime / Walltime > 50 %