Difference between revisions of "Atlas:Analysis Challenge"

Un article de lcgwiki.
Jump to: navigation, search
(First exercise on the FR Cloud (>= December 8th ))
(First exercise on the FR Cloud (>= December 8th ))
Ligne 13: Ligne 13:
 
First exercise will help to identify breaking points and bottlenecks. It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring.  
 
First exercise will help to identify breaking points and bottlenecks. It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring.  
  
This first try can be run centrally. ATLAS coordination (Dan van der Ster and Johannes Elmsheuser) needs to know which sites to be tested and when.
+
This first try (Stress tests) can be run centrally in a controlled manner. ATLAS coordination (Dan van der Ster and Johannes Elmsheuser) needs to know which sites to be tested and when.
 
* See Results : http://gangarobot.cern.ch/st/
 
* See Results : http://gangarobot.cern.ch/st/
* Nov. 28 DA Submission (Tot 200 jobs sur 12 sites): http://gangarobot.cern.ch/st/test_43/
+
====== Phase 1 - Stress test run centrally in a controlled manner (2 days) =======
 +
* Nov. 28 Submission (Tot 200 jobs sur 12 sites): http://gangarobot.cern.ch/st/test_43/
 +
* Dec. 8-9: 1rst round with Tokyo and possibly GRIF 
 +
====== Phase 1 - id. (2 days) MC production stopped =======
 +
* Dec. 15-16 : possibly LAPP, CC-IN2P3-T2(to be confirmed), Tokyo, GRIF, LPC (to be contacted)(expected all T2s)
 +
* Dec 17 : Restart of MC production
 +
* Dec 17-18 : Data Analysis exercice open to physicists with their favorite application and tools
  
 
==== Procedure ====
 
==== Procedure ====

Version du 17:31, 1 décembre 2008

Goals

  • measure "real" analysis job efficiency and turn around on several sites of a given cloud
  • measure data access performance
  • check load balancing between different users and different analysis tools (Ganga vs pAthena)
  • check load balancing between analysis and MC production

First exercise on the FR Cloud (>= December 8th )

DA challenges have been performed on IT and DE clouds in october 08. Proposition has been made to extend this cloud-by cloud challenge to the FR Cloud. See ATLAS coordination DA challenge meeting (Nov. 20)

First exercise will help to identify breaking points and bottlenecks. It is limited in time (a few days) and requires careful attention of site administrators during that period,in particular network (internal & external), disk, cpu monitoring.

This first try (Stress tests) can be run centrally in a controlled manner. ATLAS coordination (Dan van der Ster and Johannes Elmsheuser) needs to know which sites to be tested and when.

Phase 1 - Stress test run centrally in a controlled manner (2 days) =
Phase 1 - id. (2 days) MC production stopped =
  • Dec. 15-16 : possibly LAPP, CC-IN2P3-T2(to be confirmed), Tokyo, GRIF, LPC (to be contacted)(expected all T2s)
  • Dec 17 : Restart of MC production
  • Dec 17-18 : Data Analysis exercice open to physicists with their favorite application and tools

Procedure

  • Replication of target datasets accross the cloud
  • Preparation of job
  • Generation n jobs per site (Each job processes 1 dataset)
  • Bulk submission to WMS (1 per site)

Test conditions

The testing framework is ganga-based. It is currently using LCG backend but it will soon be possible to use PANDA backend as well. Both POSIX I/O and "copy mode" may be used allowing performances comparaison of the 2 modes.
It uses regular AOD analysis in 14.2.20 with mc08*AOD*e*s*r5 DQ2 inputs
Input datasets are read from ATLASMCDISK and outputs are stored on ATLASUSERDISK (no special requirements there). Input data access is the main issue. No problem on data output
Required CPUtime :~ 1 day (typical job duration 5 h)
LAN saturation observed at ~ 3 Hz in case of 1 Gb network connection between WN and SE.

  • GangaRobot

DA challenge runs similar to existing analysis functional tests : http://gangarobot.cern.ch/
See J.Elmsheuser's presentation (Nov. 5 08)
Site should pass the GangaRobot test successfully , especially :

Participation required at cloud and site level. Any site in the Tiers_of_ATLAS list can participate.

It is possible for sites to limit the number of jobs sent at a time. 
DA team is ready to take into account site constraints.
DA team is open to any metrics

Target and metrics

  • Nb of events : Few hundred up to 1000 jobs/site
  • Rate (evt/s) : up to 15 Hz
  • Efficiency (success/failure rate) : 80 %
  • CPU utilization : CPUtime / Walltime > 50 %

Results

See

2009 plans