Atlas:Analysis ST 2009 Errors: Difference between revisions

Latest revision as of 14:25, 15 mai 2009

--Chollet 14:24, 15 mai 2009 (CEST)

Errors follow-up - Known issues

Corrupted input AOD files found

 AOD.027097._37998.pool.root  
 AOD.027579._24654.pool.root 
 AOD.027076._10514.pool.root

- Badread error example

rfio:clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-30/AOD.027579._24654.pool.root.2781052.0 at byte:32849441, branch:m_genParticles.m_endVtx, entry:217, badread=0

- To check if the file is corrupted on SE or not, please refer to ATLAS procedure (requests an certificate approved by ATLAS) defined here:
https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#Checksum_error_triggered_by_dq2

IN2P3-CPPM / IN2P3-LPC : overload of lcg-gt during SURL to TURL conversion

Jobs running forever with error, killed by the batch system. The error should at least be catched by Ganga - Savannah ticket opened : https://savannah.cern.ch/bugs/index.php?48537

 send2nsd: NS009 - fatal configuration error: Host unknown:  dpnshome.in2p3.fr
 send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr

This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. I have put two of this stdout, stderr there :
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/
Heavy load of the local DPM server observed at that time.

IN2P3-LAPP : during ST125, 30/01/09 jobs still running after 2500 minutes, failing to connect to LFC with the message :

send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03
cannot connect to LFC

Service was up and running fine at that time Is it due to an expired proxy ?

RO_07 : ST jobs can not store the output locally and are always using the fail-over storing output files in Lyon

still configuration problem with the USERDISK token on tbit00.nipne.ro SE ?

lcg-cr --vo atlas -s ATLASUSERDISK  -t 2400 -d srm://tbit00.nipne.ro/dpm/.....
dpm_getspacetoken: Unknown user space token description

@@ Ligne 1: / Ligne 1: @@
-.01.09
+--[[User:Chollet|Chollet]] 14:24, 15 mai 2009 (CEST)
+== Errors follow-up - Known issues ==
+* '''Corrupted input AOD files found'''
+  AOD.027097._37998.pool.root
+  AOD.027579._24654.pool.root
+  AOD.027076._10514.pool.root
+- Badread error example
+ rfio:clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-30/AOD.027579._24654.pool.root.2781052.0 at byte:32849441, branch:m_genParticles.m_endVtx, entry:217, badread=0
+- To check if the file is corrupted on SE or not, please refer to ATLAS procedure
+(requests an certificate approved by ATLAS) defined here: <br>
+https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#Checksum_error_triggered_by_dq2
+* IN2P3-CPPM / IN2P3-LPC :  '''overload of lcg-gt during SURL to TURL conversion'''
+Jobs running forever with error, killed by the batch system.
+The error should at least be catched by Ganga - Savannah ticket opened :
+https://savannah.cern.ch/bugs/index.php?48537
+  send2nsd: NS009 - fatal configuration error: Host unknown:  dpnshome.in2p3.fr
+  send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr
+This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21.
+I have put two of this stdout, stderr there :<br>
+http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/<br>
+http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/<br>
+Heavy load of the local DPM server observed at that time.
-== Comments, Sites feed-back and Errors follow-up related to ST test 124 and 125==
+* IN2P3-LAPP : during ST125, 30/01/09 '''jobs still running after 2500 minutes, failing to connect to LFC''' with the message :
-*http://gangarobot.cern.ch/st/test_124/
+ send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03
-*http://gangarobot.cern.ch/st/test_125/
+ cannot connect to LFC
-'''Note that ATLAS Production was ON on the FR-Cloud on January 29'''
+Service was up and running fine at that time Is it due to an expired proxy ?
-* IN2P3-LPC_MCDISK: f(w)   - Errors due to load induced by MC production running at that time + ST tests (2 x 50 jobs added)
+* RO_07 :  ST jobs can not store the output locally and are always using the fail-over storing output files in Lyon
-Jobs are aborted with Logged Reason by wms <br>
+still configuration problem with the USERDISK token on tbit00.nipne.ro SE ?
-- Got a job held event, reason: Unspecified gridmanager error <br>
+ lcg-cr --vo atlas -s ATLASUSERDISK  -t 2400 -d srm://tbit00.nipne.ro/dpm/.....
-- Job got an error while in the CondorG queue.<br>
+  dpm_getspacetoken: Unknown user space token description
-The submission to the batch system has failed because the '''maximum number of jobs accepted in queue by the site was reached '''
-**cf queue atlas max_queuable = 200 in the batch system, 'GlueCEPolicyMaxTotalJobs'
-  Jan 29 23:54:46 clrlcgce03 gridinfo: [25608-30993] Job 1233269583:
- lcgpbs:internal_ FAILED during submission to batch system lcgpbs
-/29/2009 23:55:07;0080;PBS_Server;Req;req_reject;Reject reply code=15046(Maximum
-  number of jobs already in queue), aux=0..

Atlas:Analysis ST 2009 Errors: Difference between revisions

Latest revision as of 14:25, 15 mai 2009

Errors follow-up - Known issues

Navigation menu

Rechercher