Difference between revisions of "Atlas:Analysis ST 2009 Errors"

Un article de lcgwiki.
Jump to: navigation, search
(Comments, Sites feed-back and Errors follow-up related to ST test 124 and 125)
 
(33 intermediate revisions by 3 users not shown)
Ligne 1: Ligne 1:
30.01.09
+
--[[User:Chollet|Chollet]] 14:24, 15 mai 2009 (CEST)
 +
== Errors follow-up - Known issues ==
 +
* '''Corrupted input AOD files found'''
 +
  AOD.027097._37998.pool.root 
 +
  AOD.027579._24654.pool.root
 +
  AOD.027076._10514.pool.root
 +
- Badread error example
 +
rfio:clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-30/AOD.027579._24654.pool.root.2781052.0 at byte:32849441, branch:m_genParticles.m_endVtx, entry:217, badread=0
 +
- To check if the file is corrupted on SE or not, please refer to ATLAS procedure
 +
(requests an certificate approved by ATLAS) defined here: <br>
 +
https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#Checksum_error_triggered_by_dq2
 +
* IN2P3-CPPM / IN2P3-LPC :  '''overload of lcg-gt during SURL to TURL conversion'''
 +
Jobs running forever with error, killed by the batch system.
 +
The error should at least be catched by Ganga - Savannah ticket opened :
 +
https://savannah.cern.ch/bugs/index.php?48537
 +
  send2nsd: NS009 - fatal configuration error: Host unknown:  dpnshome.in2p3.fr
 +
  send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr
 +
This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21.
 +
I have put two of this stdout, stderr there :<br>
 +
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/<br>
 +
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/<br>
 +
Heavy load of the local DPM server observed at that time.
  
== Comments, Sites feed-back and Errors follow-up related to ST test 124 and 125==
+
* IN2P3-LAPP : during ST125, 30/01/09 '''jobs still running after 2500 minutes, failing to connect to LFC''' with the message :
*http://gangarobot.cern.ch/st/test_124/
+
send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03
*http://gangarobot.cern.ch/st/test_125/
+
cannot connect to LFC
'''Note that ATLAS Production was ON on the FR-Cloud on January 29'''
+
Service was up and running fine at that time Is it due to an expired proxy ?
  
* IN2P3-LPC_MCDISK: f(w)  - Errors due to load induced by MC production running at that time + ST tests (2 x 50 jobs added)
+
* RO_07 : ST jobs can not store the output locally and are always using the fail-over storing output files in Lyon
Jobs are aborted with Logged Reason by wms <br>
+
still configuration problem with the USERDISK token on tbit00.nipne.ro SE ?
- Got a job held event, reason: Unspecified gridmanager error <br>
+
  lcg-cr --vo atlas -s ATLASUSERDISK -t 2400 -d srm://tbit00.nipne.ro/dpm/.....
- Job got an error while in the CondorG queue.<br>
+
  dpm_getspacetoken: Unknown user space token description
The submission to the batch system has failed because the '''maximum number of jobs accepted in queue by the site was reached '''
 
  Jan 29 23:54:46 clrlcgce03 gridinfo: [25608-30993] Job 1233269583:
 
  lcgpbs:internal_ FAILED during submission to batch system lcgpbs
 
01/29/2009 23:55:07;0080;PBS_Server;Req;req_reject;Reject reply code=15046(Maximum
 
  number of jobs already in queue), aux=0..
 

Latest revision as of 13:25, 15 mai 2009

--Chollet 14:24, 15 mai 2009 (CEST)

Errors follow-up - Known issues

  • Corrupted input AOD files found
 AOD.027097._37998.pool.root  
 AOD.027579._24654.pool.root 
 AOD.027076._10514.pool.root

- Badread error example

rfio:clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-30/AOD.027579._24654.pool.root.2781052.0 at byte:32849441, branch:m_genParticles.m_endVtx, entry:217, badread=0

- To check if the file is corrupted on SE or not, please refer to ATLAS procedure (requests an certificate approved by ATLAS) defined here:
https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#Checksum_error_triggered_by_dq2

  • IN2P3-CPPM / IN2P3-LPC : overload of lcg-gt during SURL to TURL conversion

Jobs running forever with error, killed by the batch system. The error should at least be catched by Ganga - Savannah ticket opened : https://savannah.cern.ch/bugs/index.php?48537

 send2nsd: NS009 - fatal configuration error: Host unknown:  dpnshome.in2p3.fr
 send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr 

This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. I have put two of this stdout, stderr there :
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/
Heavy load of the local DPM server observed at that time.

  • IN2P3-LAPP : during ST125, 30/01/09 jobs still running after 2500 minutes, failing to connect to LFC with the message :
send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03
cannot connect to LFC 

Service was up and running fine at that time Is it due to an expired proxy ?

  • RO_07 : ST jobs can not store the output locally and are always using the fail-over storing output files in Lyon

still configuration problem with the USERDISK token on tbit00.nipne.ro SE ?

lcg-cr --vo atlas -s ATLASUSERDISK  -t 2400 -d srm://tbit00.nipne.ro/dpm/.....
dpm_getspacetoken: Unknown user space token description