Difference between revisions of "Atlas:Analysis ST 2009 Errors"
(→Errors follow-up) |
|||
(9 intermediate revisions by the same user not shown) | |||
Ligne 1: | Ligne 1: | ||
− | + | --[[User:Chollet|Chollet]] 14:24, 15 mai 2009 (CEST) | |
− | + | == Errors follow-up - Known issues == | |
− | == Errors follow-up == | + | * '''Corrupted input AOD files found''' |
− | + | AOD.027097._37998.pool.root | |
− | * IN2P3- | + | AOD.027579._24654.pool.root |
− | + | AOD.027076._10514.pool.root | |
+ | - Badread error example | ||
+ | rfio:clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-30/AOD.027579._24654.pool.root.2781052.0 at byte:32849441, branch:m_genParticles.m_endVtx, entry:217, badread=0 | ||
+ | - To check if the file is corrupted on SE or not, please refer to ATLAS procedure | ||
+ | (requests an certificate approved by ATLAS) defined here: <br> | ||
+ | https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#Checksum_error_triggered_by_dq2 | ||
+ | * IN2P3-CPPM / IN2P3-LPC : '''overload of lcg-gt during SURL to TURL conversion''' | ||
+ | Jobs running forever with error, killed by the batch system. | ||
+ | The error should at least be catched by Ganga - Savannah ticket opened : | ||
+ | https://savannah.cern.ch/bugs/index.php?48537 | ||
+ | send2nsd: NS009 - fatal configuration error: Host unknown: dpnshome.in2p3.fr | ||
+ | send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr | ||
This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. | This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. | ||
I have put two of this stdout, stderr there :<br> | I have put two of this stdout, stderr there :<br> | ||
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/<br> | http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/<br> | ||
− | http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/ | + | http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/<br> |
− | + | Heavy load of the local DPM server observed at that time. | |
− | * IN2P3-LAPP : during ST125, 30/01/09 jobs still running after 2500 minutes, failing to connect to LFC with the message : | + | * IN2P3-LAPP : during ST125, 30/01/09 '''jobs still running after 2500 minutes, failing to connect to LFC''' with the message : |
send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03 | send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03 | ||
cannot connect to LFC | cannot connect to LFC | ||
Ligne 20: | Ligne 31: | ||
lcg-cr --vo atlas -s ATLASUSERDISK -t 2400 -d srm://tbit00.nipne.ro/dpm/..... | lcg-cr --vo atlas -s ATLASUSERDISK -t 2400 -d srm://tbit00.nipne.ro/dpm/..... | ||
dpm_getspacetoken: Unknown user space token description | dpm_getspacetoken: Unknown user space token description | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 13:25, 15 mai 2009
--Chollet 14:24, 15 mai 2009 (CEST)
Errors follow-up - Known issues
- Corrupted input AOD files found
AOD.027097._37998.pool.root AOD.027579._24654.pool.root AOD.027076._10514.pool.root
- Badread error example
rfio:clrgpfssrv03-dpm.in2p3.fr//storage/atlas1/atlas/2008-10-30/AOD.027579._24654.pool.root.2781052.0 at byte:32849441, branch:m_genParticles.m_endVtx, entry:217, badread=0
- To check if the file is corrupted on SE or not, please refer to ATLAS procedure
(requests an certificate approved by ATLAS) defined here:
https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#Checksum_error_triggered_by_dq2
- IN2P3-CPPM / IN2P3-LPC : overload of lcg-gt during SURL to TURL conversion
Jobs running forever with error, killed by the batch system. The error should at least be catched by Ganga - Savannah ticket opened : https://savannah.cern.ch/bugs/index.php?48537
send2nsd: NS009 - fatal configuration error: Host unknown: dpnshome.in2p3.fr send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr
This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21.
I have put two of this stdout, stderr there :
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/
Heavy load of the local DPM server observed at that time.
- IN2P3-LAPP : during ST125, 30/01/09 jobs still running after 2500 minutes, failing to connect to LFC with the message :
send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03 cannot connect to LFC
Service was up and running fine at that time Is it due to an expired proxy ?
- RO_07 : ST jobs can not store the output locally and are always using the fail-over storing output files in Lyon
still configuration problem with the USERDISK token on tbit00.nipne.ro SE ?
lcg-cr --vo atlas -s ATLASUSERDISK -t 2400 -d srm://tbit00.nipne.ro/dpm/..... dpm_getspacetoken: Unknown user space token description