Difference between revisions of "Atlas:Analysis ST 2009 Errors"
(→Errors follow-up) |
(→Errors follow-up) |
||
Ligne 21: | Ligne 21: | ||
dpm_getspacetoken: Unknown user space token description | dpm_getspacetoken: Unknown user space token description | ||
− | * <span style="color:green;"> <b>SOLVED</b></span>IN2P3-LPC_MCDISK: f(w) - Errors due to the load induced by MC production running at that time. Then ST tests jobs (2 x 50 jobs added)were aborted with an "Unspecified gridmanager error" logged by WMS.<br> | + | * <span style="color:green;"> <b>SOLVED</b> </span>IN2P3-LPC_MCDISK: f(w) - Errors due to the load induced by MC production running at that time. Then ST tests jobs (2 x 50 jobs added)were aborted with an "Unspecified gridmanager error" logged by WMS.<br> |
In fact, the submission to the batch system was failing because the '''max. total number of jobs (GlueCEPolicyMaxTotalJobs) was reached '''. Probably this value is not looked at as it could be before submission by WMS. <br> | In fact, the submission to the batch system was failing because the '''max. total number of jobs (GlueCEPolicyMaxTotalJobs) was reached '''. Probably this value is not looked at as it could be before submission by WMS. <br> | ||
Jan 29 23:54:46 clrlcgce03 lcgpbs:internal_ FAILED during submission | Jan 29 23:54:46 clrlcgce03 lcgpbs:internal_ FAILED during submission | ||
to batch system lcgpbs(Maximum number of jobs already in queue).. | to batch system lcgpbs(Maximum number of jobs already in queue).. | ||
This limitation has been removed by site. | This limitation has been removed by site. |
Version du 16:40, 20 mars 2009
30.01.09
Errors follow-up
- IN2P3-CPPM_MCDISK: Jobs running forever with error, killed by the batch system.
"send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr ".
This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21.
I have put two of this stdout, stderr there :
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/
The load of the local DPM server was around 9 at that time.
- IN2P3-LAPP : during ST125, 30/01/09 jobs still running after 2500 minutes, failing to connect to LFC with the message :
send2nsd: NS002 - send error : _Csec_recv_token: Received magic:30e1301 expecting ca03 cannot connect to LFC
Service was up and running fine at that time Is it due to an expired proxy ?
- RO_07 : ST jobs can not store the output locally and are always using the fail-over storing output files in Lyon
still configuration problem with the USERDISK token on tbit00.nipne.ro SE ?
lcg-cr --vo atlas -s ATLASUSERDISK -t 2400 -d srm://tbit00.nipne.ro/dpm/..... dpm_getspacetoken: Unknown user space token description
- SOLVED IN2P3-LPC_MCDISK: f(w) - Errors due to the load induced by MC production running at that time. Then ST tests jobs (2 x 50 jobs added)were aborted with an "Unspecified gridmanager error" logged by WMS.
In fact, the submission to the batch system was failing because the max. total number of jobs (GlueCEPolicyMaxTotalJobs) was reached . Probably this value is not looked at as it could be before submission by WMS.
Jan 29 23:54:46 clrlcgce03 lcgpbs:internal_ FAILED during submission to batch system lcgpbs(Maximum number of jobs already in queue)..
This limitation has been removed by site.