Difference between revisions of "Atlas:Analysis ST 2009 Errors"

Un article de lcgwiki.
Jump to: navigation, search
(Comments and Errors follow-up)
(Comments and Errors follow-up)
Ligne 7: Ligne 7:
  
 
* IN2P3-LPC_MCDISK: f(w)  - Errors due to the load induced by MC production running at that time. Then ST tests jobs (2 x 50 jobs added)were aborted with Logged Reason by wms <br>
 
* IN2P3-LPC_MCDISK: f(w)  - Errors due to the load induced by MC production running at that time. Then ST tests jobs (2 x 50 jobs added)were aborted with Logged Reason by wms <br>
- Got a job held event, reason: Unspecified gridmanager error <br>
+
Got a job held event, reason: Unspecified gridmanager error  
- Job got an error while in the CondorG queue.<br>
+
Job got an error while in the CondorG queue.<br>
The submission to the batch system has failed because the '''maximum number of jobs accepted in queue by the site was reached ''' <br>
+
The submission to the batch system has failed because the '''max. total number of jobs (GlueCEPolicyMaxTotalJobs) was reached ''' <br>
- queue atlas max_queuable = 200 in the batch system, Attributes 'GlueCEPolicyMaxTotalJobs' on the queue
 
 
  Jan 29 23:54:46 clrlcgce03 gridinfo: [25608-30993] Job 1233269583:
 
  Jan 29 23:54:46 clrlcgce03 gridinfo: [25608-30993] Job 1233269583:
 
  lcgpbs:internal_ FAILED during submission to batch system lcgpbs
 
  lcgpbs:internal_ FAILED during submission to batch system lcgpbs
  01/29/2009 23:55:07;0080;PBS_Server;Req;req_reject;Reject reply code=15046(Maximum
+
  01/29/2009 23:55:07;0080;PBS_Server;Req;req_reject;Reject reply   code=15046(Maximum
 
  number of jobs already in queue), aux=0..
 
  number of jobs already in queue), aux=0..
 +
Probably this value is not looked at as it could be before submission by WMS.
  
*IN2P3-CPPM_MCDISK:  The same problem has in previous test. Jobs running forever with error ."send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr ". This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. I have put two of this stdout, stderr there  
+
* IN2P3-CPPM_MCDISK:  The same problem has in previous test. Jobs running forever with error ."send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr ". This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. I have put two of this stdout, stderr there  
 
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/
 
http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/
 
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/
 
http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/
  
 
The load of the local DPM server was around 9 at that time.
 
The load of the local DPM server was around 9 at that time.

Version du 14:16, 4 février 2009

30.01.09

Comments and Errors follow-up

Note that ATLAS Production was ON on the FR-Cloud on January 29

  • IN2P3-LPC_MCDISK: f(w) - Errors due to the load induced by MC production running at that time. Then ST tests jobs (2 x 50 jobs added)were aborted with Logged Reason by wms
Got a job held event, reason: Unspecified gridmanager error 
Job got an error while in the CondorG queue.

The submission to the batch system has failed because the max. total number of jobs (GlueCEPolicyMaxTotalJobs) was reached

Jan 29 23:54:46 clrlcgce03 gridinfo: [25608-30993] Job 1233269583:
lcgpbs:internal_ FAILED during submission to batch system lcgpbs
01/29/2009 23:55:07;0080;PBS_Server;Req;req_reject;Reject reply   code=15046(Maximum
number of jobs already in queue), aux=0..

Probably this value is not looked at as it could be before submission by WMS.

  • IN2P3-CPPM_MCDISK: The same problem has in previous test. Jobs running forever with error ."send2dpm: DP000 - disk pool manager not running on marwn04.in2p3.fr ". This arrive for 13 jobs, all starts running nearly at the same time Thu Jan 29 22:37:53 and run in error around Jan 30 00:21. I have put two of this stdout, stderr there

http://marwww.in2p3.fr/~knoops/752629.marce01.in2p3.fr/ http://marwww.in2p3.fr/~knoops/752631.marce01.in2p3.fr/

The load of the local DPM server was around 9 at that time.