SA3
TSA3.2.3: Operate certification and testing test beds

Torque/Maui certification Plan and Results

February 14, 2007

Table of Contents:

  1. - Conclusions for the impatient reader
  2. - Introduction
    • Document Status
    • Overview
    • Scope
  3. - Certification Approach
    • Test Planning
    • Test Framework
  4. - Test Environment
    • Hardware
    • Software
  5. - Performed Tests
  6. - Appendix A: Required Modifications
  7. - Appendix B: Test Data and Logs

1 - Conclusions for the impatient reader

The Torque-Maui batch system should be considered as one of the most robust components in the software stack used by a grid site. All of the test cases described in the following report have been executed numerous times. There were repetitions of these tests that did not conclude, yet Torque-Maui was rarely the reason for that. It is because Torque-Maui comes last in the chain of steps required for a job to be completed and most of the times failures of other software components did not allow the tests to reach this point. On the other hand functionality and stress tests designed to submit jobs directly to the Local Resource Management System (LRMS), showed that Torque-Maui handles load very well and behaves as expected. In fact given an average grid site configuration and normal usage patterns, it is quite difficult to stress a server node dedicated to the Torque-Maui system, before other nodes of gLite Middleware (e.g. WMS, gliTeCE) have already reached their limits.

Regarding Maui, apart from some additional configuration steps needed to be performed during the installation, there is only a memory management issue worth mentioning. Maui's process size is proportional to its maximum queue length. As the maximum length of its queues increases its process size also increases and never frees the allocated memory even after hours of idle time and empty queues. While this might be considered normal behavior, in accordance with its internal design and internal structure management, it requires careful planning of the memory resources allocated to the node that will host Torque and Maui servers.

2 - Introduction

2.1 - Document Status

Information for the latest version:

Document Name: Torque/Maui certification
Test Plan and Results
Version: 1.0
Publication Date: 2007-02-13
Author(s): Nikos Voutsinas, Dimitrios Apostolou, Kostantinos Koukopoulos
Contact: For any questions or comments about this document please contact contact@gridctb.uoa.gr
Status: In progress

Revision history
Version Date Partner/Author(s) comments
1.2 2007-02-14 UOA/Nikos Voutsinas UOA/Dimitrios Apostolou UOA/Konstantinos Koukopoulos First Release Torque 2.1.6, Maui 3.2.6p17
       

2.2 - Overview

2.3 - Scope

Torque-Maui is the proposed LRMS from gLite middleware. In the framework of SA3 activity, GRNET/UOA is the responsible partner for the certification of new Torque-Maui releases. This document provides a description of the certification method followed. In particular, it addresses the specification of the:

Test results from the certification of the latest Torque-Maui release (Torque 2.1.6, Maui 3.2.6p17) are also provided.

3.- Certification Approach

3.1- Test Planning

The test cases are organized in two levels. The first level covers the Torque-Maui as part of the full stack of gLite middleware. In this case, test cases are based on gLite tools related with grid job submission and resource monitoring. The second level concentrates on verifying the Torque-Maui as a stand alone batch system independently of other middleware components. Different types of test cases covered the certification needs in each level for the full range of functionality, stress and performance tests.

Also, the certification covers both the installation and configuration of Torque-Maui, as well as the run time behaviour.

Installation and Configuration Certification:

The installation and configuration scenarios include test cases to verify that:

  1. The installation procedure doesn't fail (Procedure Completes Tests)
  2. The installation and configuration procedure brings the system to an acceptable initial state (Install Integrity Tests)
  3. The upgrade from the prveious version works as expected, ie. without breaking existing functionality and system configuration (Upgrade Completes Tests)

System Operation Certification:

In order to validate the entire system against its operational requirements, the related scenarios include test cases to verify that

  1. The system operates in accordance with the functional requirements (Functionality Tests)
  2. The system fails at established peak load conditions. Subject to extreme data and event traffic. (Stress Tests)
  3. The system scales at an established rate.(Scalability Tests)
  4. The system is resistant against compromise attempts.(Security Tests)
  5. The system meets its performance requirements under various circumstances.(Performance tests)
  6. The system uses an established amount of resources (memory, disk, network, ..).(Resource usage tests)
  7. The system works properly with other applications.(Compatibility Tests)
  8. The system is available over an extended period or number of requests(Reliability/Availability Tests)
  9. The system after failure recovers to a previous working state with minimum loss of information (Resilience Tests)

3.2- Test Framework

The test framework consists of several processes and utilities to facilitate testing automation and results analysis. The following are considered as the main building blocks of this framework:

  1. JDLs: Various types of submitting jobs have been used to simulate different types of loads
  2. Glite Job submission process:Multiple UIs and multiple WMS are used to submit jobs to the LRMS that is being tested. It should be noted that the test scripts used in this case are based on glite-wms-xxxx utils (e.g. glite-wms-job-submit, glite-wms-job-cancel etc) and as a consequence this process is highly dependable on the operation of WMProxy and gliteCE middleware components.
  3. Batch System job submission process: Jobs are submitted directly to the LRMS that is being tested. The test scripts used in this case are based on qsub,qstat and qdel utils.
  4. Monitoring process and Results Archiving:LRMS processes and system nodes are monitored using simple scripts to facilitate results validation and verification
  5. Ceritification documents and report:The certification method, the utilties used and the certification results. The latest version of all these are made available through CVS as the 'ctb' module at :pserver:anoncvs@email.uoa.gr/egee (Web access)

Limitations

Various limitations such as the number of available hardware resources and test environment configuration need to be considered during the evaluation of the test results. Due to these limitations, there were test cases that could not be executed.

The most important limitations on the current version were the following:

All of the above limit the test coverage on concurrency issues and the simulation of specific real world usage patterns

.

Tools

A test suite (bash shell scripts) has been developed in order to support a series of test cases, especially those that covered Stress, Performance and Resource tests. An external node (See Test Environment) has been used to execute the test script in test cases where script operation might impact the test results.

Presentation and archiving of results

In order to proceed with a quantitative analysis of the results, metrics were specified to facilitate system and process monitoring. A collection of bash shell scripts and cron jobs are used to gather monitoring data from the involved grid nodes on per minute basis and store them on text files that reside on an NFS filesystem. Following that, to ease the comparison and in depth analysis of the results, a batch process inserts monitoring data to a database backend. To illustrate test results and conclusions, monitoring data are provied through graphical representation when appropriate.

Nodesdata types
Load Average%cpu,sizeJobs Status
pbs_serverpbs_mommauiBLParserPBSGIPBDIIqstat
gliteCE/SiteBDIIX    XX 
WorkerX X     
Torque/MauiXX XX  X
Monitored data by node

4.- Test Environment

4.1- Hardware

Site Used
EGEE-SEE-CERT
Date
2007/02/12
Hosts
ctb03.gridctb.uoa.gr
site BDII
gliteCE
ctb07.gridctb.uoa.gr
Torque head node

4.2- Software

LRMS
Type
TORQUE
Version
2.1.6
JobManager
Type
Maui
Version
3.2.6p17
RPM packages installed
TORQUE_server
torque-2.1.6-1cri_sl3_2st
torque-client-2.1.6-1cri_sl3_2st
torque-devel-2.1.6-1cri_sl3_2st
torque-server-2.1.6-1cri_sl3_2st
glite-torque-server-config-2.3.4-0
maui-client-3.2.6p17-1_sl3
maui-server-3.2.6p17-1_sl3
gliteCE
torque-2.1.6-1cri_sl3_2st
torque-docs-2.1.6-1cri_sl3_2st
torque-client-2.1.6-1cri_sl3_2st
maui-client-3.2.6p17-1_sl3
maui-3.2.6p17-1_sl3
WN
glite-torque-client-config-2.1.2-0
torque-mom-2.1.6-1cri_sl3_2st
Configuration
BS head node installed off the CE
Repository
Base URL
http://lxb2042.cern.ch/gLite/APT/R3.0-cert
Subdirectories
rhel30
externals
Release3.0
updates
updates.certified
internal
patch950.uncertified
patch985.uncertified
patch991.uncertified
patch1010.uncertified

5.- Performed Tests

The following table gives an overview of the test cases and links to more detailed test case descriptions.

Test Case Overview
Test caseDescription
5.1. check Batch System information published through BDII
5.2. check Batch System configuration
5.3. check network ports and services
5.4. check logging
5.5. checking General Information Providers (GIPs)
5.5. job submission of few long lived, cpu intensive jobs
5.7. 200 job submissions using 1 WMS
5.8. 400 job submissions using 2 WMS
5.9. 300 job submissions using 3 WMS
5.10. job submissions directly to Torque
5.11. parallel job submissions directly to Torque
5.12. Stressing the LRMS memory management
5.13.Batch System Resilience Tests

5.1. check Batch System information published through BDII

Description:
Check the Batch System specific entries publiced on the root and the site BDII. Specifically check if the type and version of LRMS are published correctly.
Comments:
As can be seen from the test results the information is published correctly through both the Root and the Site BDII.

5.2. check Batch System configuration

Description:
View the Batch System configuration by using various commands and by viewing the relevant configuration files, on all three related nodes (gliteCE, TORQUE_server, WN_torque).
Comments:
The configuration files have default values except specific changes that were applied for the site to work with separate Torque and gliteCE nodes.

5.3. check network ports and services

Description:
Check which Batch System related services run on which ports, on related nodes (gliteCE, TORQUE_server, WN_torque).
Comments:
On the gliteCE BLParserPBS is not running by default and needs manual starting in each reboot, due to bug in yaim. No listening daemons were found on the gliteCE providing services concerning the Batch System. The only Batch System specific connections on the gliteCE are outgoing ones, initiated from client programs.

5.4. check logging

Description:
Check proper logging of all services in all three nodes (gliteCE, TORQUE_server, WN_Torque). Log files are also monitored during job submissions, and specific job related information is checked.
Comments:
Everything is logged as expected. Test job is properly logged through the Batch System. BLahPD logger was reporting wrong lrmsID, which was fixed after applying patch 991.

5.5. checking General Information Providers (GIPs)

Description:
Watch Information Providers' results while running long lived, CPU intensive jobs. Compare with results received directly from the LRMS.
Comments:
Information providers provide results comparable to what the LRMS client tools show. The only exception is when the LRMS is loaded with many short duration jobs. On this case, lcg-info-dynamic-scheduler-wrapper spontaneously reports "GlueCEStateEstimatedResponseTime: 777777" for the empty queues.

5.5. job submission of few long lived, cpu intensive jobs

Description:
Job submission of few long lived, cpu intensive jobs (jdl). In particular submission of 10 jobs through the following path:
Results:
Comments:
As shown on the first of the following graphs, lcg-info-dynamic-scheduler-wrapper results provided by rgma user (every 10mins) are ok. Instead lcg-info-dynamic-scheduler-wrapper results provided by edguser user (every 1 min) use static ldif beacause of the user insufficient permissions. The problem was resolved when edguser was added to the ADMIN line in /var/spool/maui/maui.cfg in TORQUE_server node. Why is the BDII being updated from both the rgma (every 10 min) and the edguser (every 1 min) users?

10 long lived, cpu intensive jobs

5.7. 200 job submissions using 1 WMS

Description:
Job submission of many short lived (instantaneous) simple jobs. In particular 100 job submissions using each one of the following paths:
Results:
Comments:
Job submission via a single WMS is not enough to stress LRMS subsystem. Average loads on gliteCE and TORQUE_server nodes are minimal and LRMS queue length doesn't grow, as it seems that jobs are ending faster than they are being submitted from the WMS.

200 job submissions using 1 WMS

5.8. 400 job submissions using 2 WMS

Description:
Job submission of many short lived (instantaneous) simple jobs. In particular 100 job submissions using each one of the following paths:
Results:
Comments:

This test is not capable of stressing gliteCE or TORQUE_server either, as the load average is minimal on these nodes. However it seems that jobs are being submitted faster than being run (even though they are instantaneous) and LRMS's queue reaches a maximum length of 118.

The only Grid component that seems to be stressed in this test is the WMS, in particular WMS-3, even though the same number of jobs were submitted to WMS-3 and WMS-1. Perhaps this is a result of network distance between the UIs and WMS-3, or overload due to other submissions taking place, or generally because of bad scalability of the WMS subsystem. It should finally be noted that WMS-3 reported for a short period of time incorrect numbers, such as 0 jobs completed (22:53-22:54) or more jobs completed than the actual results (23:12).

400 job submissions using 2 WMS

5.9. 600 job submissions using 3 WMS

Description:
Job submission of many short lived (instantaneous) simple jobs. In particular 100 job submissions using each one of the following paths:
Results:
Comments:
  1. Many jobs never reached the LRMS at all, due to WMS problems. For example, WMS-2 reported 19 jobs as "running" for a long time, and many hours later they became aborted, and reported 1 job as "waiting" *forever*. WMS-3 didn't even accept some job submissions from the beginning, reported many jobs as aborted, numbers were increasing and decreasing pretty much randomly, for example the number of completed ones was at some time diminishing greatly! Because of the abysmal WMS performance, it was very hard to produce a graph like in previous cases, showing the details.
  2. WMS-1, reached 1min LA threshold (15)
  3. WMS-2, reached 1min LA threshold (10)
  4. Maximum Torque's jobs queue lenght: 250
  5. Torque/Maui load: negligible
  6. WorkerNode load: negligible
  7. gliteCE load: normal. As demonstrated also on following test cases, the gliteCE's load is directly proportional to the number of queued jobs. While someone would expect the gliteCE to be uninfluenced from LRMS queue length, exactly the opposite happens, because the GIP scripts running on it get stressed more and more for each queued job.

600 job submissions using 3 WMS

5.10. job submissions directly to Torque

Description:
Many instantaneous jobs submitted directly to the LRMS using qsub, to stress Torque as no WMS can, and to show its actual limits (rate at which jobs are actually being submitted, rate that instantaneous jobs are finishing and are being replaced by others). In particular 1000 simple jobs were submitted, followed by 2000, 4000 and finally 5000 jobs. All torque client commands were executed from a third node to avoid unecessarily stressing of important nodes.
Comments:

5.11. parallel job submissions directly to Torque

Description:
Many instantaneous jobs submitted directly to the LRMS via many parallel connections. As soon as the jobs are queued for execution the test script automatically requests their deletion from the queue. The objective is to stress the LRMS by simulating many requests via many paths. All torque client commands were executed from a third node to avoid unecessarily stressing of important nodes.
Comments:

50x100 parallel job submissions directly to Torque

70x100 parallel job submissions directly to Torque

5.12. Stressing the LRMS memory management

Description:
Following the previous test cases, strange memory management behaviour was noticed for various processes on the TORQUE_server node. As such it was decided to run a mix of the direct submission tests, monitoring this time the memory usage of maui and pbs_server processes. 1000-7000 instantaneous jobs were serially submitted, or 70 threads of 100 iterations each were in parallel performing qsub and qdel, in various combinations.
Comments:

5.13. Batch System Resilience Tests

Description:
Several simple, medium lived jobs are submitted directly to the LRMS. While the jobs are running, using the command-line interface of Torque, the Worker Node is switched to offline state and to down state. Afterwards they are switched to online state again. Moreover the resilience of the LRMS is checked by shutting down pbs_server and TORQUE_server node, and starting it again after a while.
Comments:

No matter the state of the WN, the job already running on it was never lost. The only result of bringing the WN offline or down was that no more jobs were being submitted to it until remarking it online. Moreover in the case of switching the state of the WN down the WN, after a few seconds it was automatically marked online by the LRMS which realized that it was not truly down.

Stopping the pbs_server only caused the relevant command-line utilities to stop working, but again no jobs were lost and queue was resumed normally a while after restarting the service.

 

Appendix A: Required Modifications

By default yaim has several bugs which prevent normal functionality of gLite middleware when configured with TORQUE_server on a separate node. In this appendix are listed the necessary changes that should be applied manually in order for the middleware to work properly, in the given configuration. The mandatory ones are marked as bold.

 

Appendix B: Test Data and Logs

Test Case 5.1. check Batch System information published through BDII

Commands Used:
$ ldapsearch -h ctb03.gridctb.uoa.gr -p 2170 -x -b mds-vo-name=EGEE-SEE-CERT,\
o=grid "(&(GlueForeignKey=GlueClusterUniqueID=ctb03.gridctb.uoa.gr)"\
"(GlueCEName=dteam))" GlueCEInfoHostName GlueCEInfoLRMSType \
GlueCEInfoLRMSVersion GlueCEInfoJobManager
$ ldapsearch -h ctb08.gridctb.uoa.gr -p 2170 -x -b mds-vo-name=local,\
o=grid "(&(GlueForeignKey=GlueClusterUniqueID=ctb03.gridctb.uoa.gr)"\
"(GlueCEName=dteam))" GlueCEInfoHostName GlueCEInfoLRMSType \
GlueCEInfoLRMSVersion GlueCEInfoJobManager
Results:
GlueCEInfoHostName: ctb03.gridctb.uoa.gr
GlueCEInfoLRMSType: pbs
GlueCEInfoLRMSVersion: 2.1.6
GlueCEInfoJobManager: pbs

Test Case 5.2. check Batch System configuration

Commands Used:
[root@ctb07 root]# qmgr -c 'print server'
[root@ctb07 root]# cat /var/spool/pbs/server_priv/nodes
[root@ctb07 root]# showconfig
[root@ctb07 root]# cat /var/spool/maui/maui.cfg
[root@ctb05 root]# cat /var/spool/pbs/mom_priv/config
[root@ctb03 root]# cat /opt/lcg/etc/lcg-info-dynamic-scheduler.conf
Results:
[root@ctb07 root]# qmgr -c 'print server'
........ [*]
[root@ctb07 root]# cat /var/spool/pbs/server_priv/nodes
ctb05.gridctb.uoa.gr np=2 lcgpro
[root@ctb07 root]# showconfig
........ [*]
[root@ctb07 root]# cat /var/spool/maui/maui.cfg
SERVERHOST              ctb07.gridctb.uoa.gr
ADMIN1                  root
ADMIN3                  edginfo rgma
ADMINHOST               ctb07.gridctb.uoa.gr
RMCFG[base]             TYPE=PBS
SERVERPORT              40559
SERVERMODE              NORMAL
RMPOLLINTERVAL        00:00:10
LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              1
DEFERTIME       00:01:00
ENABLEMULTIREQJOBS TRUE
[root@ctb05 root]# cat /var/spool/pbs/mom_priv/config
$pbsserver ctb07.gridctb.uoa.gr
$restricted ctb07.gridctb.uoa.gr
$logevent 255
$ideal_load 1.6
$max_load 2.1
[root@ctb03 root]# cat /opt/lcg/etc/lcg-info-dynamic-scheduler.conf
[Main]
static_ldif_file: /opt/lcg/var/gip/ldif/lcg-info-static-ce.ldif
vomap :
   ops:ops
   dteam:dteam
module_search_path : ../lrms:../ett
[LRMS]
lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs
[Scheduler]
vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h ctb07.gridctb.uoa.gr
cycle_time : 0

Test Case 5.3. check network ports and services

Commands Used:
# netstat
Results:
TORQUE_server
--------------

port		process		service
------------------------------------------
40559		maui		
40560		maui		
33332		BLParserPBS	
15001		pbs_server	pbs
15004		maui		pbs_sched
15001(udp)	pbs_server	pbs
1022(udp)	pbs_server
WN_torque
----------

port		process		service
------------------------------------------
20001		BPRserver.1811	
20002		BPRserver.1811	
15002		pbs_mom		pbs_mom
15003		pbs_mom		pbs_resmom
15003(udp)	pbs_mom		pbs_resmom
1022(udp)	pbs_mom	
gliteCE
----------
No listening services

Test Case 5.4. check logging

Commands used:
[root@ctb03 root]# cat /var/log/glite/accounting/blahp.log-`date +%Y%m%d`\
	|grep '10237.'
[root@ctb07 root]# cat /var/spool/pbs/server_logs/`date +%Y%m%d` \
	|grep '10237.'
[root@ctb07 root]# cat /var/log/maui.log |grep '10237.'
[root@ctb05 root]# cat /var/spool/pbs/mom_logs/`date +%Y%m%d` \
	|grep '10237.'
Results:

Test Case 5.5.: checking General Information Providers (GIPs)

JDL used:
JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "job3.sh";
Arguments       = "180 1";
StdOutput       = "job3.out";
StdError        = "job3.err";
InputSandbox    = {"JOBs/job3.sh"};
OutputSandbox   = {"job3.out","job3.err"};
Requirements = other.GlueCEUniqueID=="ctb03.gridctb.uoa.gr:2119/blah-pbs-dteam";
job3.sh
#!/bin/bash

SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/urandom};

NUM_OF_BG_JOBS=2

date
hostname
whoami
pwd


if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then 
	for i in `seq 1 $NUM_OF_BG_JOBS`;do
		md5sum $FILE_MD5 &
		job_id=$!
		jobs_array[i]=$job_id
	done
fi

if [ "$SLEEP_TIME" != "0" ];then
	sleep $SLEEP_TIME
fi

if [ "$DO_MD5" != "0" -a -f $FILE_MD5 ];then 
	for i in seq `1 $NUM_OF_BG_JOBS`;do
		kill ${jobs_array[i]}
	done
fi
exit 0;

Commands used:
$ for i in `seq 1 10`; do glite-wms-job-submit -a -o IDs/ids_uoa \
-e https://ctb05.gridctb.uoa.gr:7443/glite_wms_wmproxy_server \
JDLs/ctb03_4.gridctb.uoa.gr.blah-pbs-dteam.jdl;done
[root@ctb03 root]# watch "qstat -q"
[root@ctb03 root]# watch showstate
[root@ctb03 root]# watch /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-wrapper
[root@ctb03 root]# watch /opt/lcg/var/gip/plugin/ce-pbs.sh
Results:

Test Case 5.6: Job Submission of few long lived, cpu intensve jobs

JDL used:
JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "job3.sh";
Arguments       = "180 1";
StdOutput       = "job3.out";
StdError        = "job3.err";
InputSandbox    = {"JOBs/job3.sh"};
OutputSandbox   = {"job3.out","job3.err"};
Requirements = other.GlueCEUniqueID=="ctb03.gridctb.uoa.gr:2119/blah-pbs-dteam";
job3.sh
#!/bin/bash

SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/urandom};

NUM_OF_BG_JOBS=2

date
hostname
whoami
pwd


if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then 
	for i in `seq 1 $NUM_OF_BG_JOBS`;do
		md5sum $FILE_MD5 &
		job_id=$!
		jobs_array[i]=$job_id
	done
fi

if [ "$SLEEP_TIME" != "0" ];then
	sleep $SLEEP_TIME
fi

if [ "$DO_MD5" != "0" -a -f $FILE_MD5 ];then 
	for i in seq `1 $NUM_OF_BG_JOBS`;do
		kill ${jobs_array[i]}
	done
fi
exit 0;

Commands used:
glite-wms-job-submit

Test Case 5.7: 200 job submissions using 1 WMS

JDL used:
JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "job2.sh";
Arguments       = "0 0";
StdOutput       = "job2.out";
StdError        = "job2.err";
InputSandbox    = {"JOBs/job2.sh"};
OutputSandbox   = {"job2.out","job2.err"};
Requirements = other.GlueCEUniqueID=="ctb03.gridctb.uoa.gr:2119/blah-pbs-dteam";
job2.sh
#!/bin/bash

SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/null};

date
hostname
whoami
pwd


if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then 
	md5sum $FILE_MD5
fi

if [ "$SLEEP_TIME" != "0" ];then
	sleep $SLEEP_TIME
fi
exit 0;
Commands used:
 glite-wms-job-submit

Test Case 5.8: 400 job submissions using 2 WMS

JDL used:
JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "job2.sh";
Arguments       = "0 0";
StdOutput       = "job2.out";
StdError        = "job2.err";
InputSandbox    = {"JOBs/job2.sh"};
OutputSandbox   = {"job2.out","job2.err"};
Requirements = other.GlueCEUniqueID=="ctb03.gridctb.uoa.gr:2119/blah-pbs-dteam";
job2.sh
#!/bin/bash

SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/null};

date
hostname
whoami
pwd


if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then 
	md5sum $FILE_MD5
fi

if [ "$SLEEP_TIME" != "0" ];then
	sleep $SLEEP_TIME
fi
exit 0;
Commands used:
 glite-wms-job-submit

Test Case 5.9: 600 job submissions using 3 WMS

JDL used:
JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "job2.sh";
Arguments       = "0 0";
StdOutput       = "job2.out";
StdError        = "job2.err";
InputSandbox    = {"JOBs/job2.sh"};
OutputSandbox   = {"job2.out","job2.err"};
Requirements = other.GlueCEUniqueID=="ctb03.gridctb.uoa.gr:2119/blah-pbs-dteam";
job2.sh
#!/bin/bash

SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/null};

date
hostname
whoami
pwd


if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then 
	md5sum $FILE_MD5
fi

if [ "$SLEEP_TIME" != "0" ];then
	sleep $SLEEP_TIME
fi
exit 0;
Commands used:
glite-wms-job-submit

Test Case 5.10: job submissions directly to Torque

Commands used:
qsub
qstat
Results:
TODO missing

Test Case 5.11: parallel job submissions directly to Torque

Commands used:
qsub
qdel
qstat
Results:

Test Case 5.12: Stressing the LRMS Memory Management

Commands used:
qsub
qdel
qstat

Test Case 5.13: Batch System Resilience Tests

Commands used:
pbsnodes [-o|-r|-c] node
service pbs_server [stop|start]