Torque/Maui Certification Plan and Results

The following table gives an overview of the test cases and links to more detailed test case descriptions.

Monitored data by node
Nodes	data types
Load Average	%cpu,size	Jobs Status
pbs_server	pbs_mom	maui	BLParserPBS	GIP	BDII	qstat
gliteCE/SiteBDII	X					X	X
Worker	X		X
Torque/Maui	X	X		X	X			X

Test Case Overview
Test case	Description
5.1.	check Batch System information published through BDII
5.2.	check Batch System configuration
5.3.	check network ports and services
5.4.	check logging
5.5.	checking General Information Providers (GIPs)
5.5.	job submission of few long lived, cpu intensive jobs
5.7.	200 job submissions using 1 WMS
5.8.	400 job submissions using 2 WMS
5.9.	300 job submissions using 3 WMS
5.10.	job submissions directly to Torque
5.11.	parallel job submissions directly to Torque
5.12.	Stressing the LRMS memory management
5.13.	Batch System Resilience Tests

5.1. check Batch System information published through BDII

Description:

Check the Batch System specific entries publiced on the root and the site BDII. Specifically check if the type and version of LRMS are published correctly.

Comments:

As can be seen from the test results the information is published correctly through both the Root and the Site BDII.

5.2. check Batch System configuration

Description:

View the Batch System configuration by using various commands and by viewing the relevant configuration files, on all three related nodes (gliteCE, TORQUE_server, WN_torque).

Comments:

The configuration files have default values except specific changes that were applied for the site to work with separate Torque and gliteCE nodes.

5.3. check network ports and services

Description:

Check which Batch System related services run on which ports, on related nodes (gliteCE, TORQUE_server, WN_torque).

Comments:

On the gliteCE BLParserPBS is not running by default and needs manual starting in each reboot, due to bug in yaim. No listening daemons were found on the gliteCE providing services concerning the Batch System. The only Batch System specific connections on the gliteCE are outgoing ones, initiated from client programs.

5.4. check logging

Description:

Check proper logging of all services in all three nodes (gliteCE, TORQUE_server, WN_Torque). Log files are also monitored during job submissions, and specific job related information is checked.

Comments:

Everything is logged as expected. Test job is properly logged through the Batch System. BLahPD logger was reporting wrong lrmsID, which was fixed after applying patch 991.

5.5. checking General Information Providers (GIPs)

Description:

Watch Information Providers' results while running long lived, CPU intensive jobs. Compare with results received directly from the LRMS.

Comments:

Information providers provide results comparable to what the LRMS client tools show. The only exception is when the LRMS is loaded with many short duration jobs. On this case, lcg-info-dynamic-scheduler-wrapper spontaneously reports "GlueCEStateEstimatedResponseTime: 777777" for the empty queues.

5.5. job submission of few long lived, cpu intensive jobs

Description:

Job submission of few long lived, cpu intensive jobs (jdl). In particular submission of 10 jobs through the following path:

UI -> WMS -> gliteCE -> Torque_server -> WN

Results:

START TIME: 2007-02-11T20:12+0200
END TIME: 2007-02-11T20:39+0200
NUM OF JOBS: 10
WMS SUBMITTED: 10
SCHEDULER SUCCESS/COMPLETED: 10
gLite SUCCESS: 10
Chart: BS node 1 min LA, LRMS Completed jobs, LRMS Queued jobs, LRMS running jobs
Chart: LRMS Completed jobs, LRMS|BDII Queued jobs, LRMS|BDII running jobs
Test Results:
- PATH 1: Done=10

Comments:

As shown on the first of the following graphs, lcg-info-dynamic-scheduler-wrapper results provided by rgma user (every 10mins) are ok. Instead lcg-info-dynamic-scheduler-wrapper results provided by edguser user (every 1 min) use static ldif beacause of the user insufficient permissions. The problem was resolved when edguser was added to the ADMIN line in /var/spool/maui/maui.cfg in TORQUE_server node. Why is the BDII being updated from both the rgma (every 10 min) and the edguser (every 1 min) users?

10 long lived, cpu intensive jobs

5.7. 200 job submissions using 1 WMS

Description:

Job submission of many short lived (instantaneous) simple jobs. In particular 100 job submissions using each one of the following paths:

Path 1:UI-1 -> WMS-1 -> gliteCE -> Torque_server -> WN
Path 2:UI-2 -> WMS-1 -> gliteCE -> Torque_server -> WN

Results:

START TIME: 2007-02-11T21:50+0200
END TIME: 2007-02-11T22:15+0200
NUM OF JOBS: 200
WMS SUBMITTED: 200
SCHEDULER SUCCESS/COMPLETED: 200
gLite SUCCESS: 200
Chart: WMS jobs' status, LRMS Completed jobs (?)
Test Results:
- PATH 1: Done=100
- PATH 2: Done=100

Comments:

Job submission via a single WMS is not enough to stress LRMS subsystem. Average loads on gliteCE and TORQUE_server nodes are minimal and LRMS queue length doesn't grow, as it seems that jobs are ending faster than they are being submitted from the WMS.

200 job submissions using 1 WMS

5.8. 400 job submissions using 2 WMS

Description:

Job submission of many short lived (instantaneous) simple jobs. In particular 100 job submissions using each one of the following paths:

Path 1:UI-1 -> WMS-1 -> gliteCE -> Torque_server -> WN
Path 2:UI-2 -> WMS-1 -> gliteCE -> Torque_server -> WN
Path 3:UI-1 -> WMS-3 -> gliteCE -> Torque_server -> WN
Path 2:UI-2 -> WMS-3 -> gliteCE -> Torque_server -> WN

Results:

START TIME: 2007-02-11T22:17+0200
END TIME:
- WMS-1: 2007-02-11T22:52+0200
- WMS-3: 2007-02-11T23:13+0200
NUM OF JOBS: 400
WMS SUBMITTED: 400
- WMS-1: 200
- WMS-3: 200
Test Results: 351 Jobs Done
- PATH 1: Done=100
- PATH 2: Done=100
- PATH 3: Done=74 Aborted=26
- PATH 4: Done=77 Aborted=23

Comments:

This test is not capable of stressing gliteCE or TORQUE_server either, as the load average is minimal on these nodes. However it seems that jobs are being submitted faster than being run (even though they are instantaneous) and LRMS's queue reaches a maximum length of 118.

The only Grid component that seems to be stressed in this test is the WMS, in particular WMS-3, even though the same number of jobs were submitted to WMS-3 and WMS-1. Perhaps this is a result of network distance between the UIs and WMS-3, or overload due to other submissions taking place, or generally because of bad scalability of the WMS subsystem. It should finally be noted that WMS-3 reported for a short period of time incorrect numbers, such as 0 jobs completed (22:53-22:54) or more jobs completed than the actual results (23:12).

400 job submissions using 2 WMS

5.9. 600 job submissions using 3 WMS

Description:

Job submission of many short lived (instantaneous) simple jobs. In particular 100 job submissions using each one of the following paths:

Path 1:UI-1 -> WMS-1 -> gliteCE -> Torque_server -> WN
Path 2:UI-2 -> WMS-1 -> gliteCE -> Torque_server -> WN
Path 3:UI-1 -> WMS-2 -> gliteCE -> Torque_server -> WN
Path 4:UI-2 -> WMS-2 -> gliteCE -> Torque_server -> WN
Path 5:UI-1 -> WMS-3 -> gliteCE -> Torque_server -> WN
Path 6:UI-2 -> WMS-3 -> gliteCE -> Torque_server -> WN

Results:

START TIME: 2007-02-13T19:02+0200
END TIME:
- WMS-1: 2007-02-13T23:09+0200
- WMS-2: 2007-02-14T01:06+0200
- WMS-3: 2007-02-13T23:06+0200
NUM OF JOBS: 600
WMS SUBMITTED: 565
Test Results: 382 Jobs Done
- PATH 1: Done=99 Aborted=1
- PATH 2: Done=99 Aborted=1
- PATH 3: Done=86 Aborted=13 Waiting=1
- PATH 4: Done=78 Aborted=22
- PATH 5: Done=13 Aborted=59
- PATH 6: Done=7 Aborted=65 Waiting=1

Comments:

Many jobs never reached the LRMS at all, due to WMS problems. For example, WMS-2 reported 19 jobs as "running" for a long time, and many hours later they became aborted, and reported 1 job as "waiting" *forever*. WMS-3 didn't even accept some job submissions from the beginning, reported many jobs as aborted, numbers were increasing and decreasing pretty much randomly, for example the number of completed ones was at some time diminishing greatly! Because of the abysmal WMS performance, it was very hard to produce a graph like in previous cases, showing the details.
WMS-1, reached 1min LA threshold (15)
WMS-2, reached 1min LA threshold (10)
Maximum Torque's jobs queue lenght: 250
Torque/Maui load: negligible
WorkerNode load: negligible
gliteCE load: normal. As demonstrated also on following test cases, the gliteCE's load is directly proportional to the number of queued jobs. While someone would expect the gliteCE to be uninfluenced from LRMS queue length, exactly the opposite happens, because the GIP scripts running on it get stressed more and more for each queued job.

600 job submissions using 3 WMS

5.10. job submissions directly to Torque

Description:

Many instantaneous jobs submitted directly to the LRMS using qsub, to stress Torque as no WMS can, and to show its actual limits (rate at which jobs are actually being submitted, rate that instantaneous jobs are finishing and are being replaced by others). In particular 1000 simple jobs were submitted, followed by 2000, 4000 and finally 5000 jobs. All torque client commands were executed from a third node to avoid unecessarily stressing of important nodes.

Comments:

The following graph pairs depict the queue length (as reported by GIP and the BDII) and the average load on the CE, WN and LRMS, during the 1000,2000,4000 and 5000 job submissions (qsub). The TORQUE_server is stressed only on the last two cases, but handles the load gracefully and performs correctly. Further stressing could possibly be applied on that node by submitting the jobs to many Worker Nodes.

1000 jobs submitted directly to Torque

2000 jobs submitted directly to Torque

4000 jobs submitted directly to Torque

5000 jobs submitted directly to Torque

5.11. parallel job submissions directly to Torque

Description:

Many instantaneous jobs submitted directly to the LRMS via many parallel connections. As soon as the jobs are queued for execution the test script automatically requests their deletion from the queue. The objective is to stress the LRMS by simulating many requests via many paths. All torque client commands were executed from a third node to avoid unecessarily stressing of important nodes.

Comments:

50x100 parallel job submissions directly to Torque

70x100 parallel job submissions directly to Torque

5.12. Stressing the LRMS memory management

Description:

Following the previous test cases, strange memory management behaviour was noticed for various processes on the TORQUE_server node. As such it was decided to run a mix of the direct submission tests, monitoring this time the memory usage of maui and pbs_server processes. 1000-7000 instantaneous jobs were serially submitted, or 70 threads of 100 iterations each were in parallel performing qsub and qdel, in various combinations.

Comments:

The first two graphs show the memory usage and queue length of the maui and pbs_server processes during the submission of batches of 1000, 2000, 4000, 5000 jobs followed by 5000 and 7000 parallel submissions/deletions. The maui process starts with almost 70MB memory usage, which stabilizes in a higher value after each submission batch. After the final (7000 qsub/qdel) job batch, maui stabilizes its memory usage to more than 335MB. The pbs_server memory footprint also grows with each job submission but we see that much of the memory used during the submission is freed after the end.

From this test as well as from other similar tests carried out, it is evident that maui does not free any significant part of it's allocated memory even after long periods of inactivity. This could pose a significant problem to other processes running on the same machine. In the event of a surge in job submissions maui claims memory that it doesn't ever release. It is not uncommon to see the memory usage of maui constant at over 300MB, after some uptime of the corresponding node.

LRMS memory usage case 1
pbs_server on the other hand behaves better, as can be seen in the next two graphs. The graphs depict the memory footprint and queue length during the course of three 7000 qsub/qdel's followed by three 5000 qsub's. While maui keeps all the memory it needed during the job submission, pbs_server stabilizes at a much lower memory footprint than what it needed during the submission:

LRMS memory usage case 2

5.13. Batch System Resilience Tests

Description:

Several simple, medium lived jobs are submitted directly to the LRMS. While the jobs are running, using the command-line interface of Torque, the Worker Node is switched to offline state and to down state. Afterwards they are switched to online state again. Moreover the resilience of the LRMS is checked by shutting down pbs_server and TORQUE_server node, and starting it again after a while.

Comments:

No matter the state of the WN, the job already running on it was never lost. The only result of bringing the WN offline or down was that no more jobs were being submitted to it until remarking it online. Moreover in the case of switching the state of the WN down the WN, after a few seconds it was automatically marked online by the LRMS which realized that it was not truly down.

Stopping the pbs_server only caused the relevant command-line utilities to stop working, but again no jobs were lost and queue was resumed normally a while after restarting the service.

Document Name:	Torque/Maui certification Test Plan and Results
Version:	1.0
Publication Date:	2007-02-13
Author(s):	Nikos Voutsinas, Dimitrios Apostolou, Kostantinos Koukopoulos
Contact:	For any questions or comments about this document please contact contact@gridctb.uoa.gr
Status:	In progress

Version	Date	Partner/Author(s)	comments
1.2	2007-02-14	UOA/Nikos Voutsinas UOA/Dimitrios Apostolou UOA/Konstantinos Koukopoulos	First Release Torque 2.1.6, Maui 3.2.6p17

SA3 TSA3.2.3: Operate certification and testing test beds