Other links:
A tutorial lecture on using the Cluster and Sun Grid Engine (SGE)
Troubleshooting
The cluster is the computational workhorse for the department and all users are encouraged to run jobs on it. As stated above, the machine enigma2 is the access host for the cluster. You will not be logging directly into the compute nodes of the cluster rather, you will logon to enigma2 and then submit jobs to the cluster nodes.
As of 2010-08-20 the cluster has 376 64-bit cores with an aggregate of 1.78TB RAM. Individual cluster nodes have memory capacities ranging from 20GB to 64GB. However we are continually upgrading its capacity. The most up-to-date configuration information is here.
The above configuration may change depending on maintenance needs and not all nodes are available for all types of jobs. For example, due to licensing restrictions, one 20GB node is reserved for SAS.
Everything related to job submission, scheduling, and execution on the cluster is under the control of Sun's Grid Engine software (SGE). The Grid Engine project sponsored by Sun Microsystems is an open source community effort to facilitate the adoption of distributed computing solutions. Among other things, we use SGE to limit the total number of job/slots each user is allowed to run simultaneously on the cluster (currently 16, but subject to change). However, you may submit more jobs than the limit all of which will be queued to run as your other jobs finish. When the cluster nodes are all at maximum capacity, jobs waiting to run will be subject to a functional share priority algorithm as we have defined it using SGE.
qrsh
*** Be sure to see the section   Specifying your job's memory needs
You will be logged into a "random" cluster node and get an interactive shell prompt, just as if you logged into enigma2. Now you can run whatever program you want. For example, you can run R. However, you must remember to logout ('exit' or 'CTRL D'). Otherwise, you will be taking up a slot in the queue which will not be available to others.
While you are logged into a cluster node via   qrsh , if you run
qstat -u YOUR_USER_IDyou'll see something like the following:
job-ID prior name user state submit/start at queue slots ------------------------------------------------------------------------------------------------------- 15194 1.53962 Pf_3D7 maryj r 08/15/2007 16:04:16 standard.q@compute-0-20.local 1 15299 2.00790 BootA10600 maryj r 08/17/2007 15:26:06 standard.q@compute-0-10.local 1 15290 2.35449 QRLOGIN maryj r 08/17/2007 15:20:00 standard.q@compute-0-11.local 1The job labeled QRLOGIN is the interactive session ( for more info see Checking the status of your job).
You may also specify memory requirements or special queues on your qrsh command just as you do on the qsub command (see below). For interactive work we strongly encourage users to work on the cluster via qrsh (rather than use enigma2).
NOTE: Do not run background or 'nohup' jobs while using qrsh. Sun Grid Engine (SGE) must know about your job/session so that it can manage and account for cluster resources. Additionally, SGE assumes one slot (corresponding to one CPU core) for each qrsh session. If you still have running programs and no session appears for you in qstat, then you have done something that is not appropriate for the way the HPSCC Cluster is managed. If jobs are found running on cluster nodes with no associated SGE entry, they will be killed.
NOTE: If you encounter an error while running a program interactively on a cluster node and your program crashes, it still might be in the cluster's process queue. If you don't quit out of your program normally, make sure to check the cluster queue (via qstat, see below) and see if your (interactive) job is still there. If it is, get the job-ID and kill the job using qdel.
To run an R BATCH job on the cluster using the mycommands.R file, your batch.sh file need only have this one line in it, like this:
R CMD BATCH mycommands.RThe file might have other lines in it to specify SGE job options or commands to run before or after the "R CMD BATCH ..." line. The technical name for this file is "shell script". Knowing this might help you communicate with the system administrator.
qsub -cwd batch.sh
*** Be sure to see the section   Specifying your job's memory needs
The -cwd option tells SGE to execute the batch.sh script on the cluster from the current working directory (otherwise, it will run from your home directory, which is probably not what you want).
**************************** We now recommend that all users use both the mem_free AND the h_vmem parameters on all qsub jobs AND qrsh sessions. That will insure that your session/job gets to a proper node and possibly protect the node (which also has other users' jobs running on it) from crashing due to a memory mistake in your session or job.
When submitting your job(s), if you do not specify any memory requirements, SGE will choose the cluster node(s) with the lowest CPU load WITHOUT REGARD TO MEMORY AVAILABILITY (subject to other scheduling parameters which we have defined).
It is, therefore, IMPORTANT to specify your expected memory requirements when submitting cluster jobs. After calculating approximately how much memory your job will need, you should add a memory resource requirement to your qsub (or qrsh ) command.
qsub -cwd -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX batch.sh
For example, if your job will require 4GB of memory, you could type
qsub -cwd -l mem_free=4G,h_vmem=6G batch.sh
In the above case, your job would go to a node with at least 4GB of memory available at the time the job starts and the job would automatically be stopped if it exceeded 6GB of memory usage at any time.
NOTES:
-l is a 'minus' follwed by the 'lower-case letter L'.
No spaces in the comma delimited list of resources and limits.
For a more detailed explanation of what   -l mem_free   and   h_vmem   imply see below in this section.
To see a summary of available nodes and their memory capacity and current load, use the command   qhostw   .
After submitting your job with qsub, use the qu or qstat command to see which queue (node) your job actually went to (see Checking the status of your job). In the output of qu , the next to last column lists the queue name.
Here are some notes explaining the use of   mem_free and   h_vmem, as well as   h_fsize :
------------------------------------------------
mem_freeYou should use approx what you think your job will need (or a little more) on the mem_free request. This does not reserve memory for your job. It simply puts your job on a node with that amount of memory currently available (see example under h_vmem below)
h_vmemTo avoid running away at the high end use the h_vmem parameter to limit your job's total memory use. (We are now encouraging all users to use this parameter to stop a runaway job from crashing the node)
Something like:
qsub -cwd -l mem_free=12G,h_vmem=16G batch.sh... or similarly on a qrsh command.
h_fsizeSome users might also want to limit the size of files that can be created by their job (to avoid the consequences of any bug in their program that might, under certain conditions, cause a file to grow without bounds ... NOT A GOOD THING)
Something like:
qsub -cwd -l mem_free=12G,h_vmem=16G,h_fsize=1G batch.shwould, additionally, limit the size of any file created by the job to not more than 1 GB. It would behave similarly on a qrsh command.
NOTES:
No spaces in the comma delimited list of resources and limits.
You can abbreviate mem_free as 'mf'. For example, here is a simple qrsh command you might use if you expect to use 5G or less memory during an interactive session:
qrsh -l mf=5G,h_vmem=6GOne easy way to always have a default limit on your jobs and sessions is to put a line like this
-l mem_free=5G,h_vmem=6Gin the file .sge_request in your home directory.
On the command line, the "-l" and its list of resource specifications must go immediately after the qrsh (or qsub) command. Any other resource requirements usually specified with the "-l" option, such as a special queue or whatever, can also be in the comma-delimited resource list (no spaces in the resource list).
If you usually invoke a program (such as R) when you type your qrsh command, it might look like this:
qrsh -l mf=5G,h_vmem=6G R
Here is one scenario explaining why you should use mem_free even if you expect your job to only use 3G :
It's possible that a user has requested mem_free=18G and is using all of that or more on a 20G node.
That user may be the only one on that node and if it has 8 cpu-cores there could be 7 slots open.
Your job with no mem_free requirement might, in fact, go to that node since it looked lightly loaded as far as jobs.
It will happily accept your job and begin swapping and slow down both jobs now on the node.
-l mem_free=10G,h_vmem=12G,h_stack=256MThe mem_free and h_vmem values may vary according to your needs but the h_stack value should always be 256M (as far as we know from our experience thus far).
qstat -j NNNNN | grep vmem
where NNNNN is your specific cluster job number ... look at the "vmem"
and "maxvmem" entries.
To make it easier to monitor memory usage for your currently running jobs, we have created the command
qmemIf you have no jobs running on the cluster qmem will print nothing, but if you do, the results will look something like:
[enigma2]$ qmem 10506 maryj node=33 vmem=289.1M, maxvmem=294.3M howMany10.sh 14257 maryj node=8 vmem=231.5M, maxvmem=238.0M s.all.sh 16695 maryj node=25 vmem= 1.8G, maxvmem= 1.8G mergedoc1.3.sh 17464 maryj node=15 vmem=272.9M, maxvmem=284.0M simulateVariance.sh 17555 maryj node=12 vmem=N/A, maxvmem=N/A QRLOGIN 17584 maryj node=6 vmem=315.1M, maxvmem=334.3M calculateVaried-emp.genSampScheme.sh
To see your job's memory usage upon job completion, use email notification, which works for aborted jobs as well. See the job status via email discussion for instructions on how to use email notification.
Note: qrsh sessions will not report memory usage using the above method. You will simply see "N/A" in the entries for vmem and maxvmem as shown in the above example..
By default, under our version of SGE, qstat with no arguments shows cluster jobs for all users. To restrict the output to show only your jobs, use the -u USERID argument. For example:
qstat -u maryj
would only display active/pending jobs for user maryj.
However, we have created the command qu to easily accomplish the same thing (view only your jobs). If you have no jobs running on the cluster qu will print nothing, but if you do, the results will look something like:
[enigma2]$ qu job-ID prior name user state submit/start at queue slots ------------------------------------------------------------------------------------------------------- 15194 1.53962 Pf_3D7 maryj r 08/15/2007 16:04:16 standard.q@compute-0-20.local 1 15299 2.00790 BootA10600 maryj r 08/17/2007 15:26:06 standard.q@compute-0-10.local 1 15290 2.35449 QRLOGIN maryj r 08/17/2007 15:20:00 standard.q@compute-0-11.local 1Under the state column you can see the status of your job. Some of the codes are
Another important thing to note is the job-ID for your job. You need to know this if you ever want to make changes to your job. For example, to delete your job from the cluster, you can run
qdel 15299
where 15299 is the job-ID   I got from running qstat.
qsub -m e -M your_email@jhsph.edu your_job.sh
which means send email to given address(es) when the job ends.
If you want to automatically have such options (or others) always added to your job(s), simply put them in a file named .sge_request in your home directory. You can also have working-directory-specific .sge_request files (see the man page for sge_request - man sge_request).
Lines like this in your .sge_request file:
-M your_email@jhsph.edu -m ewill cause an email to be sent, when your job ends, for every cluster job that you start (including, for what it's worth, a qrsh 'job').
You could use   -m n on individual qsub job command lines to suppress email notification for certain jobs.
Or better yet, ... you might only put the -M your_email@jhsph.edu in the .sge_request file and simply use the -m e option on jobs for which you want email notification.
Note: You may also invoke the options shown above (and others) by including special lines at the top of your job shell scripts. Lines beginning with #$ are interpreted as qsub options for that job. For example, if the first few lines of your script look like the following:
#!/bin/bash #$ -M joe_x@gmail.com #$ -m eThe lines beginning with #$ would cause SGE to send email to 'joe_x@gmail.com' when the job ends.
#$ -m bewould cause an email to be sent when the job begins ('b') and ends ('e'). See the manual page for qsub (type man qsub at a shell prompt ) to get more information.
A special queue has been created (currently consisting of 4 slots on one node) for "express" jobs.
Use this express queue if you have max'd out your available standard queue cluster slots and you need to run a relatively quick job, whether it be with qrsh or qsub. It could also be used to get around "traffic jams" on the standard cluster queue (those rare times when no standard queue slots are available) to run that short urgent job.
The express queue currently allows 1 slot per user (if available). Since the express queue has a limited number of slots, if too many users are currently using it you may not be granted a slot until a current user's session times out.
You can access the express queue by using the "-l express" option on your qrsh or qsub command. So, for example, ...
qrsh -l express
would connect you to a slot on the express queue.
For now, please do not run big memory jobs on the express queue.
Please send any questions or comments about this document to to BITSUPPORT ( bitsupport at jhsph.edu ).
This document was last modified on 2011-Feb-02