User Tools

Site Tools


tutorial:torque

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tutorial:torque [2016/10/18 15:13]
sertalpbilal Minor fixes
tutorial:torque [2024/02/28 13:12] (current)
mjm519 [Table]
Line 2: Line 2:
  
 TORQUE provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, reliability, and functionality and is currently in use at tens of thousands of leading government, academic, and commercial sites throughout the world. TORQUE may be freely used, modified, and distributed under the constraints of the included license. TORQUE provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, reliability, and functionality and is currently in use at tens of thousands of leading government, academic, and commercial sites throughout the world. TORQUE may be freely used, modified, and distributed under the constraints of the included license.
 +
 +
 +
 +===== Prerequisite =====
 +In order to extract your output and error results in Torque, you need to have password-less connection between nodes. If you have not set it once, execute the following commands. These commands create a public and private key so that when a node want to transfer a file to your home folder, it does not require the password.
 +After connecting to polyps enter:
 +
 +<code bash>
 +ssh-keygen -N ""
 +</code>
 +
 +Then just press ENTER for any question. After that type the following commands:
 +
 +<code bash>
 +touch ~/.ssh/authorized_keys2
 +chmod 600 ~/.ssh/authorized_keys2
 +cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
 +</code>
 +Now, you will get the error log and output log files for your jobs.
 +
 +
 +
  
 ===== Hardware ===== ===== Hardware =====
Line 10: Line 32:
 | polyp30 | 24 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz | 128 GB | 2x K80 (4GPUs) | | polyp30 | 24 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz | 128 GB | 2x K80 (4GPUs) |
  
-===== Submitting Jobs ===== 
  
-Check [[#prerequisite|prerequisite]] before using Torque.+Configured Resources as provided in the Maui schedulerThis is pulled from Torque:  
 +                        PROCS: 16   
 +                        MEM: 31G   
 +                        SWAP: 63G   
 + 
 +===== Submitting Jobs =====
  
 Jobs can be submitted either using a submission file or directly from command line. First we explain how it is done and then we will discuss the options. Jobs can be submitted either using a submission file or directly from command line. First we explain how it is done and then we will discuss the options.
Line 24: Line 50:
 #PBS -o /home/mat614/TEST.out #PBS -o /home/mat614/TEST.out
 #PBS -l nodes=1:ppn=4  #PBS -l nodes=1:ppn=4 
 +#PBS -l pmem=2GB:vmem=1GB
 #PBS -q batch #PBS -q batch
  
Line 41: Line 68:
 </code> </code>
 If you do not want to write the submission script you can do it just by calling If you do not want to write the submission script you can do it just by calling
-<code>qsub -N JobName -q batch -l nodes=1:pnn=2  myscript.sh</code>+<code>qsub -N JobName -q batch -l nodes=1:ppn=2  myscript.sh</code>
 Now, we will run the code but we are setting the job parameters using ''-'' character (e.g. ''-N JobName'') Now, we will run the code but we are setting the job parameters using ''-'' character (e.g. ''-N JobName'')
  
-===== Important Options =====+===== Options =====
  
 +^ Option  ^ Description  ^
 +| ''-q <queue>''  | Set the queue. Often you will use the standard queue, so no need to set this up. | 
 +| ''-V''  | Will pass all environment variables to the job | 
 +| ''-v var[=value]''  | Will specifically pass environment variable 'var' to the job | 
 +| ''-b y''  | Allow command to be a binary file instead of a script. | 
 +| ''-w e''  | Verify options and abort if there is an error | 
 +| ''-N <jobname>''  | Name of the job. This you will see when you use qstat, to check status of your jobs. | 
 +| ''-l resource_list''  | Specify resources | 
 +| ''-l h_rt=<hh:mm:ss>''  | Specify the maximum run time (hours, minutes and seconds) | 
 +| ''-l s_rt=hh:mm:ss''  | Specify the soft run time limit (hours, minutes and seconds) - Remember to set both s_rt and h_rt. | 
 +| ''-cwd''  | Run in current working directory | 
 +| ''-wd <dir>''  | Set working directory for this job as <dir> | 
 +| ''-o <output_logfile>''  | Name of the output log file | 
 +| ''-e <error_logfile>''  | Name of the error log file | 
 +| ''-m ea''  | Will send email when job ends or aborts | 
 +| ''-P <projectName>''  | Set the job's project | 
 +| ''-M <emailaddress>''  | Email address to send email to | 
 +| ''-t <start>-<end>:<incr>''  | Submit a job array with start index , stop index in increments using |
  
-  * ''-q <queue>'' set the queue. Often you will use the standard queue, so no need to set this up. +You can find detailed information [[http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html|here]].
-  * ''-V'' will pass all environment variables to the job +
-  * ''-v var[=value]'' will specifically pass environment variable 'var' to the job +
-  * ''-b y'' allow command to be a binary file instead of a script. +
-  * ''-w e'' verify options and abort if there is an error +
-  * ''-N <jobname>'' name of the job. This you will see when you use qstat, to check status of your jobs. +
-  * ''-l resource_list'' specify resources +
-  * ''-l h_rt=<hh:mm:ss>'' specify the maximum run time (hours, minutes and seconds) +
-  * ''-l s_rt=hh:mm:ss'' specify the soft run time limit (hours, minutes and seconds) - Remember to set both s_rt and h_rt. +
-  * ''-cwd'' run in current working directory +
-  * ''-wd <dir>'' Set working directory for this job as <dir> +
-  * ''-o <output_logfile>'' name of the output log file +
-  * ''-e <error_logfile>'' name of the error log file +
-  * ''-m ea'' Will send email when job ends or aborts +
-  * ''-P <projectName>'' set the job's project +
-  * ''-M <emailaddress>'' Email address to send email to +
-  * ''-t <start>-<end>:<incr>'' submit a job array with start index , stop index in increments using +
- +
-See [[http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html|THIS]] for more details+
  
 +<note tip>You need to use option ''-V'' to pass environment variables, which is needed to run solvers such as (Cplex, Gurobi, MOSEK, etc..). [[tutorial:torque#running_solvers|See here]].</note>
 ===== Monitoring and Removing jobs ===== ===== Monitoring and Removing jobs =====
  
 To show the jobs use ''qstat'' or ''qstat -a'' You can also see more details using To show the jobs use ''qstat'' or ''qstat -a'' You can also see more details using
 <code>qstat -f</code> <code>qstat -f</code>
-To show jobs of some user use ''qstat -u "mat614"'' To remove job do ''qdel JOB_ID''+To show jobs of some user use ''qstat -u "mat614"'' To remove job use 
 +<code shell> 
 +qdel JOB_ID 
 +</code>
  
 +Moreover, you can use the following command:
 +<code>qstat -r : provides the list of the running jobs</code>
 +<code>qstat -i : provides the list of the jobs which are in queue</code>
 +<code>qstat -n : provides the polyps node(s) which are executing each job</code>
 ==== Queues ==== ==== Queues ====
  
-We have few queues ''qstat -q''+We have few queues ''qstat -Q''
 <code> <code>
-Queue            Memory CPU Time Walltime Node  Run Que Lm  State +Queue              Max    Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T   Cpt 
----------------- ------ -------- -------- ----  --- --- --  ----- +----------------   ---   ----    --    --   ---   ---   ---   ---   ---   --- -   --- 
-medium             --      --       --      --    0   --   E R +MOSEK               48      0   yes   yes                         0 E     0 
-short              --      --       --      --    0   --   E R +AMPL                        yes   yes         0                 0 E     0 
-long               --      --       --      --      --   E R +long                30      1   yes   yes                             0 
-batch              --      --       --      --    0   --   R +gpu                  4      0   yes   yes                         0 E     0 
-verylong           --      --       --      --      --   E R +verylong            20        yes   yes                         0 E     0 
-                                               ----- ----- +medium             100      0   yes   yes                             0 
-                                                   0     0+coraverylong                 no    no         0                 0 E     0 
 +special             24      0   yes   yes                             0 
 +batch                     1   yes   yes                         0 E     0 
 +short                0        yes   yes                         0 E     0 
 +urgent                    0    no    no         0                 0 E     0 
 +background                0   yes   yes                             0 
 +mediumlong          60        yes   yes                         0 E     0
 </code> </code>
 +
 +If you want to use AMPL or MOSEK, you have to use queue: AMPL or MOSEK, because we have limited licenses for them.
 +
 +
  
 You can see limits using this command ''qstat -f -Q'' You can see limits using this command ''qstat -f -Q''
-^ Queue ^ Wall Time ^ +^ Queue       ^ Wall Time  ^ Max Queueable  ^ Max Running  ^ Max User run  ^ Max User Queuable  ^ Notes                         
-batch  | 01:00:00  +urgent      |            |                |              |                                  | high priority - upon request  
-| short  | 02:00:00  +| batch       | 01:00:00   |                |              |                                  |                               
-| medium | 04:00:00 +| short       | 02:00:00   |                |              |                                  |                               
-| long  | 72:00:00  +| medium      | 04:00:00   |                | 100          | 40            | 200                |                               | 
-very long  | 240:00:00 |+| mediumlong  | 24:00:00   | 1200           | 60                                            |                               
 +| long        | 72:00:00   |                | 30           | 20            | 900                |                               
 +verylong    | 240:00:00                 | 20           | 10            | 600                |                               | 
 +| special     | 72:00:00                  | 24                                            |                               | 
 +| background  | unlimited  |                |              |                                  | low priority                  | 
 +| gpu                    |                | 4            | 1                                | GPU node is not in Torque     | 
 +| AMPL        |            | 200            | 8            | 6                                |                               | 
 +| MOSEK                  | 50             | 48                                            |                               |
  
-==== Examples ==== 
  
-=== Submitting Large Memory Job === 
  
-Sometimes your job needs more memoryThis can be achieved by ''-l mem=size'' example +Notes: 
-<code>qsub  -l mem=20gb  test.pbs</code>+  * Urgent queue has no limits and jobs have a higher priority over all other jobs in the queuesPlease be respectful of others if using this queue to complete time sensitive or critical jobs. 
 +  * background queue has no limits and jobs have a lower priority over all other jobs in the queues. 
 +===== Examples =====
  
-=== Running MATLAB -- Example ===+==== Submitting a Small or Large Memory Job ==== 
 + 
 +You can use the option ''-l pmem=size,vmem=size'' to limit memory usage of your job. 
 + 
 +<code bash limited.sh> 
 +qsub -l pmem=4gb,vmem=4gb test.pbs 
 +</code> 
 + 
 +Sometimes your job needs more memory. You can choose a larger memory size with the same option: 
 + 
 +<code bash large.pbs>qsub  -l pmem=20gb  test.pbs</code> 
 + 
 +To see what resources have been assigned by the batch queuing system run the ulimit command (bash) or limit comamnd: 
 +<code bash pbs job submission command>qsub -I -l nodes=1:ppn=1 -l pmem=30GB:vmem=4GB -q short -N test -e TEST.err -o TEST.out -w e</code> 
 +<code bash ulimit>user@polyp13:~$ ulimit -a 
 +core file size          (blocks, -c) 0 
 +data seg size           (kbytes, -d) 31457280 
 +scheduling priority             (-e) 0 
 +file size               (blocks, -f) unlimited 
 +pending signals                 (-i) 128344 
 +max locked memory       (kbytes, -l) unlimited 
 +max memory size         (kbytes, -m) 31457280 
 +open files                      (-n) 65536 
 +pipe size            (512 bytes, -p) 8 
 +POSIX message queues     (bytes, -q) 819200 
 +real-time priority              (-r) 0 
 +stack size              (kbytes, -s) unlimited 
 +cpu time               (seconds, -t) unlimited 
 +max user processes              (-u) 128344 
 +virtual memory          (kbytes, -v) unlimited 
 +file locks                      (-x) unlimited</code> 
 + 
 +**[[https://www.geeksforgeeks.org/ulimit-soft-limits-and-hard-limits-in-linux|For more information on the ulimit command review this link.]]** 
 +==== Running MATLAB ====
  
 You just have to create a submission job which looks like this You just have to create a submission job which looks like this
Line 111: Line 196:
 #PBS -o /home/mat614/TEST.out #PBS -o /home/mat614/TEST.out
 #PBS -l nodes=1:ppn=4  #PBS -l nodes=1:ppn=4 
 +#PBS -l pmem=2GB:vmem:1GB
 #PBS -q batch #PBS -q batch
  
Line 116: Line 202:
 </code> </code>
  
-=== Interactive Jobs ===+<note tip>Use **-singleCompThread** [[https://www.mathworks.com/help/matlab/ref/maxnumcompthreads.html|option]] for Matlab to use a single thread. A similar option may be needed for the program/solver you're using.</note> 
 + 
 +==== Running Solvers ==== 
 + 
 +In order to run solvers (such as Gurobi/CPLEX/Mosek/AMPL/...), you need to use "-V" (it is Upper case) option. i.e.: 
 + 
 +<code>qsub -V submitFile.pbs </code> 
 + 
 +This flag enables the solver to find necessary authentication information.
  
 +==== Interactive Jobs ====
  
 If you do not care where you run your job just use ''-I'' and do not specify any script to run. If you do not care where you run your job just use ''-I'' and do not specify any script to run.
Line 126: Line 221:
 and you will be running interactive session on polyp15. and you will be running interactive session on polyp15.
  
-=== Using GPU's ===+==== Using GPU'====
  
  
Line 134: Line 229:
 However, first you have to have a permission to use GPU (given by Prof. Takac) -- this is just formality to allow to certain users to use video driver on polyp30 However, first you have to have a permission to use GPU (given by Prof. Takac) -- this is just formality to allow to certain users to use video driver on polyp30
  
-=== Running MPI and Parallel Jobs ===+If you are using TensorFlow in Python, you can set the limit on amount of GPU memory using: 
 +<code>config_tf = tf.ConfigProto() 
 +config_tf.gpu_options.per_process_gpu_memory_fraction = p</code> 
 +in which **//p//** is the percent of GPU memory (a number between zero and one).  
 + 
 +==== Running MPI and Parallel Jobs ====
  
 <code bash mpi.pbs> <code bash mpi.pbs>
Line 196: Line 296:
 c2 c2
 </code> </code>
- 
-===== Advanced ===== 
- 
- 
-The qsub command will pass certain environment variables in the Variable_List attribute of the job. These variables will be available to the job. The value for the following variables will be taken from the environment of the qsub command: 
-  * **HOME** (the path to your home directory) 
-  * **LANG** (which language you are using) 
-  * **LOGNAME** (the name that you logged in with) 
-  * **PATH** (standard path to excecutables) 
-  * **MAIL** (location of the users mail file) 
-  * **SHELL** (command shell, i.e bash,sh,zsh,csh, ect.) 
-  * **TZ** (time zone) 
-These values will be assigned to a new name which is the current name prefixed with the string "PBS_O_". For example, the job will have access to an environment variable named PBS_O_HOME which have the value of the variable HOME in the qsub command environment. In addition to these standard environment variables, there are additional environment variables available to the job. 
-  * **PBS_O_HOST** (the name of the host upon which the qsub command is running) 
-  * **PBS_SERVER** (the hostname of the pbs_server which qsub submits the job to) 
-  * **PBS_O_QUEUE** (the name of the original queue to which the job was submitted) 
-  * **PBS_O_WORKDIR** (the absolute path of the current working directory of the qsub command) 
-  * **PBS_ARRAYID** (each member of a job array is assigned a unique identifier) 
-  * **PBS_ENVIRONMENT** (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job) 
-  * **PBS_JOBID** (the job identifier assigned to the job by the batch system) 
-  * **PBS_JOBNAME** (the job name supplied by the user) 
-  * **PBS_NODEFILE** (the name of the file contain the list of nodes assigned to the job) 
-  * **PBS_QUEUE** (the name of the queue from which the job was executed from) 
-  * **PBS_WALLTIME** (the walltime requested by the user or default walltime allotted by the scheduler) 
  
 ===== Mass Operations ===== ===== Mass Operations =====
Line 250: Line 326:
 </code> </code>
 to cancel all of your running jobs. to cancel all of your running jobs.
- 
-===== Prerequisite ===== 
-In order to extract your output and error results in Torque, you need to have password-less connection between nodes. If you have not set it once, execute the following commands. These commands create a public and private key so that when a node want to transfer a file to your home folder, it does not require the password. 
-After connecting to polyps enter: 
  
 <code bash> <code bash>
-ssh-keygen -N ""+qselect -u <username> | xargs qdel
 </code> </code>
 +will cancel all jobs (both running jobs and queue).
  
-Then just press ENTER for any question. After that type the following commands: 
  
 +===== Advanced =====
 +
 +
 +The qsub command will pass certain environment variables in the Variable_List attribute of the job. These variables will be available to the job. The value for the following variables will be taken from the environment of the qsub command:
 +  * **HOME** (the path to your home directory)
 +  * **LANG** (which language you are using)
 +  * **LOGNAME** (the name that you logged in with)
 +  * **PATH** (standard path to excecutables)
 +  * **MAIL** (location of the users mail file)
 +  * **SHELL** (command shell, i.e bash,sh,zsh,csh, ect.)
 +  * **TZ** (time zone)
 +These values will be assigned to a new name which is the current name prefixed with the string "PBS_O_". For example, the job will have access to an environment variable named PBS_O_HOME which have the value of the variable HOME in the qsub command environment. In addition to these standard environment variables, there are additional environment variables available to the job.
 +  * **PBS_O_HOST** (the name of the host upon which the qsub command is running)
 +  * **PBS_SERVER** (the hostname of the pbs_server which qsub submits the job to)
 +  * **PBS_O_QUEUE** (the name of the original queue to which the job was submitted)
 +  * **PBS_O_WORKDIR** (the absolute path of the current working directory of the qsub command)
 +  * **PBS_ARRAYID** (each member of a job array is assigned a unique identifier)
 +  * **PBS_ENVIRONMENT** (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job)
 +  * **PBS_JOBID** (the job identifier assigned to the job by the batch system)
 +  * **PBS_JOBNAME** (the job name supplied by the user)
 +  * **PBS_NODEFILE** (the name of the file contain the list of nodes assigned to the job)
 +  * **PBS_QUEUE** (the name of the queue from which the job was executed from)
 +  * **PBS_WALLTIME** (the walltime requested by the user or default walltime allotted by the scheduler)
 +
 +
 +==== Tensorflow with GPU ====
 +To use tensorflow with a specific GPU, say GPU 1, you can simply set
 <code bash> <code bash>
-touch ~/.ssh/authorized_keys2 +export CUDA_VISIBLE_DEVICES=1
-chmod 600 ~/.ssh/authorized_keys2 +
-cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2+
 </code> </code>
-Nowyou will get the error log and output log files for your jobs.+and then schedule your jobs with Torque to perform experiments on GPU 1. 
 + 
 + 
 +====== MOAB Scheduler ====== 
 +PBS Torque is used to schedule and run jobs on our cluster. Two PBS processes are required to run jobs. On the PBS serverthe pbs_server process runs to accept your job and add it to the queue. It will also dispatch the job to the nodes to run under the pbs_mom process. 
 + 
 + 
 +==== Useful MOAB Commands ==== 
 +1. [[https://docs.adaptivecomputing.com/maui/commands/showq.php|showq]] - Displays information about active, eligible, blocked, and/or recently completed jobs. 
 + 
 +2. [[https://docs.adaptivecomputing.com/maui/commands/showstart.php|showstart]] - Displays the estimated start time of a job based a number of analysis types. 
 + 
 +3. [[https://docs.adaptivecomputing.com/maui/commands/checkjob.php|checkjob]] - Allows end users to view the status of their own jobs. 
 + 
 +====Useful External Resources==== 
 +[[https://www.icer.msu.edu/sites/default/files/files/understand_job_scheduler_v2.pdf|MSU -Understand job scheduler and resource manager]] - Describes the batch queuing system and has some useful diagrams explaining the interrelationship between the scheduler and the server. 
 + 
 +[[https://wvuhpc.github.io/2019-Intro-HPC/07-jobs/index.html|WVU - Job Submission (Torque and Moab)]] - Lists frequently used commands for Torque and Moab. Also includes information on Prologue and Epilogue scripts. 
 + 
 +[[http://docs.adaptivecomputing.com/mwm/7-1-3/help.htm#pbsintegration.html|Moab-TORQUE/PBS Integration Guide]] - Guide for Administrators and integrators on the deployment and integration of PBS Torque and Moab into a computer system 
 + 
 +[[https://silas.net.br/tech/hpc/torque.html|Torque Notes]] - Information about the processes involved in using torque and debugging information.
  
  
tutorial/torque.1476818025.txt.gz · Last modified: 2016/10/18 15:13 by sertalpbilal