This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
condor [2014/10/25 19:52] sertalpbilal |
condor [2017/07/16 00:05] sertalpbilal [CONDOR] |
||
---|---|---|---|
Line 1: | Line 1: | ||
======CONDOR====== | ======CONDOR====== | ||
+ | |||
+ | <note warning> | ||
+ | |||
===== What is CONDOR ===== | ===== What is CONDOR ===== | ||
Line 8: | Line 11: | ||
===== Using CONDOR ===== | ===== Using CONDOR ===== | ||
- | To use condor, you need to create a .sub file. | ||
- | ==== Submitting Jobs ==== | ||
+ | ==== Submitting A Single Job ==== | ||
- | === Matlab === | + | To submit a job via CONDOR, you need to create a .sub file. This .sub file must include a program that you will execute (e.g., matlab, cplex, etc.) along with the arguments for the program (such as your file to be executed). |
- | Here is an example .sub file which submits the matlab file ' | ||
+ | === A case study: Matlab === | ||
+ | |||
+ | Suppose that we want to run a MATLAB code on Polyps. Here is an example .sub file which submits the matlab file ' | ||
<code bash myexp.sub> | <code bash myexp.sub> | ||
Line 43: | Line 47: | ||
</ | </ | ||
to submit the file to condor.\\ | to submit the file to condor.\\ | ||
+ | |||
+ | <note tip> | ||
+ | You can find the " | ||
+ | </ | ||
+ | <note tip> | ||
+ | * Matlab: / | ||
+ | * Cplex: / | ||
+ | * Mosek: / | ||
+ | * Ampl: / | ||
+ | </ | ||
+ | |||
+ | ==== Submitting Multiple Jobs ==== | ||
+ | |||
+ | There are multiple ways to submit a set of experiments (multiple jobs). Here we have two different ways to achieve the same result. | ||
+ | |||
+ | === 1. Via Bash Script === | ||
+ | |||
+ | A simple example to demonstrate the use of nested loops in multiple jobs submission | ||
+ | In this example, the executable " | ||
+ | " | ||
+ | |||
+ | One is running " | ||
+ | < | ||
+ | for(int i=0; i< ilimit; ++i) | ||
+ | { | ||
+ | | ||
+ | { | ||
+ | test -i -j; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | The following example demonstrates a two-layer nestedloop with " | ||
+ | Nested Loop with more than two layers can be achieved in the same logic | ||
+ | |||
+ | <code bash nestedloop.sh> | ||
+ | getenv = TRUE | ||
+ | Universe =vanilla | ||
+ | |||
+ | ## ilimit=5, jlimit=10 | ||
+ | ## N=(ilimit)*(jlimit)=50 | ||
+ | ## ilimit is implicitly included in the " | ||
+ | |||
+ | jlimit=10 | ||
+ | N=50 | ||
+ | |||
+ | I = $$([ $(Process) / $(jlimit) ]) | ||
+ | J = $$([ $(Process) % $(jlimit) ]) | ||
+ | |||
+ | Executable =test | ||
+ | arguments= "$(I) $(J)" | ||
+ | output=test$(Process).txt | ||
+ | Error =test.err | ||
+ | Log =test.log | ||
+ | queue $(N) | ||
+ | |||
+ | </ | ||
+ | |||
+ | Output Correspondance | ||
+ | < | ||
+ | ## test -i=0 -j=0 -> test0.txt | ||
+ | ## test -i=0 -j=1 -> test1.txt | ||
+ | ## ...... | ||
+ | ## test -i=4 -j=9 -> test49.txt | ||
+ | </ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | A simple example to demonstrate the use of variables in multiple-job submission | ||
+ | In this example, the executable " | ||
+ | " | ||
+ | |||
+ | Executable " | ||
+ | The corresponding output files are '' | ||
+ | '' | ||
+ | It could be used as an iteration counter | ||
+ | |||
+ | <code bash variable.sub> | ||
+ | getenv = TRUE | ||
+ | Universe =vanilla | ||
+ | |||
+ | Executable =test | ||
+ | arguments= $(Process) | ||
+ | output=test$(Process).txt | ||
+ | Error =test.err | ||
+ | Log =test.log | ||
+ | queue 5 | ||
+ | |||
+ | ## Executable " | ||
+ | ## A variable N is defined to specify the number of jobs | ||
+ | |||
+ | N=5 | ||
+ | Executable =test | ||
+ | arguments= $(Process) | ||
+ | output=test$(Process).txt | ||
+ | Error =test.err | ||
+ | Log =test.log | ||
+ | queue $(N) | ||
+ | |||
+ | ## Executable " | ||
+ | ## The corresponding output files are " | ||
+ | ## Variable I is defined based on the macro $(Process) | ||
+ | |||
+ | I=$$([ $(Process)+5]) | ||
+ | Executable =test | ||
+ | arguments= $(I) | ||
+ | output=test$(Process).txt | ||
+ | Error =test.err | ||
+ | Log =test.log | ||
+ | queue 5 | ||
+ | </ | ||
+ | |||
+ | |||
+ | === 2. Via Python (Script) === | ||
+ | |||
+ | |||
+ | You can use the same executable, options, etc. and change some of them to create new jobs. Then when you submit your file using '' | ||
+ | |||
+ | For your experiments, | ||
+ | |||
+ | <code python create.py> | ||
+ | # This create.py script search the data folder and | ||
+ | # create condor submission file (condor.sub) for same problem with different arguments | ||
+ | |||
+ | # Open file and write common part | ||
+ | cfile = open(' | ||
+ | common_command = \ | ||
+ | ' | ||
+ | Universe | ||
+ | getenv | ||
+ | transfer_executable = false \n\n' | ||
+ | cfile.write(common_command) | ||
+ | |||
+ | # Loop over various values of an argument and create different output file for each | ||
+ | # Then put it in the queue | ||
+ | for a in xrange(5, | ||
+ | run_command = \ | ||
+ | ' | ||
+ | output | ||
+ | queue 1\n\n' %(a,a) | ||
+ | cfile.write(run_command) | ||
+ | </ | ||
+ | |||
+ | This script will generate the following condor file | ||
+ | <code bash condor.sub> | ||
+ | Executable = ../ | ||
+ | Universe | ||
+ | getenv | ||
+ | transfer_executable = false | ||
+ | arguments | ||
+ | output | ||
+ | queue 1 | ||
+ | |||
+ | arguments | ||
+ | output | ||
+ | queue 1 | ||
+ | |||
+ | arguments | ||
+ | output | ||
+ | queue 1 | ||
+ | |||
+ | </ | ||
+ | |||
+ | <note important> | ||
+ | Be sure to provide output argument to your Condor submissions. Otherwise, you may not able to see results of your tasks. | ||
+ | </ | ||
+ | ==== Checking Jobs ==== | ||
To check the job progress, use command | To check the job progress, use command | ||
Line 54: | Line 224: | ||
condor_q userid | condor_q userid | ||
</ | </ | ||
+ | |||
+ | <note tip>If you think somehow your jobs are not being processed, you can debug and see the reasons by calling '' | ||
+ | |||
+ | ==== Removing Jobs ==== | ||
+ | |||
+ | First find the ID of the job you will terminate <code bash> | ||
+ | |||
+ | <code bash> | ||
+ | condor_rm ID | ||
+ | </ | ||
+ | |||
+ | Example: | ||
+ | I call '' | ||
+ | < | ||
+ | -- Submitter: polyp1.ie.lehigh.edu : < | ||
+ | | ||
+ | 42989.0 | ||
+ | 42989.1 | ||
+ | 42989.5 | ||
+ | </ | ||
+ | Now let say I want to terminate 42989.5. I call '' | ||
+ | '' | ||
+ | |||
+ | You can remove all your jobs using command '' | ||
+ | |||
+ | |||
+ | ===== Frequently Used CONDOR Commands ===== | ||
+ | |||
+ | A summary of frequently used commands in CONDOR: | ||
+ | |||
+ | ^ Command ^ Action ^ Basic Usage ^ Example ^ | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | ===== Some other CONDOR commands ===== | ||
+ | |||
+ | ^ Command ^ Action ^ Info ^ | ||
+ | | '' | ||
+ | | '' | ||
+ | |||
+ | |||
+ | ===== Running MPI Jobs with Condor ===== | ||
+ | |||
+ | FIXME To submit MPI jobs to our condor pool you can check Dr. Takac' | ||
+ | |||
+ | |||
+ | ===== Using AMPL with Condor ===== | ||
+ | |||
+ | We have limited license of AMPL installed in COR@L Lab. The license allows at most 10 simultaneous AMPL jobs. If you are using AMPL in your experiments you can let condor know about this and it will schedule all jobs that needs AMPL considering the license limit. For this you should add the following line to your condor submission file. | ||
+ | |||
+ | '' | ||
+ | |||
+ | ===== Condor Jobs Memory Usage ===== | ||
+ | Please check status of your condor jobs regularly, especially memory usage. | ||
+ | Each polyp node has 16 processors and 32 GB memory. This means 1 process | ||
+ | gets 2 GB memory in average. | ||
+ | |||
+ | When a polyp node is out of memory it starts using hard drive (swap) as memory | ||
+ | but reading and writing from hard drives is 1000 times slower. This means | ||
+ | if your jobs are using large amounts of memory and the polyp node processing | ||
+ | your job is out of memory, do not expect your job to terminate. | ||
+ | |||
+ | Tips: | ||
+ | You can see memory usage of your job using '' | ||
+ | gives memory usage in MB). | ||
+ | |||
+ | You can check the node your job is running using '' | ||
+ | |||
+ | You can check memory status in a node using '' | ||
+ | For more memory checking commands see http:// | ||
+ | or google is your friend. | ||
+ | |||
+ | **Your job might get killed if it is using swap. Do not waste your | ||
+ | system administrators' | ||
+ | Just control your jobs and submit jobs that are reasonable.** | ||
+ |