An Introduction ICE 2011 The Condor Project Established 85 Distributed High Throughput Computing research performed by a team of 35 faculty full time staff and students Condor is a batch computing System ID: 532648
Download Presentation The PPT/PDF document "Using Condor" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Using Condor
An Introduction
ICE 2011Slide2
The Condor Project
(Established ‘85)
Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.Slide3
Condor is a batch computing System
High Throughput (HTC),
Not High Performance (HPC)Originated from desktop cycle scavangingSlide4
Cycle
Scavanging
Good metaphor even for clusters which are dedicatedSlide5
Cycles Are Cheap!
Amazon.com EC2: 10 cents/hour
Academic computing: 4 cents/hourOpportunistic computing: even cheaperSlide6
Total Usage between 2011-07-21 and 2011-07-22
Group Usage Summary User Hours Pct Demand
-- ------------------------------ ---------- ----- ------ 1 Physics_Balantekin 6224.8 16.8% 46.4% 2 ChE_dePablo 5932.3 16.0% 100.0% 3 Astronomy_Friedman 5764.0 15.5% 0.0% 4 Economics_Traczynski
4218.4 11.4% 61.1%
5
Chemistry_Skinner
4186.5 11.3% 45.4%
6 BMRB 1731.5 4.7% 15.6%
7
Physics_Petriello
1708.3 4.6% 7.1%
8 CMS 1494.6 4.0% 31.8%
9 LMCG 1444.4 3.9% 27.3%
10
Biochem_Sussman
996.3 2.7% 3.6%
11 Atlas 847.9 2.3% 79.9%
12 MSE 812.5 2.2% 2.9%
--------------------------------- ---------- ------ ------
TOTAL 37126.7 100.0% 100.0% Slide7Slide8
HTC in a nutshell
Work is divided into “jobs”
Cluster of machines is divided into “machine”HTC runs jobs on machines.Slide9
Condor TutorialSlide10
Definitions
Job
The Condor representation of your workMachineThe Condor representation of computers and that can perform the workMatch MakingMatching a job with a machine “Resource”Slide11
Job
Jobs state their requirements and preferences:
I need a Linux/x86 platformI need the machine at least 500 MbI prefer a machine with more memorySlide12
Machine
Machines state their requirements and preferences:
Run jobs only when there is no keyboard activityI prefer to run Frieda’s jobsI am a machine in the econ department
Never run jobs belonging to Dr. SmithSlide13
The Magic of Matchmaking
Jobs and machines state their requirements and preferences
Condor matches jobs with machinesbased on requirements and preferencesSlide14
Getting Started:
Submitting Jobs to Condor
Overview:
Choose a “
Universe
” for your job
Make your job “batch-ready”
Create a
submit description
file
Run
condor_submit
to put your job in the queueSlide15
1. Choose the “Universe”
Controls how Condor handles jobs
Choices include:
Vanilla
Standard
Grid
Java
Parallel
VMSlide16
Using the Vanilla Universe
The Vanilla Universe:
Allows running almost any “serial” job
Provides automatic file transfer, etc.
Like vanilla ice cream
Can be used in just about any situationSlide17
2. Make your job batch-ready
Must be able to run in the background
No interactive input
No GUI/window clicks
No music ;^)Slide18
Make your job batch-ready (continued)…
Job can still use
STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devicesSimilar to UNIX (or DOS) shell:$ ./
myprogram
<input.txt >output.txtSlide19
3. Create a Submit Description File
A plain ASCII text file
Condor does
not
care about file extensions
Tells Condor about your job:
Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)
Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.Slide20
Simple Submit Description File
# Simple condor_submit input file
# (Lines beginning with # are comments)
# NOTE: the words on the left side are not
# case sensitive, but filenames are!
Universe = vanilla
Executable = my_job
Output = output.txt
QueueSlide21
4. Run condor_submit
You give
condor_submit
the name of the submit file you have created:
condor_submit my_job.submit
condor_submit
:
Parses the submit file, checks for errors
Creates a “ClassAd” that describes your job(s)
Puts job(s) in the Job QueueSlide22
The Job Queue
condor_submit sends your job’s ClassAd(s) to the schedd
The schedd (more details later):
Manages the local job queue
Stores the job in the job queue
Atomic operation, two-phase commit
“Like money in the bank”
View the queue with
condor_qSlide23
Example
condor_submit and condor_q
%
condor_submit my_job.submit
Submitting job(s).
1 job(s) submitted to cluster 1.
%
condor_q
-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job
1 jobs; 1 idle, 0 running, 0 held
%Slide24
Input, output & error files
Controlled by submit file settings
You can define the job’s standard input, standard output and standard error:
Read job’s standard input from “input_file”:
Input = input_file
Shell equivalent:
program <input_file
Write job’s standard ouput to “output_file”:
Output = output_file
Shell equivalent:
program >output_file
Write job’s standard error to “error_file”:
Error = error_file
Shell equivalent:
program 2>error_fileSlide25
Email about your job
Condor sends email about job events to the submitting user
Specify “notification” in your submit file to control which events:
Notification = complete
Notification = never
Notification = error
Notification = always
DefaultSlide26
Feedback on your job
Create a
log of job eventsAdd to submit description file:log = sim.logBecomes the Life Story of a JobShows all events in the life of a jobAlways have a log fileSlide27
Sample Condor User Log
000 (0001.000.000)
05/25 19:10:03
Job submitted from host: <
128.105.146.14
:1816>
...
001 (0001.000.000)
05/25 19:12:17
Job executing on host: <
128.105.146.14
:1026>
...
005 (0001.000.000)
05/25 19:13:06
Job terminated.
(1) Normal termination (
return value 0
)
...Slide28
Example Submit Description File With Logging
# Example condor_submit input file
# (Lines beginning with # are comments)
# NOTE: the words on the left side are not
# case sensitive, but filenames are!
Universe = vanilla
Executable = /home/frieda/condor/my_job.condor
Log = my_job.log
·Job log (from Condor)
Input = my_job.in
·Program’s standard input
Output = my_job.out
·Program’s standard output
Error = my_job.err
·Program’s standard error
Arguments = -a1 -a2
·Command line arguments
InitialDir = /home/frieda/condor/run
QueueSlide29
Let’s run a job
First, need a terminal emulator
http://www.putty.org(or similar)Login to chopin.cs.wisc.edu ascguserXX, and the given passwordSlide30
Logged In?
source /scratch/setup.sh
mkdir /scratch/your_namecd /scratch/your_namecondor_qcondor_statusSlide31
Create submit file
nano
submit.your_initialsuniverse = vanillaexecutable = /bin/echoArguments = hello worldShould_transfer_files = yesWhen_to_transfer_output = on_exit
Output = out
Log = log
queueSlide32
And submit it…
condor_submit submit.your_initials
(wait… remember the HTC bit?)Condor_q xxcat outputSlide33
A
matlab example
#!/s/std/bin/octave –qfprintf “Hello World\n”;Save as Hello.oChmod 0755 Hello.o
./
Hello.oSlide34
submit file
nano
submit.your_initialsuniverse = vanillaexecutable = Hello.oShould_transfer_files = yesWhen_to_transfer_output = on_exitOutput = out
Log = log
queueSlide35
“Clusters” and “Processes”
If your submit file describes multiple jobs, we call this a “cluster”
Each cluster has a unique “cluster number”
Each job in a cluster is called a “process”
Process numbers always start at zero
A Condor “Job ID” is the cluster number, a period, and the process number (i.e. 2.1)
A cluster can have a single process
Job ID = 20.0
·Cluster 20, process 0
Or, a cluster can have more than one process
Job ID: 21.0, 21.1, 21.2
·Cluster 21, process 0, 1, 2Slide36
Submit File for a Cluster
# Example submit file for a cluster of 2 jobs
# with separate input, output, error and log files
Universe = vanilla
Executable = my_job
Arguments = -x 0
log = my_job_0.log
Input = my_job_0.in
Output = my_job_0.out
Error = my_job_0.err
Queue
·Job 2.0 (cluster 2, process 0)
Arguments = -x 1
log = my_job_1.log
Input = my_job_1.in
Output = my_job_1.out
Error = my_job_1.err
Queue
·Job 2.1 (cluster 2, process 1)Slide37
%
condor_submit my_job.submit-file
Submitting job(s).
2 job(s) submitted to cluster 2.
%
condor_q
-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2
2.0
frieda 4/15 06:56 0+00:00:00
I
0 0.0
my_job –x 0
2.1
frieda 4/15 06:56 0+00:00:00
I
0 0.0
my_job –x 1
3 jobs; 2 idle, 1 running, 0 held
%
Submitting The JobSlide38
Organize your files and directories for big runs
Create subdirectories for each “run”
run_0, run_1, … run_599Create input files in each of theserun_0/simulation.in
run_1/simulation.in
…
run_599/simulation.in
The output, error & log files for each job will be created by Condor from your job’s outputSlide39
Submit Description File for 600 Jobs
# Cluster of 600 jobs with different directories
Universe = vanilla
Executable = sim
Log = simulation.log
...
Arguments = -x 0
InitialDir = run_0
·Log, input, output & error files -> run_0
Queue
·Job 3.0 (Cluster 3, Process 0)
Arguments = -x 1
InitialDir = run_1
·Log, input, output & error files -> run_1
Queue
·Job 3.1 (Cluster 3, Process 1)
·Do this 598 more times…………Slide40
Submit File for a Big
Cluster of Jobs
We just submitted 1 cluster with 600 processes
All the input/output files will be in different directories
The submit file is pretty unwieldy (over 1200 lines)
Isn’t there a better way?Slide41
Submit File for a Big
Cluster of Jobs (the better way) #1
We can queue all 600 in 1 “Queue” command
Queue 600
Condor provides $(Process) and $(Cluster)
$(Process)
will be expanded to the process number for each job in the cluster
0, 1, … 599
$(Cluster)
will be expanded to the cluster number
Will be 4 for all jobs in this clusterSlide42
Submit File for a Big
Cluster of Jobs (the better way) #2
The initial directory for each job can be specified using
$(Process)
InitialDir = run_$(Process)
Condor will expand these to “
run_0
”, “
run_1
”, … “
run_599
” directories
Similarly, arguments can be variable
Arguments = -x $(Process)
Condor will expand these to
“-x 0”, “-x 1”, … “-x 599”Slide43
Better Submit File for 600 Jobs
# Example condor_submit input file that defines
# a cluster of 600 jobs with different directories
Universe = vanilla
Executable = my_job
Log = my_job.log
Input = my_job.in
Output = my_job.out
Error = my_job.err
Arguments = –x $(Process)
·–x 0, -x 1, … -x 599
InitialDir = run_$(Process)
·run_0 … run_599
Queue 600
·Jobs 4.0 … 4.599Slide44
Now, we submit it…
$ condor_submit my_job.submit
Submitting job(s) ...............................................................................................................................................................................................................................................................
Logging submit event(s) ...............................................................................................................................................................................................................................................................
600 job(s) submitted to cluster 4.Slide45
And, Check the queue
$
condor_q
-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0
4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1
4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2
4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3
...
4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598
4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599
600 jobs; 599 idle, 1 running, 0 heldSlide46
Removing jobs
If you want to remove a job from the Condor queue, you use
condor_rm
You can only remove jobs that you own
Privileged user can remove any jobs
“root” on UNIX
“administrator” on WindowsSlide47
Removing jobs (continued)
Remove an entire cluster:
condor_rm 4
·Removes the whole cluster
Remove a specific job from a cluster:
condor_rm 4.0
·
Removes a single job
Or, remove
all
of your jobs with “
-a
”
condor_rm -a
·Removes all jobs / clustersSlide48
Submit cluster of 10 jobs
nano
submituniverse = vanillaexecutable = /bin/echoShould_transfer_files = yesWhen_to_transfer_output = on_exitArguments = hello world $(PROCESS)Output = out.$(PROCESS)
Log = log
Queue 10Slide49
And submit it…
condor_submit submit
(wait…)Condor_q xxcat logcat output.yySlide50
My new jobs run for 20 days…
What happens when a job is forced off it’s CPU?
Preempted by higher priority user or job
Vacated because of user activity
How can I add fault tolerance to my jobs?Slide51
Condor’s
Standard Universe
to the rescue!
Support for transparent process checkpoint and restart
Remote system calls (remote I/O)
Your job can read / write files as if they were localSlide52
Remote System Calls in
the Standard Universe
I/O system calls are trapped and sent back to the submit machine
Examples: open a file, write to a file
No source code changes typically required
Programming language independentSlide53
Process Checkpointing in the
Standard Universe
Condor’s process checkpointing provides a mechanism to automatically save the state of a job
The process can then be restarted
from right where it was checkpointed
After preemption, crash, etc.Slide54
Checkpointing:
Process Starts
checkpoint
: the entire state of a program, saved in a file
CPU registers, memory image, I/O
timeSlide55
Checkpointing:
Process Checkpointed
time
1
2
3Slide56
Checkpointing:
Process Killed
time
3
3
Killed!Slide57
Checkpointing:
Process Resumed
time
3
3
goodput
badput
goodputSlide58
When will Condor checkpoint your job?
Periodically, if desired
For fault tolerance
When your job is preempted by a higher priority job
When your job is vacated because the execution machine becomes busy
When you explicitly run
condor_checkpoint
,
condor_vacate
,
condor_off
or
condor_restart
co
mmandSlide59
Making the Standard Universe Work
The job
must be
relinked
with Condor’s standard universe support library
To relink, place
condor_compile
in front of the command used to link the job:
% condor_compile gcc -o myjob myjob.c
- OR -
% condor_compile f77 -o myjob filea.f fileb.f
- OR -
% condor_compile make –f MyMakefileSlide60
Limitations of the
Standard Universe
Condor’s checkpointing is not at the kernel level.
Standard Universe the job may not:
Fork()
Use kernel threads
Use some forms of IPC, such as pipes and shared memory
Must have access to source code to relink
Many typical scientific jobs are OKSlide61
Submitting Std uni job
#include <stdio.h>
int main(int argc, char **argv) { int i;for(i = 0 ; i < 10000000; i++) {}}Slide62
And submit…
condor_compile gcc –o foo foo.c
-- Change "vanilla" to "standard"-- Change "/bin/echo" to "foo" (or above)Slide63
My jobs have have dependencies…
Can Condor help solve my dependency problems?Slide64
Condor Universes:
Scheduler and Local
Scheduler Universe
Plug in a meta-scheduler
Developed for DAGMan (more later)
Similar to Globus’s fork job manager
Local
Very similar to vanilla, but jobs run on the local host
Has more control over jobs than scheduler universeSlide65
DAGMan
D
irected
A
cyclic
G
raph
Man
ager
DAGMan allows you to specify the
dependencies
between your Condor jobs, so it can
manage
them automatically for you.
(e.g., “Don’t run job “B” until job “A” has completed successfully.”)Slide66
What is a DAG?
A DAG is the
data
structure
used by DAGMan to represent these dependencies.
Each job is a
“node”
in the DAG.
Each node can have any number of “parent” or “children” nodes – as long as there are
no loops
!
Job A
Job B
Job C
Job DSlide67
Defining a DAG
A DAG is defined by a
.dag
file
, listing each of its nodes and their dependencies:
# diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
each node will run the Condor job specified by its accompanying
Condor submit file
Job A
Job B
Job C
Job DSlide68
Submitting a DAG
To start your DAG, just run
condor_submit_dag
with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:
% condor_submit_dag diamond.dag
condor_submit_dag is run by the schedd
DAGMan daemon itself is “watched” by Condor, so you don’t have toSlide69
DAGMan
Running a DAG
DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.
Condor
Job
Queue
B
C
D
A
A
.dag
FileSlide70
DAGMan
Running a DAG (cont’d)
DAGMan holds & submits jobs to the Condor queue at the appropriate times.
Condor
Job
Queue
D
B
C
B
A
CSlide71
Running a DAG (cont’d)
In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a
“rescue” file
with the current state of the DAG.
Condor
Job
Queue
DAGMan
X
D
A
B
Rescue
FileSlide72
Recovering a DAG
Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.
Condor
Job
Queue
Rescue
File
C
DAGMan
D
A
B
CSlide73
DAGMan
Recovering a DAG (cont’d)
Once that job completes, DAGMan will continue the DAG as if the failure never happened.
Condor
Job
Queue
C
D
A
B
DSlide74
DAGMan
Finishing a DAG
Once the DAG is complete, the DAGMan job itself is finished, and exits.
Condor
Job
Queue
C
D
A
BSlide75
Additional DAGMan Features
Provides other handy features for job management…
nodes can have
PRE
&
POST
scripts
failed nodes can be automatically re-tried a configurable number of times
job submission can be “throttled”Slide76
What about Licensed Jobs?
e.g. matlab
Site license?matlab compilerOctaveSlide77
Chirp
condor_chirp get_file remote local
condor_chirp put_file local remoteSlide78
General User Commands
condor_status
View Pool Status
condor_q
View Job Queue
condor_submit
Submit new Jobs
condor_rm
Remove Jobs
condor_prio
Intra-User Prios
condor_history
Completed Job Info
condor_submit_dag
Submit new DAG
condor_checkpoint
Force a checkpoint
condor_compile
Link Condor librarySlide79
Statistical Bootstrap
Build up from the worker side out
The matlab/octave worker:worker.m:#!/s/octave/bin/octave -qload "subset" subset;
subset = subset(floor(rand(10,1) .* 1000));
printf
("%f ", mean(subset));Slide80
Run the worker alone
(won’t work – why?)
chmod 0755 worker.m./worker.mSlide81
Create the initial data
driver.m
#!/s/octave/bin/octave –qfdist_size = 100000;d = rand(dist_size, 1) .* 500;subset = d(floor(rand(1000,1) .* 100000));save "subset" subset;Slide82
Submit file
universe = vanilla
executable = worker.mshould_transfer_files
= true
when_to_transfer_output
=
on_exit
transfer_input_files
= subset
output = mean.$(PROCESS)
error =
foo
log = log
queue 10Slide83
And submit the job…
condor_submit
submitSlide84
Add the submission to
the driver script…
#!/s/octave/bin/octave –qfdist_size
= 100000;
d = rand(
dist_size
, 1) .* 500;
subset = d(floor(rand(1000,1) .* 100000));
save "subset" subset;
system("
condor_submit
submit");
system("
condor_wait
log");Slide85
And run the driver!
./
driver.mSlide86
Master – Worker:
Many very short jobs
Condor doesn’t run short jobs well.time needed to transmit the executable/data/results.Condor doesn’t deal directly with parallel algorithms.Can have the process on the user’s workstation generatingwaves of “worker” jobs to run in parallel, buteach worker job must be scheduled anew in the Condorpool, andthe master application has to handle all the details of
scheduling, rescheduling after faults, managing input and
outputs to workers, etc.
Master-Worker (MW) addresses these issues!Slide87
Master – Worker:
Master assigns tasks to the workers
Workers perform tasks, and report results back to masterWorkers do not communicate (except through the master)Simple!Fault-tolerantDynamicProgramming model reusable across many applications.Slide88
Master – Worker:
Data common to all tasks is sent to workers only once
(Try to) Retain workers until the whole computation iscomplete—don’t release them after a single task is done.These features make for much higher parallel efficiency.We now need to transmit much less data between masterand workers.We avoid the overhead of putting each task on the condorqueue and waiting for it to be allocated to a processor.Slide89
Three abstractions in the master-worker paradigm:
Master,Worker
, and Task.The MW package encapsulates these abstractionsC++ abstract classesUser writes 10 functions (Templates and skeletons suppliedin distribution)The MWized code will adapt transparently to the dynamicand heterogeneous environment
The back side of MW interfaces to resource managementSlide90
MW Functions
MWMaster
get_userinfo() setup_initial_tasks()pack_worker_init_data()act_on_completed_task()
MWTask
(un)
pack_work
(un)
pack_result
]
MWWorker
unpack_worker_init_data
()
execute_task
()Slide91
But Wait, there’s more..
User-defined
checkpointing of master. (Don’t lose the whole run if the master crashes.)(Rudimentary) Task SchedulingMW assigns first task to first idle workerLists of tasks and workers can be arbitrarily ordered and reorderedUser can set task rescheduling policiesUser-defined benchmarkingA (user-defined) task is sent to each worker upon initializationBy accumulating normalized task CPU time, MW computes
a performance statistic that is comparable between runs,
though the properties of the pool may differ between runs.Slide92
There’s an App
for that..
MWFATCOP (Chen, Ferris, Linderoth) – A branch and cut codefor linear integer programmingMWQAP (Anstreicher, Brixius, Goux,
Linderoth
) – A
branch-and-bound code for solving the quadratic assignment problem
MWATR (
Linderoth
, Shapiro, Wright) – A trust-region-enhanced
cutting plane code for two-stage linear stochastic programming
and statistical verification of solution quality.
MWKNAP (
Glankwamdee
,
Linderoth
) – A simple branch-and-bound knapsack solver
MWAND (
Linderoth
,
Shen
) – A nested decomposition-based
solver for multistage stochastic linear programming
MWSYMCOP (
Linderoth
, Margot, Thain) – An LP-based
branch-and-bound solver for symmetric integer programsSlide93
Other frameworks
CCTools
group at Notre DameAll PairsWaveFrontMakeflowSlide94
Condor and Big Data
Big Data driving web development
Web developments driving Big DataBig Data: DefinitionLike Condor for disksSlide95
In 2003…
http://labs.google.com/papers/gfs.html
http://labs.google.com/papers/mapreduce.htmlSlide96Slide97Slide98
Shortly thereafter…Slide99
Two main
Hadoop partsSlide100
For more detail
CondorWeek
2009 talk Dhruba Borthakurhttp://www.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.pptSlide101Slide102
HDFS overview
Making POSIX
distributed file system go fast is easy…Slide103
HDFS overview
…If you get rid of the POSIX part
RemoveRandom accessSupport for small filesauthenticationIn-kernel supportSlide104
HDFS Overview
Add in
Data replication (key for distributed systems)Command line utilitiesSlide105
HDFS ArchitectureSlide106
HDFS Condor Integration
HDFS Daemons run under master
Management/controlInput/Output files can be in hdfs:Input = hdfs://some/pathnameSlide107
Parallel convergence checking:
Another DAGman example
Evaluating a function at many pointsCheck for convergence -> retryParticle Swarm OptimizationSlide108
Prepare
Compute
Compute
Compute
Compute
Converge?
Done
Yes!
NoSlide109
Any Guesses?
Who has thoughts?
Best to work from “inside out”109Slide110
The job itself.
#!/bin/
sh###### random.sh
echo $RANDOM
exit 0
110Slide111
The submit file
Any guesses?
111Slide112
The submit file
#
submitRandomuniverse = vanillaexecutable = random.sh
Should_transfer_files
= yes
When_to_transfer_output
=
on_exit
output = out
log = log
queue
112Slide113
Next step: the inner DAG
113
First
Last
Node
Node0
Node1
Node2
Node3
Node4
Node11Slide114
The DAG file
Any guesses?
114Slide115
The inner DAG file
Job Node0
submitRandomJob Node1 submitRandomJob Node2 submitRandomJob Node3 submitRandom
PARENT Node0 CHILD Node1
PARENT Node0 CHILD Node2
PARENT Node0 CHILD Node3
Job Node11
submitRandom
PARENT Node1, Node2, Node3 CHILD Node11
115Slide116
Inner DAG
Does this work?
At least one iteration?116Slide117
How to iterate
DAGman
has simple control structures(Makes it reliable)SUBDAGs!Remember what happens if post fails?117Slide118
The Outer Dag
Another Degenerate Dag
(But Useful!)118
Post Script (with exit value)
SubDag (with retry)
tSlide119
This one is easy!
Can you do it yourself?
119Slide120
The outer DAG file
####### Outer.dag #############
SUBDAG EXTERNAL A inner.dagSCRIPT POST A converge.shRETRY A 10#### converge.sh could look like#!/bin/sh
echo "Checking convergence" >> converge
exit 1
120Slide121
Let’s run that…
condor_submit_dag outer.dag
Does it work? How can you tell?121Slide122
DAGman a bit verbose…
$ condor_submit_dag outer.dag
-----------------------------------------------------------------------File for submitting this DAG to Condor : submit.dag.condor.subLog of DAGMan debugging messages : submit.dag.dagman.out
Log of Condor library output : submit.dag.lib.out
Log of Condor library error messages : submit.dag.lib.err
Log of the life of condor_dagman itself : submit.dag.dagman.log
-no_submit given, not submitting DAG to Condor. You can do this with:
"condor_submit submit.dag.condor.sub"
-----------------------------------------------------------------------
-----------------------------------------------------------------------
File for submitting this DAG to Condor : outer.dag.condor.sub
Log of DAGMan debugging messages : outer.dag.dagman.out
Log of Condor library output : outer.dag.lib.out
Log of Condor library error messages : outer.dag.lib.err
Log of the life of condor_dagman itself : outer.dag.dagman.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 721.
-----------------------------------------------------------------------
122Slide123
Debugging helps
Look in the user log file, “log”
Look in the DAGman debugging log“foo”.dagman.out123Slide124
What does converge.sh need
Note the output files?
How to make them unique?Add DAG variables to inner dagAnd submitRandom file124Slide125
The submit file (again)
# submitRandom
universe = vanillaexecutable = random.shoutput = out
log = log
queue
125Slide126
The submit file
# submitRandom
universe = vanillaexecutable = random.shoutput = out.$(NodeNumber)
log = log
queue
126Slide127
The inner DAG file (again)
Job Node0 submit_pre
Job Node1 submitRandomJob Node2 submitRandomJob Node3 submitRandomPARENT Node0 CHILD Node1PARENT Node0 CHILD Node2PARENT Node0 CHILD Node3
Job Node11 submit_post
PARENT Node1 CHILD Node11
PARENT Node2 CHILD Node11
PARENT Node3 CHILD Node11
127Slide128
The inner DAG file (again)
Job Node0 submit_pre
Job Node1 submitRandomJob Node2 submitRandomJob Node3 submitRandom…VARS Node1 NodeNumber=“1”VARS Node2 NodeNumber=“2”VARS Node3 NodeNumber=“3”
…
128Slide129
Then converge.sh sees:
$ ls out.*
out.1 out.10 out.2 out.3 out.4 out.5 out.6 out.7 out.8 out.9$
And can act accordingly…
129Slide130
Thank you!
Check us out on the Web:
http://www.condorproject.org
Email:
condor-admin@cs.wisc.edu