/
Using Condor Using Condor

Using Condor - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
389 views
Uploaded On 2017-04-02

Using Condor - PPT Presentation

An Introduction ICE 2011 The Condor Project Established 85 Distributed High Throughput Computing research performed by a team of 35 faculty full time staff and students Condor is a batch computing System ID: 532648

condor job file submit job condor submit file dag jobs run output cluster log dagman queue process input universe error files worker

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Using Condor" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Using Condor

An Introduction

ICE 2011Slide2

The Condor Project

(Established ‘85)

Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.Slide3

Condor is a batch computing System

High Throughput (HTC),

Not High Performance (HPC)Originated from desktop cycle scavangingSlide4

Cycle

Scavanging

Good metaphor even for clusters which are dedicatedSlide5

Cycles Are Cheap!

Amazon.com EC2: 10 cents/hour

Academic computing: 4 cents/hourOpportunistic computing: even cheaperSlide6

Total Usage between 2011-07-21 and 2011-07-22

Group Usage Summary User Hours Pct Demand

-- ------------------------------ ---------- ----- ------ 1 Physics_Balantekin 6224.8 16.8% 46.4% 2 ChE_dePablo 5932.3 16.0% 100.0% 3 Astronomy_Friedman 5764.0 15.5% 0.0% 4 Economics_Traczynski

4218.4 11.4% 61.1%

5

Chemistry_Skinner

4186.5 11.3% 45.4%

6 BMRB 1731.5 4.7% 15.6%

7

Physics_Petriello

1708.3 4.6% 7.1%

8 CMS 1494.6 4.0% 31.8%

9 LMCG 1444.4 3.9% 27.3%

10

Biochem_Sussman

996.3 2.7% 3.6%

11 Atlas 847.9 2.3% 79.9%

12 MSE 812.5 2.2% 2.9%

--------------------------------- ---------- ------ ------

TOTAL 37126.7 100.0% 100.0% Slide7
Slide8

HTC in a nutshell

Work is divided into “jobs”

Cluster of machines is divided into “machine”HTC runs jobs on machines.Slide9

Condor TutorialSlide10

Definitions

Job

The Condor representation of your workMachineThe Condor representation of computers and that can perform the workMatch MakingMatching a job with a machine “Resource”Slide11

Job

Jobs state their requirements and preferences:

I need a Linux/x86 platformI need the machine at least 500 MbI prefer a machine with more memorySlide12

Machine

Machines state their requirements and preferences:

Run jobs only when there is no keyboard activityI prefer to run Frieda’s jobsI am a machine in the econ department

Never run jobs belonging to Dr. SmithSlide13

The Magic of Matchmaking

Jobs and machines state their requirements and preferences

Condor matches jobs with machinesbased on requirements and preferencesSlide14

Getting Started:

Submitting Jobs to Condor

Overview:

Choose a “

Universe

” for your job

Make your job “batch-ready”

Create a

submit description

file

Run

condor_submit

to put your job in the queueSlide15

1. Choose the “Universe”

Controls how Condor handles jobs

Choices include:

Vanilla

Standard

Grid

Java

Parallel

VMSlide16

Using the Vanilla Universe

The Vanilla Universe:

Allows running almost any “serial” job

Provides automatic file transfer, etc.

Like vanilla ice cream

Can be used in just about any situationSlide17

2. Make your job batch-ready

Must be able to run in the background

No interactive input

No GUI/window clicks

No music ;^)Slide18

Make your job batch-ready (continued)…

Job can still use

STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devicesSimilar to UNIX (or DOS) shell:$ ./

myprogram

<input.txt >output.txtSlide19

3. Create a Submit Description File

A plain ASCII text file

Condor does

not

care about file extensions

Tells Condor about your job:

Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)

Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.Slide20

Simple Submit Description File

# Simple condor_submit input file

# (Lines beginning with # are comments)

# NOTE: the words on the left side are not

# case sensitive, but filenames are!

Universe = vanilla

Executable = my_job

Output = output.txt

QueueSlide21

4. Run condor_submit

You give

condor_submit

the name of the submit file you have created:

condor_submit my_job.submit

condor_submit

:

Parses the submit file, checks for errors

Creates a “ClassAd” that describes your job(s)

Puts job(s) in the Job QueueSlide22

The Job Queue

condor_submit sends your job’s ClassAd(s) to the schedd

The schedd (more details later):

Manages the local job queue

Stores the job in the job queue

Atomic operation, two-phase commit

“Like money in the bank”

View the queue with

condor_qSlide23

Example

condor_submit and condor_q

%

condor_submit my_job.submit

Submitting job(s).

1 job(s) submitted to cluster 1.

%

condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job

1 jobs; 1 idle, 0 running, 0 held

%Slide24

Input, output & error files

Controlled by submit file settings

You can define the job’s standard input, standard output and standard error:

Read job’s standard input from “input_file”:

Input = input_file

Shell equivalent:

program <input_file

Write job’s standard ouput to “output_file”:

Output = output_file

Shell equivalent:

program >output_file

Write job’s standard error to “error_file”:

Error = error_file

Shell equivalent:

program 2>error_fileSlide25

Email about your job

Condor sends email about job events to the submitting user

Specify “notification” in your submit file to control which events:

Notification = complete

Notification = never

Notification = error

Notification = always

DefaultSlide26

Feedback on your job

Create a

log of job eventsAdd to submit description file:log = sim.logBecomes the Life Story of a JobShows all events in the life of a jobAlways have a log fileSlide27

Sample Condor User Log

000 (0001.000.000)

05/25 19:10:03

Job submitted from host: <

128.105.146.14

:1816>

...

001 (0001.000.000)

05/25 19:12:17

Job executing on host: <

128.105.146.14

:1026>

...

005 (0001.000.000)

05/25 19:13:06

Job terminated.

(1) Normal termination (

return value 0

)

...Slide28

Example Submit Description File With Logging

# Example condor_submit input file

# (Lines beginning with # are comments)

# NOTE: the words on the left side are not

# case sensitive, but filenames are!

Universe = vanilla

Executable = /home/frieda/condor/my_job.condor

Log = my_job.log

·Job log (from Condor)

Input = my_job.in

·Program’s standard input

Output = my_job.out

·Program’s standard output

Error = my_job.err

·Program’s standard error

Arguments = -a1 -a2

·Command line arguments

InitialDir = /home/frieda/condor/run

QueueSlide29

Let’s run a job

First, need a terminal emulator

http://www.putty.org(or similar)Login to chopin.cs.wisc.edu ascguserXX, and the given passwordSlide30

Logged In?

source /scratch/setup.sh

mkdir /scratch/your_namecd /scratch/your_namecondor_qcondor_statusSlide31

Create submit file

nano

submit.your_initialsuniverse = vanillaexecutable = /bin/echoArguments = hello worldShould_transfer_files = yesWhen_to_transfer_output = on_exit

Output = out

Log = log

queueSlide32

And submit it…

condor_submit submit.your_initials

(wait… remember the HTC bit?)Condor_q xxcat outputSlide33

A

matlab example

#!/s/std/bin/octave –qfprintf “Hello World\n”;Save as Hello.oChmod 0755 Hello.o

./

Hello.oSlide34

submit file

nano

submit.your_initialsuniverse = vanillaexecutable = Hello.oShould_transfer_files = yesWhen_to_transfer_output = on_exitOutput = out

Log = log

queueSlide35

“Clusters” and “Processes”

If your submit file describes multiple jobs, we call this a “cluster”

Each cluster has a unique “cluster number”

Each job in a cluster is called a “process”

Process numbers always start at zero

A Condor “Job ID” is the cluster number, a period, and the process number (i.e. 2.1)

A cluster can have a single process

Job ID = 20.0

·Cluster 20, process 0

Or, a cluster can have more than one process

Job ID: 21.0, 21.1, 21.2

·Cluster 21, process 0, 1, 2Slide36

Submit File for a Cluster

# Example submit file for a cluster of 2 jobs

# with separate input, output, error and log files

Universe = vanilla

Executable = my_job

Arguments = -x 0

log = my_job_0.log

Input = my_job_0.in

Output = my_job_0.out

Error = my_job_0.err

Queue

·Job 2.0 (cluster 2, process 0)

Arguments = -x 1

log = my_job_1.log

Input = my_job_1.in

Output = my_job_1.out

Error = my_job_1.err

Queue

·Job 2.1 (cluster 2, process 1)Slide37

%

condor_submit my_job.submit-file

Submitting job(s).

2 job(s) submitted to cluster 2.

%

condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2

2.0

frieda 4/15 06:56 0+00:00:00

I

0 0.0

my_job –x 0

2.1

frieda 4/15 06:56 0+00:00:00

I

0 0.0

my_job –x 1

3 jobs; 2 idle, 1 running, 0 held

%

Submitting The JobSlide38

Organize your files and directories for big runs

Create subdirectories for each “run”

run_0, run_1, … run_599Create input files in each of theserun_0/simulation.in

run_1/simulation.in

run_599/simulation.in

The output, error & log files for each job will be created by Condor from your job’s outputSlide39

Submit Description File for 600 Jobs

# Cluster of 600 jobs with different directories

Universe = vanilla

Executable = sim

Log = simulation.log

...

Arguments = -x 0

InitialDir = run_0

·Log, input, output & error files -> run_0

Queue

·Job 3.0 (Cluster 3, Process 0)

Arguments = -x 1

InitialDir = run_1

·Log, input, output & error files -> run_1

Queue

·Job 3.1 (Cluster 3, Process 1)

·Do this 598 more times…………Slide40

Submit File for a Big

Cluster of Jobs

We just submitted 1 cluster with 600 processes

All the input/output files will be in different directories

The submit file is pretty unwieldy (over 1200 lines)

Isn’t there a better way?Slide41

Submit File for a Big

Cluster of Jobs (the better way) #1

We can queue all 600 in 1 “Queue” command

Queue 600

Condor provides $(Process) and $(Cluster)

$(Process)

will be expanded to the process number for each job in the cluster

0, 1, … 599

$(Cluster)

will be expanded to the cluster number

Will be 4 for all jobs in this clusterSlide42

Submit File for a Big

Cluster of Jobs (the better way) #2

The initial directory for each job can be specified using

$(Process)

InitialDir = run_$(Process)

Condor will expand these to “

run_0

”, “

run_1

”, … “

run_599

” directories

Similarly, arguments can be variable

Arguments = -x $(Process)

Condor will expand these to

“-x 0”, “-x 1”, … “-x 599”Slide43

Better Submit File for 600 Jobs

# Example condor_submit input file that defines

# a cluster of 600 jobs with different directories

Universe = vanilla

Executable = my_job

Log = my_job.log

Input = my_job.in

Output = my_job.out

Error = my_job.err

Arguments = –x $(Process)

·–x 0, -x 1, … -x 599

InitialDir = run_$(Process)

·run_0 … run_599

Queue 600

·Jobs 4.0 … 4.599Slide44

Now, we submit it…

$ condor_submit my_job.submit

Submitting job(s) ...............................................................................................................................................................................................................................................................

Logging submit event(s) ...............................................................................................................................................................................................................................................................

600 job(s) submitted to cluster 4.Slide45

And, Check the queue

$

condor_q

-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0

4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1

4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2

4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3

...

4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598

4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599

600 jobs; 599 idle, 1 running, 0 heldSlide46

Removing jobs

If you want to remove a job from the Condor queue, you use

condor_rm

You can only remove jobs that you own

Privileged user can remove any jobs

“root” on UNIX

“administrator” on WindowsSlide47

Removing jobs (continued)

Remove an entire cluster:

condor_rm 4

·Removes the whole cluster

Remove a specific job from a cluster:

condor_rm 4.0

·

Removes a single job

Or, remove

all

of your jobs with “

-a

condor_rm -a

·Removes all jobs / clustersSlide48

Submit cluster of 10 jobs

nano

submituniverse = vanillaexecutable = /bin/echoShould_transfer_files = yesWhen_to_transfer_output = on_exitArguments = hello world $(PROCESS)Output = out.$(PROCESS)

Log = log

Queue 10Slide49

And submit it…

condor_submit submit

(wait…)Condor_q xxcat logcat output.yySlide50

My new jobs run for 20 days…

What happens when a job is forced off it’s CPU?

Preempted by higher priority user or job

Vacated because of user activity

How can I add fault tolerance to my jobs?Slide51

Condor’s

Standard Universe

to the rescue!

Support for transparent process checkpoint and restart

Remote system calls (remote I/O)

Your job can read / write files as if they were localSlide52

Remote System Calls in

the Standard Universe

I/O system calls are trapped and sent back to the submit machine

Examples: open a file, write to a file

No source code changes typically required

Programming language independentSlide53

Process Checkpointing in the

Standard Universe

Condor’s process checkpointing provides a mechanism to automatically save the state of a job

The process can then be restarted

from right where it was checkpointed

After preemption, crash, etc.Slide54

Checkpointing:

Process Starts

checkpoint

: the entire state of a program, saved in a file

CPU registers, memory image, I/O

timeSlide55

Checkpointing:

Process Checkpointed

time

1

2

3Slide56

Checkpointing:

Process Killed

time

3

3

Killed!Slide57

Checkpointing:

Process Resumed

time

3

3

goodput

badput

goodputSlide58

When will Condor checkpoint your job?

Periodically, if desired

For fault tolerance

When your job is preempted by a higher priority job

When your job is vacated because the execution machine becomes busy

When you explicitly run

condor_checkpoint

,

condor_vacate

,

condor_off

or

condor_restart

co

mmandSlide59

Making the Standard Universe Work

The job

must be

relinked

with Condor’s standard universe support library

To relink, place

condor_compile

in front of the command used to link the job:

% condor_compile gcc -o myjob myjob.c

- OR -

% condor_compile f77 -o myjob filea.f fileb.f

- OR -

% condor_compile make –f MyMakefileSlide60

Limitations of the

Standard Universe

Condor’s checkpointing is not at the kernel level.

Standard Universe the job may not:

Fork()

Use kernel threads

Use some forms of IPC, such as pipes and shared memory

Must have access to source code to relink

Many typical scientific jobs are OKSlide61

Submitting Std uni job

#include <stdio.h>

int main(int argc, char **argv) { int i;for(i = 0 ; i < 10000000; i++) {}}Slide62

And submit…

condor_compile gcc –o foo foo.c

-- Change "vanilla" to "standard"-- Change "/bin/echo" to "foo" (or above)Slide63

My jobs have have dependencies…

Can Condor help solve my dependency problems?Slide64

Condor Universes:

Scheduler and Local

Scheduler Universe

Plug in a meta-scheduler

Developed for DAGMan (more later)

Similar to Globus’s fork job manager

Local

Very similar to vanilla, but jobs run on the local host

Has more control over jobs than scheduler universeSlide65

DAGMan

D

irected

A

cyclic

G

raph

Man

ager

DAGMan allows you to specify the

dependencies

between your Condor jobs, so it can

manage

them automatically for you.

(e.g., “Don’t run job “B” until job “A” has completed successfully.”)Slide66

What is a DAG?

A DAG is the

data

structure

used by DAGMan to represent these dependencies.

Each job is a

“node”

in the DAG.

Each node can have any number of “parent” or “children” nodes – as long as there are

no loops

!

Job A

Job B

Job C

Job DSlide67

Defining a DAG

A DAG is defined by a

.dag

file

, listing each of its nodes and their dependencies:

# diamond.dag

Job A a.sub

Job B b.sub

Job C c.sub

Job D d.sub

Parent A Child B C

Parent B C Child D

each node will run the Condor job specified by its accompanying

Condor submit file

Job A

Job B

Job C

Job DSlide68

Submitting a DAG

To start your DAG, just run

condor_submit_dag

with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:

% condor_submit_dag diamond.dag

condor_submit_dag is run by the schedd

DAGMan daemon itself is “watched” by Condor, so you don’t have toSlide69

DAGMan

Running a DAG

DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

Condor

Job

Queue

B

C

D

A

A

.dag

FileSlide70

DAGMan

Running a DAG (cont’d)

DAGMan holds & submits jobs to the Condor queue at the appropriate times.

Condor

Job

Queue

D

B

C

B

A

CSlide71

Running a DAG (cont’d)

In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a

“rescue” file

with the current state of the DAG.

Condor

Job

Queue

DAGMan

X

D

A

B

Rescue

FileSlide72

Recovering a DAG

Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

Condor

Job

Queue

Rescue

File

C

DAGMan

D

A

B

CSlide73

DAGMan

Recovering a DAG (cont’d)

Once that job completes, DAGMan will continue the DAG as if the failure never happened.

Condor

Job

Queue

C

D

A

B

DSlide74

DAGMan

Finishing a DAG

Once the DAG is complete, the DAGMan job itself is finished, and exits.

Condor

Job

Queue

C

D

A

BSlide75

Additional DAGMan Features

Provides other handy features for job management…

nodes can have

PRE

&

POST

scripts

failed nodes can be automatically re-tried a configurable number of times

job submission can be “throttled”Slide76

What about Licensed Jobs?

e.g. matlab

Site license?matlab compilerOctaveSlide77

Chirp

condor_chirp get_file remote local

condor_chirp put_file local remoteSlide78

General User Commands

condor_status

View Pool Status

condor_q

View Job Queue

condor_submit

Submit new Jobs

condor_rm

Remove Jobs

condor_prio

Intra-User Prios

condor_history

Completed Job Info

condor_submit_dag

Submit new DAG

condor_checkpoint

Force a checkpoint

condor_compile

Link Condor librarySlide79

Statistical Bootstrap

Build up from the worker side out

The matlab/octave worker:worker.m:#!/s/octave/bin/octave -qload "subset" subset;

subset = subset(floor(rand(10,1) .* 1000));

printf

("%f ", mean(subset));Slide80

Run the worker alone

(won’t work – why?)

chmod 0755 worker.m./worker.mSlide81

Create the initial data

driver.m

#!/s/octave/bin/octave –qfdist_size = 100000;d = rand(dist_size, 1) .* 500;subset = d(floor(rand(1000,1) .* 100000));save "subset" subset;Slide82

Submit file

universe = vanilla

executable = worker.mshould_transfer_files

= true

when_to_transfer_output

=

on_exit

transfer_input_files

= subset

output = mean.$(PROCESS)

error =

foo

log = log

queue 10Slide83

And submit the job…

condor_submit

submitSlide84

Add the submission to

the driver script…

#!/s/octave/bin/octave –qfdist_size

= 100000;

d = rand(

dist_size

, 1) .* 500;

subset = d(floor(rand(1000,1) .* 100000));

save "subset" subset;

system("

condor_submit

submit");

system("

condor_wait

log");Slide85

And run the driver!

./

driver.mSlide86

Master – Worker:

Many very short jobs

Condor doesn’t run short jobs well.time needed to transmit the executable/data/results.Condor doesn’t deal directly with parallel algorithms.Can have the process on the user’s workstation generatingwaves of “worker” jobs to run in parallel, buteach worker job must be scheduled anew in the Condorpool, andthe master application has to handle all the details of

scheduling, rescheduling after faults, managing input and

outputs to workers, etc.

Master-Worker (MW) addresses these issues!Slide87

Master – Worker:

Master assigns tasks to the workers

Workers perform tasks, and report results back to masterWorkers do not communicate (except through the master)Simple!Fault-tolerantDynamicProgramming model reusable across many applications.Slide88

Master – Worker:

Data common to all tasks is sent to workers only once

(Try to) Retain workers until the whole computation iscomplete—don’t release them after a single task is done.These features make for much higher parallel efficiency.We now need to transmit much less data between masterand workers.We avoid the overhead of putting each task on the condorqueue and waiting for it to be allocated to a processor.Slide89

Three abstractions in the master-worker paradigm:

Master,Worker

, and Task.The MW package encapsulates these abstractionsC++ abstract classesUser writes 10 functions (Templates and skeletons suppliedin distribution)The MWized code will adapt transparently to the dynamicand heterogeneous environment

The back side of MW interfaces to resource managementSlide90

MW Functions

MWMaster

get_userinfo() setup_initial_tasks()pack_worker_init_data()act_on_completed_task()

MWTask

(un)

pack_work

(un)

pack_result

]

MWWorker

unpack_worker_init_data

()

execute_task

()Slide91

But Wait, there’s more..

User-defined

checkpointing of master. (Don’t lose the whole run if the master crashes.)(Rudimentary) Task SchedulingMW assigns first task to first idle workerLists of tasks and workers can be arbitrarily ordered and reorderedUser can set task rescheduling policiesUser-defined benchmarkingA (user-defined) task is sent to each worker upon initializationBy accumulating normalized task CPU time, MW computes

a performance statistic that is comparable between runs,

though the properties of the pool may differ between runs.Slide92

There’s an App

for that..

MWFATCOP (Chen, Ferris, Linderoth) – A branch and cut codefor linear integer programmingMWQAP (Anstreicher, Brixius, Goux,

Linderoth

) – A

branch-and-bound code for solving the quadratic assignment problem

MWATR (

Linderoth

, Shapiro, Wright) – A trust-region-enhanced

cutting plane code for two-stage linear stochastic programming

and statistical verification of solution quality.

MWKNAP (

Glankwamdee

,

Linderoth

) – A simple branch-and-bound knapsack solver

MWAND (

Linderoth

,

Shen

) – A nested decomposition-based

solver for multistage stochastic linear programming

MWSYMCOP (

Linderoth

, Margot, Thain) – An LP-based

branch-and-bound solver for symmetric integer programsSlide93

Other frameworks

CCTools

group at Notre DameAll PairsWaveFrontMakeflowSlide94

Condor and Big Data

Big Data driving web development

Web developments driving Big DataBig Data: DefinitionLike Condor for disksSlide95

In 2003…

http://labs.google.com/papers/gfs.html

http://labs.google.com/papers/mapreduce.htmlSlide96
Slide97
Slide98

Shortly thereafter…Slide99

Two main

Hadoop partsSlide100

For more detail

CondorWeek

2009 talk Dhruba Borthakurhttp://www.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.pptSlide101
Slide102

HDFS overview

Making POSIX

distributed file system go fast is easy…Slide103

HDFS overview

…If you get rid of the POSIX part

RemoveRandom accessSupport for small filesauthenticationIn-kernel supportSlide104

HDFS Overview

Add in

Data replication (key for distributed systems)Command line utilitiesSlide105

HDFS ArchitectureSlide106

HDFS Condor Integration

HDFS Daemons run under master

Management/controlInput/Output files can be in hdfs:Input = hdfs://some/pathnameSlide107

Parallel convergence checking:

Another DAGman example

Evaluating a function at many pointsCheck for convergence -> retryParticle Swarm OptimizationSlide108

Prepare

Compute

Compute

Compute

Compute

Converge?

Done

Yes!

NoSlide109

Any Guesses?

Who has thoughts?

Best to work from “inside out”109Slide110

The job itself.

#!/bin/

sh###### random.sh

echo $RANDOM

exit 0

110Slide111

The submit file

Any guesses?

111Slide112

The submit file

#

submitRandomuniverse = vanillaexecutable = random.sh

Should_transfer_files

= yes

When_to_transfer_output

=

on_exit

output = out

log = log

queue

112Slide113

Next step: the inner DAG

113

First

Last

Node

Node0

Node1

Node2

Node3

Node4

Node11Slide114

The DAG file

Any guesses?

114Slide115

The inner DAG file

Job Node0

submitRandomJob Node1 submitRandomJob Node2 submitRandomJob Node3 submitRandom

PARENT Node0 CHILD Node1

PARENT Node0 CHILD Node2

PARENT Node0 CHILD Node3

Job Node11

submitRandom

PARENT Node1, Node2, Node3 CHILD Node11

115Slide116

Inner DAG

Does this work?

At least one iteration?116Slide117

How to iterate

DAGman

has simple control structures(Makes it reliable)SUBDAGs!Remember what happens if post fails?117Slide118

The Outer Dag

Another Degenerate Dag

(But Useful!)118

Post Script (with exit value)

SubDag (with retry)

tSlide119

This one is easy!

Can you do it yourself?

119Slide120

The outer DAG file

####### Outer.dag #############

SUBDAG EXTERNAL A inner.dagSCRIPT POST A converge.shRETRY A 10#### converge.sh could look like#!/bin/sh

echo "Checking convergence" >> converge

exit 1

120Slide121

Let’s run that…

condor_submit_dag outer.dag

Does it work? How can you tell?121Slide122

DAGman a bit verbose…

$ condor_submit_dag outer.dag

-----------------------------------------------------------------------File for submitting this DAG to Condor : submit.dag.condor.subLog of DAGMan debugging messages : submit.dag.dagman.out

Log of Condor library output : submit.dag.lib.out

Log of Condor library error messages : submit.dag.lib.err

Log of the life of condor_dagman itself : submit.dag.dagman.log

-no_submit given, not submitting DAG to Condor. You can do this with:

"condor_submit submit.dag.condor.sub"

-----------------------------------------------------------------------

-----------------------------------------------------------------------

File for submitting this DAG to Condor : outer.dag.condor.sub

Log of DAGMan debugging messages : outer.dag.dagman.out

Log of Condor library output : outer.dag.lib.out

Log of Condor library error messages : outer.dag.lib.err

Log of the life of condor_dagman itself : outer.dag.dagman.log

Submitting job(s).

Logging submit event(s).

1 job(s) submitted to cluster 721.

-----------------------------------------------------------------------

122Slide123

Debugging helps

Look in the user log file, “log”

Look in the DAGman debugging log“foo”.dagman.out123Slide124

What does converge.sh need

Note the output files?

How to make them unique?Add DAG variables to inner dagAnd submitRandom file124Slide125

The submit file (again)

# submitRandom

universe = vanillaexecutable = random.shoutput = out

log = log

queue

125Slide126

The submit file

# submitRandom

universe = vanillaexecutable = random.shoutput = out.$(NodeNumber)

log = log

queue

126Slide127

The inner DAG file (again)

Job Node0 submit_pre

Job Node1 submitRandomJob Node2 submitRandomJob Node3 submitRandomPARENT Node0 CHILD Node1PARENT Node0 CHILD Node2PARENT Node0 CHILD Node3

Job Node11 submit_post

PARENT Node1 CHILD Node11

PARENT Node2 CHILD Node11

PARENT Node3 CHILD Node11

127Slide128

The inner DAG file (again)

Job Node0 submit_pre

Job Node1 submitRandomJob Node2 submitRandomJob Node3 submitRandom…VARS Node1 NodeNumber=“1”VARS Node2 NodeNumber=“2”VARS Node3 NodeNumber=“3”

128Slide129

Then converge.sh sees:

$ ls out.*

out.1 out.10 out.2 out.3 out.4 out.5 out.6 out.7 out.8 out.9$

And can act accordingly…

129Slide130

Thank you!

Check us out on the Web:

http://www.condorproject.org

Email:

condor-admin@cs.wisc.edu