/
Ian C. Smith* Ian C. Smith*

Ian C. Smith* - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
421 views
Uploaded On 2015-11-12

Ian C. Smith* - PPT Presentation

Harvesting unused clock cycles with Condor Advanced Research Computing The University of Liverpool Overview what is Condor High Performance versus High Throughput Computing Condor fundamentals ID: 191047

jobs condor execute submit condor jobs submit execute run job time hosts amp host pool running computing matlab large users machines windows

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Ian C. Smith*" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Ian C. Smith*

Harvesting unused clock cycles with Condor

*Advanced Research Computing

The University of LiverpoolSlide2

Overview

what is Condor ?

High Performance versus High Throughput ComputingCondor fundamentalssetting up and running a Condor Pool

The University of Liverpool Condor Poolexample applicationsSlide3

What is Condor ?

a specialized system for delivering High Throughput Computing

a harvester of unused computing resourcesdeveloped by Computer Science Dept at University of Wisconsin in late ‘80sfree and (now) open source software

widely used in academia and increasing in industryavailable for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OSSlide4

HPC vs HTC (1)

High Performance Computing (

HPC)delivers large amounts of computing power over relatively short periods of time (peak FLOPS ratings important)can also provide lots of memory, large amounts of fast (parallel) storage

fairly exotic hardware, may need plenty of TLClarge capital outlay on hardwareneed to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources)

users run relatively small numbers of parallel jobs

essential for certain time-critical applicationsSlide5

HPC vs

HTC (2)

High Throughput Computing (HTC)allows many computational tasks to be completed over a long period of time (peak FLOPS ratings not so important)

users more concerned with running large numbers of jobs over a long time span than a few short burst computationsmakes use of existing commodity hardware (e.g. desktop PCs)small capital outlay on hardware possible

limited memory and storage available generally

mostly aimed at running concurrent serial jobs (although

MPI

and

PVM

are supported by Condor)Slide6

Types of Condor application

large numbers of independent calculations typically (“pleasantly parallel”)

data parallel applications – split large datasets into smaller parts and analyse independently biological sequence analysisprocessing of census dataoptimisation problems

microprocessor design and testingapplications based on Monte Carlo methodsradiotherapy treatment analysis

epidemiological studiesSlide7

A “typical” Condor pool

Central manager

Submit/execute host

Submit host

Execute hosts

Execute hostsSlide8

A “typical” Condor pool

Central manager

Submit/execute host

Submit host

Execute hosts

Execute hosts

ClassAds

ClassAds

ClassAds

ClassAdsSlide9

A “typical” Condor pool

Central manager

Submit/execute host

Submit host

Execute hosts

Execute hosts

Match Info

Match Info

Match Info

Match InfoSlide10

A “typical” Condor pool

Central manager

Submit/execute host

Submit host

Execute hosts

Execute hosts

Jobs

JobsSlide11

A “typical” Condor pool

Central manager

Submit/execute host

Submit host

Execute hosts

Execute hosts

Output

OutputSlide12

ClassAds and Matchmaking

ClassAds are a fundamental part of Condor

similar to classified advertisements in a paper“Job Ads” represent jobs to Condor (similar to “wanted” ads)

“Machine Ads” represent compute resources in a Condor Pool (similar to “for sale” ads)Condor central manager matches Machine Ads to

Job Ads and hence machines to jobs

Job Ads are created using submit description filesSlide13

Simple submit description file

# simple submit description file

# (anything following a # is comment and is ignored by Condor)# this would be used for Windows XP based execute hosts

universe = vanillaexecutable = example.exe

# what to run

output =

stdout.out

$(PROCESS)

# job`s standard output

log = mylog.log$(PROCESS)

# log job`s activities

transfer_input_files

= common.txt,

myinput

$(PROCESS).txt

# input files needed

requirements = ( Arch=="Intel") && (

OpSys

=="

WINNT51

" )

# what machines to run on

queue 2

# number of jobs to queueSlide14

Requirements and Rank

Requirements

expression determines where (and when) a job will run e.g.

Rank is used to express a preference

Requirements = (

OpSys

==“

WINNT51

” ) &&

# Windows XP OS wanted

( Arch==“Intel” ) && \

# Intel/compatible processor

( Memory >= 2000 ) && \

# want a least

2GB

memory and

( Disk >= 33554432 ) && \

# at least 32 GB of free disk

(

HAS_MATLAB

== TRUE ) && \

# must have MATLAB installed

( (

ClockMin

> 1020 ) || \

# only run jobs after 5 pm OR ...

(

ClockMin

== 6 ) || (

ClockDay

== 0) )

# at weekends

Rank

=

Kflops

# run

on machines with best floating point performance firstSlide15

Job submission and monitoring

[

einstein@submit

~]$

condor_submit

example

.sub

Submitting

job(s).

2 job(s) submitted to cluster 100.

[

einstein@submit

~]$

condor_q

-- Submitter:

submit.chtc.wisc.edu

: <128.104.55.9:51883> :

submit.chtc.wisc.edu

ID OWNER SUBMITTED

RUN_TIME

ST PRI SIZE

CMD

1.0

sagan

7/22 14:19 172+21:28:36

R

0 22.0

checkprogress.cron

2.0

heisenberg

1/13 13:59 0+00:00:00 I 0 0.0

env

3.0 hawking 1/15 19:18 0+04:29:33

R

0 0.0

script.sh

4.0 hawking 1/15 19:33 0+00:00:00

R

0 0.0

script.sh

5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0

script.sh

6.0 hawking 1/15 19:34 0+00:00:00

R

0 0.0

script.sh

...

96.0

bohr

4/5 13:46 0+00:00:00 I 0 0.0

c2b_dops.sh

97.0

bohr

4/5 13:46 0+00:00:00 I 0 0.0

c2b_dops.sh

98.0

bohr

4/5 13:52 0+00:00:00 I 0 0.0

c2b_dopc.sh

99.0

bohr

4/5 13:52 0+00:00:00 I 0 0.0

c2b_dopc.sh

100.0

einstein

4/5 13:55 0+00:00:00 I 0 0.0 cosmos

557 jobs; 402 idle, 145 running,

1

held

[

einstein@submit

~]$Slide16

Condor policies

Condor supports a wide range of policies for when to start jobs e.g.

run jobs only outside office hoursrun jobs only if load average on host is small and there has been no recent activityrun jobs at any time on one core (at low priority)

run jobs only submitted by certain usersalso a wide choice of what to do when a job is about to be interrupted e.g.suspend the job for a limited time then let it resume

checkpoint the job and migrate it to another machinekill off the job immediatelySlide17

UNIX or Windows execute hosts ? (1)

UNIXCondor’s natural environment

not widely installed on desktop machines (but depends on institution...)supports the Condor “standard universe” containing many useful featurescheckpointing

allows jobs to be migrated from one machine to another without loss of useful workRemote Procedure Calls give transparent access to files on submit hoststreaming of standard output (

stdout) from jobs to submit hostNetwork

filesystems

work well making installation and

configration

much

easier

leverages large amount of scientific and engineering codes which have been developed under UNIXSlide18

UNIX or Windows execute hosts ? (2)

Windowsworld’s most widely installed OS – rich source of execute hosts

many commercial 3rd party applications run on Windowsusing shared (network) filesystems

can be difficult under Condoronly supports the “vanilla” Condor universe no checkpointing

– evicted jobs may waste a lot of cyclesall input and output files need to be transferred to/from execute host

output streaming not supported

may be difficult to port “legacy” UNIX codes (although

Cygwin

and Co-Linux can make life easier)

Windows support from the U-W Condor Team tends to lag behind UNIXSlide19

Setting up a Condor pool

best to start off small and build up pool slowly

need to understand Condor fundamentals:role of Condor processes and how they interactlife-cycle of jobs

ClassAds and Matchmakingavoid firewalls if possible (may be easier said than done ...)talk to central IT services (particularly network and PC teams)

submit hosts may need to be fairly high spec if large numbers of jobs are to be run - ideally wantmulti-core/processor machine (quad core at least)

plenty of memory (say 8 GB or more)

large fast access filestore (e.g. 1 TB RAID)Slide20

Where to go for help

Read The Fine Manual !log files contain a lot of useful information

take a look at the presentations, tutorials and “how-to recipes”on the Condor website: (www.cs.wisc.edu/condor

)search the condor-users mail list archive: (lists.cs.wisc.edu/archive/condor-users)

subscribe to the condor-users mail listjoin the Campus Grids SIG: (

wikis.nesc.ac.uk

/

escinet

/

Campus_Grids

)

commercial support is also available (e.g. Cycle Computing)Slide21

University of Liverpool Condor Pool

contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon)

most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machinesingle submission point for Condor jobs provided by Sun Solaris V445 SMP

serverpolicy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hoursjob will be killed off if running when a user logs in to a PC

web interface for specific applicationssupport for running large numbers of MATLAB jobsSlide22

Condor service caveats

only suitable for DOS-based applications running in batch mode

no communication between processes possible (“pleasantly parallel” applications only)statically linked executables work best (although can cope with DLLs)all files needed by application must be present on local disk (cannot access network drives)

shorter jobs more likely to run to completion (10-20 min seems to work best)very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)Slide23

Running MATLAB jobs under Condor

many users prefer to create applications using MATLAB rather than traditional compiled languages (e.g. FORTRAN, C)

need to create standalone application from M-file(s) using MATLAB compiler

standalone application can run without a MATLAB licenserun-time libraries still need to be accessible to MATLAB jobs

nearly all toolbox functions available to standalone applications

simple (but powerful) file I/O makes

checkpointing

easier

see Liverpool Condor website for more informationSlide24

Power-saving and Green IT at Liverpool

we have around 2 000 centrally managed classroom PCs across campus which were powered up overnight, at weekends and during vacations.original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity

policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

3rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are runningCondor’s own power management features allows machines to be woken up automatically according to demandSlide25

Condor-G and Grid Computing

Condor-G is an extension to Condor allowing job submission to remote resources using Globus

provides familiar Condor-like interface to users hiding the underlying middleware complexitywe have used Condor-G to give users grid access to a variety of HPC resources:local HPC clusters (UL-Grid)

NW-Grid resources at Daresbury Lab, Lancaster and ManchesterNational Grid Service facilities

Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine)

Web portal removes the need for command line use completelySlide26

Radiotherapy example

3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1]

aim is to provide insight into connection between dose-distribution characteristics, different organ architectures and complication rates beyond that of analytical methodscode written in MATLAB and compiled into standalone executable

set of 800 simulations took ~ 36 hours to run on Condor poolwould require 4-5 months

of computing time on a single PCseveral dozen sets of simulations have since been completed

[1]

Rutkowska

E., Baker C.R. and Nahum A.E.

Mechanistic simulation of normal-tissue damage in

radiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.Slide27

Personalised Medicine example

project is a Genome-Wide Association Studyaims to identify genetic predictors of response to anti-epileptic drugs

try to identify regions of the human genome that differ between individuals (referred to as SNPs)800 patients genotyped at 500 000

SNPs along the entire genometest statistically the association between SNPs and outcomes (e.g. time to withdrawl

of drug due to adverse effects)very large data-parallel problem – ideal for Condor

divide datasets into small partitions so that individual jobs run for 15-30 minutes

batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute time on Condor but ~ 5 weeks on a single PCSlide28

Epidemiology example

researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2]

Monte Carlo type method - highly paralleloriginal code written in MATLAB and compiled into standalone applicationindividual simulations take only 10-15 minutes to run – ideal for Condor

require ~ 10 000 - 20 000 simulations per scenariowould have needed several years of compute time on single machine, on Condor needed a few weeks

[2

] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and

Christley

R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28Slide29

Further Information

http://www.liv.ac.uk/e-science/condor

i.c.smith@liverpool.ac.uk