Harvesting unused clock cycles with Condor Advanced Research Computing The University of Liverpool Overview what is Condor High Performance versus High Throughput Computing Condor fundamentals ID: 191047
Download Presentation The PPT/PDF document "Ian C. Smith*" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Ian C. Smith*
Harvesting unused clock cycles with Condor
*Advanced Research Computing
The University of LiverpoolSlide2
Overview
what is Condor ?
High Performance versus High Throughput ComputingCondor fundamentalssetting up and running a Condor Pool
The University of Liverpool Condor Poolexample applicationsSlide3
What is Condor ?
a specialized system for delivering High Throughput Computing
a harvester of unused computing resourcesdeveloped by Computer Science Dept at University of Wisconsin in late ‘80sfree and (now) open source software
widely used in academia and increasing in industryavailable for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OSSlide4
HPC vs HTC (1)
High Performance Computing (
HPC)delivers large amounts of computing power over relatively short periods of time (peak FLOPS ratings important)can also provide lots of memory, large amounts of fast (parallel) storage
fairly exotic hardware, may need plenty of TLClarge capital outlay on hardwareneed to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources)
users run relatively small numbers of parallel jobs
essential for certain time-critical applicationsSlide5
HPC vs
HTC (2)
High Throughput Computing (HTC)allows many computational tasks to be completed over a long period of time (peak FLOPS ratings not so important)
users more concerned with running large numbers of jobs over a long time span than a few short burst computationsmakes use of existing commodity hardware (e.g. desktop PCs)small capital outlay on hardware possible
limited memory and storage available generally
mostly aimed at running concurrent serial jobs (although
MPI
and
PVM
are supported by Condor)Slide6
Types of Condor application
large numbers of independent calculations typically (“pleasantly parallel”)
data parallel applications – split large datasets into smaller parts and analyse independently biological sequence analysisprocessing of census dataoptimisation problems
microprocessor design and testingapplications based on Monte Carlo methodsradiotherapy treatment analysis
epidemiological studiesSlide7
A “typical” Condor pool
Central manager
Submit/execute host
Submit host
Execute hosts
Execute hostsSlide8
A “typical” Condor pool
Central manager
Submit/execute host
Submit host
Execute hosts
Execute hosts
ClassAds
ClassAds
ClassAds
ClassAdsSlide9
A “typical” Condor pool
Central manager
Submit/execute host
Submit host
Execute hosts
Execute hosts
Match Info
Match Info
Match Info
Match InfoSlide10
A “typical” Condor pool
Central manager
Submit/execute host
Submit host
Execute hosts
Execute hosts
Jobs
JobsSlide11
A “typical” Condor pool
Central manager
Submit/execute host
Submit host
Execute hosts
Execute hosts
Output
OutputSlide12
ClassAds and Matchmaking
ClassAds are a fundamental part of Condor
similar to classified advertisements in a paper“Job Ads” represent jobs to Condor (similar to “wanted” ads)
“Machine Ads” represent compute resources in a Condor Pool (similar to “for sale” ads)Condor central manager matches Machine Ads to
Job Ads and hence machines to jobs
Job Ads are created using submit description filesSlide13
Simple submit description file
# simple submit description file
# (anything following a # is comment and is ignored by Condor)# this would be used for Windows XP based execute hosts
universe = vanillaexecutable = example.exe
# what to run
output =
stdout.out
$(PROCESS)
# job`s standard output
log = mylog.log$(PROCESS)
# log job`s activities
transfer_input_files
= common.txt,
myinput
$(PROCESS).txt
# input files needed
requirements = ( Arch=="Intel") && (
OpSys
=="
WINNT51
" )
# what machines to run on
queue 2
# number of jobs to queueSlide14
Requirements and Rank
Requirements
expression determines where (and when) a job will run e.g.
Rank is used to express a preference
Requirements = (
OpSys
==“
WINNT51
” ) &&
# Windows XP OS wanted
( Arch==“Intel” ) && \
# Intel/compatible processor
( Memory >= 2000 ) && \
# want a least
2GB
memory and
( Disk >= 33554432 ) && \
# at least 32 GB of free disk
(
HAS_MATLAB
== TRUE ) && \
# must have MATLAB installed
( (
ClockMin
> 1020 ) || \
# only run jobs after 5 pm OR ...
(
ClockMin
== 6 ) || (
ClockDay
== 0) )
# at weekends
Rank
=
Kflops
# run
on machines with best floating point performance firstSlide15
Job submission and monitoring
[
einstein@submit
~]$
condor_submit
example
.sub
Submitting
job(s).
2 job(s) submitted to cluster 100.
[
einstein@submit
~]$
condor_q
-- Submitter:
submit.chtc.wisc.edu
: <128.104.55.9:51883> :
submit.chtc.wisc.edu
ID OWNER SUBMITTED
RUN_TIME
ST PRI SIZE
CMD
1.0
sagan
7/22 14:19 172+21:28:36
R
0 22.0
checkprogress.cron
2.0
heisenberg
1/13 13:59 0+00:00:00 I 0 0.0
env
3.0 hawking 1/15 19:18 0+04:29:33
R
0 0.0
script.sh
4.0 hawking 1/15 19:33 0+00:00:00
R
0 0.0
script.sh
5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0
script.sh
6.0 hawking 1/15 19:34 0+00:00:00
R
0 0.0
script.sh
...
96.0
bohr
4/5 13:46 0+00:00:00 I 0 0.0
c2b_dops.sh
97.0
bohr
4/5 13:46 0+00:00:00 I 0 0.0
c2b_dops.sh
98.0
bohr
4/5 13:52 0+00:00:00 I 0 0.0
c2b_dopc.sh
99.0
bohr
4/5 13:52 0+00:00:00 I 0 0.0
c2b_dopc.sh
100.0
einstein
4/5 13:55 0+00:00:00 I 0 0.0 cosmos
557 jobs; 402 idle, 145 running,
1
held
[
einstein@submit
~]$Slide16
Condor policies
Condor supports a wide range of policies for when to start jobs e.g.
run jobs only outside office hoursrun jobs only if load average on host is small and there has been no recent activityrun jobs at any time on one core (at low priority)
run jobs only submitted by certain usersalso a wide choice of what to do when a job is about to be interrupted e.g.suspend the job for a limited time then let it resume
checkpoint the job and migrate it to another machinekill off the job immediatelySlide17
UNIX or Windows execute hosts ? (1)
UNIXCondor’s natural environment
not widely installed on desktop machines (but depends on institution...)supports the Condor “standard universe” containing many useful featurescheckpointing
allows jobs to be migrated from one machine to another without loss of useful workRemote Procedure Calls give transparent access to files on submit hoststreaming of standard output (
stdout) from jobs to submit hostNetwork
filesystems
work well making installation and
configration
much
easier
leverages large amount of scientific and engineering codes which have been developed under UNIXSlide18
UNIX or Windows execute hosts ? (2)
Windowsworld’s most widely installed OS – rich source of execute hosts
many commercial 3rd party applications run on Windowsusing shared (network) filesystems
can be difficult under Condoronly supports the “vanilla” Condor universe no checkpointing
– evicted jobs may waste a lot of cyclesall input and output files need to be transferred to/from execute host
output streaming not supported
may be difficult to port “legacy” UNIX codes (although
Cygwin
and Co-Linux can make life easier)
Windows support from the U-W Condor Team tends to lag behind UNIXSlide19
Setting up a Condor pool
best to start off small and build up pool slowly
need to understand Condor fundamentals:role of Condor processes and how they interactlife-cycle of jobs
ClassAds and Matchmakingavoid firewalls if possible (may be easier said than done ...)talk to central IT services (particularly network and PC teams)
submit hosts may need to be fairly high spec if large numbers of jobs are to be run - ideally wantmulti-core/processor machine (quad core at least)
plenty of memory (say 8 GB or more)
large fast access filestore (e.g. 1 TB RAID)Slide20
Where to go for help
Read The Fine Manual !log files contain a lot of useful information
take a look at the presentations, tutorials and “how-to recipes”on the Condor website: (www.cs.wisc.edu/condor
)search the condor-users mail list archive: (lists.cs.wisc.edu/archive/condor-users)
subscribe to the condor-users mail listjoin the Campus Grids SIG: (
wikis.nesc.ac.uk
/
escinet
/
Campus_Grids
)
commercial support is also available (e.g. Cycle Computing)Slide21
University of Liverpool Condor Pool
contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon)
most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machinesingle submission point for Condor jobs provided by Sun Solaris V445 SMP
serverpolicy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hoursjob will be killed off if running when a user logs in to a PC
web interface for specific applicationssupport for running large numbers of MATLAB jobsSlide22
Condor service caveats
only suitable for DOS-based applications running in batch mode
no communication between processes possible (“pleasantly parallel” applications only)statically linked executables work best (although can cope with DLLs)all files needed by application must be present on local disk (cannot access network drives)
shorter jobs more likely to run to completion (10-20 min seems to work best)very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)Slide23
Running MATLAB jobs under Condor
many users prefer to create applications using MATLAB rather than traditional compiled languages (e.g. FORTRAN, C)
need to create standalone application from M-file(s) using MATLAB compiler
standalone application can run without a MATLAB licenserun-time libraries still need to be accessible to MATLAB jobs
nearly all toolbox functions available to standalone applications
simple (but powerful) file I/O makes
checkpointing
easier
see Liverpool Condor website for more informationSlide24
Power-saving and Green IT at Liverpool
we have around 2 000 centrally managed classroom PCs across campus which were powered up overnight, at weekends and during vacations.original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity
policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.
3rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are runningCondor’s own power management features allows machines to be woken up automatically according to demandSlide25
Condor-G and Grid Computing
Condor-G is an extension to Condor allowing job submission to remote resources using Globus
provides familiar Condor-like interface to users hiding the underlying middleware complexitywe have used Condor-G to give users grid access to a variety of HPC resources:local HPC clusters (UL-Grid)
NW-Grid resources at Daresbury Lab, Lancaster and ManchesterNational Grid Service facilities
Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine)
Web portal removes the need for command line use completelySlide26
Radiotherapy example
3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1]
aim is to provide insight into connection between dose-distribution characteristics, different organ architectures and complication rates beyond that of analytical methodscode written in MATLAB and compiled into standalone executable
set of 800 simulations took ~ 36 hours to run on Condor poolwould require 4-5 months
of computing time on a single PCseveral dozen sets of simulations have since been completed
[1]
Rutkowska
E., Baker C.R. and Nahum A.E.
Mechanistic simulation of normal-tissue damage in
radiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.Slide27
Personalised Medicine example
project is a Genome-Wide Association Studyaims to identify genetic predictors of response to anti-epileptic drugs
try to identify regions of the human genome that differ between individuals (referred to as SNPs)800 patients genotyped at 500 000
SNPs along the entire genometest statistically the association between SNPs and outcomes (e.g. time to withdrawl
of drug due to adverse effects)very large data-parallel problem – ideal for Condor
divide datasets into small partitions so that individual jobs run for 15-30 minutes
batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute time on Condor but ~ 5 weeks on a single PCSlide28
Epidemiology example
researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2]
Monte Carlo type method - highly paralleloriginal code written in MATLAB and compiled into standalone applicationindividual simulations take only 10-15 minutes to run – ideal for Condor
require ~ 10 000 - 20 000 simulations per scenariowould have needed several years of compute time on single machine, on Condor needed a few weeks
[2
] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and
Christley
R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28Slide29
Further Information
http://www.liv.ac.uk/e-science/condor
i.c.smith@liverpool.ac.uk