Rob Gardner Dave Lesny 1 ATLAS Connect A Condor and Pandabased batch service to easily connect resources Connect to ATLAS Compliant resources like a Tier2 Connect to opportunistic resources such as campus clusters ID: 410739
Download Presentation The PPT/PDF document "Parrot and ATLAS Connect" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Parrot and ATLAS Connect
Rob GardnerDave Lesny
1Slide2
ATLAS Connect
A Condor and Panda-based batch service to easily connect resourcesConnect to ATLAS Compliant resources like a Tier2
Connect to opportunistic resources such as campus clusters
Stampede cluster at the Texas Advance Computing Center
Midway cluster at University of ChicagoIllinois Campus Cluster at UIUC/NCSA Each is RHEL6 or equivalent with either SLURM or PBS as local scheduler
2Slide3
Accessing Stampede
Use simple Condor submit using BLAHP protocol (ssh login to stampede local submit host) (factory based on http://bosco.opensciencegrid.org
)
Test for prerequisites
APF uses same mechanismPanDA queues – operated from MWT2APF for pilot submission
CONNECT: production queue
ANALY_CONNECT: analysis queue
MWT2 storage for DDM endpointsFrontier squid service
3Slide4
Challenges
Additional system libraries (“ATLAS compatibility libraries”) as packaged in HEP_Oslibs_SL6Access to CVMFS clients and cache
Environment
variables normally setup by
an OSG CE, needed by the pilot$OSG_APP, $OSG_GRID, $VO_ATLAS_SW_DIRApproach was to provide via the user job wrapper these
components
4Slide5
Approaches
Linux Image with all libraries built using fake[ch]root
Deploy
this image locally via
tarball or via a CVMFS repoUse the CERN VM3 image in /cvmfs/cernvm-
prod.cern.ch
Use
Parrot to provide access to CVMFS repositoriesUse Parrot “–mount” to map file references into the Image /usr
/lib64
/
cvmfs
/
cernvm-prod.cern.ch
/cvm3/usr/lib64Install a Certificate Authority and OSG WN ClientEmulate the CE by defining env varsSome defined in APF ($VO_ATLAS_SW_DIR, $OSG_SITE_NAME)Others defined in “wrapper” ($OSG_APP, $OSG_GRID)
5Slide6
Problems (1)
Symlinks cannot be followed between repositoriesNot possible with
Parrot
due to restrictions with
libcvmfs/cvmfs/osg.mwt2.org/atlas/sw /
cvmfs
/
atlas.cern.ch/repo/swIn general, we find cross-referencing CVMFS repos unreliable
A python script located in
atlas.cern.ch
needs a
lib.so
If
lib.so resides in another repo, might get “File not found”Solution was to use a local disk for the Linux ImageSolution:Download a tarball and installed locally on diskAlso install local OSG worker-node client and CA in same location
6Slide7
Problems (2): Parrot stability
Parrot is very sensitive to the kernel versionWhen used on kernels 2.x, many atlas programs hangParrot uses ptrace and clones the system call
Bug in ptrace in some kernels cause a timing problem
Program being traced is awakened with “sigcont” before it should
Result is that the program stays in “T” state foreverKernels known to have issues with ParrotICC 2.6.32-358.23.2.el6.x86_64
Stampede 2.6.32-358.18.1.el6.x86_64
Midway 2.6.32-431.11.2.el6.x86_64
Custom kernel at MWT2 which seems to work is “3.2.13-UL3.el6”
7Slide8
Towards a solution: Parrot
4.1.4rc5To work around the hangs, CCTools team provided a feature
--cvmfs-enable-thread-clone-bugfix
Stops many (not all) hangs with a huge performance penalty
Simple ARLB with an asetup of a release take 10x to 100x longerNeeded on 2.x kernels to avoid many of the hangsPrograms which tend to run on 2.x without “bugfix” are
Atlas Local Root Base setup (and diagnostics db-readReal and db-fnget)
Reconstruction
Panda PilotsValidation jobsPrograms which tend to hang
Sherpa (always)
Release 16.x jobs
Some HammerCloud tests (16.x always, 17.x sometimes)
8Slide9
Alternatives
to Parrot?The CCTools team will be working on Parrot to fix bugs
May need to use kernel 3.x on target site for
reliability
Three solutions we are pursuing:
Parrot with Chirp
(
avoid libcvmfs)NFS mounting of local CVMFS (requires admin)
Use Environment Modules, common
on HPC facilities
Treat CVMFS client as a user application
Jobs “module load
cmvfs
-client”Prefix has privileges – can load needed FUSE modules Cache re-use my multi-core job slotsMight be more palatable to HPC admins9Slide10
Conclusions
Good experience accessing opportunistic resources without WLCG or ATLAS servicesA general problem for campus clustersWould greatly help if we:Relied on only one CVMFS repo + stock SL6 (like CMS)Will continue pursuing the three alternatives
Hope we can learn from others here!
10