/
Parrot and ATLAS Connect Parrot and ATLAS Connect

Parrot and ATLAS Connect - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
393 views
Uploaded On 2016-07-19

Parrot and ATLAS Connect - PPT Presentation

Rob Gardner Dave Lesny 1 ATLAS Connect A Condor and Pandabased batch service to easily connect resources Connect to ATLAS Compliant resources like a Tier2 Connect to opportunistic resources such as campus clusters ID: 410739

cvmfs atlas osg parrot atlas cvmfs parrot osg local cern connect image repo kernels campus cluster kernel x86 el6

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parrot and ATLAS Connect" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parrot and ATLAS Connect

Rob GardnerDave Lesny

1Slide2

ATLAS Connect

A Condor and Panda-based batch service to easily connect resourcesConnect to ATLAS Compliant resources like a Tier2

Connect to opportunistic resources such as campus clusters

Stampede cluster at the Texas Advance Computing Center

Midway cluster at University of ChicagoIllinois Campus Cluster at UIUC/NCSA Each is RHEL6 or equivalent with either SLURM or PBS as local scheduler

2Slide3

Accessing Stampede

Use simple Condor submit using BLAHP protocol (ssh login to stampede local submit host) (factory based on http://bosco.opensciencegrid.org

)

Test for prerequisites

APF uses same mechanismPanDA queues – operated from MWT2APF for pilot submission

CONNECT: production queue

ANALY_CONNECT: analysis queue

MWT2 storage for DDM endpointsFrontier squid service

3Slide4

Challenges

Additional system libraries (“ATLAS compatibility libraries”) as packaged in HEP_Oslibs_SL6Access to CVMFS clients and cache

Environment

variables normally setup by

an OSG CE, needed by the pilot$OSG_APP, $OSG_GRID, $VO_ATLAS_SW_DIRApproach was to provide via the user job wrapper these

components

4Slide5

Approaches

Linux Image with all libraries built using fake[ch]root

Deploy

this image locally via

tarball or via a CVMFS repoUse the CERN VM3 image in /cvmfs/cernvm-

prod.cern.ch

Use

Parrot to provide access to CVMFS repositoriesUse Parrot “–mount” to map file references into the Image /usr

/lib64

 /

cvmfs

/

cernvm-prod.cern.ch

/cvm3/usr/lib64Install a Certificate Authority and OSG WN ClientEmulate the CE by defining env varsSome defined in APF ($VO_ATLAS_SW_DIR, $OSG_SITE_NAME)Others defined in “wrapper” ($OSG_APP, $OSG_GRID)

5Slide6

Problems (1)

Symlinks cannot be followed between repositoriesNot possible with

Parrot

due to restrictions with

libcvmfs/cvmfs/osg.mwt2.org/atlas/sw  /

cvmfs

/

atlas.cern.ch/repo/swIn general, we find cross-referencing CVMFS repos unreliable

A python script located in

atlas.cern.ch

needs a

lib.so

If

lib.so resides in another repo, might get “File not found”Solution was to use a local disk for the Linux ImageSolution:Download a tarball and installed locally on diskAlso install local OSG worker-node client and CA in same location

6Slide7

Problems (2): Parrot stability

Parrot is very sensitive to the kernel versionWhen used on kernels 2.x, many atlas programs hangParrot uses ptrace and clones the system call

Bug in ptrace in some kernels cause a timing problem

Program being traced is awakened with “sigcont” before it should

Result is that the program stays in “T” state foreverKernels known to have issues with ParrotICC 2.6.32-358.23.2.el6.x86_64

Stampede 2.6.32-358.18.1.el6.x86_64

Midway 2.6.32-431.11.2.el6.x86_64

Custom kernel at MWT2 which seems to work is “3.2.13-UL3.el6”

7Slide8

Towards a solution: Parrot

4.1.4rc5To work around the hangs, CCTools team provided a feature

--cvmfs-enable-thread-clone-bugfix

Stops many (not all) hangs with a huge performance penalty

Simple ARLB with an asetup of a release take 10x to 100x longerNeeded on 2.x kernels to avoid many of the hangsPrograms which tend to run on 2.x without “bugfix” are

Atlas Local Root Base setup (and diagnostics db-readReal and db-fnget)

Reconstruction

Panda PilotsValidation jobsPrograms which tend to hang

Sherpa (always)

Release 16.x jobs

Some HammerCloud tests (16.x always, 17.x sometimes)

8Slide9

Alternatives

to Parrot?The CCTools team will be working on Parrot to fix bugs

May need to use kernel 3.x on target site for

reliability

Three solutions we are pursuing:

Parrot with Chirp

(

avoid libcvmfs)NFS mounting of local CVMFS (requires admin)

Use Environment Modules, common

on HPC facilities

Treat CVMFS client as a user application

Jobs “module load

cmvfs

-client”Prefix has privileges – can load needed FUSE modules Cache re-use my multi-core job slotsMight be more palatable to HPC admins9Slide10

Conclusions

Good experience accessing opportunistic resources without WLCG or ATLAS servicesA general problem for campus clustersWould greatly help if we:Relied on only one CVMFS repo + stock SL6 (like CMS)Will continue pursuing the three alternatives

Hope we can learn from others here!

10