/
AliEn  and  jAliEn  status AliEn  and  jAliEn  status

AliEn and jAliEn status - PowerPoint Presentation

rouperli
rouperli . @rouperli
Follow
342 views
Uploaded On 2020-08-03

AliEn and jAliEn status - PPT Presentation

Miguel Martinez Pedreira Outlook AliEn status Operations Jobs Central Services Sites Developments j AliEn status Developments First site todo New catalogue backend status AliEn and ID: 796178

jalien alien miguel status alien jalien status miguel martinez pedreira development services jobagent cpu util operations catalogue jobs central

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "AliEn and jAliEn status" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

AliEn and jAliEn status

Miguel Martinez Pedreira

Slide2

OutlookAliEn status

OperationsJobsCentral ServicesSites

DevelopmentsjAliEn status

Developments

First site to-do

New catalogue backend status

AliEn and jAliEn status - Miguel Martinez Pedreira

2

Slide3

AliEn operations

AliEn and jAliEn

status - Miguel Martinez Pedreira

3

Slide4

AliEn operations: jobs

AliEn and jAliEn

status - Miguel Martinez Pedreira

4

=>

175K

jobs running on

Christmas

day

Slide5

AliEn operations: jobs

AliEn and jAliEn status - Miguel Martinez

Pedreira

5

=>

175K

jobs running on

Christmas

day

Slide6

AliEn operations: Central Services Databases

Two main MySQL backends: File Catalogue +

TaskQueueSlave in sync -> backups and swap in case of failureTwo ways to backup/restore

SQL dump:

6h

to backup catalogue, ~1.5 days to restore

Binary copy: 1.5hbut with 1 slave, this still means there is downtime if master failsAdded second slave for both backendsNo we can swap/restore master and slave from the extra slave with just “no” downtimeCreated a procedure to restore the DBs into other serversSet up 10

Gbps

links between backends to allow for such fast restore speed

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

6

Slide7

AliEn operations: Central Services Databases

7

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

Slide8

AliEn operations: Central Services Databases

8

AliEn

and

jAliEn

status - Miguel Martinez Pedreira

Slide9

AliEn operations: Central Services LDAP

Master and slave: similar to MySQL setup‘

syncrepl refreshAndPersist’ -> master pushes updates

Added second slave

Stopped global

alien LDAPAll LDAP lookups passed through it

It held LDAP address per VOAliEn code now has an over-writable default valuealice-ldap.cern.ch:8389LDAP service very stable

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

9

Slide10

AliEn operations: Central Services API services

Started investigation after incident with some train jobs

too many `ls` commandsAPI services became unstable

Why `

ls

`?Issue investigated (by M. Zimmermann and R. Ehlers together with Miguel)Preparation of train checking if some files exist

Optimized since it could be done on train preparation phase instead of on each jobWhy APIs unstable?std::bad_alloc: buffers allocation not foreseen for such number of jobs/callsTuned kernel parameters for

ApiService

rss

, as,

nproc

=>

no more crashes

Error rates / efficiency after..?

Prepared extra servers in case of need to scale

Investigating machines sets with IT

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

10

Slide11

AliEn operations: Central Services AliEn services

Httpd

at the limit of its threads on heavy-loadHeavy load doesn’t really mean more running jobs always

Seen by jobs unable to contact

TaskQueue

(JobManager) and/or Catalogue (Authen

)Apparently httpd queueing/threading not enough to hold the requestsTuned the Apache services to allow many more connectionsWhile monitoring the effect on the databases

Need to increase to max number of connections on the databases as well

Added

httpd

threads and DB connections to

MonaLisa

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

11

Slide12

AliEn operations: Sites

SE availability (add+get

) last 6 months: 88%Excluding continuously faulty onesIssues being addressed in ORNL::EOS -> fs, performance,

CERN::EOS upgraded to citrine, xrootd4 and IPv6, improved namespace boot time

Some other sites upgraded to citrine

Error/Done ratio last month on T0+T1s: 16.4% - global 17.5%2 new sitesSARFTI at Sarov, Russia: around 400 slots and 120TB for now

HelixNebula

Science Cloud at CERN, 10K slots to share among VOs

VoBoxes

are

Capella

and

Regulus

RRC T1 error rates being investigated (network?) since beginning of 2018

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

12

Slide13

AliEn operations: Central Services OpenSSL

Vulnerability discovered in the OpenSSL

version/configuration used in CSRelated to RSA_Export

cypher suites

CVE-2015-0204

EXP-DES-CBC-SHA Kx

=RSA(512) Au=RSA Enc=DES-CBC(40) Mac=SHA1 exportEXP-RC2-CBC-MD5 Kx=RSA(512) Au=RSA Enc=RC2-CBC(40) Mac=MD5 exportEXP-RC4-MD5

Kx

=RSA(512) Au=RSA

Enc

=RC4(40) Mac=MD5

export

Fix

Updated

OpenSSL

to latest compatible with

xalienfs

0.9.8zh

Disabled all weak cyphers in the configuration

xalienfs

rework vs

jAliEn

in production

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

13

Slide14

AliEn development

AliEn and jAliEn

status - Miguel Martinez Pedreira

14

Slide15

AliEn development

JobAgent: subfolders for job outputst

he use of “Output” tag is the same, but now relative paths can be added before the filenames, the JobAgent will `mkdir

p` them in the catalogue.also different sub-paths in archives

Request for MC-to-MC embeddingJobAgent: retrials for critical calls to CSchangeStatusCommand / getCatalogueJobAgent

: traces cleanup

possibly we can go further with this, not a problem for now

Authen

: `mirror

x <SE>`

mirrors to closest site excluding the one of SE

used by

TAlienFile

(see later)

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

15

Slide16

AliEn development

TaskQueue: fix for multiple MySQL queriesnew version of MySQL broke some queries (that were anyway badly written)

this caused an incident on Central Services regarding job submission and status updates (3rd week of January)

the new MySQL version came from swapping slave to master due to broken filesystem on the master (disk failures included)

Splitting

Removed cap of 1000 InputData files per subjob

Possibility to enable baskets dump on splitting optimizer logsTrace message in case subjob submission fails with some issue on InputDataJobManager: kill jobagent on active job resubmit

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

16

Slide17

AliEn development

JobAgent, Monitor, CMreport,

JobManager: fully use JobToken to authenticate all calls from

JobAgent

(one payload attached)

this avoids job executions to collide (status changes,…), rejecting invalid calls

no way to know without the token or some extra parameters…created a Monitor and JobManager version compatible with the new and old JA to ease deploymentJobAgent: first approach to Singularity

overload the CE environment with:

CE_SINGULARITY =

singularity exec

bind /

cvmfs

:/

cvmfs

/

cvmfs

/

atlas.cern.ch

/repo/images/singularity/x86_64-centos7.img

Startup script and job execution wrapped (payload with

contain)

Never really used/push, just tests

More on containerization/Singularity in a few slides

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

17

Slide18

AliEn development: TAlienFile

ROOT interface

Users of AliEn Analysis Plugin having issues when their local

SE not working correctly

While uploading JDLs, collections

[AliROOT] ANALYSIS/

ANALYSISaliceBase/AliAnalaisysAlien.cxx ->[ROOT] TFile::Cp -> TAlienFile

::

Open

TAlienFile

::Open for write exits after first

try

It also doesn’t `mirror` successfully uploaded

file

Addressed in:

https://alice.its.cern.ch/jira/browse/ALIROOT-7715

You can try in

alice-nightlies

->

AliPhysics

/v5-09-24-01-rc1-1

Bumped into production ROOT as of Tuesday 13th March

Would be good to keep an eye on first production/s

Users feedback

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

18

Slide19

jAliEn development

AliEn and jAliEn status - Miguel Martinez

Pedreira

19

Slide20

jAliEn development

First “JAliEn developers week” last week of November 2017Structure, prioritize, find

holes, brainstorm and push jAliEn development

5 participants (the currently

active

jAliEn developers)

Costin, Miguel, Max, Tim, VovaVery fruitful discussionsClear path on what to doManaged to push some new code too

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

20

Slide21

jAliEn development

Result: …nicely growing to-do list



AliEn

and

jAliEn status - Miguel Martinez

Pedreira21

Slide22

jAliEn development: architecture

22

Worker node

ROOT on WN slot

jSite

Catalogue

T

ask

Q

ueue

Transfers

LDAP

Central services

JobAgent

:

jBox

Site services

jCentral

-Brokers

(jobs, trans)

-Optimizers

(

quota,prio

)

-

Authen

-More

jBox

ROOT on User Machine

jsh

O(10)

O(100)

O(1000)

SSL(Compressed(Java serialized object stream))

WebsocketS

, JSON serialization of requests/replies

Default uplink

Optional uplink

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

Slide23

jAliEn development: new

JobAgent

See full report on Max’s presentationSplit the current

JobAgent

in

jAliEn in twoPilot

Takes care of the matching, using a JobAgent token and payload monitoringPayload processGets from the pilot (in a safe way, streams) the JDL, Job token,…

and takes care of the second layer of isolation and execution

Isolation pilot-pilot and pilot-payload

solution-agnostic approach to containerize the different components

WG Containers actively looking for best fit for WLCG

Singularity still the main candidate and partly used in production in ATLAS/CMS

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

23

Slide24

jAliEn development: logging

Currently jAliEn uses standard java logging libs

Not fine grained, only structured on log levelsAll components share logsWe need a log per logical service or unit

One approach is to recreate the logger endpoint after getting an instance of the class logger

u

sed on the JobAgent

to log as done in AliEnContextual logging, fully detailed on Tim’s presentationeach class/service has a default logging context

the

defaulf

context can be changed or you can set extra contexts

experimental

: annotation-based logging context

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

24

Slide25

jAliEn development: transfers and recovery

See Vova’s

presentation for full detailsCreating a

tool

to deal with transfers and SE operations

SE cleanupsscheduling of data migration transferscheck consistency of results, LDAP/ML cleanup

…SE migrationsE.g. for storage decommissionSE <=> File Catalogue synchronizationOrphans, dark dataSE utilization optimization

Explore archives, delete unnecessary files, repack

File recovery

0-sized, broken/corrupted replicas

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

25

Slide26

jAliEn development: site services

CE: submission chain readyBrokeringStartup scripts -> downloading jar to the node

Token CertificatesBatch interfacesCREAMHTCONDOR

Test Monitor functions can be

proxied

through JSite (or JBox

on VoBox)To help socket management on JCentral

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

26

Slide27

jAliEn development

JobBroker: adapted to use Token CertificatesThe typical Job Token would be a combination of user, jobId

and resubmission numberSubject: OU="queueid

=1038905674/resubmission=3/user=

mmmartin

", OU=jobagent, CN=jobagent

, CN=Jobs, O=AliEn, C=ch- Issuer: JAliEn-CSJobManager

: resubmission

Job Quotas

Find

j

JobAgent

insertion and output cleanups

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

27

Slide28

jAliEn development

Wildcards expansionsToken auto-refresh and creating from stringFor the startup scriptsExtension-parsing for Tokens

Validation/fixes for ROOT client commandsFind –x and collection listing

Using

FindBugs

+ style template to validate the source codeMore…

AliEn and jAliEn

status - Miguel Martinez

Pedreira

28

Slide29

jAliEn: to-do for first full JAliEn site

AliPhysics build with TJAlien

(instead of TAlien)Validate all parameters coming from

JobAgent

and test

proxying messages through VoBox

check parameters clustered correctly in MLpreferably only test with the new implementation of the pilotLogging contexts applied at least to the components involvedTest production jobs full chain

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

29

Slide30

New file catalogue

AliEn and jAliEn status - Miguel Martinez

Pedreira

30

Slide31

New catalogue

Setup dual-instance Cassandra nodes2(+) Cassandra JVMs per machineSet private IPs using the client-server 10Gbps dedicated switch

Cassandra-stress shown ~30% performance increase (why?)700 KHz ops, 25% cpu

idle

Intrumented

the monitoring the get JMX metrics on custom host/portScylla Summit 2017, San Francisco

Invited by the company to present our work in ALICEMy presentation hereVery accessible and professional team, and the rest of the audience too..!Clarified few important items, and learned a few othersAGPL, Open version vs Enterprise version -> support and nice functionalities, protection against cloud/private providers

Management tools for repair (that we got for free

) and other

New hybrid compaction strategy

Lot of new developments coming or ready: Secondary Indexes, Materialized Views, User Defined Types

Hardware (NVMs,

Optane

), best practices, performance debugging, monitoring

AliEn

and

jAliEn

status - Miguel Martinez

Pedreira

31

Slide32

New catalogue: latest results

AliEn and jAliEn

status - Miguel Martinez Pedreira

32

KHZ

(ops/second

)

1.6 MHz

Cassandra

read

Cassandra

write

Scylla read

Scylla write

Benchmark (default

cassandra

-stress)

Cassandra

Scylla

Diff

A

I

nser

t

only

18 %

util

, 2%

iowait

100

%

util

x2

B

Read-only

Gauss(5B,2.5B,10K)

Disk idle, 50%

cpu

100 %

cpu

x3

C

Read-only

Gauss(5B,2.5B,1M)

11 %

util

, 40%

cpu

11 %

util

, 100 %

cpu

x3.28

D

Mixed (10r,1w) Gauss(5B,2.5B,100K)

2 %

util

, 45%

cpu

10 %

util

, 100 %

cpu

x5.8

E

Mixed (10r,1w) Gauss(5B,2.5B,1M)

5

%

util

, 50%

cpu

16 %

util

, 100 %

cpu

x6.4

F

Mixed (10r,1w) Gauss(5B,2.5B,10M)

9%

util

, 45%

cpu

40 %

util

, 100 %

cpu

x6.1

G

Mixed 2K thrd.

read

, 200 write, G(5B,2.5B,100K)

8 %

util

, 40%

cpu

, no

iowait

26

%

util

, 100 %

cpu

x5.62

A

B

C

D

E

F

G

Slide33

New catalogue: what next

Full table scansSE sync, quotas, analyticsFailure/production environment framework

Recreate node failures, network breakdown, disk issues…

AliEn

and

jAliEn

status - Miguel Martinez Pedreira33

Slide34

Thanks! Questions?

AliEn and jAliEn status - Miguel Martinez

Pedreira

34