Miguel Martinez Pedreira Outlook AliEn status Operations Jobs Central Services Sites Developments j AliEn status Developments First site todo New catalogue backend status AliEn and ID: 796178
Download The PPT/PDF document "AliEn and jAliEn status" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
AliEn and jAliEn status
Miguel Martinez Pedreira
Slide2OutlookAliEn status
OperationsJobsCentral ServicesSites
DevelopmentsjAliEn status
Developments
First site to-do
New catalogue backend status
AliEn and jAliEn status - Miguel Martinez Pedreira
2
Slide3AliEn operations
AliEn and jAliEn
status - Miguel Martinez Pedreira
3
Slide4AliEn operations: jobs
AliEn and jAliEn
status - Miguel Martinez Pedreira
4
=>
175K
jobs running on
Christmas
day
Slide5AliEn operations: jobs
AliEn and jAliEn status - Miguel Martinez
Pedreira
5
=>
175K
jobs running on
Christmas
day
Slide6AliEn operations: Central Services Databases
Two main MySQL backends: File Catalogue +
TaskQueueSlave in sync -> backups and swap in case of failureTwo ways to backup/restore
SQL dump:
6h
to backup catalogue, ~1.5 days to restore
Binary copy: 1.5hbut with 1 slave, this still means there is downtime if master failsAdded second slave for both backendsNo we can swap/restore master and slave from the extra slave with just “no” downtimeCreated a procedure to restore the DBs into other serversSet up 10
Gbps
links between backends to allow for such fast restore speed
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
6
Slide7AliEn operations: Central Services Databases
7
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
Slide8AliEn operations: Central Services Databases
8
AliEn
and
jAliEn
status - Miguel Martinez Pedreira
Slide9AliEn operations: Central Services LDAP
Master and slave: similar to MySQL setup‘
syncrepl refreshAndPersist’ -> master pushes updates
Added second slave
Stopped global
alien LDAPAll LDAP lookups passed through it
It held LDAP address per VOAliEn code now has an over-writable default valuealice-ldap.cern.ch:8389LDAP service very stable
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
9
Slide10AliEn operations: Central Services API services
Started investigation after incident with some train jobs
too many `ls` commandsAPI services became unstable
Why `
ls
`?Issue investigated (by M. Zimmermann and R. Ehlers together with Miguel)Preparation of train checking if some files exist
Optimized since it could be done on train preparation phase instead of on each jobWhy APIs unstable?std::bad_alloc: buffers allocation not foreseen for such number of jobs/callsTuned kernel parameters for
ApiService
rss
, as,
nproc
=>
no more crashes
Error rates / efficiency after..?
Prepared extra servers in case of need to scale
Investigating machines sets with IT
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
10
Slide11AliEn operations: Central Services AliEn services
Httpd
at the limit of its threads on heavy-loadHeavy load doesn’t really mean more running jobs always
Seen by jobs unable to contact
TaskQueue
(JobManager) and/or Catalogue (Authen
)Apparently httpd queueing/threading not enough to hold the requestsTuned the Apache services to allow many more connectionsWhile monitoring the effect on the databases
Need to increase to max number of connections on the databases as well
Added
httpd
threads and DB connections to
MonaLisa
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
11
Slide12AliEn operations: Sites
SE availability (add+get
) last 6 months: 88%Excluding continuously faulty onesIssues being addressed in ORNL::EOS -> fs, performance,
…
CERN::EOS upgraded to citrine, xrootd4 and IPv6, improved namespace boot time
Some other sites upgraded to citrine
Error/Done ratio last month on T0+T1s: 16.4% - global 17.5%2 new sitesSARFTI at Sarov, Russia: around 400 slots and 120TB for now
HelixNebula
Science Cloud at CERN, 10K slots to share among VOs
VoBoxes
are
Capella
and
Regulus
RRC T1 error rates being investigated (network?) since beginning of 2018
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
12
Slide13AliEn operations: Central Services OpenSSL
Vulnerability discovered in the OpenSSL
version/configuration used in CSRelated to RSA_Export
cypher suites
CVE-2015-0204
EXP-DES-CBC-SHA Kx
=RSA(512) Au=RSA Enc=DES-CBC(40) Mac=SHA1 exportEXP-RC2-CBC-MD5 Kx=RSA(512) Au=RSA Enc=RC2-CBC(40) Mac=MD5 exportEXP-RC4-MD5
Kx
=RSA(512) Au=RSA
Enc
=RC4(40) Mac=MD5
export
Fix
Updated
OpenSSL
to latest compatible with
xalienfs
0.9.8zh
Disabled all weak cyphers in the configuration
xalienfs
rework vs
jAliEn
in production
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
13
Slide14AliEn development
AliEn and jAliEn
status - Miguel Martinez Pedreira
14
Slide15AliEn development
JobAgent: subfolders for job outputst
he use of “Output” tag is the same, but now relative paths can be added before the filenames, the JobAgent will `mkdir
–
p` them in the catalogue.also different sub-paths in archives
Request for MC-to-MC embeddingJobAgent: retrials for critical calls to CSchangeStatusCommand / getCatalogueJobAgent
: traces cleanup
possibly we can go further with this, not a problem for now
Authen
: `mirror
–
x <SE>`
mirrors to closest site excluding the one of SE
used by
TAlienFile
(see later)
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
15
Slide16AliEn development
TaskQueue: fix for multiple MySQL queriesnew version of MySQL broke some queries (that were anyway badly written)
this caused an incident on Central Services regarding job submission and status updates (3rd week of January)
the new MySQL version came from swapping slave to master due to broken filesystem on the master (disk failures included)
Splitting
Removed cap of 1000 InputData files per subjob
Possibility to enable baskets dump on splitting optimizer logsTrace message in case subjob submission fails with some issue on InputDataJobManager: kill jobagent on active job resubmit
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
16
Slide17AliEn development
JobAgent, Monitor, CMreport,
JobManager: fully use JobToken to authenticate all calls from
JobAgent
(one payload attached)
this avoids job executions to collide (status changes,…), rejecting invalid calls
no way to know without the token or some extra parameters…created a Monitor and JobManager version compatible with the new and old JA to ease deploymentJobAgent: first approach to Singularity
overload the CE environment with:
CE_SINGULARITY =
singularity exec
–
bind /
cvmfs
:/
cvmfs
/
cvmfs
/
atlas.cern.ch
/repo/images/singularity/x86_64-centos7.img
Startup script and job execution wrapped (payload with
–
contain)
Never really used/push, just tests
More on containerization/Singularity in a few slides
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
17
Slide18AliEn development: TAlienFile
ROOT interface
Users of AliEn Analysis Plugin having issues when their local
SE not working correctly
While uploading JDLs, collections
[AliROOT] ANALYSIS/
ANALYSISaliceBase/AliAnalaisysAlien.cxx ->[ROOT] TFile::Cp -> TAlienFile
::
Open
TAlienFile
::Open for write exits after first
try
It also doesn’t `mirror` successfully uploaded
file
Addressed in:
https://alice.its.cern.ch/jira/browse/ALIROOT-7715
You can try in
alice-nightlies
->
AliPhysics
/v5-09-24-01-rc1-1
Bumped into production ROOT as of Tuesday 13th March
Would be good to keep an eye on first production/s
Users feedback
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
18
Slide19jAliEn development
AliEn and jAliEn status - Miguel Martinez
Pedreira
19
Slide20jAliEn development
First “JAliEn developers week” last week of November 2017Structure, prioritize, find
holes, brainstorm and push jAliEn development
5 participants (the currently
active
jAliEn developers)
Costin, Miguel, Max, Tim, VovaVery fruitful discussionsClear path on what to doManaged to push some new code too
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
20
Slide21jAliEn development
Result: …nicely growing to-do list
AliEn
and
jAliEn status - Miguel Martinez
Pedreira21
Slide22jAliEn development: architecture
22
Worker node
ROOT on WN slot
jSite
Catalogue
T
ask
Q
ueue
Transfers
LDAP
Central services
JobAgent
:
jBox
Site services
jCentral
-Brokers
(jobs, trans)
-Optimizers
(
quota,prio
)
-
Authen
-More
…
jBox
ROOT on User Machine
jsh
O(10)
O(100)
O(1000)
SSL(Compressed(Java serialized object stream))
WebsocketS
, JSON serialization of requests/replies
Default uplink
Optional uplink
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
Slide23jAliEn development: new
JobAgent
See full report on Max’s presentationSplit the current
JobAgent
in
jAliEn in twoPilot
Takes care of the matching, using a JobAgent token and payload monitoringPayload processGets from the pilot (in a safe way, streams) the JDL, Job token,…
and takes care of the second layer of isolation and execution
Isolation pilot-pilot and pilot-payload
solution-agnostic approach to containerize the different components
WG Containers actively looking for best fit for WLCG
Singularity still the main candidate and partly used in production in ATLAS/CMS
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
23
Slide24jAliEn development: logging
Currently jAliEn uses standard java logging libs
Not fine grained, only structured on log levelsAll components share logsWe need a log per logical service or unit
One approach is to recreate the logger endpoint after getting an instance of the class logger
u
sed on the JobAgent
to log as done in AliEnContextual logging, fully detailed on Tim’s presentationeach class/service has a default logging context
the
defaulf
context can be changed or you can set extra contexts
experimental
: annotation-based logging context
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
24
Slide25jAliEn development: transfers and recovery
See Vova’s
presentation for full detailsCreating a
tool
to deal with transfers and SE operations
SE cleanupsscheduling of data migration transferscheck consistency of results, LDAP/ML cleanup
…SE migrationsE.g. for storage decommissionSE <=> File Catalogue synchronizationOrphans, dark dataSE utilization optimization
Explore archives, delete unnecessary files, repack
…
File recovery
0-sized, broken/corrupted replicas
…
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
25
Slide26jAliEn development: site services
CE: submission chain readyBrokeringStartup scripts -> downloading jar to the node
Token CertificatesBatch interfacesCREAMHTCONDOR
Test Monitor functions can be
proxied
through JSite (or JBox
on VoBox)To help socket management on JCentral
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
26
Slide27jAliEn development
JobBroker: adapted to use Token CertificatesThe typical Job Token would be a combination of user, jobId
and resubmission numberSubject: OU="queueid
=1038905674/resubmission=3/user=
mmmartin
", OU=jobagent, CN=jobagent
, CN=Jobs, O=AliEn, C=ch- Issuer: JAliEn-CSJobManager
: resubmission
Job Quotas
Find
–
j
JobAgent
insertion and output cleanups
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
27
Slide28jAliEn development
Wildcards expansionsToken auto-refresh and creating from stringFor the startup scriptsExtension-parsing for Tokens
Validation/fixes for ROOT client commandsFind –x and collection listing
Using
FindBugs
+ style template to validate the source codeMore…
AliEn and jAliEn
status - Miguel Martinez
Pedreira
28
Slide29jAliEn: to-do for first full JAliEn site
AliPhysics build with TJAlien
(instead of TAlien)Validate all parameters coming from
JobAgent
and test
proxying messages through VoBox
check parameters clustered correctly in MLpreferably only test with the new implementation of the pilotLogging contexts applied at least to the components involvedTest production jobs full chain
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
29
Slide30New file catalogue
AliEn and jAliEn status - Miguel Martinez
Pedreira
30
Slide31New catalogue
Setup dual-instance Cassandra nodes2(+) Cassandra JVMs per machineSet private IPs using the client-server 10Gbps dedicated switch
Cassandra-stress shown ~30% performance increase (why?)700 KHz ops, 25% cpu
idle
Intrumented
the monitoring the get JMX metrics on custom host/portScylla Summit 2017, San Francisco
Invited by the company to present our work in ALICEMy presentation hereVery accessible and professional team, and the rest of the audience too..!Clarified few important items, and learned a few othersAGPL, Open version vs Enterprise version -> support and nice functionalities, protection against cloud/private providers
Management tools for repair (that we got for free
) and other
New hybrid compaction strategy
Lot of new developments coming or ready: Secondary Indexes, Materialized Views, User Defined Types
Hardware (NVMs,
Optane
), best practices, performance debugging, monitoring
…
AliEn
and
jAliEn
status - Miguel Martinez
Pedreira
31
Slide32New catalogue: latest results
AliEn and jAliEn
status - Miguel Martinez Pedreira
32
KHZ
(ops/second
)
1.6 MHz
Cassandra
read
Cassandra
write
Scylla read
Scylla write
Benchmark (default
cassandra
-stress)
Cassandra
Scylla
Diff
A
I
nser
t
only
18 %
util
, 2%
iowait
100
%
util
x2
B
Read-only
Gauss(5B,2.5B,10K)
Disk idle, 50%
cpu
100 %
cpu
x3
C
Read-only
Gauss(5B,2.5B,1M)
11 %
util
, 40%
cpu
11 %
util
, 100 %
cpu
x3.28
D
Mixed (10r,1w) Gauss(5B,2.5B,100K)
2 %
util
, 45%
cpu
10 %
util
, 100 %
cpu
x5.8
E
Mixed (10r,1w) Gauss(5B,2.5B,1M)
5
%
util
, 50%
cpu
16 %
util
, 100 %
cpu
x6.4
F
Mixed (10r,1w) Gauss(5B,2.5B,10M)
9%
util
, 45%
cpu
40 %
util
, 100 %
cpu
x6.1
G
Mixed 2K thrd.
read
, 200 write, G(5B,2.5B,100K)
8 %
util
, 40%
cpu
, no
iowait
26
%
util
, 100 %
cpu
x5.62
A
B
C
D
E
F
G
Slide33New catalogue: what next
Full table scansSE sync, quotas, analyticsFailure/production environment framework
Recreate node failures, network breakdown, disk issues…
AliEn
and
jAliEn
status - Miguel Martinez Pedreira33
Slide34Thanks! Questions?
AliEn and jAliEn status - Miguel Martinez
Pedreira
34