Previous Tier2 ECDF pre2019 ARC5 running atop SGE 4 additional cron jobs across 3 servers to provide better integrationsupport 4 very large patches gt300 lines of code against ARC5 ID: 930936
Download Presentation The PPT/PDF document "ARC6 @ECDF What we do, what we found, wh..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ARC6 @ECDF
What we do, what we found, what we did, what we fixed…
Slide2Previous Tier2 @ECDF
(pre-2019)
ARC5 running atop SGE
4 additional
cron
jobs across 3 servers to provide better integration/support
4 very large patches (>300 lines of code) against ARC5:
2 for the general job parsing and GLUE
2 patches to
sge
-scan-job
&
submit-
sge
-job
for site support
This was
unstable/unpredictable
and was reflected in poor A/R
Old accounting/job-parsing very I/O intensive
Size of hacks introduced bugs (some severe!) and many other issues…
Slide3Current
Tier2 @ ECDF
(mid-2019 + )
Running UNIVA(SGE) with ARC-6.6.0 vanilla + old-
lhcb
-glue-fixes
(Planning to update soon)
Our UNIVA backend is behind a custom SGE-ECDF tooling layer (Python3)
^ Lots of custom-caching/site-logic captured here ^
Running with new accounting stack
No issues to report
(ARC support for SGE is best effort so
¯\_(
ツ
)_/¯
)
Slide4Tier3 @ ECDF
Tier3 composed of
‘spare kit’
running ARC6+HTCondor
HTCondor
deployment is very minimal.
cgroups
, caching and more advanced tooling
still
planned
for when we have extra effort.
HTCondor
support by ARC6 is 1
st
class.
After we had
HTCondor
working integrating first jobs with ARC6 took <1hr.
Accounting/Reporting here initially had issues
(Hence the rest of this talk)
Slide5The ARC6 Accounting Bug:
How did we find the problem?
Setting Tier3 up was an exercise in making sure we knew how to do all this with GGUS/EGI/etc.
We were checking the EGI dashboards.
The 2 metrics which we checked were:
`
Normalised
Sum Elapsed
` and `
Sum CPU
`
Ideally these are related by the benchmark
HEPSPEC
value of the site.
(Both sets of data were the same for us
)
Originally the problem was spotted only a few hours after the T3-CE came online and the first few jobs were manually published.
Slide6Slide7How to spot if you have a problem at your site?
For October, compare the Normalized CPU time:
https://accounting-next.egi.eu/egi/site/UKI-SCOTGRID-ECDF/normelap/SubmitHost/DATE/2020/10/2020/10/all/localinfrajobs
With the sum of CPU time used:
https://accounting-next.egi.eu/egi/site/UKI-SCOTGRID-ECDF/sumcpu/SubmitHost/DATE/2020/10/2020/10/all/localinfrajobs
(Changing the URLs above for your site)
If these have the same numbers for a given CE for a given month, you’re impacted by this ARC6 bug.
Slide8So what was going wrong?
We resubmitted the accounting data:
(For October this would be)
#
arcctl
-d DEBUG accounting republish -b 2020-10-01 -e 2020-10-31 -t
egi
Output is
highly
verbose, but shows:
Accounting data is extracted from:
/var/spool/arc/
jobstatus
/accounting/
accounting.db
Essentially the
S/MIME
signed message with the accounting data is viewable via DEBUG
Accounting tools are reporting a
HEPSPEC:1.0
for all jobs
Slide9ARC accounting
db
Inspecting the ARC accounting DB from
ARC+HTCondor
host revealed:
# sqlite3 ./accounting-
test.db
"select * from
JobExtraInfo
ORDER BY
RecordID
DESC;" | less
This differed from our ARC+UNIVA host which had an additional
|benchmark|
entry for each job
Slide10How did we fix this @ECDF?
Fix 1/2
Current/New jobs aren’t storing the benchmark in the accounting db.
Identifying what
might
fix this previously required hacking the ARC job parsing code, introducing breakpoints and tracing what was going on…
This is fixed now
, so the simplest solution is to update ARC and add something like the following to your
arc.conf
:
Slide11How did we fix this @ECDF?
Fix 2/2
(part1…)
Jobs which have previously run are now in a bad state in the accounting db.
In the case for jobs having missing benchmarking data this can be fixed by:
Insert the correct site benchmarking values into the
db
where missing (HEPSPEC:13 @ECDF):
sqlite
> INSERT INTO
JobExtraInfo
(
RecordID
,
InfoKey
,
InfoValue
) SELECT
RecordID
, 'benchmark', 'HEPSPEC:13.0' FROM AAR WHERE
RecordID
NOT IN (SELECT
RecordID
FROM
JobExtraInfo
WHERE
InfoKey
= 'benchmark') AND
LocalJobID
IS NOT NULL AND
LocalJobID
<> '';
Re-publish your results:
#
arcctl
-d DEBUG accounting republish -b 2020-08-01 -e 2020-10-31 -t
egi
Slide12How did we fix this @ECDF?
Fix 2/2
(part2…)
Depending on the build/config of ARC6 you may have jobs with an incorrect benchmark (Doh!
)
Need to fix these jobs.
Update entries that have the wrong benchmark in the accounting
db
:
sqlite
> UPDATE
JobExtraInfo
SET
InfoValue
= 'HEPSPEC:13.0' WHERE
InfoKey
= 'benchmark’;
Re-publish your results:
#
arcctl
-d DEBUG accounting republish -b 2020-08-01 -e 2020-10-31 -t
egi
Slide13Where are we now?
ARC fixed all of this since at least 2020/09/14
Accounting fixed in latest builds
Fixed benchmarks in the accounting
db
Everything appears to work
Re-publishing updated the “accounting-
next.egi.eu
” dashboard within hours for our CE
Slide14Summary
New ARC accounting subsystem is very friendly
Instructions from ARC6
devs
:
http://
www.nordugrid.org
/documents/arc6/admins/details/accounting-
benchmark.html
Had to brush up on my Perl, sqlite3, grid-skills when things went wrong
… but learnt a bit along the way.
HTCondor+ARC6 works and feels better than UNIVA+ARC6
Managed to get a simplified working Tier3 with correct accounting up within a week(-end)