/
ARC6 @ECDF What we do, what we found, what we did, what we fixed… ARC6 @ECDF What we do, what we found, what we did, what we fixed…

ARC6 @ECDF What we do, what we found, what we did, what we fixed… - PowerPoint Presentation

DiamondsAreForever
DiamondsAreForever . @DiamondsAreForever
Follow
342 views
Uploaded On 2022-07-28

ARC6 @ECDF What we do, what we found, what we did, what we fixed… - PPT Presentation

Previous Tier2 ECDF pre2019 ARC5 running atop SGE 4 additional cron jobs across 3 servers to provide better integrationsupport 4 very large patches gt300 lines of code against ARC5 ID: 930936

arc accounting ecdf 2020 accounting arc 2020 ecdf arc6 egi jobs fix benchmark site job htcondor sge fixed data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ARC6 @ECDF What we do, what we found, wh..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ARC6 @ECDF

What we do, what we found, what we did, what we fixed…

Slide2

Previous Tier2 @ECDF

(pre-2019)

ARC5 running atop SGE

4 additional

cron

jobs across 3 servers to provide better integration/support

4 very large patches (>300 lines of code) against ARC5:

2 for the general job parsing and GLUE

2 patches to

sge

-scan-job

&

submit-

sge

-job

for site support

This was

unstable/unpredictable

and was reflected in poor A/R

Old accounting/job-parsing very I/O intensive

Size of hacks introduced bugs (some severe!) and many other issues…

Slide3

Current

Tier2 @ ECDF

(mid-2019 + )

Running UNIVA(SGE) with ARC-6.6.0 vanilla + old-

lhcb

-glue-fixes

(Planning to update soon)

Our UNIVA backend is behind a custom SGE-ECDF tooling layer (Python3)

^ Lots of custom-caching/site-logic captured here ^

Running with new accounting stack

No issues to report

(ARC support for SGE is best effort so

¯\_(

)_/¯

)

Slide4

Tier3 @ ECDF

Tier3 composed of

‘spare kit’

running ARC6+HTCondor

HTCondor

deployment is very minimal.

cgroups

, caching and more advanced tooling

still

planned

for when we have extra effort.

HTCondor

support by ARC6 is 1

st

class.

After we had

HTCondor

working integrating first jobs with ARC6 took <1hr.

Accounting/Reporting here initially had issues

(Hence the rest of this talk)

Slide5

The ARC6 Accounting Bug:

How did we find the problem?

Setting Tier3 up was an exercise in making sure we knew how to do all this with GGUS/EGI/etc.

We were checking the EGI dashboards.

The 2 metrics which we checked were:

`

Normalised

Sum Elapsed

` and `

Sum CPU

`

Ideally these are related by the benchmark

HEPSPEC

value of the site.

(Both sets of data were the same for us

)

Originally the problem was spotted only a few hours after the T3-CE came online and the first few jobs were manually published.

Slide6

Slide7

How to spot if you have a problem at your site?

For October, compare the Normalized CPU time:

https://accounting-next.egi.eu/egi/site/UKI-SCOTGRID-ECDF/normelap/SubmitHost/DATE/2020/10/2020/10/all/localinfrajobs

With the sum of CPU time used:

https://accounting-next.egi.eu/egi/site/UKI-SCOTGRID-ECDF/sumcpu/SubmitHost/DATE/2020/10/2020/10/all/localinfrajobs

(Changing the URLs above for your site)

If these have the same numbers for a given CE for a given month, you’re impacted by this ARC6 bug.

Slide8

So what was going wrong?

We resubmitted the accounting data:

(For October this would be)

#

arcctl

-d DEBUG accounting republish -b 2020-10-01 -e 2020-10-31 -t

egi

Output is

highly

verbose, but shows:

Accounting data is extracted from:

/var/spool/arc/

jobstatus

/accounting/

accounting.db

Essentially the

S/MIME

signed message with the accounting data is viewable via DEBUG

Accounting tools are reporting a

HEPSPEC:1.0

for all jobs

Slide9

ARC accounting

db

Inspecting the ARC accounting DB from

ARC+HTCondor

host revealed:

# sqlite3 ./accounting-

test.db

"select * from

JobExtraInfo

ORDER BY

RecordID

DESC;" | less

This differed from our ARC+UNIVA host which had an additional

|benchmark|

entry for each job

Slide10

How did we fix this @ECDF?

Fix 1/2

Current/New jobs aren’t storing the benchmark in the accounting db.

Identifying what

might

fix this previously required hacking the ARC job parsing code, introducing breakpoints and tracing what was going on…

This is fixed now

, so the simplest solution is to update ARC and add something like the following to your

arc.conf

:

Slide11

How did we fix this @ECDF?

Fix 2/2

(part1…)

Jobs which have previously run are now in a bad state in the accounting db.

In the case for jobs having missing benchmarking data this can be fixed by:

Insert the correct site benchmarking values into the

db

where missing (HEPSPEC:13 @ECDF):

sqlite

> INSERT INTO

JobExtraInfo

(

RecordID

,

InfoKey

,

InfoValue

) SELECT

RecordID

, 'benchmark', 'HEPSPEC:13.0' FROM AAR WHERE

RecordID

NOT IN (SELECT

RecordID

FROM

JobExtraInfo

WHERE

InfoKey

= 'benchmark') AND

LocalJobID

IS NOT NULL AND

LocalJobID

<> '';

Re-publish your results:

#

arcctl

-d DEBUG accounting republish -b 2020-08-01 -e 2020-10-31 -t

egi

Slide12

How did we fix this @ECDF?

Fix 2/2

(part2…)

Depending on the build/config of ARC6 you may have jobs with an incorrect benchmark (Doh!

)

Need to fix these jobs.

Update entries that have the wrong benchmark in the accounting

db

:

sqlite

> UPDATE

JobExtraInfo

SET

InfoValue

= 'HEPSPEC:13.0' WHERE

InfoKey

= 'benchmark’;

Re-publish your results:

#

arcctl

-d DEBUG accounting republish -b 2020-08-01 -e 2020-10-31 -t

egi

Slide13

Where are we now?

ARC fixed all of this since at least 2020/09/14

Accounting fixed in latest builds

Fixed benchmarks in the accounting

db

Everything appears to work

Re-publishing updated the “accounting-

next.egi.eu

” dashboard within hours for our CE

Slide14

Summary

New ARC accounting subsystem is very friendly

Instructions from ARC6

devs

:

http://

www.nordugrid.org

/documents/arc6/admins/details/accounting-

benchmark.html

Had to brush up on my Perl, sqlite3, grid-skills when things went wrong

… but learnt a bit along the way.

HTCondor+ARC6 works and feels better than UNIVA+ARC6

Managed to get a simplified working Tier3 with correct accounting up within a week(-end)