/
ALICE DCS upgrades ALICE WINCC setup ALICE DCS upgrades ALICE WINCC setup

ALICE DCS upgrades ALICE WINCC setup - PowerPoint Presentation

mofferro
mofferro . @mofferro
Follow
343 views
Uploaded On 2020-08-06

ALICE DCS upgrades ALICE WINCC setup - PPT Presentation

Each detector builds a local distributed system We try to keep subsystems on separate computers HV LV FEE Central DCS systems connect to all WINCC systems in ALICE Scattered systems used for UI central operator UI alert screen displays in control center expert ID: 800700

alice wcc data systems wcc alice systems data detector dcs system version release wincc support undistributed server installed detectors

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "ALICE DCS upgrades ALICE WINCC setup" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ALICE DCS upgrades

Slide2

ALICE WINCC setup

Each detector builds a local distributed system

We try to keep subsystems on separate computers (HV, LV, FEE..)

Central DCS systems connect to all WINCC systems in ALICE

Scattered systems used for UI (central operator UI, alert screen, displays in control center, expert

Uis

…)

Slide3

DETECTOR

DETECTOR

DETECTOR

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

Each detector builds its own distributed control system

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

For example to restore from SAFE:

Some detectors end in Configured state

Other detectors need special configuration

Other do not change state at all

The state of the experiment depends on:

The state of each subdetector, each having its own control logic

Internal and external services

ALICE online and offline systems…

Each detector system is built as an autonomous distributed system of 2-16 WINCC systems

Slide4

DETECTOR

DETECTOR

DETECTOR

DETECTOR

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

Central systems connect to all detector systems

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

WCC OA

The central systems

connect

to detector

systems

and form one large distributed system sharing the data and the control

ALICE DCS Systems

Slide5

The ALICE DCS scale

21

autonomous detector systems

120 WINCC OA systems

>100 subsystems

160

control computers

>700 embedded computers

1200 network attached devices

300 000 OPC and frontend items

Slide6

OPC servers and drivers in ALICE

Wiener : 23 OPC servers

Systec

: 25 drivers

ELMB: 13 OPC serversIseg: 5 OPC server + 1 custom (DIM based) serverCAEN: 20 OPC servers

Slide7

(Some of ) the present ALICE and DCS challenges

ALICE Pixel detector (SPD):

10 000 000 configurable pixels

1.3 kW power dissipation on

(200+150)

μ

m assembliesContingency in case of cooling failure less than 1 minute

ALICE Pixel detector

ALICE Transition Radiation Detector (TRD):760 m2 covered by drift chambers

17TB/s raw data processed by 250 000 tracklet processors directly on the detector65kW power provided by 89 LV power supplies

Transition Radiation Detector

ALICE Time Projection Chamber (TPC):

96 m3 gas volume (largest ever)stabilized at 0.10CCooling system has to remove 28kW of dissipated power

Installed just next to TRD557568 readout pads100kV voltage in the drift cage

Time Projection Chamber

Slide8

Slide9

The ALICE O2 Project

ALICE 2007-2017

~10GB/s

Slide10

LHC will provide for RUN3 ~100x more

Pb-Pb

collisions compared to RUN2

(10

10

– 4.10

11 collisions/year) ALICE O2 Project merges online and offline into one large systemTDR approved IN 2015 !

Some O2 challenges:New detector readout8400 optical linksData rate

1.1TB/sGlobal computing:~100k CPU cores, 5k GPUs, 500 FPGAsData storage requirements: ~60PB/year

The ALICE O2 Project

ALICE 2007-2017

ALICE 2019-2027

~10GB/s

Slide11

The O2 architecture

Detectors

250 First Level processors (FLP)

1250 Event Processing Nodes (EPN)

~100 000 processor cores

Slide12

DCS in the context of O2

DCS provides input to O2

~100 000 conditions parameters are requested for the event reconstruction

Data has to be injected into each 20ms data frame

Slide13

DCS-O2 interface

ALICE Data Collector and Publisher

ALICE DCS Production System

ALICE DCS Storage System (ORACLE)

Process Image

O2

The ALICE Data Collector receives data from the DCS

Depending on the type and priority of the data, the Data Collector can connect to different layers of the system

A process image, containing conditions data is sent to O2

Slide14

DCS-O2 interface

ALICE Data Collector

Data Finder and Subscriber

Finds the data and subscribes to the published values

Process Image

Keeps update copy of data received by Subscriber

Data Publisher

Sends conditions data to O2

Prototypes proved the feasibility of the concept

Larger scale prototypes being prepared

…but...

T

his is just the beginning of the story

Slide15

New readout electronics for ALICE

The O2 interfaces to detector frontends using a new Common Readout Unit (CRU)

All data is transmitted over 8400 optical links

DCS data provided by front-end is interleaved with the physics data

FLP strips the DCS data off the global stream and sends it to the DCSThe same link is used to send DCS commands to the front-end

Slide16

FLP

FLP

The CONTROL and CONFIGURATION

16

WCC OA

WCC OA

FRED

FLP

ALF

CRU

DETECTOR SPECIFIC LAYER FRED (Front End Device) runs on a dedicated server

Receives commands from WCC OA and forwards them to ALF

Receives data from ALF and publish it to WCC OA

DETECTOR NEUTRAL LAYER ALF (Alice Low Level Frontend interface) provides communication interface to CRU firmware

Slide17

How ALICE performs updates

Slide18

ALICE update procedures

General rules:

Big updates only during the YETS

Operating systems and patches, DB servers, WINCC upgrades, new FW….

Critical and security updates during the technical stopsOne exception: no updates during the last TS before the HI run Exceptional updates installed at any time, as needed

Fixes for urgent issues like OPC memory leak …On Monday of each TS a full backup of all WINCC systems is taken Detectors are responsible for taking additional backups after each system change The DCS team can rollback the systems to a state at the beginning of the last TS

Slide19

ALICE update procedures

The operating systems, OS patches, drivers, WINCCOA and OPC servers are installed centrally

Using CMF and homegrown scripts/procedures

Detectors install only WINCCOA projects

Trap : WINCC experts are granted too elevated privileges which leads to frequent misuseQuestion about required privilleges

ahs been submitted, the answer was: “We would like to know as well, let us know if you find out the answer” 

Slide20

ALICE update procedures

Framework update procedure:

Install new FW in the lab to spot obvious issues

Create ALICE FW package in ALICE repositories

JCOP FW + ALICE extensionsUpgrade and test the central DCS systems maintained by the central team

Publish the new FW to detectors and set deadline for upgrade (grace period of few weeks, depending on the ALICE activities)We rely on rollback in case of troublesIntegration session with all detectors to re-establish the full system functionalityThe ALICE Fw is frozen after the release, to assure uniformityCritical patches are maintained in separate repositories (we do not modify the already released distribution)

Slide21

Testing of components

Ad-hoc tests are performed in the DCS lab (performance, long-term stability…)

Scale of the system is not realistic

A small project for this summer aims for creating reference slices for some standard devices (CAEN, ISEG, …)

Full slice from the device up to the user interfaceThe lab is used for debugging before a support request is submitted to BE-ICS

Standing offer to use the lab for more standardized testWe could give another try to revive this project

Slide22

Operating systems in ALICE DCS

Windows Server

WINCC is installed exclusively on Windows Server systems

Currently WS2008R2

Windows desktop systemsExceptional cases - devices without WS support, like stepper motors, gas chromatograph Single exception – exotic PC controlling cooling plant, running XP on isolated network

LinuxSLC6 used typically for front-end controlSLC5 Linux on some VME controllersSLC4 Linux installed on ~750 FPGA based control boards

Slide23

Operating system versions

The natural choice would be to install always the latest OS for the upgrades

Good support for the new computer hardware

Compliant with CERN policies

Up-to-date featuresThe installed version is determined by many constraints: Most often there is lack of support by vendors of the components (drivers, etc..)

In many cases the preferred platform is not supported, because the company did not try

Slide24

Operating systems upgrade

We do not have enough arguments for upgrading the operating systems during the 2016/2017 YETS yet

Still open question

At the end of RUN2 we plan to move to

Windows Server 2016CENTOS Current front-end software is running stable on SLC6, no need for

change before the end of RUN2Reminder: ALICE is not running WINCC on Linux

Slide25

Windows Server lifecycle (valid for all Microsoft products)

Mainstream support

:

new feature requests and product design changes accepted

Non-security and security updatesExtended support:Security updatesPaid updates

Slide26

WS 2008 R2

WS 2012 R2

WS 2016

Windows Server Lifecycle

LS2 and R3

Slide27

Windows server Lifecycle

From the Microsoft Roadmap is clear, that at the start of R3 the WS2016 will be the supported operating system

End of mainstream support will occur around the end of R3 (this is where we are now with WS2008R2)

WS 2016 shall therefore be our

next platformOPC, WINCC OA, … components shall be available on that platform

Release in 2016 makes it comfortable for: testing of installation procedures and various configurationsFor example PVSS did not start on WS2008 if File Server role was enabledDefinition of hardware requirements, compatible with hardware purchasing procedureCompatibility, stability and performance tests/tuning of all software components

Slide28

Some of the upgrade issues

Slide29

WARNING

The following slides do not aim to blame ANYONE!

We just provide a summary of events linked to relatively simple cases to illustrate the complex dependencies

affecting the smooth upgrade

Slide30

Support requests

The ALICE central team reports all issue to BE-ICS directly

We do not let the detectors contact BE-ICS directly

Reason:

We can keep overview and track all support requestsBefore contacting BE-ICS we always do the initial tests and debugging to prevent false reports and exclude ALICE-specific dependencies

It is clear, that sometimes we manage to report issues where a fix would be evident and trivialSometimes we report issues based on careful debugging, but we get back very basic questions and it takes quite some time until the problem is recognized as a severe issueTo speed up we developed a bad habbit of contacting the expert who might have the answer directly to make him/her aware of the ticket. Nasty but efficient.

Slide31

2015/2016 YETS upgrade: Sequence of events

ALICE prepared a production release of framework based on FW 5.2.0 and asked detectors to upgrade the systems immediately

ALICE framework release:

FEBRUARY 10

JCOP FW 5.2.1 announced on FEBRUARY 11JCOP FW 5.2.1 released on FEBRUARY 12

…. Confused by the UNICOS affiliation and maybe not so clear understanding of the severity of the problem, ALICE continued upgrades with wrong component for few more days…. We lost also some time until we clarified the versions and created a new releasePatched ALICE framework (containing new DIM and new unDistributed component) released on the night February 14/15Fwinstallation tool 7.2.8 released on

February 15 (we are not sure if the intention to release a new tool has been announced)

Slide32

Example:Undistributed

undistributed component badly damaged the

config

file

Reported on 11-FEB, fixed immediately and release in FW 5.2.1 on 12-FEBWe appreciate the extremely quick fix!We would appreciate having a bit more details: what was fixed, how to repair the damaged

config and where to get the fixed component (on 11-feb FW 5.2.1 it was not yet available)The information about the critical bug did not reach ALICE (maybe due to our ignorance)We noticed the new FW release, but it was not immediately evident, that it fixes this bug and that this bug will affect us so badly. Confusing is the UNICOS projectWe were deploying the new FW exactly at the day when this bug has been fixed – early warning could have stopped us!

Slide33

Example: unDistributed

A short summary of the versions:

FW 5.2.0 contained version 6.2.3 which screwed-up the [DIST]

FW 5.2.1 contains version 6.2.4

XML file inside says <component><name>unDistributedControl</name><version>6.2.4</version><date>28/01/2016</date>

As FW 5.2.0 was already installed on many systems, we wished to extract only undistributed and fixed already affected systemsWe received a recommendation : pick the correct version from the individual components download page. BUT…..:Component download page for FW 5.2.1 STILL contains unDistributed version 5.2.1. tagged now as released on 4-Mar-2016CONFUSING VERSION – it resembles the FW release 5.2.1, so nobody will get immediately alerted

XML file inside says: <component><name>unDistributedControl</name><version>5.2.1</version><date>17/10/2011</date> ALICE solution:We manualy extracted correct version from full FW 5.2.1 release and patched ALICE distributionWe prepared a script which corrects the [DIST] (taking the original settings from backups) We still do not know, where this manager is needed if we do not use UNICOS

Well… we see that it is used in the JCOP AES for example…Details in still opened case ENS-1617

Slide34

Slide35

unDistributed issue summary and possible lesson to be learned

We appreciate the quick fix of the problem

We could still work on the warning information flow

It would be good to announce critical bugs to JCOP CB members or contact persons immediately after these were reported

It would be good to know, which components are going to be released (~within next 2 weeks)It would be good to know, what issues are being fixed and what is the expected

scheduleAlso the experiments shall announce the dates of upgrades to BE-ICS

Slide36

Release notes issue: Example fw

5.2.1

Problem description

Solution

Slide37

What is the issue?

One of the systems lost connection to ISEG power supply via DIM

We knew that there is a new version of DIM

all information which we had came from the release notes

Unfortunately ICE support did not have more details eitherWe blindly installed new DIMMiraculously it helped!

Later we learned, that the fix was absolutely not related to our issue, so we do not know what really happened

Slide38

Possible improvements

We would prefer to have again real release notes, summarizing all fixes and improvements, instead if JIRA links

If JIRA is used, we would appreciate a lot if we could have

full description of the problem (

unDistributed was described OK for example)If necessary, some instructions on required actions (

unDistributed would profit from this for example)All details in one place (not references to other JIRA cases)

Slide39

Conclusions

No big changes during the RUN2

No revolution for RUN3, but new interfaces need to be developed

Operating systems to be used: WS2016, CENTOS

Current upgrade issues:Latest framework installed smoothly and performs very wellRelatively simple issues created a big trouble

We see possible improvement in the communication flow and documentationALICE is ready to move to WINCC 3.14 during the next YETS unless major issues are discovered during the next monthsWe would need all componets ready by this summer to allow for testing