Each detector builds a local distributed system We try to keep subsystems on separate computers HV LV FEE Central DCS systems connect to all WINCC systems in ALICE Scattered systems used for UI central operator UI alert screen displays in control center expert ID: 800700
Download The PPT/PDF document "ALICE DCS upgrades ALICE WINCC setup" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ALICE DCS upgrades
Slide2ALICE WINCC setup
Each detector builds a local distributed system
We try to keep subsystems on separate computers (HV, LV, FEE..)
Central DCS systems connect to all WINCC systems in ALICE
Scattered systems used for UI (central operator UI, alert screen, displays in control center, expert
Uis
…)
Slide3DETECTOR
DETECTOR
DETECTOR
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
Each detector builds its own distributed control system
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
For example to restore from SAFE:
Some detectors end in Configured state
Other detectors need special configuration
Other do not change state at all
…
The state of the experiment depends on:
The state of each subdetector, each having its own control logic
Internal and external services
ALICE online and offline systems…
Each detector system is built as an autonomous distributed system of 2-16 WINCC systems
Slide4DETECTOR
DETECTOR
DETECTOR
DETECTOR
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
Central systems connect to all detector systems
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
WCC OA
The central systems
connect
to detector
systems
and form one large distributed system sharing the data and the control
ALICE DCS Systems
Slide5The ALICE DCS scale
21
autonomous detector systems
120 WINCC OA systems
>100 subsystems
160
control computers
>700 embedded computers
1200 network attached devices
300 000 OPC and frontend items
Slide6OPC servers and drivers in ALICE
Wiener : 23 OPC servers
Systec
: 25 drivers
ELMB: 13 OPC serversIseg: 5 OPC server + 1 custom (DIM based) serverCAEN: 20 OPC servers
Slide7(Some of ) the present ALICE and DCS challenges
ALICE Pixel detector (SPD):
10 000 000 configurable pixels
1.3 kW power dissipation on
(200+150)
μ
m assembliesContingency in case of cooling failure less than 1 minute
ALICE Pixel detector
ALICE Transition Radiation Detector (TRD):760 m2 covered by drift chambers
17TB/s raw data processed by 250 000 tracklet processors directly on the detector65kW power provided by 89 LV power supplies
Transition Radiation Detector
ALICE Time Projection Chamber (TPC):
96 m3 gas volume (largest ever)stabilized at 0.10CCooling system has to remove 28kW of dissipated power
Installed just next to TRD557568 readout pads100kV voltage in the drift cage
Time Projection Chamber
Slide8Slide9The ALICE O2 Project
ALICE 2007-2017
~10GB/s
Slide10LHC will provide for RUN3 ~100x more
Pb-Pb
collisions compared to RUN2
(10
10
– 4.10
11 collisions/year) ALICE O2 Project merges online and offline into one large systemTDR approved IN 2015 !
Some O2 challenges:New detector readout8400 optical linksData rate
1.1TB/sGlobal computing:~100k CPU cores, 5k GPUs, 500 FPGAsData storage requirements: ~60PB/year
The ALICE O2 Project
ALICE 2007-2017
ALICE 2019-2027
~10GB/s
Slide11The O2 architecture
Detectors
250 First Level processors (FLP)
1250 Event Processing Nodes (EPN)
~100 000 processor cores
Slide12DCS in the context of O2
DCS provides input to O2
~100 000 conditions parameters are requested for the event reconstruction
Data has to be injected into each 20ms data frame
Slide13DCS-O2 interface
ALICE Data Collector and Publisher
ALICE DCS Production System
ALICE DCS Storage System (ORACLE)
Process Image
O2
The ALICE Data Collector receives data from the DCS
Depending on the type and priority of the data, the Data Collector can connect to different layers of the system
A process image, containing conditions data is sent to O2
Slide14DCS-O2 interface
ALICE Data Collector
Data Finder and Subscriber
Finds the data and subscribes to the published values
Process Image
Keeps update copy of data received by Subscriber
Data Publisher
Sends conditions data to O2
Prototypes proved the feasibility of the concept
Larger scale prototypes being prepared
…but...
T
his is just the beginning of the story
Slide15New readout electronics for ALICE
The O2 interfaces to detector frontends using a new Common Readout Unit (CRU)
All data is transmitted over 8400 optical links
DCS data provided by front-end is interleaved with the physics data
FLP strips the DCS data off the global stream and sends it to the DCSThe same link is used to send DCS commands to the front-end
Slide16FLP
FLP
The CONTROL and CONFIGURATION
16
WCC OA
WCC OA
FRED
FLP
ALF
CRU
DETECTOR SPECIFIC LAYER FRED (Front End Device) runs on a dedicated server
Receives commands from WCC OA and forwards them to ALF
Receives data from ALF and publish it to WCC OA
DETECTOR NEUTRAL LAYER ALF (Alice Low Level Frontend interface) provides communication interface to CRU firmware
Slide17How ALICE performs updates
Slide18ALICE update procedures
General rules:
Big updates only during the YETS
Operating systems and patches, DB servers, WINCC upgrades, new FW….
Critical and security updates during the technical stopsOne exception: no updates during the last TS before the HI run Exceptional updates installed at any time, as needed
Fixes for urgent issues like OPC memory leak …On Monday of each TS a full backup of all WINCC systems is taken Detectors are responsible for taking additional backups after each system change The DCS team can rollback the systems to a state at the beginning of the last TS
Slide19ALICE update procedures
The operating systems, OS patches, drivers, WINCCOA and OPC servers are installed centrally
Using CMF and homegrown scripts/procedures
Detectors install only WINCCOA projects
Trap : WINCC experts are granted too elevated privileges which leads to frequent misuseQuestion about required privilleges
ahs been submitted, the answer was: “We would like to know as well, let us know if you find out the answer”
Slide20ALICE update procedures
Framework update procedure:
Install new FW in the lab to spot obvious issues
Create ALICE FW package in ALICE repositories
JCOP FW + ALICE extensionsUpgrade and test the central DCS systems maintained by the central team
Publish the new FW to detectors and set deadline for upgrade (grace period of few weeks, depending on the ALICE activities)We rely on rollback in case of troublesIntegration session with all detectors to re-establish the full system functionalityThe ALICE Fw is frozen after the release, to assure uniformityCritical patches are maintained in separate repositories (we do not modify the already released distribution)
Slide21Testing of components
Ad-hoc tests are performed in the DCS lab (performance, long-term stability…)
Scale of the system is not realistic
A small project for this summer aims for creating reference slices for some standard devices (CAEN, ISEG, …)
Full slice from the device up to the user interfaceThe lab is used for debugging before a support request is submitted to BE-ICS
Standing offer to use the lab for more standardized testWe could give another try to revive this project
Slide22Operating systems in ALICE DCS
Windows Server
WINCC is installed exclusively on Windows Server systems
Currently WS2008R2
Windows desktop systemsExceptional cases - devices without WS support, like stepper motors, gas chromatograph Single exception – exotic PC controlling cooling plant, running XP on isolated network
LinuxSLC6 used typically for front-end controlSLC5 Linux on some VME controllersSLC4 Linux installed on ~750 FPGA based control boards
Slide23Operating system versions
The natural choice would be to install always the latest OS for the upgrades
Good support for the new computer hardware
Compliant with CERN policies
Up-to-date featuresThe installed version is determined by many constraints: Most often there is lack of support by vendors of the components (drivers, etc..)
In many cases the preferred platform is not supported, because the company did not try
Slide24Operating systems upgrade
We do not have enough arguments for upgrading the operating systems during the 2016/2017 YETS yet
Still open question
At the end of RUN2 we plan to move to
Windows Server 2016CENTOS Current front-end software is running stable on SLC6, no need for
change before the end of RUN2Reminder: ALICE is not running WINCC on Linux
Slide25Windows Server lifecycle (valid for all Microsoft products)
Mainstream support
:
new feature requests and product design changes accepted
Non-security and security updatesExtended support:Security updatesPaid updates
Slide26WS 2008 R2
WS 2012 R2
WS 2016
Windows Server Lifecycle
LS2 and R3
Slide27Windows server Lifecycle
From the Microsoft Roadmap is clear, that at the start of R3 the WS2016 will be the supported operating system
End of mainstream support will occur around the end of R3 (this is where we are now with WS2008R2)
WS 2016 shall therefore be our
next platformOPC, WINCC OA, … components shall be available on that platform
Release in 2016 makes it comfortable for: testing of installation procedures and various configurationsFor example PVSS did not start on WS2008 if File Server role was enabledDefinition of hardware requirements, compatible with hardware purchasing procedureCompatibility, stability and performance tests/tuning of all software components
Slide28Some of the upgrade issues
Slide29WARNING
The following slides do not aim to blame ANYONE!
We just provide a summary of events linked to relatively simple cases to illustrate the complex dependencies
affecting the smooth upgrade
Slide30Support requests
The ALICE central team reports all issue to BE-ICS directly
We do not let the detectors contact BE-ICS directly
Reason:
We can keep overview and track all support requestsBefore contacting BE-ICS we always do the initial tests and debugging to prevent false reports and exclude ALICE-specific dependencies
It is clear, that sometimes we manage to report issues where a fix would be evident and trivialSometimes we report issues based on careful debugging, but we get back very basic questions and it takes quite some time until the problem is recognized as a severe issueTo speed up we developed a bad habbit of contacting the expert who might have the answer directly to make him/her aware of the ticket. Nasty but efficient.
Slide312015/2016 YETS upgrade: Sequence of events
ALICE prepared a production release of framework based on FW 5.2.0 and asked detectors to upgrade the systems immediately
ALICE framework release:
FEBRUARY 10
JCOP FW 5.2.1 announced on FEBRUARY 11JCOP FW 5.2.1 released on FEBRUARY 12
…. Confused by the UNICOS affiliation and maybe not so clear understanding of the severity of the problem, ALICE continued upgrades with wrong component for few more days…. We lost also some time until we clarified the versions and created a new releasePatched ALICE framework (containing new DIM and new unDistributed component) released on the night February 14/15Fwinstallation tool 7.2.8 released on
February 15 (we are not sure if the intention to release a new tool has been announced)
Slide32Example:Undistributed
undistributed component badly damaged the
config
file
Reported on 11-FEB, fixed immediately and release in FW 5.2.1 on 12-FEBWe appreciate the extremely quick fix!We would appreciate having a bit more details: what was fixed, how to repair the damaged
config and where to get the fixed component (on 11-feb FW 5.2.1 it was not yet available)The information about the critical bug did not reach ALICE (maybe due to our ignorance)We noticed the new FW release, but it was not immediately evident, that it fixes this bug and that this bug will affect us so badly. Confusing is the UNICOS projectWe were deploying the new FW exactly at the day when this bug has been fixed – early warning could have stopped us!
Slide33Example: unDistributed
A short summary of the versions:
FW 5.2.0 contained version 6.2.3 which screwed-up the [DIST]
FW 5.2.1 contains version 6.2.4
XML file inside says <component><name>unDistributedControl</name><version>6.2.4</version><date>28/01/2016</date>
As FW 5.2.0 was already installed on many systems, we wished to extract only undistributed and fixed already affected systemsWe received a recommendation : pick the correct version from the individual components download page. BUT…..:Component download page for FW 5.2.1 STILL contains unDistributed version 5.2.1. tagged now as released on 4-Mar-2016CONFUSING VERSION – it resembles the FW release 5.2.1, so nobody will get immediately alerted
XML file inside says: <component><name>unDistributedControl</name><version>5.2.1</version><date>17/10/2011</date> ALICE solution:We manualy extracted correct version from full FW 5.2.1 release and patched ALICE distributionWe prepared a script which corrects the [DIST] (taking the original settings from backups) We still do not know, where this manager is needed if we do not use UNICOS
Well… we see that it is used in the JCOP AES for example…Details in still opened case ENS-1617
Slide34Slide35unDistributed issue summary and possible lesson to be learned
We appreciate the quick fix of the problem
We could still work on the warning information flow
It would be good to announce critical bugs to JCOP CB members or contact persons immediately after these were reported
It would be good to know, which components are going to be released (~within next 2 weeks)It would be good to know, what issues are being fixed and what is the expected
scheduleAlso the experiments shall announce the dates of upgrades to BE-ICS
Slide36Release notes issue: Example fw
5.2.1
Problem description
Solution
Slide37What is the issue?
One of the systems lost connection to ISEG power supply via DIM
We knew that there is a new version of DIM
all information which we had came from the release notes
Unfortunately ICE support did not have more details eitherWe blindly installed new DIMMiraculously it helped!
Later we learned, that the fix was absolutely not related to our issue, so we do not know what really happened
Slide38Possible improvements
We would prefer to have again real release notes, summarizing all fixes and improvements, instead if JIRA links
If JIRA is used, we would appreciate a lot if we could have
full description of the problem (
unDistributed was described OK for example)If necessary, some instructions on required actions (
unDistributed would profit from this for example)All details in one place (not references to other JIRA cases)
Slide39Conclusions
No big changes during the RUN2
No revolution for RUN3, but new interfaces need to be developed
Operating systems to be used: WS2016, CENTOS
Current upgrade issues:Latest framework installed smoothly and performs very wellRelatively simple issues created a big trouble
We see possible improvement in the communication flow and documentationALICE is ready to move to WINCC 3.14 during the next YETS unless major issues are discovered during the next monthsWe would need all componets ready by this summer to allow for testing