/
Fermilab Site Report HEPiX Fermilab Site Report HEPiX

Fermilab Site Report HEPiX - PowerPoint Presentation

bikersjoker
bikersjoker . @bikersjoker
Follow
345 views
Uploaded On 2020-06-16

Fermilab Site Report HEPiX - PPT Presentation

Fall 2011 Keith Chadwick Work supported by the US Department of Energy under contract No DEAC0207CH11359 Physics CDF amp D0 Top Quark Asymmetry Results CDF Discovery of Ξ b 0 ID: 779300

fermilab 2011 oct site 2011 fermilab site oct report computing fall hepix amp data service power gcc services active

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Fermilab Site Report HEPiX" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fermilab Site ReportHEPiX Fall 2011

Keith Chadwick

Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

Slide2

Physics!

CDF & D0 - Top Quark Asymmetry Results.CDF - Discovery of Ξ

b

0

.CDF - Λc(2595) baryon.Combined CDF & D0 Limits on standard model higgs mass.

24-Oct-2011

1

Fermilab Site Report - HEPiX Fall 2011

Slide3

TeVatron Shutdown

On Friday 30-Sep-2011, the Fermilab TeVatron was shut down following 28 years of operation,The collider reached peak luminosities of 4

x

10

32 per centimeter squared per second,The CDF and Dzero detectors recorded 8.63 PiB and 7.54 PiB of data respectively, corresponding to nearly 12 inverse femtobarns of data,

CDF and Dzero data analysis continues,Fermilab has committed to 5+ years of support for analysis and 10+ years for access to data.

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

2

Slide4

CDF & D0 Publications

CDF

D0

24-Oct-2011

3

Fermilab Site Report - HEPiX Fall 2011

Slide5

CD to CS Reorganization

Computing Sector:Vicky White serves as the Associate Director for Computing & CIO.

Computing Sector has two divisions:

Core Computing Division lead by Jon Bakken.

Scientific Computing Division lead by TBD.http://cdorg.fnal.gov/adm/orgcharts/orgchart.pdf

24-Oct-2011

4

Fermilab Site Report - HEPiX Fall 2011

Slide6

Core Computing Division – a strong base for science

Scientific Computing relies on Core Computing services and Computing Facility infrastructure

Core Networking and network services

Computer rooms, power and cooling

Enterprise virtualizationEmail, videoconferencing, web servicesDocument databases, Indico, calendaring

Service deskMonitoring and alertsLogisticsDesktop support (Windows and Mac)

Printer support Computer SecurityBusiness services, including many projects - identity management,

fermidash, teamcenter, EBS, timecards, etc.

… and more24-Oct-2011

5

Fermilab Site Report - HEPiX Fall 2011

Slide7

Scientific Computing Division

Scientific Computing is responsible for meeting the ever evolving needs of the Fermilab Scientific Program.Data custodianship (

enStore

,

dCache, Lustre),Data reduction and analysis (FermiGrid, CDF & D0 Analysis clusters, CMS LPC & Tier 1, GPCF, FermiCloud, etc.),Experiment support (electronics, hardware, software, etc.),Physicists, engineers, developers, system administrators, database administrators, computer services specialists,

24-Oct-2011

6

Fermilab Site Report - HEPiX Fall 2011

Slide8

Soudan Mine Update

CDMS detector and Minos “far detector” back in operation as of 25-May-2011.

24-Oct-2011

7

Fermilab Site Report - HEPiX Fall 2011

Slide9

Fermilab Computing Facilities

Feynman Computing Center:FCC2 Computer RoomFCC3 Computer Rooms

High availability services – e.g. core network, email, etc.

Tape Robotic Storage (3 10000 slot libraries)

UPS & Standby Power GenerationARRA project: upgrade cooling and add HA computing room - completedGrid Computing Center:

3 Computer Rooms – GCC-[A,B,C]Tape Robot Room – GCC-TRRHigh Density Computational Computing

CMS, RUNII, Grid Farm batch worker nodesLattice HPC nodesTape Robotic Storage (4 10000 slot libraries)

UPS & taps for portable generatorsLattice Computing Center:

High Performance Computing (HPC)Accelerator Simulation, Cosmology nodesSystems for Integration & Development

No UPS

EPA Energy Star award

for 2010

24-Oct-2011

8

Fermilab Site Report - HEPiX Fall 2011

Slide10

It’s not easy being Green…

It Requires a Lot of Work

Lots of space has been retired / consolidated - LCC 108, Tape Vault

Mezz

, FCC1

.

Ensured acquisition of EPEAT registered (95%), ENERGY STAR qualified (100%), or FEMP designated (95%) electronic office products when procuring electronics in eligible product categories.

http://www.epeat.net/

Fermilab participates in the Federal Electronics Challenge (FEC). The FEC focus on procurement, use and disposal of electronics (computing, cell phones, etc.)

We have received two bronze awards.

Personal Computing Environmental Policy

http://computing.fnal.gov/xms/About/Computing_Policies/Personal_Computing_Environmental_Policy

Think Green – Guidance on procuring and using computing

http://computing.fnal.gov/xms/Services/Think_Green

24-Oct-2011

9

Fermilab Site Report - HEPiX Fall 2011

Slide11

Cooling Incidents at GCC

Tuesday 7-Jun-2011 through Wednesday 8-Jun-2011:Outside temperature of 93F / 34C caused Loss of cooling due to “high head” - refrigerant too hot coming back from the condensers.

Additional monitoring installed.

Tuesday 19-Jul-2011 through Monday 25-Jul-2011:

Additional internal and external cooling installed due to predicted temperatures of 100F / 38C,

Ran at 30% capacity reduction through Monday 25-Jul-2011.

Engineering study in progress to design a fix for the underlying issue(s).

24-Oct-2011

10

Fermilab Site Report - HEPiX Fall 2011

Slide12

Power Incidents - 1

Thursday 28-Jul-2011 & Friday 29-Jul-2011 [Site Wide]:Site wide power outage due to lightning strike on 345kV primary power lines to site, site switched to secondary (lower capacity) power lines,

GCC down for ~8 hours on Thursday and ~2 hours on Friday when the site was switched back to primary power lines,

FCC rode through on UPS + Generators.

Tuesday 16-Aug-2011 [GCC-B]:GCC-B computer room lost power at ~20:45,

Investigation determined that an indicator light in a PLC board in the main room breaker enclosure failed in such a manner to “short out” the PLC board and trip off the entire room,Power was restored at ~14:00 on Wednesday 17-Aug-2011.

24-Oct-2011

11

Fermilab Site Report - HEPiX Fall 2011

Slide13

Power Incidents - 2

Thursday 25-Aug-2011 [GCC-B]:On Tuesday 23-Aug-2011, an internal UPS that supports the PLC board (replaced on Tuesday 16-Aug-2011) was identified as being “unstable” – likely damaged during the previous incident,

A scheduled power down of GCC-B was taken on Thursday 25-Aug-2011 from 0830-1130 to replace the failing internal UPS.

Saturday 15-Oct-2011 [FCC2]:

Scheduled ~8 hour downtime to complete ARRA funded electrical work and upgrade EPO for FCC2 computer room.Downtime preparations started at 0300, power shut down at 0600, power back at 1300, EPO tests until 1400, network restart at 1400, file servers restarted at 1500, services restarted at 1600, all complete by 1800.

An annual report is written on all data center service disruptions, including

RCAs and lessons learned as appropriate.

24-Oct-2011

12

Fermilab Site Report - HEPiX Fall 2011

Slide14

Tape Robots & Tape Drives

7th SL8500 tape robot installed - now have 4 in GCC & 3 in FCC.

10 T10KC tape drives were put into limited production on 20-Jun-2011:

Writing second copy of data onto T10KC tapes,

Encountered two types of errors once T10KC entered limited production,Fortunately no data loss since T10KC was a second copy,

New firmware was installed in all T10KC drives to address these issues.

After testing LTO5 and T10KC, Fermilab has decided to adopt T10KC tape technology and will start migrating to this media in FY2012.USCMS has purchased 30 T10KC drives (will still write a second copy to LTO4 for a while)!

Run II (CDF/D0) has purchased 8 T10KC drives.We are working to provide a small file aggregation/cache for

enStore.Expect to release early in CY2012

24-Oct-2011

13

Fermilab Site Report - HEPiX Fall 2011

Slide15

Storage

The Data Movement and Storage department are exploring alternative disk storage for analysis (ex: Lustre) and will be capable of bringing up a small production level Lustre system.

The CMS Tier 1 has deployed CERN's EOS for user data files & report that it has worked wonderfully. More disk will probably be added and the system will grow in importance.

All CMS Tier 1 data files on disk are available for external users via

xrootd's data reflector capabilities & via the OSG "Data Anytime, Anywhere" program of work.

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

14

Slide16

Computing Sector Project Management

Capital “P” Projects and small “p

” projects:

Capital “P” projects typically run under formal project management,

Service-Now migration, Email migration, FermiDash, TeamCenter, etc.Small “

p” projects typically run under line management,Yearly worker node procurements, GPCF, FermiCloud, FermiGrid-HA2, VoIP pilot, enhancements to NIMI & Tissue work, server consolidation and virtualization, etc.

24-Oct-2011

15

Fermilab Site Report - HEPiX Fall 2011

Slide17

Capital “P” Projects

Project Name

Description

Current

Status

Teamcenter

Implement a common Engineering Data Management System (EDMS) to capture all elements of the Fermilab engineering process and documents.

Target go live is 1Q CY2012Exchange

MigrationDeliver an e-mail and calendaring service based on Exchange Server 2010. Migrate all Imap4, Exchange 2007, Lotus Notes, and Meeting Maker users to it.

Target go live is ~now

Service Now

Migrate Fermilab ITIL support tool from BMC Remedy to cloud based Service-Now in support of ISO 20K certification

Go live was 19-Oct-2011

FermiDash

Deliver a management dashboard for senior Laboratory management

1

st

draft of dashboard available

SharePoint Deployment

Production

SharePoint deployment

SharePoint in production

FY11 Computer Security

Compliance

Address FY11 Computer Security Compliance

Issues

Planning in progress

Windows 7

Deployment

Manage the Windows 7 pre-deployment

testing and

deployment, phase 1 is complete, phase 2 is being planned (replacement of hardware

that is unable to run Windows 7

Roll out in progress

EBS

R12 Upgrade

Update E-business

suite to release 12

Planning in progress

Identity Management

Provide a single authoritative source of truth for managing and maintaining information about individuals (employees, visitors, contractors, etc) associated with the laboratory. Provide a secure trusted electronic identity that can be used in a variety of ways, in particular to authorize use of computing services.

Technology

investigations underway

24-Oct-2011

16

Fermilab Site Report - HEPiX Fall 2011

Slide18

Email Migration

Migration from obsolete IMAP servers to Exchange 2010 with anti-virus & anti-spam filters:Exchange servers hosted onsite but managed by off-site subcontractors.

Anti-virus & anti-spam filters are hosted offsite “in the cloud”.

1

st phase of migration implemented on Tuesday 02-Aug-2011 – Anti-Virus / Anti-Spam service commissioned:We encountered a few issues over the next 36 hours related to email processing of non “

fnal.gov” domains that were hosted at Fermilab.Resolved by a reconfiguration of the email delivery

ACLs late on Wednesday 03-Aug-2011.email delivery incident on Saturday 08-Oct-2011:

User SMTP credentials compromised, likely through IM client that transmitted (shared) credentials in the clear,Spamcop

notifications started at 6:21, user credentials randomized at 12:08, subcontractor disabled external routing of Fermilab email, workaround for external email routing implemented at 18:42.

24-Oct-2011

17

Fermilab Site Report - HEPiX Fall 2011

Slide19

Windows 7 Rollout

Phase 1 is complete:Migrated all the desktops with hardware capable of running Windows 7

Phase 2 is being planned:

Will target the older systems that need upgrade or replacement to run Windows 7.

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

18

Windows 7 Plot

Slide20

Mac OS X

10.4 (Tiger)10.5 (Leopard)10.6 (Snow Leopard)

10.7 (Lion)

Released by Apple on 20-Jul-2011

Approved for Fermilab deployment on 28-Sep-2011Desktop Support is deploying the Casper service to enable central management of MacsThis replaces QMX.

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

19

Mac OS X Plot

Slide21

Scientific Linux

SL(F/C) 4 - End-Of-Life - 2/12/2012!SL(F/C) 5

SL(F/C) 6

Jason Harrington & Tyler Parsons have joined the SL team,

Departure of Troy Dawson, addition of Pat RieheckyYou will hear more about SL in Connie's talk.

24-Oct-2011

20

Fermilab Site Report - HEPiX Fall 2011

Slide22

Fermilab CPU Core Count

24-Oct-2011

21

Fermilab Site Report - HEPiX Fall 2011

Slide23

Data Storage at Fermilab

24-Oct-2011

22

Fermilab Site Report - HEPiX Fall 2011

Slide24

High Speed Networking

Chicago MAN and Fermilab LightPath connections to

StarLight

are working extremely well, delivering production high

bandwith data transfers in and out of Fermilab for LHC experiments.We encountered and resolved some performance issues with various network fabric extenders.Work is underway to implement a distributed core to the network in order to be resilient to building outages.

Working on IPv6 testbed.FermiGrid and FermiCloud will actively participate.

Working on preparations to participate in the DOE 100 Gb

/sec test network.

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

23

Slide25

HPC

Lattice QCDDs (~430 WN), J/Psi (~860 WN), Kaon

(~284 WN) all running well.

Computational Cosmology

~1200 cores, running well.Wilson ClusterIn the process of adding 34 nodes (each with 32 cores) to the cluster.GPU ClusterOrdered at end of FY2011, delivery will be in the next couple of weeks.

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

24

Slide26

FermiGrid Occupancy & Utilization

Cluster(s

)

Current Size (Slots)

Average

Size (Slots)

Average Occupancy

Average UtilizationCDF

5630

5477

93%

67%

CMS

7132

6772

94%

87%

D0

6540

6335

79%

53%

GP

3042

2890

78%

68%

Total

21927

21463

87%

70%

24-Oct-2011

Fermilab Site Report - HEPiX Fall 2011

25

Slide27

FY2011 Worker Node Acquisition

Specifications:Quad processor,

8 core,

AMD 6128/6128HE, 2.0 GHz,

64 Gbytes DDR3 memory,3x2 Tbytes disk,4 year warranty,

$3,654 each.

Who

Retire-

ments

Base

Option

Purchase

Assign

CDF

200

0

16+36

52

36

CMS

--

40

4+20

64

64

D0

23

30

37

67

67

IF

--

0

40

40

0

GP

48

0

9

9

65

Wilson

--

34

--

34

34

Total

99

106+61

266

266

24-Oct-2011

26

Fermilab Site Report - HEPiX Fall 2011

Slide28

Other Significant FY2011 Purchases

Storage:5.52 Petabytes

of

Nexsan

E60 SATA drives for raw cache disk (will mostly be used for dCache with some Lustre).180 Terabytes of BlueArc SATA disk.60 Terabytes of BlueArc 15K SAS disk.

Servers:119 servers,16 different configurations,

All done in a single order.

24-Oct-2011

27

Fermilab Site Report - HEPiX Fall 2011

Slide29

FermiGrid-HA2 Deployment

1st Rack moved from FCC1 to FCC2 on Tuesday 24-May-2011,

2

nd

Rack moved from FCC1 to GCC-B on Tuesday 7-Jun-2011 & the FermiGrid-HA2 physical reorganization was completed on Tuesday 07-Jun-2011 at ~1300.Critical services are now hosted in two data centers (FCC2 & GCC-B). Non-critical services are split across the two data centers.

The plan had been to utilize an scheduled power outage on 13-Aug-2011 for FCC2 to serve as the final acceptance test for the FermiGrid-HA2 project.

This power outage was later moved to 15-Oct-2011.The GCC-B cooling outage at 1500 on Tuesday 07-Jun-2011, resulted in all systems in GCC-B being immediately shutdown when the facilities personnel switched off the main electrical panel breakers.

FermiGrid-HA2 functioned exactly as designed.

The critical services failed back to the single remaining copy of the service on FCC2.The non-critical services went to reduced capacity.

When power was restored at 1700, the second copy of the critical services transparently rejoined the service “pool”, and the non-critical services resumed operation.

24-Oct-2011

28

Fermilab Site Report - HEPiX Fall 2011

Slide30

FermiGrid-HA2 Service Availability

Service

Raw Availability

HA Configuration

Measured HA Availability

Minutes of Downtime

VOMS – VO Management Service

99. 657%

Active-Active100.000%

0

GUMS – Grid User Mapping Service

99.652%

Active-Active

100.000%

0

SAZ – Site AuthoriZation Service

99.657%

Active-Active

100.000%

0

Squid – Web Cache

99.640%

Active-Active

100.000%

0

MyProxy – Grid Proxy Server

99.954%

Active-Standby

99.954%

240

ReSS

– Resource Selection Service

99.635%

Active-Active

100.000%

0

Gratia – Fermilab and OSH Accounting

99.365%

Active-Standby

99.997%

120

Databases

99.765%

Active-Active

99.988%

60

24-Oct-2011

29

Fermilab Site Report - HEPiX Fall 2011

Slide31

FermiCloud

Collaboration with KISTI:Vcluster

(Grid cluster on demand).

MPI on FermiCloud – achieving 70% of “bare metal” performance

without Mellanox SRIOV drivers.Relocated ~half of FermiCloud from GCC-B to new FCC3 computer rooms.

Collaboration with OpenNebula:OpenNebula 3.0 includes X.509 authentication patches written at Fermilab & contributed back to OpenNebula project!

SAN Upgrade underway:

2 x SATABEAST,2x2

x Brocade Switches,Dual FC HBAs

in all systems,

Designed to be fault tolerant in the event of building outage.

24-Oct-2011

30

Fermilab Site Report - HEPiX Fall 2011

Slide32

Summary

Physics results from CDF, D0, CMS, Intensity Frontier and Cosmic Frontier are continuing,Fermilab is in a time of transition, with a very bright and interesting future,

The Computing Sector has reorganized,

The Fermilab computing facilities have faced several challenges and performed exceedingly well,

Several large “P” computing Projects and numerous small “p” computing projects are underway,Virtualization, Grid & Cloud Computing is well established and growing,

Thanks to the extremely hard working members of the Computing Sector!

24-Oct-2011

31

Fermilab Site Report - HEPiX Fall 2011