/
Ceph Ceph

Ceph - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
563 views
Uploaded On 2016-10-28

Ceph - PPT Presentation

Deployment at Target Best Practices and Lessons Learned Agenda 2 Introduction Will Boege Sr Technical Architect RAPID Team Private Cloud and Automation Agenda 3 First Ceph Environment at Target went live in October of 2014 ID: 481506

ceph osd iops performance osd ceph performance iops crush openstack runt environment osds cluster pid max hardware ssd deployment

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Ceph" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Ceph

Deployment at Target:

Best Practices and Lessons LearnedSlide2

Agenda

2

Introduction

Will Boege

Sr. Technical Architect

RAPID Team (Private Cloud and Automation)Slide3

Agenda

3

@

First

Ceph

Environment at Target went live in October of 2014

“Firefly” Release

Ceph

was backing Target’s first ‘official’

Openstack

release

Icehouse Based

Ceph is used for:libRBD for Openstack Instances and VolumesRADOSGW for Object (instead of Swift)Kernel RBD backing Celiometer MongoDB volumesCurrently DEV is largest environment with ~1500 instancesReplaced traditional array-based approach that was implemented in our prototype Havana environment.Traditional storage model was problematic to integrateMaintenance/purchase costs from Expensive Machine Corporations can get prohibitiveTraditional SAN just doesn’t ‘feel’ right in this space.Ceph’s tight integration with OpenstackSlide4

Agenda

4

@

Initial

Ceph

Deployment:

3 x Monitor Nodes – Cisco B200

12 x OSD Nodes – Cisco C240 LFF

12 4TB SATA Disks

10 OSD per server

Journal partition co-located on each OSD disk

120 OSD Total = ~ 400 TB

2x 10GBE per host1 “Front End” 1 Private CephBasic LSI ‘MegaRaid’ controller – SAS 2008M-8iNo Supercap or Cache capability onboard10xRAID0Slide5

Post rollout it became evident that there were performance issues within the environment.KitchenCI

users would complain of slow Chef converge timesYum transactions would take abnormal amounts of time to complete.

Instance boot times, especially for images using cloud-init would take excessively long time to bootDatabase

General user griping about ‘slowness’

Lesson #1 -

Instrument Your Deployment!

Track statistics/ metrics that have real impact to

your usersSlide6

Unacceptable levels of latency even while cluster was relatively unworked

High levels of CPU IOWait% on the OSD servers & IDLE Openstack

Instances

Poor IOPS / Latency - FIO benchmarks running

INSIDE

Openstack

Instances

$

fio

--

rw

=write --ioengine=libaio --runtime=100 --direct=1 --bs=4k --size=10G --iodepth=32 --name=/tmp/testfile.bin test: (groupid=0, jobs=1): err= 0: pid=1914read : io=1542.5MB, bw=452383 B/s, iops=110 , runt=3575104msecwrite: io=527036KB, bw=150956 B/s, iops=36 , runt=3575104msec

Having more effective instrumentation from the outset would have revealed obvious problems with our architectureSlide7

Compounding the performance issues we began to see mysterious reliability issues.

OSDs would randomly fall offlineCluster would enter a HEALTH_ERR state about once a week with ‘unfound objects’ and/or Inconsistent page groups that required manual intervention to fix.

These problems were usually coupled with a large drop in our already suspect performance levelsLesson #2 –Do your research on hardware your server vendor provides!

Don

t just blindly accept whatever they had laying around, be proactive!Slide8

Root cause of HEALTH_ERRs was “unnamed vendor’s” SATA drives in our solution ‘soft-failing’ – slowly gaining media errors without reporting themselves as failed. Don’t rely on SMART. Interrogate your disks with a array-level tool, like MegaRAID to identify drives for proactive replacement.

$ opt/

MegaRAID/MegaCli

/MegaCli64 -

PDList

-

aALL

|

grep

Media

In installations with co-located journal partitions, a RAID solution with

cache+BBU

for writeback operation would have been a huge performance gain. Paying more attention to the suitability of hardware our vendor of choice provided would have saved a lot of headachesSlide9

Which leads us to –

Lesson #3 –Ceph is not magic. It does the best with the hardware you give it!

Much ill-advised advice floating around that if you throw enough crappy disks at Ceph you will achieve enterprise grade performance. Garbage in – Garbage out. Don

t be greedy and build for capacity, if high performance is what your objective is.Slide10

Agenda

10

New

Ceph

OSD Deployment:

5 x OSD Nodes – Cisco C240

S

FF

18 10k Seagate SAS 1.1TB

6 480g Intel S3500 SSD

Intel SSDs are the

DEFACTO

SSD for use with Ceph. Caveat Emptor with other brands!18 OSD per serverJournal partition on SSD with 4/5:1 OSD/Journal ratio 90 OSD Total = ~ 100 TBImproved LSI ‘MegaRaid’ controller – SAS-9271-8iSupercapWriteback capability18xRAID0Writethru on journals, writeback on spinning OSDs .Still experimenting with this!Based on “Hammer” Ceph ReleaseLessons learned – we set out to rebuildSlide11

Obtaining metrics from our design change was nearly immediate due to having effective monitors in place

Latency improvements have been extreme

IOWait% within Openstack

instances have been greatly reduced

Raw IOPS throughput has

sykrocketed

Testing

Celiometer

backended

by

MongoDB

on RBD I’ve seen this 5 node / 90 OSD cluster spike to 18k IOPSThroughput testing with RADOS bench and osd bench shows aprox. 10 fold increaseUser feedback has been extremely positive, general Openstack experience at Target is much improved.Performance within Openstack instances has increase about 10xResultstest: (groupid=0, jobs=1): err= 0: pid=1914 read : io=1542.5MB, bw=452383 B/s,

iops=110

, runt=3575104msec

write:

io

=527036KB,

bw

=150956 B/s,

iops

=36

, runt=3575104msec

test: (

groupid

=0, jobs=1): err= 0:

pid

=2131

read :

io

=2046.6MB,

bw

=11649KB/s,

iops

=2912

, runt=179853msec

write:

io

=2049.1MB,

bw

=11671KB/s,

iops

=2917

, runt=179853msecSlide12

Before embarking on creating a Ceph environment, have a good idea of what your objectives are for the environment.

Capacity?Performance?Once you understand your objective, understand that your hardware selection is crucial to your success

Unless you are architecting for raw capacity, use SSDs for your journal volumes without exceptionIf you must co-locate journals, use a RAID adapter with

BBU+Writeback

cache

A hybrid approach may be feasible with SATA ‘capacity’ disks with SSD journals. I’ve yet to try this, I’d be interested in seeing some benchmark data on a setup like this

Research, experiment, consult with Red Hat /

Inktank

Monitor, monitor, monitor and provide a very short feedback loop for your users to engage you with their concerns

ConclusionSlide13

Looking to test all SSD pool performanceCeph

not there in Firefly, but SSD optimizations were a major part of the Hammer release. We have needs for a ‘ultra’ Cinder tier for workloads that require high IOPS / low latency for use cases such as Kafka, Cassandra

Also considering Solidfire for this use caseIf anyone has experience with this – I’d love to hear about it!

Repurposing legacy SATA hardware into a dedicated object pool

High capacity, low performance drives should work well in an object use case – more research is needed into end-user requirements.

Broadening

Ceph

beyond cloud niche use case. Repurpose ‘capacity’ frames coupled with

Ceph

erasure coding could be compelling for certain use cases.

Video archiving for security camera footage

Register / POS log archiving

Next StepsSlide14

Plan time into your deployment schedule to iron out dependancy hell, especially if you are moving from

Inktank packages to Red Hat packagesIn Hammer, you no longer have to use Apache and the FastCGI

shim for RADOSGW object service. Enable civitweb with the following entry in the [

client.radosgw.gateway

] section of

ceph.conf

and make sure you shut off Apache!

rgw_frontends

= "

civetweb

port=

80”

Use new and improved CRUSH algorithm. This WILL trigger a lot of rebalancing activity!$ ceph osd crush tunables optimalIn the [osd] section of ceph.conf set the following directive. This prevents new OSDs from triggering rebalancing. Nope, setting NOIN won’t do the trick!osd_crush_update_on_start = falseCeph’s default recovery settings are far too aggressive. Tone it down with the following in the [osd] section or it will impact client IO osd_max_backfills = 1 osd_recovery_priority = 1 osd_client_op_priority = 63 osd_recovery_max_active = 1 osd_recovery_max_single_start = 1General Tips on Migrations/Upgrade to HammerSlide15

Best method to ‘drain’ hosts is by adjusting the CRUSH weight of the OSDs on those hosts, NOT the OSD weight. CRUSH weight dictates cluster-wide data distribution. OSD weights only impact the host the OSD is on and can cause unpredictability.

Don’t work serially host by host – Drop the CRUSH weight of all the OSDs you are removing across the cluster simultaneously. I used a ‘reduce by 50% and allow recovery’ scheme. Your mileage may vary.$ for

i in {0..119}; do ceph

osd

crush reweight

osd

.$

i

3.0; done

$ for

i

in {0..119}; do ceph osd crush reweight osd.$i 1.5; done$ for i in {0..119}; do ceph osd crush reweight osd.$i .75; doneConsider numad to auto-magically set numa affinities. Still experimenting with the impact of this on cluster performance.Last but not least – VERY Important. You WILL run out of threads and OSDs WILL crash if you don’t tune the kernel.pid_max value $ echo "kernel.pid_max = 4194303" >> /etc/sysctl.confSlide16

Thanks For Your Time!

Questions?

@

&

@