Deployment at Target Best Practices and Lessons Learned Agenda 2 Introduction Will Boege Sr Technical Architect RAPID Team Private Cloud and Automation Agenda 3 First Ceph Environment at Target went live in October of 2014 ID: 481506
Download Presentation The PPT/PDF document "Ceph" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Ceph
Deployment at Target:
Best Practices and Lessons LearnedSlide2
Agenda
2
Introduction
Will Boege
Sr. Technical Architect
RAPID Team (Private Cloud and Automation)Slide3
Agenda
3
@
First
Ceph
Environment at Target went live in October of 2014
“Firefly” Release
Ceph
was backing Target’s first ‘official’
Openstack
release
Icehouse Based
Ceph is used for:libRBD for Openstack Instances and VolumesRADOSGW for Object (instead of Swift)Kernel RBD backing Celiometer MongoDB volumesCurrently DEV is largest environment with ~1500 instancesReplaced traditional array-based approach that was implemented in our prototype Havana environment.Traditional storage model was problematic to integrateMaintenance/purchase costs from Expensive Machine Corporations can get prohibitiveTraditional SAN just doesn’t ‘feel’ right in this space.Ceph’s tight integration with OpenstackSlide4
Agenda
4
@
Initial
Ceph
Deployment:
3 x Monitor Nodes – Cisco B200
12 x OSD Nodes – Cisco C240 LFF
12 4TB SATA Disks
10 OSD per server
Journal partition co-located on each OSD disk
120 OSD Total = ~ 400 TB
2x 10GBE per host1 “Front End” 1 Private CephBasic LSI ‘MegaRaid’ controller – SAS 2008M-8iNo Supercap or Cache capability onboard10xRAID0Slide5
Post rollout it became evident that there were performance issues within the environment.KitchenCI
users would complain of slow Chef converge timesYum transactions would take abnormal amounts of time to complete.
Instance boot times, especially for images using cloud-init would take excessively long time to bootDatabase
General user griping about ‘slowness’
Lesson #1 -
Instrument Your Deployment!
Track statistics/ metrics that have real impact to
your usersSlide6
Unacceptable levels of latency even while cluster was relatively unworked
High levels of CPU IOWait% on the OSD servers & IDLE Openstack
Instances
Poor IOPS / Latency - FIO benchmarks running
INSIDE
Openstack
Instances
$
fio
--
rw
=write --ioengine=libaio --runtime=100 --direct=1 --bs=4k --size=10G --iodepth=32 --name=/tmp/testfile.bin test: (groupid=0, jobs=1): err= 0: pid=1914read : io=1542.5MB, bw=452383 B/s, iops=110 , runt=3575104msecwrite: io=527036KB, bw=150956 B/s, iops=36 , runt=3575104msec
Having more effective instrumentation from the outset would have revealed obvious problems with our architectureSlide7
Compounding the performance issues we began to see mysterious reliability issues.
OSDs would randomly fall offlineCluster would enter a HEALTH_ERR state about once a week with ‘unfound objects’ and/or Inconsistent page groups that required manual intervention to fix.
These problems were usually coupled with a large drop in our already suspect performance levelsLesson #2 –Do your research on hardware your server vendor provides!
Don
’
t just blindly accept whatever they had laying around, be proactive!Slide8
Root cause of HEALTH_ERRs was “unnamed vendor’s” SATA drives in our solution ‘soft-failing’ – slowly gaining media errors without reporting themselves as failed. Don’t rely on SMART. Interrogate your disks with a array-level tool, like MegaRAID to identify drives for proactive replacement.
$ opt/
MegaRAID/MegaCli
/MegaCli64 -
PDList
-
aALL
|
grep
Media
In installations with co-located journal partitions, a RAID solution with
cache+BBU
for writeback operation would have been a huge performance gain. Paying more attention to the suitability of hardware our vendor of choice provided would have saved a lot of headachesSlide9
Which leads us to –
Lesson #3 –Ceph is not magic. It does the best with the hardware you give it!
Much ill-advised advice floating around that if you throw enough crappy disks at Ceph you will achieve enterprise grade performance. Garbage in – Garbage out. Don
’
t be greedy and build for capacity, if high performance is what your objective is.Slide10
Agenda
10
New
Ceph
OSD Deployment:
5 x OSD Nodes – Cisco C240
S
FF
18 10k Seagate SAS 1.1TB
6 480g Intel S3500 SSD
Intel SSDs are the
DEFACTO
SSD for use with Ceph. Caveat Emptor with other brands!18 OSD per serverJournal partition on SSD with 4/5:1 OSD/Journal ratio 90 OSD Total = ~ 100 TBImproved LSI ‘MegaRaid’ controller – SAS-9271-8iSupercapWriteback capability18xRAID0Writethru on journals, writeback on spinning OSDs .Still experimenting with this!Based on “Hammer” Ceph ReleaseLessons learned – we set out to rebuildSlide11
Obtaining metrics from our design change was nearly immediate due to having effective monitors in place
Latency improvements have been extreme
IOWait% within Openstack
instances have been greatly reduced
Raw IOPS throughput has
sykrocketed
Testing
Celiometer
backended
by
MongoDB
on RBD I’ve seen this 5 node / 90 OSD cluster spike to 18k IOPSThroughput testing with RADOS bench and osd bench shows aprox. 10 fold increaseUser feedback has been extremely positive, general Openstack experience at Target is much improved.Performance within Openstack instances has increase about 10xResultstest: (groupid=0, jobs=1): err= 0: pid=1914 read : io=1542.5MB, bw=452383 B/s,
iops=110
, runt=3575104msec
write:
io
=527036KB,
bw
=150956 B/s,
iops
=36
, runt=3575104msec
test: (
groupid
=0, jobs=1): err= 0:
pid
=2131
read :
io
=2046.6MB,
bw
=11649KB/s,
iops
=2912
, runt=179853msec
write:
io
=2049.1MB,
bw
=11671KB/s,
iops
=2917
, runt=179853msecSlide12
Before embarking on creating a Ceph environment, have a good idea of what your objectives are for the environment.
Capacity?Performance?Once you understand your objective, understand that your hardware selection is crucial to your success
Unless you are architecting for raw capacity, use SSDs for your journal volumes without exceptionIf you must co-locate journals, use a RAID adapter with
BBU+Writeback
cache
A hybrid approach may be feasible with SATA ‘capacity’ disks with SSD journals. I’ve yet to try this, I’d be interested in seeing some benchmark data on a setup like this
Research, experiment, consult with Red Hat /
Inktank
Monitor, monitor, monitor and provide a very short feedback loop for your users to engage you with their concerns
ConclusionSlide13
Looking to test all SSD pool performanceCeph
not there in Firefly, but SSD optimizations were a major part of the Hammer release. We have needs for a ‘ultra’ Cinder tier for workloads that require high IOPS / low latency for use cases such as Kafka, Cassandra
Also considering Solidfire for this use caseIf anyone has experience with this – I’d love to hear about it!
Repurposing legacy SATA hardware into a dedicated object pool
High capacity, low performance drives should work well in an object use case – more research is needed into end-user requirements.
Broadening
Ceph
beyond cloud niche use case. Repurpose ‘capacity’ frames coupled with
Ceph
erasure coding could be compelling for certain use cases.
Video archiving for security camera footage
Register / POS log archiving
Next StepsSlide14
Plan time into your deployment schedule to iron out dependancy hell, especially if you are moving from
Inktank packages to Red Hat packagesIn Hammer, you no longer have to use Apache and the FastCGI
shim for RADOSGW object service. Enable civitweb with the following entry in the [
client.radosgw.gateway
] section of
ceph.conf
and make sure you shut off Apache!
rgw_frontends
= "
civetweb
port=
80”
Use new and improved CRUSH algorithm. This WILL trigger a lot of rebalancing activity!$ ceph osd crush tunables optimalIn the [osd] section of ceph.conf set the following directive. This prevents new OSDs from triggering rebalancing. Nope, setting NOIN won’t do the trick!osd_crush_update_on_start = falseCeph’s default recovery settings are far too aggressive. Tone it down with the following in the [osd] section or it will impact client IO osd_max_backfills = 1 osd_recovery_priority = 1 osd_client_op_priority = 63 osd_recovery_max_active = 1 osd_recovery_max_single_start = 1General Tips on Migrations/Upgrade to HammerSlide15
Best method to ‘drain’ hosts is by adjusting the CRUSH weight of the OSDs on those hosts, NOT the OSD weight. CRUSH weight dictates cluster-wide data distribution. OSD weights only impact the host the OSD is on and can cause unpredictability.
Don’t work serially host by host – Drop the CRUSH weight of all the OSDs you are removing across the cluster simultaneously. I used a ‘reduce by 50% and allow recovery’ scheme. Your mileage may vary.$ for
i in {0..119}; do ceph
osd
crush reweight
osd
.$
i
3.0; done
$ for
i
in {0..119}; do ceph osd crush reweight osd.$i 1.5; done$ for i in {0..119}; do ceph osd crush reweight osd.$i .75; doneConsider numad to auto-magically set numa affinities. Still experimenting with the impact of this on cluster performance.Last but not least – VERY Important. You WILL run out of threads and OSDs WILL crash if you don’t tune the kernel.pid_max value $ echo "kernel.pid_max = 4194303" >> /etc/sysctl.confSlide16
Thanks For Your Time!
Questions?
@
&
@