/
BlueDBM BlueDBM

BlueDBM - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
379 views
Uploaded On 2016-02-26

BlueDBM - PPT Presentation

An Appliance for Big Data Analytics SangWoo Jun Ming Liu Sungjin Lee Jamey Hicks John Ankcorn Myron King Shuotao Xu Arvind MIT Computer Science and Artificial Intelligence Laboratory ID: 231481

network flash dram storage flash network storage dram bluedbm based data latency accelerator controller performance memcached pcie distributed system fpga server access

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "BlueDBM" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

BlueDBM: An Appliance for Big Data Analytics

Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+John Ankcorn+ Myron King+ Shuotao Xu* Arvind**MIT Computer Science and Artificial Intelligence Laboratory+Quanta Research Cambridge

1

June 15, 2015

This work is funded by Quanta, Samsung and Lincoln Laboratory.

We also thank Xilinx for their hardware and expertise donations.

ISCA 2015, Portland, OR

.Slide2

Big data analytics

Analysis of previously unimaginable amount of data can provide deep insightGoogle has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC)Analyzing personal genome can determine predisposition to diseasesSocial network chatter analysis can identify political revolutions before newspapersScientific datasets can be mined to extract accurate modelsLikely to be the biggest economic driver for the IT industry for the next decade

2Slide3

A currently popular solution:RAM CloudCluster of machines with large DRAM capacity and fast interconnect

+ Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAMFlash-based solutions may be a better alternative+ Faster than Disk, cheaper than DRAM+ Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM3

What if enough DRAM isn’t affordable?Slide4

Related workUse of flashSSDs, FusionIO

, PurestorageZetascaleSSD for database buffer pool and metadata[SIGMOD 2008], [IJCA 2013]NetworksQuickSAN [ISCA 2013]Hadoop/Spark on Infiniband RDMA [SC 2012]AcceleratorsSmartSSD[SIGMOD 2013], Ibex[VLDB 2014]Catapult[ISCA 2014]GPUs4Slide5

Latency profile of distributed flash-based analytics5

Distributed processing involves many system componentsFlash device accessStorage software (OS, FTL, …)Network interface (10gE, Infiniband, …)Actual processingFlashAccess75 μs50~100 μs

StorageSoftware100 μs100~1000 μsNetwork20 μs20~1000 μsProcessing…

Latency is additiveSlide6

100~1000

μsStorageSoftware100 μsLatency profile of distributed flash-based analytics6Architectural modifications can remove unnecessary overheadNear-storage processingCross-layer optimization of flash management software*Dedicated storage area networkAccelerator

FlashAccess75 μs50~100 μsNetwork20 μs20~1000 μsProcessing…< 20μs

…Difficult to explore using flash packaged as off-the-shelf SSDsSlide7

Custom flash card had to be built

7HPC FMC PORT

Artix 7FPGA

FlashFlashFlashFlashFlashFlashFlashFlashNetwork Ports

Flash Array (on both side)

Bus 0

To

VC707

Bus 1

Bus 2Bus 3Slide8

BlueDBM: Platform with near-storage processing and inter-controller networks

8

20 24-core Xeon Servers20 BlueDBM Storage devices1TB flash storagex4 20Gbps controller networkXilinx VC7072GB/s PCIeSlide9

BlueDBM: Platform with near-storage processing and inter-controller networks

9

1 of 2 Racks (10 Nodes)BlueDBM Storage Device

20 24-core Xeon Servers20 BlueDBM Storage devices1TB flash storagex4 20Gbps controller networkXilinx VC7072GB/s PCIeSlide10

BlueDBM node architecture

10In-StorageProcessorFlashControllerNetworkInterfaceHost Server

Flash DevicePCIeLightweight flash management with very low overheadAdds almost no latencyECC support

Custom network protocol with low latency/high bandwidthx4 20Gbps links at 0.5us latencyVirtual channels with flow controlSoftware has very low level access to flash storageHigh level information can be used for low level managementFTL implemented inside file systemFlashController

NetworkInterface

Host Server

No time to go into gritty details!Slide11

BlueDBM software view

Block Device DriverConnectal (By Quanta)Flash CtrlNAND Flash

File SystemAccelerator ManagerConnectal WrapperBlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal) Hardware-assisted ApplicationsConnectal ProxyGenerated by Connectal*Kernel-space

FPGAUser-spaceHW Accelerator11NetworkInterfaceSlide12

Power consumption is low

ComponentPower (Watts)VC70730Flash Board (x2)10Storage Device Total4012ComponentPower (Watts)Storage Device

40Xeon Server200+Node Total240+Storage device power consumption is a very conservative estimateGPU-based accelerator will double the powerSlide13

ApplicationsContent-based image search

* Faster flash with accelerators as replacement for DRAM-based systemsBlueCache – An accelerated memcached* Dedicated network and accelerated caching systems with larger capacityGraph analyticsBenefits of lower latency access into distributed flash for computation on large graphs13* Results obtained since the paper submissionSlide14

Content-based image retrievalTakes a query image and returns similar images in a dataset of tens of million pictures

Image similarity is determined by measuring the distance between histograms of each imageHistogram is generated using RGB, HSV, “edgeness”, etcBetter algorithms are available!14Slide15

Image search acceleratorSang woo Jun, Chanwoo Chung

15Sobel FilterHistogramGeneratorQuery Histogram

ComparatorFlashControllerSoftware

FlashFPGASlide16

Image query performance without sampling

Faster flash with acceleration can perform at DRAM speed16CPU Bottleneck

Off-the shelfM.2. SSDBlueDBM+ CPUBlueDBM+ FPGASlide17

Sampling to improve performanceIntelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space

But introduces random access pattern17

Locality-sensitive hash tableData

Data accesses corresponding to a single hash table entry results in a lot of random accessesSlide18

Image query performance with sampling

A disk based system cannot take advantage of the reduced search space18Slide19

memcached serviceA distributed in-memory key-value store caches DB results indexed by query strings

Accessed via socket communicationUses system DRAM for caching (~256GB)Extensively used by database-driven websitesFacebook, Flicker, Twitter, Wikipedia, Youtube …

Web requestMemcachedrequestMemcachedResponseReturn dataApplication ServersMemcached ServersBrower/ Mobile Apps19

Networking contributes to 90% overheadSlide20

Bluecache: Accelerated memcached service Shuotao Xu

20

Memcached server implemented in hardwareHashing and flash management implemented in FPGA1TB hardware managed flash cache per nodeHardware server accessed via local PCIeDirect network between hardwareInter-controller network…Bluecache

acceleratorFlashControllerNetwork1TB Flashweb serverPCIe

Bluecacheaccelerator

FlashControllerNetwork

1TB Flash

web server

PCIe

BluecacheacceleratorFlashController

Network

1TB Flash

web server

PCIeSlide21

Effect of architecture modification(no flash, only DRAM)

21

PCIe DMA and inter-controller network reduces access overheadFPGA acceleration of memcached is effective11X PerformanceSlide22

High cache-hit rate outweighs slow flash-accesses (small DRAM vs. large Flash)

22Key size = 64 Bytes, Value size = 8K Bytes5ms penalty per cache miss

Bluecache starts performing better at 5% missA “sweet spot” for large flash caches exist* Assuming no cache misses for BluecacheSlide23

Graph traversalVery latency-bound problem, because often

cannot predict the next node to visitBeneficial to reduce latency by moving computation closer to data23

Flash 1

Flash 2

Flash 3

Host 1

Host 3

In-Store ProcessorHost 2Slide24

Graph traversal performance

24Flash based system can achieve comparable performance with a much smaller cluster DRAMFlash

* Used fast BlueDBM network even for separate network for fairnessSlide25

Other potential applicationsGenomicsDeep machine learningComplex graph analytics

Platform accelerationSpark, MATLAB, SciDB, …25Suggestions and collaboration are welcome!Slide26

ConclusionFast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big DataReducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networks

Flash-based analytics hold a lot of promise, and we plan to continue demonstrating more application accelerationThank you26Slide27

27Slide28

DRAM

MotherboardNear-Data Accelerator is PreferableCPUDRAMFlash

NICAcceleratorFPGACPUFlashNICMotherboardAcceleratorFPGA

CPUDRAMFlashNICFPGAMotherboardCPUDRAM

Flash

NIC

FPGA

Motherboard

Hardware & software latencies are additive

BlueDBMTraditional Approach28Slide29

29

DRAMPCIeFlashNetwork Ports

Artix 7Network CableVirtex 7VC707

Related Contents


Next Show more