An Appliance for Big Data Analytics SangWoo Jun Ming Liu Sungjin Lee Jamey Hicks John Ankcorn Myron King Shuotao Xu Arvind MIT Computer Science and Artificial Intelligence Laboratory ID: 231481
Download Presentation The PPT/PDF document "BlueDBM" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
BlueDBM: An Appliance for Big Data Analytics
Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+John Ankcorn+ Myron King+ Shuotao Xu* Arvind**MIT Computer Science and Artificial Intelligence Laboratory+Quanta Research Cambridge
1
June 15, 2015
This work is funded by Quanta, Samsung and Lincoln Laboratory.
We also thank Xilinx for their hardware and expertise donations.
ISCA 2015, Portland, OR
.Slide2
Big data analytics
Analysis of previously unimaginable amount of data can provide deep insightGoogle has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC)Analyzing personal genome can determine predisposition to diseasesSocial network chatter analysis can identify political revolutions before newspapersScientific datasets can be mined to extract accurate modelsLikely to be the biggest economic driver for the IT industry for the next decade
2Slide3
A currently popular solution:RAM CloudCluster of machines with large DRAM capacity and fast interconnect
+ Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAMFlash-based solutions may be a better alternative+ Faster than Disk, cheaper than DRAM+ Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM3
What if enough DRAM isn’t affordable?Slide4
Related workUse of flashSSDs, FusionIO
, PurestorageZetascaleSSD for database buffer pool and metadata[SIGMOD 2008], [IJCA 2013]NetworksQuickSAN [ISCA 2013]Hadoop/Spark on Infiniband RDMA [SC 2012]AcceleratorsSmartSSD[SIGMOD 2013], Ibex[VLDB 2014]Catapult[ISCA 2014]GPUs4Slide5
Latency profile of distributed flash-based analytics5
Distributed processing involves many system componentsFlash device accessStorage software (OS, FTL, …)Network interface (10gE, Infiniband, …)Actual processingFlashAccess75 μs50~100 μs
StorageSoftware100 μs100~1000 μsNetwork20 μs20~1000 μsProcessing…
Latency is additiveSlide6
100~1000
μsStorageSoftware100 μsLatency profile of distributed flash-based analytics6Architectural modifications can remove unnecessary overheadNear-storage processingCross-layer optimization of flash management software*Dedicated storage area networkAccelerator
FlashAccess75 μs50~100 μsNetwork20 μs20~1000 μsProcessing…< 20μs
…Difficult to explore using flash packaged as off-the-shelf SSDsSlide7
Custom flash card had to be built
7HPC FMC PORT
Artix 7FPGA
FlashFlashFlashFlashFlashFlashFlashFlashNetwork Ports
Flash Array (on both side)
Bus 0
To
VC707
Bus 1
Bus 2Bus 3Slide8
BlueDBM: Platform with near-storage processing and inter-controller networks
8
20 24-core Xeon Servers20 BlueDBM Storage devices1TB flash storagex4 20Gbps controller networkXilinx VC7072GB/s PCIeSlide9
BlueDBM: Platform with near-storage processing and inter-controller networks
9
1 of 2 Racks (10 Nodes)BlueDBM Storage Device
20 24-core Xeon Servers20 BlueDBM Storage devices1TB flash storagex4 20Gbps controller networkXilinx VC7072GB/s PCIeSlide10
BlueDBM node architecture
10In-StorageProcessorFlashControllerNetworkInterfaceHost Server
Flash DevicePCIeLightweight flash management with very low overheadAdds almost no latencyECC support
Custom network protocol with low latency/high bandwidthx4 20Gbps links at 0.5us latencyVirtual channels with flow controlSoftware has very low level access to flash storageHigh level information can be used for low level managementFTL implemented inside file systemFlashController
NetworkInterface
Host Server
No time to go into gritty details!Slide11
BlueDBM software view
Block Device DriverConnectal (By Quanta)Flash CtrlNAND Flash
File SystemAccelerator ManagerConnectal WrapperBlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal) Hardware-assisted ApplicationsConnectal ProxyGenerated by Connectal*Kernel-space
FPGAUser-spaceHW Accelerator11NetworkInterfaceSlide12
Power consumption is low
ComponentPower (Watts)VC70730Flash Board (x2)10Storage Device Total4012ComponentPower (Watts)Storage Device
40Xeon Server200+Node Total240+Storage device power consumption is a very conservative estimateGPU-based accelerator will double the powerSlide13
ApplicationsContent-based image search
* Faster flash with accelerators as replacement for DRAM-based systemsBlueCache – An accelerated memcached* Dedicated network and accelerated caching systems with larger capacityGraph analyticsBenefits of lower latency access into distributed flash for computation on large graphs13* Results obtained since the paper submissionSlide14
Content-based image retrievalTakes a query image and returns similar images in a dataset of tens of million pictures
Image similarity is determined by measuring the distance between histograms of each imageHistogram is generated using RGB, HSV, “edgeness”, etcBetter algorithms are available!14Slide15
Image search acceleratorSang woo Jun, Chanwoo Chung
15Sobel FilterHistogramGeneratorQuery Histogram
ComparatorFlashControllerSoftware
FlashFPGASlide16
Image query performance without sampling
Faster flash with acceleration can perform at DRAM speed16CPU Bottleneck
Off-the shelfM.2. SSDBlueDBM+ CPUBlueDBM+ FPGASlide17
Sampling to improve performanceIntelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space
But introduces random access pattern17
Locality-sensitive hash tableData
Data accesses corresponding to a single hash table entry results in a lot of random accessesSlide18
Image query performance with sampling
A disk based system cannot take advantage of the reduced search space18Slide19
memcached serviceA distributed in-memory key-value store caches DB results indexed by query strings
Accessed via socket communicationUses system DRAM for caching (~256GB)Extensively used by database-driven websitesFacebook, Flicker, Twitter, Wikipedia, Youtube …
Web requestMemcachedrequestMemcachedResponseReturn dataApplication ServersMemcached ServersBrower/ Mobile Apps19
Networking contributes to 90% overheadSlide20
Bluecache: Accelerated memcached service Shuotao Xu
20
Memcached server implemented in hardwareHashing and flash management implemented in FPGA1TB hardware managed flash cache per nodeHardware server accessed via local PCIeDirect network between hardwareInter-controller network…Bluecache
acceleratorFlashControllerNetwork1TB Flashweb serverPCIe
Bluecacheaccelerator
FlashControllerNetwork
1TB Flash
web server
PCIe
BluecacheacceleratorFlashController
Network
1TB Flash
web server
PCIeSlide21
Effect of architecture modification(no flash, only DRAM)
21
PCIe DMA and inter-controller network reduces access overheadFPGA acceleration of memcached is effective11X PerformanceSlide22
High cache-hit rate outweighs slow flash-accesses (small DRAM vs. large Flash)
22Key size = 64 Bytes, Value size = 8K Bytes5ms penalty per cache miss
Bluecache starts performing better at 5% missA “sweet spot” for large flash caches exist* Assuming no cache misses for BluecacheSlide23
Graph traversalVery latency-bound problem, because often
cannot predict the next node to visitBeneficial to reduce latency by moving computation closer to data23
Flash 1
Flash 2
Flash 3
Host 1
Host 3
In-Store ProcessorHost 2Slide24
Graph traversal performance
24Flash based system can achieve comparable performance with a much smaller cluster DRAMFlash
* Used fast BlueDBM network even for separate network for fairnessSlide25
Other potential applicationsGenomicsDeep machine learningComplex graph analytics
Platform accelerationSpark, MATLAB, SciDB, …25Suggestions and collaboration are welcome!Slide26
ConclusionFast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big DataReducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networks
Flash-based analytics hold a lot of promise, and we plan to continue demonstrating more application accelerationThank you26Slide27
27Slide28
DRAM
MotherboardNear-Data Accelerator is PreferableCPUDRAMFlash
NICAcceleratorFPGACPUFlashNICMotherboardAcceleratorFPGA
CPUDRAMFlashNICFPGAMotherboardCPUDRAM
Flash
NIC
FPGA
Motherboard
Hardware & software latencies are additive
BlueDBMTraditional Approach28Slide29
29
DRAMPCIeFlashNetwork Ports
Artix 7Network CableVirtex 7VC707