a Big Data Benchmark Suite from Internet Services Lei Wang Jianfeng Zhan Chunjie Luo Yuqing Zhu Qiang Yang Yongqiang He Wanling Gao Zhen Jia ID: 572095
Download Presentation The PPT/PDF document "BigDataBench" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
BigDataBench: a Big Data Benchmark Suite from Internet Services
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Gang Lu, Kent Zhang, Xiaona Li, and Bizhu QiuHPCA 2014
1Slide2
Why Big Data Benchmarking?
Measuring big data
systems and architectures quantitativelySlide3
What is BigDataBench?
An open source big data benchmarking project http://prof.ict.ac.cn/BigDataBench/6 real-world data setsGenerate (4V) big data19 workloadsOLTP, Cloud OLTP, OLAP, and offline analytics Same workloads: different implementationsSlide4
Executive summaryBig Data BenchmarksDo we know enough about big data benchmarking?
Big Data workload characterizationWhat are differences from traditional workloads?Exploring best big data architecturesbrawny-core or wimpy multi-core or wimpy many-core?Slide5
Outline Benchmarking Methodology and Decision Big
Data Workload Characterization Evaluating Hardware Systems with Big Data Conclusion 3
3
2Slide6
Methodology
4V of Big Data System and architecture characteristics BigDataBench
Refine Slide7
Methodology (Cont’)
Diverse Data SetsDiverse Workloads
Data Sources
Text data
Graph data
Table data
Extended …
Data Types
Structured
Semi-structured
Unstructured
Big Data
Sets
Preserving 4V
BigDataBench
Investigate Typical Application Domains
BDGS: big data
generation
tools
Application Types
OLTP
Cloud OLTP
OLAP
Offline analytics
Basic & Important Operations and Algorithms
Extended…
Represent Software Stack
Extended…
Big Data WorkloadsSlide8
Top Sites on the Web
More details in http://www.alexa.com/topsites/global;0
Search Engine, Social Network and Electronic Commerce take 80% page views of all the Internet service.Slide9
MPI
SharkImpalaNoSql
Software StacksBigDataBench Summary19 Workloads
(Cloud)
OLTP
OLAP
Offline Analytics
Search
Engine
Social
Network
E-commerce
Six Real-world Data Sets
Google
Web Graph
E-commerce Transaction
Wikipedia
Entries
BDGS(Big Data Generator Suite
) for scalable data
Facebook Social Network
ProfSearch
Person resumes
Amazon Movie ReviewsSlide10
Outline Benchmarking Methodology and Decision Big
Data Workload Characterization Evaluating Hardware Systems with Big Data Conclusion 3
5
3
2Slide11
Big Data Workloads Analyzed
Input data size varying from 32GB to 1TBSlide12
Other Benchmarks Compared HPCC Representative HPC benchmark
suite7 benchmarks PARSECCMP (Multi-threaded) benchmark suite 12 benchmarks SPECCPUSPECCFPSPECINTSlide13
Metrics User-perceivable metricsOLTP services: requests per second(RPS)
Cloud OLTP: operations per second(OPS)OLAP and Offline analytics: data processed per second(DPS)Micro-architecture characteristics Hardware performance counterSlide14
Experimental ConfigurationsTestbed ConfigurationsFifteen nodes: 1 master + 14 slaves
Data input size: 32GB~1TB Each node: 2*Xeon E5645, 16GB Memory, 8TB DiskNetwork: 1Gb Ethernet CPU TypeIntel CPU CoreIntel Xeon E56456 Cores @ 2.40GL1D CacheL1I CacheL2 CacheL3 Cache6*32KB
6*32KB6*256KB12MB
Software Configurations
OS
:
Centos 5.5 with Linux kernel 2.6.34.
Stacks: Hadoop 1.0.2, Hbase 0.94.5, Hive 0.9, MPICH2 1.5, Nutch 1.1, and Rubis 5.0Slide15
Instruction Breakdown
Data AnalyticsServicesMore integer instructions
(Less floating point instructions)The average ratio of integer to floating point instructions is 75FP instruction: X87+SSE FP (X87, SSE_Pack_Float, SSE_Pack_Double,
SSE_Scalar_Float and SSE_Scalar_Double)
Integer instruction: Total _Ins -
FP_Ins - Branch_Ins - Store_Ins - Load_InsSlide16
Floating Point Operation Intensity (E5310)
Total number of floating point instructions divided by total number of memory access bytes in a run of workload.Very low floating point operation intensity : two orders of magnitude lower than in the traditional workloadsData Analytics
ServicesCPU TypeIntel CPU CoreIntel Xeon E53104 Cores @ 1.6GL1 CacheL1 Cache
L2 Cache
L3 Cache4*32KB
4*32KB2*4MBNoneSlide17
Floating Point Operation Intensity
Data AnalyticsServicesFloating point operation intensity on E5645 is higher than that on E5310Slide18
Integer
Operation IntensityData AnalyticsServicesInteger operation intensity is in the same order like the traditional workloadsInteger
operation intensity on E5645 is higher than that on E5310L3 Cache is effective & Bandwidth improvement Slide19
Possible reasons (Xeon E5645 vs. Xeon E5310)
More cores in one processor
Deeper cache hierarchy level: L1~L3 vs. L1~L2 Larger bandwidth in Front Side Bus
Sixe cores in Xeon E5645 vs. four cores in Xeon E5310
L3 cache is effective in decreasing memory access traffic
for big data workloads
Xeon E5645
adopts
Intel
QuickPath
Interconnect (QPI) to
eliminate bottlenecks in Front Side Bus [ASPLOS 2012]
Hyperthreading
technology
Hyperthreading
can improve performance by factors of
1.3~1.6 times for scale-out workloads
Technique improvements of Xeon E5645:Slide20
Cache Behaviors
Higher L1I Cache misses than the traditional workloadsData analytic workloads have better L2 Cache behaviors than service workloads with the exception of BFS Good L3 Cache behaviorsData Analytics
Services
56
74
83Slide21
TLB Behaviors
data analysisservice145Higher ITBL misses
than the traditional workloads Slide22
Computation intensity (integer operations)
Integer Operations per Byte
(Receiving from networks)Integer Operations per Byte (Memory Accesses)X axis : (total number of integer instructions)/(total memory access bytes
)
Higher : execute more integer operations between two memory accessesY axis : (
total number of integer instructions)/(total bytes receiving from networks) Higher : execute more integer operations on the same receiving bytesSlide23
Big Workloads Characterization SummaryData movement dominated computing
Low computation intensityCache Behaviors (Xeon E5645)Very high L1I MPKI L3 Cache is effectiveDiverse workload behaviors Computation/communication vs. computation/memory accessesSlide24
Outline Benchmarking Methodology and Decision
Big Data Workload Characterization Evaluating Hardware Systems with Big DataY. Shi, S. A. McKee et al. Performance and Energy Efficiency Implications from Evaluating Four Big Data Systems, Submitted to IEEE Micro. Conclusion
3
3Slide25
State-of-art Big Data System Architectures
Wimpy many-core processors
Wimpy multi-core processors
Brawny-core processors
Big
Data System
& Architecture Trends
Hardware
Designers:
What are the best
big
data
system and architectures
in
terms of both
performance and energy efficiency
?
Data Center Administrators
: How to choose
appropriate hardware
for big data applications?Slide26
Evaluated Platforms
Xeon E5310 (Brawny-core) scale-up Xeon E5645 (Brawny-core) Atom D510 (Wimpy multi-core) scale-out TileGx 36 (Wimpy many-core)ModelXeon E5645Xeon E5310Atom D510
TileGx36No. of Processors2111No. of Cores/CPU64236Frequency2.4GHz1.6GHz1.66GHz
1.2GHz
L1 Cache (I/D)32KB/32KB
32KB/32KB32KB/24KB32KB/32KBL2 Cache256KB*64096KB*2
512KB*2256KB*36
L3 Cache12MBNONENONENONE
TDP
80W
80W
13W
45W
Basic Information
Model
Xeon E5645
Xeon E5310
Atom D510
TileGx36
Pipeline
Depth
16
14
16
5
Superscalar
Widths
4
4
2
3
Instruction
Set Architecture
X86
X86
X86
MIPS
Hyper-threading
Yes
No
Yes
No
Out-of-Order
Execution
Yes
Yes
No
No
Specified
Floating Point
Unit YesYesYesNoArchitectural CharacteristicsSlide27
Chosen Workloads from BigDataBench
Application TypeOffline analytics Realtime analyticsWorkload
SortWordcountGrepNaïve BayesK-meansSelect QueryAggregation QueryJoin Query
Time
Complexity
O(n*logn)O(n)
O(n)
O(m*n)O(m*n)O(n)
O(n)
Map
Operation
Quicksort
String comparison & integer calculation
String comparison & integer calculation
Statistics computation
Distance computation
String comparison
String comparison & integer calculation
String comparison
Reduce Operation
Merge sort
Combination
Combination
Merge
Merge
None
Combination
Cross product
Reduce
Input/Map Input
1
0.067
1.85e-6
1.98e-5
2.64e-5
N/A
0.20
0.19Slide28
Experimental Configurations
Software stack:Hadoop 1.0.2Cluster configuration:Xeon & Atom-based systems:1 master + 4 slavesTilera system: 1 master + 2 slavesData Size: 500MB, 2GB, 8GB, 32GB, 64GB, 128GBApples-to-Apples comparison:Deploy the systems with the same network
and disk configurationsProvide about 1GB memory for each hardware thread / coreAdjust the Hadoop parameters to optimize performanceSlide29
Metrics
Performance: Data processed per second (DPS)Energy Efficiency:Data processed per joule(DPJ) Data Input Size DPS = Running Time
Data Input SizeDPJ = Energy ConsumptionReport DPS and DPJ per processorSlide30
General Observations
The Average DPS Comparison
The Average DPJ ComparisonI/O intensive workload (Sort):many-core TileGx36 achieves
the best performance and energy efficiency,
The brawny-core processors do not
provide performance advantages.
CPU-intensive and floating point operation dominated workloads
(
Bayes
& K-means)
:
brawny-core processors
show
obvious performance advantages
with
close
energy
efficiency
to wimpy-core processors
.
Other workloads:
no platform
consistently
wins
in terms
of both performance and energy
efficiency.
Report the average number only when
the data sizes bigger than
8GB (
not
fully utilized on small data
sizes).Slide31
Improvements from Scaling-out the Wimpy Core
(TileGx36 vs. Atom D510)
The core of TileGx36 is more wimpy than Atom D510
TileGx36
integrates more cores on the NOC(Network on Chip)
Adopts
MIPS-derived VLIW instruction set.
Does not support
hyperthreading
.
Less stages in the pipeline depth.
Does not have dedicated floating point units.
36 cores
in TileGx36 vs. 4 cores Atom D510Slide32
Improvements from Scaling-out the Wimpy Core
(TileGx36 vs. Atom D510)The DPS Comparison
The DPJ ComparisonI/O intensive workload (Sort):
TileGx36
shows
4.1 times performance improvement, 1.01 times energy
improvement (on average)
.
CPU-intensive and floating point operation dominated workloads
(Bayes & K-means):
TileGx36
shows 2.5 times
performance advantage
and 0.7 times
energy
efficiency (on average).
Other
workloads:
TileGx36
shows
2.5
times
performance
improvement,
1.03
times
energy
improvement (on average).Slide33
Improvements from Scaling-out the Wimpy Core
(TileGx36 vs. Atom D510)
The core of TileGx36 is more wimpy than Atom D510
TileGx36
integrates more cores on the NOC(Network on Chip)
Adopts
MIPS-derived VLIW instruction set.
Does not support
hyperthreading
.
Less stages in the pipeline depth.
Does not have dedicated floating point units.
36 cores
in TileGx36 vs. 4 cores Atom D510
Scaling out
the wimpy core can bring
performance advantage
by improving execution
parallelism.
Simplifying
the wimpy cores and
integrating more cores
on the NOC is
an option
for Big Data workloads.Slide34
Scale-up the Brawny Core(Xeon E5645) vs. Scale-out the Wimpy Core
(TileGx36)The DPS Comparison
The DPJ ComparisonI/O intensive workload (Sort):
TileGx36
shows
1.2 times performance improvement, 1.9 times
energy improvement (on average).
CPU-intensive
and floating point operation dominated
workloads
(
Bayes
& K-means):
E5645
shows
4.2
times
performance
improvement,
2.0
times
energy
improvement (on average).
Other
workloads:
E5645
shows
performance
advantage, but with no consistent energy improvement.Slide35
Hardware Evaluation Summary
No one-size-fits-all solutionNone of the microprocessors consistently wins in terms of both performance and energy efficiency for all of our Big Data workloadsOne-size-fits-a-bunch solutionThere are different classes of Big Data workloads, and each class of workload realizes better performance and energy efficiency on different architectures.Slide36
Outline Benchmarking Methodology and Decision Big Data Workload Characterization
Evaluating hardware systems With Big Data Conclusion 3
3Slide37
ConclusionAn open source big data benchmark suiteData-centric benchmarking methodologyhttp://prof.ict.ac.cn/BigDataBench
Big Data workload characterizationData movement dominated computingDiverse behaviorsMust including diversity of data and workloadsEschew one-size-fits-all solutionTailor system designs to specific workload requirements.Slide38
THANKs