BigDataBench - PowerPoint Presentation

401 views
Uploaded On 2017-07-22

BigDataBench - PPT Presentation

a Big Data Benchmark Suite from Internet Services Lei Wang Jianfeng Zhan Chunjie Luo Yuqing Zhu Qiang Yang Yongqiang He Wanling Gao Zhen Jia ID: 572095

big data core workloads data big workloads core performance integer tilegx36 energy wimpy xeon cores cache floating point workload

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/572095" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "BigDataBench" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

BigDataBench: a Big Data Benchmark Suite from Internet Services

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Gang Lu, Kent Zhang, Xiaona Li, and Bizhu QiuHPCA 2014

1Slide2

Why Big Data Benchmarking?

Measuring big data

systems and architectures quantitativelySlide3

What is BigDataBench?

An open source big data benchmarking project http://prof.ict.ac.cn/BigDataBench/6 real-world data setsGenerate (4V) big data19 workloadsOLTP, Cloud OLTP, OLAP, and offline analytics Same workloads: different implementationsSlide4

Executive summaryBig Data BenchmarksDo we know enough about big data benchmarking?

Big Data workload characterizationWhat are differences from traditional workloads?Exploring best big data architecturesbrawny-core or wimpy multi-core or wimpy many-core?Slide5

Outline Benchmarking Methodology and Decision Big

Data Workload Characterization Evaluating Hardware Systems with Big Data Conclusion 3

2Slide6

Methodology

4V of Big Data System and architecture characteristics BigDataBench

Refine Slide7

Methodology (Cont’)

Diverse Data SetsDiverse Workloads

Data Sources

Text data

Graph data

Table data

Extended …

Data Types

Structured

Semi-structured

Unstructured

Big Data

Sets

Preserving 4V

BigDataBench

Investigate Typical Application Domains

BDGS: big data

generation

tools

Application Types

OLTP

Cloud OLTP

OLAP

Offline analytics

Basic & Important Operations and Algorithms

Extended…

Represent Software Stack

Extended…

Big Data WorkloadsSlide8

Top Sites on the Web

More details in http://www.alexa.com/topsites/global;0

Search Engine, Social Network and Electronic Commerce take 80% page views of all the Internet service.Slide9

MPI

SharkImpalaNoSql

Software StacksBigDataBench Summary19 Workloads

(Cloud)

OLTP

OLAP

Offline Analytics

Engine

Social

Network

E-commerce

Six Real-world Data Sets

Google

Web Graph

E-commerce Transaction

Wikipedia

Entries

BDGS(Big Data Generator Suite

) for scalable data

Facebook Social Network

ProfSearch

Person resumes

Amazon Movie ReviewsSlide10

Outline Benchmarking Methodology and Decision Big

Data Workload Characterization Evaluating Hardware Systems with Big Data Conclusion 3

2Slide11

Big Data Workloads Analyzed

Input data size varying from 32GB to 1TBSlide12

Other Benchmarks Compared HPCC Representative HPC benchmark

suite7 benchmarks PARSECCMP (Multi-threaded) benchmark suite 12 benchmarks SPECCPUSPECCFPSPECINTSlide13

Metrics User-perceivable metricsOLTP services: requests per second(RPS)

Cloud OLTP: operations per second(OPS)OLAP and Offline analytics: data processed per second(DPS)Micro-architecture characteristics Hardware performance counterSlide14

Experimental ConfigurationsTestbed ConfigurationsFifteen nodes: 1 master + 14 slaves

Data input size: 32GB~1TB Each node: 2*Xeon E5645, 16GB Memory, 8TB DiskNetwork: 1Gb Ethernet CPU TypeIntel CPU CoreIntel Xeon E56456 Cores @ 2.40GL1D CacheL1I CacheL2 CacheL3 Cache6*32KB

6*32KB6*256KB12MB

Software Configurations

Centos 5.5 with Linux kernel 2.6.34.

Stacks: Hadoop 1.0.2, Hbase 0.94.5, Hive 0.9, MPICH2 1.5, Nutch 1.1, and Rubis 5.0Slide15

Instruction Breakdown

Data AnalyticsServicesMore integer instructions

(Less floating point instructions)The average ratio of integer to floating point instructions is 75FP instruction: X87+SSE FP (X87, SSE_Pack_Float, SSE_Pack_Double,

SSE_Scalar_Float and SSE_Scalar_Double)

Integer instruction: Total _Ins -

FP_Ins - Branch_Ins - Store_Ins - Load_InsSlide16

Floating Point Operation Intensity (E5310)

Total number of floating point instructions divided by total number of memory access bytes in a run of workload.Very low floating point operation intensity : two orders of magnitude lower than in the traditional workloadsData Analytics

ServicesCPU TypeIntel CPU CoreIntel Xeon E53104 Cores @ 1.6GL1 CacheL1 Cache

L2 Cache

L3 Cache4*32KB

4*32KB2*4MBNoneSlide17

Floating Point Operation Intensity

Data AnalyticsServicesFloating point operation intensity on E5645 is higher than that on E5310Slide18

Integer

Operation IntensityData AnalyticsServicesInteger operation intensity is in the same order like the traditional workloadsInteger

operation intensity on E5645 is higher than that on E5310L3 Cache is effective & Bandwidth improvement Slide19

Possible reasons (Xeon E5645 vs. Xeon E5310)

More cores in one processor

Deeper cache hierarchy level: L1~L3 vs. L1~L2 Larger bandwidth in Front Side Bus

Sixe cores in Xeon E5645 vs. four cores in Xeon E5310

L3 cache is effective in decreasing memory access traffic

for big data workloads

Xeon E5645

adopts

Intel

QuickPath

Interconnect (QPI) to

eliminate bottlenecks in Front Side Bus [ASPLOS 2012]

Hyperthreading

technology

Hyperthreading

can improve performance by factors of

1.3~1.6 times for scale-out workloads

Technique improvements of Xeon E5645:Slide20

Cache Behaviors

Higher L1I Cache misses than the traditional workloadsData analytic workloads have better L2 Cache behaviors than service workloads with the exception of BFS Good L3 Cache behaviorsData Analytics

Services

83Slide21

TLB Behaviors

data analysisservice145Higher ITBL misses

than the traditional workloads Slide22

Computation intensity (integer operations)

Integer Operations per Byte

(Receiving from networks)Integer Operations per Byte (Memory Accesses)X axis : (total number of integer instructions)/(total memory access bytes

)

Higher : execute more integer operations between two memory accessesY axis : (

total number of integer instructions)/(total bytes receiving from networks) Higher : execute more integer operations on the same receiving bytesSlide23

Big Workloads Characterization SummaryData movement dominated computing

Low computation intensityCache Behaviors (Xeon E5645)Very high L1I MPKI L3 Cache is effectiveDiverse workload behaviors Computation/communication vs. computation/memory accessesSlide24

Outline Benchmarking Methodology and Decision

Big Data Workload Characterization Evaluating Hardware Systems with Big DataY. Shi, S. A. McKee et al. Performance and Energy Efficiency Implications from Evaluating Four Big Data Systems, Submitted to IEEE Micro. Conclusion

3Slide25

State-of-art Big Data System Architectures

Wimpy many-core processors

Wimpy multi-core processors

Brawny-core processors

Big

Data System

& Architecture Trends

Hardware

Designers:

What are the best

big

data

system and architectures

terms of both

performance and energy efficiency

Data Center Administrators

: How to choose

appropriate hardware

for big data applications?Slide26

Evaluated Platforms

Xeon E5310 (Brawny-core) scale-up Xeon E5645 (Brawny-core) Atom D510 (Wimpy multi-core) scale-out TileGx 36 (Wimpy many-core)ModelXeon E5645Xeon E5310Atom D510

TileGx36No. of Processors2111No. of Cores/CPU64236Frequency2.4GHz1.6GHz1.66GHz

1.2GHz

L1 Cache (I/D)32KB/32KB

32KB/32KB32KB/24KB32KB/32KBL2 Cache256KB*64096KB*2

512KB*2256KB*36

L3 Cache12MBNONENONENONE

TDP

80W

13W

45W

Basic Information

Model

Xeon E5645

Xeon E5310

Atom D510

TileGx36

Pipeline

Depth

Superscalar

Widths

Instruction

Set Architecture

X86

MIPS

Hyper-threading

Yes

Out-of-Order

Execution

Yes

Specified

Floating Point

Unit YesYesYesNoArchitectural CharacteristicsSlide27

Chosen Workloads from BigDataBench

Application TypeOffline analytics Realtime analyticsWorkload

SortWordcountGrepNaïve BayesK-meansSelect QueryAggregation QueryJoin Query

Time

Complexity

O(n*logn)O(n)

O(n)

O(m*n)O(m*n)O(n)

O(n)

Map

Operation

Quicksort

String comparison & integer calculation

Statistics computation

Distance computation

String comparison

String comparison & integer calculation

String comparison

Reduce Operation

Merge sort

Combination

Merge

None

Combination

Cross product

Reduce

Input/Map Input

0.067

1.85e-6

1.98e-5

2.64e-5

N/A

0.20

0.19Slide28

Experimental Configurations

Software stack：Hadoop 1.0.2Cluster configuration:Xeon & Atom-based systems：1 master + 4 slavesTilera system： 1 master + 2 slavesData Size: 500MB, 2GB, 8GB, 32GB, 64GB, 128GBApples-to-Apples comparison：Deploy the systems with the same network

and disk configurationsProvide about 1GB memory for each hardware thread / coreAdjust the Hadoop parameters to optimize performanceSlide29

Metrics

Performance： Data processed per second (DPS)Energy Efficiency：Data processed per joule(DPJ) Data Input Size DPS = Running Time

Data Input SizeDPJ = Energy ConsumptionReport DPS and DPJ per processorSlide30

General Observations

The Average DPS Comparison

The Average DPJ ComparisonI/O intensive workload (Sort)：many-core TileGx36 achieves

the best performance and energy efficiency,

The brawny-core processors do not

provide performance advantages.

CPU-intensive and floating point operation dominated workloads

(

Bayes

& K-means)

brawny-core processors

show

obvious performance advantages

with

energy

efficiency

to wimpy-core processors

Other workloads:

no platform

consistently

wins

in terms

of both performance and energy

efficiency.

Report the average number only when

the data sizes bigger than

8GB (

not

fully utilized on small data

sizes).Slide31

Improvements from Scaling-out the Wimpy Core

(TileGx36 vs. Atom D510)

The core of TileGx36 is more wimpy than Atom D510

TileGx36

integrates more cores on the NOC(Network on Chip)

Adopts

MIPS-derived VLIW instruction set.

Does not support

hyperthreading

Less stages in the pipeline depth.

Does not have dedicated floating point units.

36 cores

in TileGx36 vs. 4 cores Atom D510Slide32

Improvements from Scaling-out the Wimpy Core

(TileGx36 vs. Atom D510)The DPS Comparison

The DPJ ComparisonI/O intensive workload (Sort):

TileGx36

shows

4.1 times performance improvement, 1.01 times energy

improvement (on average)

CPU-intensive and floating point operation dominated workloads

(Bayes & K-means):

TileGx36

shows 2.5 times

performance advantage

and 0.7 times

energy

efficiency (on average).

Other

workloads:

TileGx36

shows

2.5

times

performance

improvement,

1.03

times

energy

improvement (on average).Slide33

Improvements from Scaling-out the Wimpy Core

(TileGx36 vs. Atom D510)

The core of TileGx36 is more wimpy than Atom D510

TileGx36

integrates more cores on the NOC(Network on Chip)

Adopts

MIPS-derived VLIW instruction set.

Does not support

hyperthreading

Less stages in the pipeline depth.

Does not have dedicated floating point units.

36 cores

in TileGx36 vs. 4 cores Atom D510

Scaling out

the wimpy core can bring

performance advantage

by improving execution

parallelism.

Simplifying

the wimpy cores and

integrating more cores

on the NOC is

an option

for Big Data workloads.Slide34

Scale-up the Brawny Core(Xeon E5645) vs. Scale-out the Wimpy Core

(TileGx36)The DPS Comparison

The DPJ ComparisonI/O intensive workload (Sort):

TileGx36

shows

1.2 times performance improvement, 1.9 times

energy improvement (on average).

CPU-intensive

and floating point operation dominated

workloads

(

Bayes

& K-means):

E5645

shows

4.2

times

performance

improvement,

2.0

times

energy

improvement (on average).

Other

workloads:

E5645

shows

performance

advantage, but with no consistent energy improvement.Slide35

Hardware Evaluation Summary

No one-size-fits-all solutionNone of the microprocessors consistently wins in terms of both performance and energy efficiency for all of our Big Data workloadsOne-size-fits-a-bunch solutionThere are different classes of Big Data workloads, and each class of workload realizes better performance and energy efficiency on different architectures.Slide36

Outline Benchmarking Methodology and Decision Big Data Workload Characterization

Evaluating hardware systems With Big Data Conclusion 3

3Slide37

ConclusionAn open source big data benchmark suiteData-centric benchmarking methodologyhttp://prof.ict.ac.cn/BigDataBench

Big Data workload characterizationData movement dominated computingDiverse behaviorsMust including diversity of data and workloadsEschew one-size-fits-all solutionTailor system designs to specific workload requirements.Slide38

THANKs

BigDataBench - PowerPoint Presentation

BigDataBench - PPT Presentation

Share:

Link:

Embed:

Related Contents