/
Performance Analysis and Optimization Performance Analysis and Optimization

Performance Analysis and Optimization - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
413 views
Uploaded On 2017-09-11

Performance Analysis and Optimization - PPT Presentation

of Full GC in Memoryhungry Environments Yang Yu Tianyang Lei Weihua Zhang Haibo Chen Binyu Zang Institute of Parallel and Distributed Systems IPADS Shanghai Jiao Tong University China ID: 587346

region full amp query full region query amp based memory addr speedup size heap reference objects openjdk environments xeon

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Performance Analysis and Optimization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Performance Analysis and Optimization

of Full GC in Memory-hungry Environments

Yang Yu, Tianyang Lei, Weihua Zhang, Haibo Chen, Binyu ZangInstitute of Parallel and Distributed Systems (IPADS)Shanghai Jiao Tong University, ChinaFudan University, China

1

VEE 2016Slide2

Big-data

Ecosystem2Slide3

JVM-based languages

3Slide4

Memory-hungry environments

4

Memory bloat phenomenon in large-scale Java applications [ISMM ’13]

Limited per-application memory

in a

shared-cluster design

inside companies

like

Google

[

EuroSys

‘13]

Limited

per-core

memory in many-core architecture (e.g., Intel Xeon Phi)Slide5

Where exactly is the bottleneck of GC in such memory-hungry environments

?

GC suffers severe strainAccumulated stragglers [HOTOS ’15]Amplified tail latency [

Commun

. ACM]

Effects

of Garbage Collection

5Slide6

Parallel Scavenge in a Production JVM – HotSpot

6

Default garbage collector in OpenJDK 7 & 8Stop-the-world, throughput-orientedHeap space segregated into multiple areasYoung generationOld generationPermanent generationYoung GC

to collect young gen

Full GC

to collect all, mainly for old genSlide7

Profiling of PS GC

7

GC Profiling of data-intensive Java programs from JOlden

Set heap size close to workload size to keep memory

hungrySlide8

Full GC of Parallel Scavenge

8

A variant of Mark-Compact algorithmSlide live objects towards starting sideTwo bitmaps mapping the heap

Heap initially segregated into multiple

regions

Three phases

marking,

summary

&

compacting

Heap

BitmapsSlide9

Decomposition of Full GC

9Slide10

S

Update Refs Using Bitmaps

Updating process for a referenced live object O

B

O

A

Bitmaps

Source

 

B

O

A

Destination

 

N

?

10Slide11

Reference Updating Algorithm

Calculate new location that reference points to

11Slide12

Reference Updating Algorithm

Calculate new location that reference points to

12Slide13

Reference Updating Algorithm

Calculate new location that reference points to

13Slide14

Decomposition of Full GC (cont.)

14

We found the bottleneck !!!Slide15

Last searching range

QN

MSolution: Incremental Query15

Key issue:

Repeated searching range

when two sequentially searched objects reside in the same

region

Basic idea:

Reuse the result of last query

Last query i

n Region R

last_beg_addr

last_end_addr

Current query

beg_addr

Matches?

S

ame region !!!

end

_addr

end

_addr

end

_addr

(last_end_addr

– beg_addr) / 2 Slide16

Caching Types

16

SPECjbb20151GB workload10GB heapSlide17

Query Patterns

17

Local patternSequentially referenced objects tend to lie in same regionResults of last queries could thus be easily reused

Random pattern

Sequentially

referenced

objects always lie in random

regions

Incapable to reuse last results directly

Most applications are mixed with

two

query patterns, differentiated by

respective proportionsSlide18

Optimistic IQ (1/3)

18

A straightforward implementationComplies with the basic ideaEach GC thread maintains one global

result

of last query for all the

regions

Pros & cons

Pros: Little overhead for both memory utilization and calculation

Cons:

Rely

heavily on the local pattern to take good effectSlide19

Sort-based IQ (2/3)

19

Dynamically reorder refs with a lazy updateReferences first filled into a buffer before

updating

Once

filled

up,

reorder

refs

based

on

region

indexes

B

uffer

size

close

to

L1

cache line

size

Pros & cons

Pros: Gather refs

in

same region periodically

Cons: Calculation overhead

due to the extra sorting procedureSlide20

Region-based IQ (3/3)

20

Maintain the result of last query for each region per GC threadFit for both local and random query patternsA Slicing scheme – divide each region into multiple slices, maintaining

last result for

each

slice

More aggressive

Minimize memory overhead

16-bit

integer to store

calculated

size of live

objects

Offset instead

of full-length address

for

last

queried

object

Reduced to

0.09% of

heap

size

with

one slice

per GC

threadSlide21

Experimental environments

21

ParameterIntel(R) Xeon(R) CPUE5-2620Intel Xeon PhiTM Coprocessor 5110PChips11Core typeOut-of-order

In-order

Physical cores

6

60

Frequency

2.00 GHz

1052.63

MHz

Data caches

32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared

32 KB L1,

512 KB L2

per core

Memor

y capacity

32 GB

7697 MB

Memory Technology

DDR3

GDDR5

Memory Access Latency

140 cycles

340 cyclesSlide22

Experimental environments (cont.)

22

OpenJDK 7u + HotSpot JVMJOlden +

GCBench

+

Dacapo

+

SPECjvm2008 + Spark +

Giraph

(

X.v

&

C.c

refer

to

Xml.validation

&

Compiler.compiler

)Slide23

Speedup of Full GC Thru.

on CPU23

Comparison of 3 query schemes and OpenJDK 8 with 1&6 GC threads

1.99x

1.94xSlide24

Improvement of App. Thru.

on CPU24

With 6 GC threads using region-based IQ

%19.3Slide25

Speedup

on Xeon Phi25

Speedup of full GC & app. thru. with 1&20 GC threads using region-based IQ2.22x2.08x

11.1%Slide26

Reduction in Pause Time

26

Normalized elapsed time of full GC & total pause. Lower is better%31.2

%34.9Slide27

Speedup for Big-data

on CPU27

Speedup of full GC & app. thru. using region-based IQ with varying input and heap sizesSlide28

Conclusions

28

Integrated into OpenJDK main

stream

JDK-8146987

A thorough profiling-based analysis of

Parallel Scavenge in a production JVM – HotSpot

An incremental query model and three different

schemesSlide29

Thanks

29

QuestionsSlide30

Backups

30Slide31

Port of

Region-based IQ to OpenJDK 831

Speedup of full GC thru. of region-based IQ on JDK 8Slide32

Evaluation on Clusters

32

Orthogonal to distributed executionA small-scale evaluation on a 5-node cluster, each with two 10-core Intel Xeon E5-2650 v3 processors and 64GB DRAMRun Spark PageRank

with 100 million

edges input and 10GB

heap size on each

node

Record accumulated

full GC time for all nodes and

elapsed

application time on

master

63.8%

and

7.3%

improvement

for full

GC and application

throughput, respectively

Smaller speedup due

to network communication becomes a more dominating factor during distributed execution