of Full GC in Memoryhungry Environments Yang Yu Tianyang Lei Weihua Zhang Haibo Chen Binyu Zang Institute of Parallel and Distributed Systems IPADS Shanghai Jiao Tong University China ID: 587346
Download Presentation The PPT/PDF document "Performance Analysis and Optimization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Performance Analysis and Optimization
of Full GC in Memory-hungry Environments
Yang Yu, Tianyang Lei, Weihua Zhang, Haibo Chen, Binyu ZangInstitute of Parallel and Distributed Systems (IPADS)Shanghai Jiao Tong University, ChinaFudan University, China
1
VEE 2016Slide2
Big-data
Ecosystem2Slide3
JVM-based languages
3Slide4
Memory-hungry environments
4
Memory bloat phenomenon in large-scale Java applications [ISMM ’13]
Limited per-application memory
in a
shared-cluster design
inside companies
like
Google
[
EuroSys
‘13]
Limited
per-core
memory in many-core architecture (e.g., Intel Xeon Phi)Slide5
Where exactly is the bottleneck of GC in such memory-hungry environments
?
GC suffers severe strainAccumulated stragglers [HOTOS ’15]Amplified tail latency [
Commun
. ACM]
Effects
of Garbage Collection
5Slide6
Parallel Scavenge in a Production JVM – HotSpot
6
Default garbage collector in OpenJDK 7 & 8Stop-the-world, throughput-orientedHeap space segregated into multiple areasYoung generationOld generationPermanent generationYoung GC
to collect young gen
Full GC
to collect all, mainly for old genSlide7
Profiling of PS GC
7
GC Profiling of data-intensive Java programs from JOlden
Set heap size close to workload size to keep memory
hungrySlide8
Full GC of Parallel Scavenge
8
A variant of Mark-Compact algorithmSlide live objects towards starting sideTwo bitmaps mapping the heap
Heap initially segregated into multiple
regions
Three phases
–
marking,
summary
&
compacting
Heap
BitmapsSlide9
Decomposition of Full GC
9Slide10
S
Update Refs Using Bitmaps
Updating process for a referenced live object O
B
O
A
Bitmaps
Source
B
O
A
Destination
N
?
10Slide11
Reference Updating Algorithm
Calculate new location that reference points to
11Slide12
Reference Updating Algorithm
Calculate new location that reference points to
12Slide13
Reference Updating Algorithm
Calculate new location that reference points to
13Slide14
Decomposition of Full GC (cont.)
14
We found the bottleneck !!!Slide15
Last searching range
QN
MSolution: Incremental Query15
Key issue:
Repeated searching range
when two sequentially searched objects reside in the same
region
Basic idea:
Reuse the result of last query
Last query i
n Region R
last_beg_addr
last_end_addr
Current query
beg_addr
Matches?
S
ame region !!!
end
_addr
end
_addr
end
_addr
(last_end_addr
– beg_addr) / 2 Slide16
Caching Types
16
SPECjbb20151GB workload10GB heapSlide17
Query Patterns
17
Local patternSequentially referenced objects tend to lie in same regionResults of last queries could thus be easily reused
Random pattern
Sequentially
referenced
objects always lie in random
regions
Incapable to reuse last results directly
Most applications are mixed with
two
query patterns, differentiated by
respective proportionsSlide18
Optimistic IQ (1/3)
18
A straightforward implementationComplies with the basic ideaEach GC thread maintains one global
result
of last query for all the
regions
Pros & cons
Pros: Little overhead for both memory utilization and calculation
Cons:
Rely
heavily on the local pattern to take good effectSlide19
Sort-based IQ (2/3)
19
Dynamically reorder refs with a lazy updateReferences first filled into a buffer before
updating
Once
filled
up,
reorder
refs
based
on
region
indexes
B
uffer
size
close
to
L1
cache line
size
Pros & cons
Pros: Gather refs
in
same region periodically
Cons: Calculation overhead
due to the extra sorting procedureSlide20
Region-based IQ (3/3)
20
Maintain the result of last query for each region per GC threadFit for both local and random query patternsA Slicing scheme – divide each region into multiple slices, maintaining
last result for
each
slice
More aggressive
Minimize memory overhead
16-bit
integer to store
calculated
size of live
objects
Offset instead
of full-length address
for
last
queried
object
Reduced to
0.09% of
heap
size
with
one slice
per GC
threadSlide21
Experimental environments
21
ParameterIntel(R) Xeon(R) CPUE5-2620Intel Xeon PhiTM Coprocessor 5110PChips11Core typeOut-of-order
In-order
Physical cores
6
60
Frequency
2.00 GHz
1052.63
MHz
Data caches
32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared
32 KB L1,
512 KB L2
per core
Memor
y capacity
32 GB
7697 MB
Memory Technology
DDR3
GDDR5
Memory Access Latency
140 cycles
340 cyclesSlide22
Experimental environments (cont.)
22
OpenJDK 7u + HotSpot JVMJOlden +
GCBench
+
Dacapo
+
SPECjvm2008 + Spark +
Giraph
(
X.v
&
C.c
refer
to
Xml.validation
&
Compiler.compiler
)Slide23
Speedup of Full GC Thru.
on CPU23
Comparison of 3 query schemes and OpenJDK 8 with 1&6 GC threads
1.99x
1.94xSlide24
Improvement of App. Thru.
on CPU24
With 6 GC threads using region-based IQ
%19.3Slide25
Speedup
on Xeon Phi25
Speedup of full GC & app. thru. with 1&20 GC threads using region-based IQ2.22x2.08x
11.1%Slide26
Reduction in Pause Time
26
Normalized elapsed time of full GC & total pause. Lower is better%31.2
%34.9Slide27
Speedup for Big-data
on CPU27
Speedup of full GC & app. thru. using region-based IQ with varying input and heap sizesSlide28
Conclusions
28
Integrated into OpenJDK main
stream
JDK-8146987
A thorough profiling-based analysis of
Parallel Scavenge in a production JVM – HotSpot
An incremental query model and three different
schemesSlide29
Thanks
29
QuestionsSlide30
Backups
30Slide31
Port of
Region-based IQ to OpenJDK 831
Speedup of full GC thru. of region-based IQ on JDK 8Slide32
Evaluation on Clusters
32
Orthogonal to distributed executionA small-scale evaluation on a 5-node cluster, each with two 10-core Intel Xeon E5-2650 v3 processors and 64GB DRAMRun Spark PageRank
with 100 million
edges input and 10GB
heap size on each
node
Record accumulated
full GC time for all nodes and
elapsed
application time on
master
63.8%
and
7.3%
improvement
for full
GC and application
throughput, respectively
Smaller speedup due
to network communication becomes a more dominating factor during distributed execution