D Leidel Yong Chen xiwangttuedu johnleidelttuedu yongchenttuedu Texas Tech University Concurrent Dynamic Memory Coalescing on GoblinCore64 Architecture Oct 3 2016 ID: 717075
Download Presentation The PPT/PDF document "Xi Wang , John" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Xi
Wang, John, D. Leidel, Yong Chenxi.wang@ttu.edu, john.leidel@ttu.edu, yong.chen@ttu.eduTexas Tech University
Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture
Oct. 3. 2016MEMSYS16Slide2
Overview
BackgroundDynamic Memory Coalescing
Concurrent Design
Evaluations
Conclusions Slide3
Background
:GC-64 GoblinCore-64 The GoblinCore-64 (GC-64), is the first RISC-V Extension for data intensive computing, such as graphs, sparse matrices, (et.al.) [1] [2].
We also utilize the Hybrid Memory Cube as the main memory[3] and discard the data cache.
3Hybrid Memory Cube Device [3]
MEMSYS 16Slide4
Background:GC-64
4
`
4
`
MEMSYS 16Slide5
Background
Cache structures provide good utilization for unit-stride memory access patterns.However, this convenience comes at a price: die area and power! To make it worse, cache structures provide little assistance for random memory
access patterns. Therefore, the GC-64 gets rid of cache.The
main purpose of Dynamic Memory Coalescing research is to reduce the memory
accesses
to the main memory, thereby providing equivalent performance on unit stride access or random access patterns.
5
MEMSYS 16Slide6
Dynamic Memory Coalescing:
Tree StructureSorting Binary Tree structures are defined to store the requests from processors. Each tree contains one root node that is nullAll nodes left of the top-level root are read operationsAll nodes right of the top-level root are write operations
Tree nodes are structures of: Task OperationTask IDAddress{TP:TID:ADDRESS}
Tree
Root
R
W
R
R
W
W
MEMSYS 16
6Slide7
Dynamic Memory Coalescing :
Tree LogicCoalescing Tree LogicMemory coalescing is conducted when the tree reaches the max read/write bytes (128), or exceeds
time limits.Firstly, the "First
Order Traversal" will be conducted. The most left child will be found as the base address.
Then the address and size of request data will
be
evaluated to check whether the
requests
are
consecutive
,
in the manner of
Left Child
=>
Parent
=>
Right Child
.
Afterwards
, coalescing tree will be expired and generate the HMC requests.
7
MEMSYS 16Slide8
Concurrent Design: Architecture
8 MEMSYS 16Slide9
Concurrent Design: Algorithms
9Following the design purpose of further coalescing the memory accesses, 2 algorithms are designed to increase the possiblilty of consecutive requests stored in the coalescing tree.Address Partitioned Algorithm (APA
): designed to coalesce adjacent memory accesses by partitioning the physical address.Work Partitioned Algorithm
(WPA): designed to coalesce same type of the memory accesses by partitioning the requests type
:
read/write, based on the APA.
MEMSYS 16Slide10
Concurrent Design: Sequential Example
10
6 HMC Requests
MEMSYS 16
S
uppos
e there are
8
requests are going to be
inserted
, as shown in the table 1.
OP
stands for the operation of the requests,
RD
represents read request,
WR
represents write
request
. Slide11
Concurrent Design: APA Example
11Suppose there are 16 threads.
The size of each memory partition is 0x1
0000000.Thread 0 => [0x00000000, 0x0FFFFFFF]Thread 1 => [0x10000000, 0x1FFFFFFF]
MEMSYS 16
4
HMC RequestsSlide12
Concurrent Design: WPA Example
12 Suppose there are 16 threads.
The size of each memory partition is 0x10000000.
MEMSYS 16
Thread
0,8
=> [0x00000000, 0x0FFFFFFF]
Thread
1,9
=> [0x10000000, 0x1FFFFFFF]
4
HMC RequestsSlide13
Evaluation: Test Cases
13A port of RISC-V Linux kernel built by UC Berkeley [6], is eventually
chosen as the environment to run the 5 test cases, including: HPCG,
SSCA#2, stream, scatter and gather benchmark [7][8][9].
The efficiency is
calculated
by the following equation:
Efficiency
[ 0 , 1 )
MEMSYS 16Slide14
Evaluation: APA & WAPA Evaluations
14MEMSYS 16APA Efficiency
WPA EfficiencySlide15
Evaluation: Requests Distribution
15Master Thesis Xi WangRequests Distribution of HPCG with APA on 2 & 8
threadsSlide16
Evaluation: Decreases of Total Request Cost
16APAWPA
Total Request Cost = Control (32 bytes) + DataFor instance, one
16 bytes HMC read request consists of:
-
R
equest packet = 16 bytes
(control
cost
)
-
R
esponse packet
= 32 bytes (16 control + 16 data)
Total Request Cost
=
32
bytes(
control
) + 16 bytes(
Data
) = 48 bytes
MEMSYS 16Slide17
Evaluation: Test Results
17As a conclusion to all the test results shown above:Across all the test cases, we find an average increase in efficiency of 55.17
% and 55.94% through APA and WPA approach, respectively. The average cost decreases across all APA tests
and WPA tests are 8.83% and 10.04%
,respectively
.
A
s such, the
WPA
approach is
also
considered to be more e
ffi
cient
in this research.
MEMSYS 16Slide18
Conclusions & Future Work
We construct the architecture and model for concurrent DMC design and implement two different parallel algorithms: APA, WPA;We also evaluate the concurrent DMC and prove the superiority of memory coalescing unit, in the perspective of efficacy.The future direction of the research will focus on the extension based on GC-64 and improve the concurrent DMC according to the HMC specification 2.1.
18MEMSYS 16Slide19
References
[1] GoblinCore-64, http://gc64.org/[2] riscv.org. Regents of the University of California. Retrieved August 25, 2014.[3] HMC Specification 2.0 from Micron, http://www.hybridmemorycube.org/[4] Memory Wall, https://en.wikipedia.org/wiki/Random-access_memory#Memory_wall[5] Yocto/OpenEmbedded RISC-V Port. http://riscv.org/software-tools/yocto/[6] Linux/RISC-V, Linux kernel port of RISC-V. http://riscv.org/software-tools/linux/[7] Dongarra J, Heroux M A, Luszczek P. A new metric for ranking high performance computing systems[J]. National Science Review, 2016: nwv084.[8] J. Kepner, D. P. Koester, and et al. HPCS SSCA#2 Graph Analysis Benchmark Specifications v1.0, April 2005.[9] McCalpin J D. A survey of memory bandwidth and machine balance in current high performance computers[J]. IEEE TCCA Newsletter, 1995: 19-25.
19 MEMSYS 16Slide20Slide21
21
Slide22
22
Slide23
Concurrent Design:
Components23Components of DMC UnitSpike is the modified RISC-V simulator which has been extended to support the memory tracing.Microcode is responsible for the requests
insertion and the generation the HMC requests.Library contains the data structure of the coalescing tree, requests from RISC-V cores and HMC requests etc.Master's Thesis
Xi WangSlide24
Concurrent Design:
Tree LogicCoalescing Tree logic-Reduce RequestsFor the case that the address of the requests in the tree are consecutive , we will reduce it into 1 HMC request regardless of the request type. As shown in the case a , b and c on the right
24
Case aCase cCase b
MEMSYS 16Slide25
Concurrent Design:
Tree LogicCoalescing Tree logic----Reduce RequestsFor the case that the requests data are not concecutive:If they are READ requests, we can still reduce it as one HMC request if the read bytes are smaller than the 128 bytes; (Max and Min read/write bytes are 128 and 16 respectively)25
Case dMEMSYS 16Slide26
Evaluation: Test Cases
26The first one is the High Performance Conjugate Gradient Benchmark, also known as HPCG, that performs symmetric Gauss-Seidel algorithms .The 2rd test case is a synthetic graph theory benchmark called
SSCA#2 which is developed by the High Productivity Computer Systems (HPCS).The 3rd test case is the stream benchmark, which is designed to measure sustainable memory bandwidth for contiguous, long-vector memory accesses.
The last 2 test cases are
the scatter and gather, which is a type of memory
a
ddressing that often arises when addressing vectors in sparse linear algebra operations. Slide27
Evaluation: Decreases of Total Request Cost
27APAWPA
Total Request Cost = Control (32 bytes) + DataFor instance, one
16 bytes HMC read request consists of:
-
R
equest packet = 16 bytes
(control
cost
)
-
R
esponse packet
= 32 bytes (16 control + 16 data)
Total Request Cost
=
32
bytes(
control
) + 16 bytes(
Data
) = 48 bytes
MEMSYS 16