/
Xi  Wang ,          John Xi  Wang ,          John

Xi Wang , John - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
344 views
Uploaded On 2018-11-06

Xi Wang , John - PPT Presentation

D Leidel Yong Chen xiwangttuedu johnleidelttuedu yongchenttuedu Texas Tech University Concurrent Dynamic Memory Coalescing on GoblinCore64 Architecture Oct 3 2016 ID: 717075

requests memory tree bytes memory requests bytes tree memsys request concurrent hmc coalescing cost test data read apa control

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Xi Wang , John" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Xi

Wang, John, D. Leidel, Yong Chenxi.wang@ttu.edu, john.leidel@ttu.edu, yong.chen@ttu.eduTexas Tech University

Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture

Oct. 3. 2016MEMSYS16Slide2

Overview

BackgroundDynamic Memory Coalescing

Concurrent Design

Evaluations

Conclusions Slide3

Background

:GC-64 GoblinCore-64 The GoblinCore-64 (GC-64), is the first RISC-V Extension for data intensive computing, such as graphs, sparse matrices, (et.al.) [1] [2].

We also utilize the Hybrid Memory Cube as the main memory[3] and discard the data cache.

3Hybrid Memory Cube Device [3]

MEMSYS 16Slide4

Background:GC-64

4

`

4

`

MEMSYS 16Slide5

Background

Cache structures provide good utilization for unit-stride memory access patterns.However, this convenience comes at a price: die area and power! To make it worse, cache structures provide little assistance for random memory

access patterns. Therefore, the GC-64 gets rid of cache.The

main purpose of Dynamic Memory Coalescing research is to reduce the memory

accesses

to the main memory, thereby providing equivalent performance on unit stride access or random access patterns.

5

MEMSYS 16Slide6

Dynamic Memory Coalescing:

Tree StructureSorting Binary Tree structures are defined to store the requests from processors. Each tree contains one root node that is nullAll nodes left of the top-level root are read operationsAll nodes right of the top-level root are write operations

Tree nodes are structures of: Task OperationTask IDAddress{TP:TID:ADDRESS}

Tree

Root

R

W

R

R

W

W

MEMSYS 16

6Slide7

Dynamic Memory Coalescing :

Tree LogicCoalescing Tree LogicMemory coalescing is conducted when the tree reaches the max read/write bytes (128), or exceeds

time limits.Firstly, the "First

Order Traversal" will be conducted. The most left child will be found as the base address.

Then the address and size of request data will

be

evaluated to check whether the

requests

are

consecutive

,

in the manner of

Left Child

=>

Parent

=>

Right Child

.

Afterwards

, coalescing tree will be expired and generate the HMC requests.

7

MEMSYS 16Slide8

Concurrent Design: Architecture

8 MEMSYS 16Slide9

Concurrent Design: Algorithms

9Following the design purpose of further coalescing the memory accesses, 2 algorithms are designed to increase the possiblilty of consecutive requests stored in the coalescing tree.Address Partitioned Algorithm (APA

): designed to coalesce adjacent memory accesses by partitioning the physical address.Work Partitioned Algorithm

(WPA): designed to coalesce same type of the memory accesses by partitioning the requests type

:

read/write, based on the APA.

MEMSYS 16Slide10

Concurrent Design: Sequential Example

10

6 HMC Requests

MEMSYS 16

S

uppos

e there are

8

requests are going to be

inserted

, as shown in the table 1.

OP

stands for the operation of the requests,

RD

represents read request,

WR

represents write

request

. Slide11

Concurrent Design: APA Example

11Suppose there are 16 threads.

The size of each memory partition is 0x1

0000000.Thread 0 => [0x00000000, 0x0FFFFFFF]Thread 1 => [0x10000000, 0x1FFFFFFF]

MEMSYS 16

4

HMC RequestsSlide12

Concurrent Design: WPA Example

12 Suppose there are 16 threads.

The size of each memory partition is 0x10000000.

MEMSYS 16

Thread

0,8

=> [0x00000000, 0x0FFFFFFF]

Thread

1,9

=> [0x10000000, 0x1FFFFFFF]

4

HMC RequestsSlide13

Evaluation: Test Cases

13A port of RISC-V Linux kernel built by UC Berkeley [6], is eventually

chosen as the environment to run the 5 test cases, including: HPCG,

SSCA#2, stream, scatter and gather benchmark [7][8][9].

The efficiency is

calculated

by the following equation:

Efficiency

 

[ 0 , 1 )

MEMSYS 16Slide14

Evaluation: APA & WAPA Evaluations

14MEMSYS 16APA Efficiency

WPA EfficiencySlide15

Evaluation: Requests Distribution

15Master Thesis Xi WangRequests Distribution of HPCG with APA on 2 & 8

threadsSlide16

Evaluation: Decreases of Total Request Cost

16APAWPA

Total Request Cost = Control (32 bytes) + DataFor instance, one

16 bytes HMC read request consists of:

-

R

equest packet = 16 bytes

(control

cost

)

-

R

esponse packet

= 32 bytes (16 control + 16 data)

Total Request Cost

=

32

bytes(

control

) + 16 bytes(

Data

) = 48 bytes

MEMSYS 16Slide17

Evaluation: Test Results

17As a conclusion to all the test results shown above:Across all the test cases, we find an average increase in efficiency of 55.17

% and 55.94% through APA and WPA approach, respectively. The average cost decreases across all APA tests

and WPA tests are 8.83% and 10.04%

,respectively

.

A

s such, the

WPA

approach is

also

considered to be more e

ffi

cient

in this research.

MEMSYS 16Slide18

Conclusions & Future Work

We construct the architecture and model for concurrent DMC design and implement two different parallel algorithms: APA, WPA;We also evaluate the concurrent DMC and prove the superiority of memory coalescing unit, in the perspective of efficacy.The future direction of the research will focus on the extension based on GC-64 and improve the concurrent DMC according to the HMC specification 2.1.

18MEMSYS 16Slide19

References

[1] GoblinCore-64, http://gc64.org/[2] riscv.org. Regents of the University of California. Retrieved August 25, 2014.[3] HMC Specification 2.0 from Micron, http://www.hybridmemorycube.org/[4] Memory Wall, https://en.wikipedia.org/wiki/Random-access_memory#Memory_wall[5] Yocto/OpenEmbedded RISC-V Port. http://riscv.org/software-tools/yocto/[6] Linux/RISC-V, Linux kernel port of RISC-V. http://riscv.org/software-tools/linux/[7] Dongarra J, Heroux M A, Luszczek P. A new metric for ranking high performance computing systems[J]. National Science Review, 2016: nwv084.[8] J. Kepner, D. P. Koester, and et al. HPCS SSCA#2 Graph Analysis Benchmark Specifications v1.0, April 2005.[9] McCalpin J D. A survey of memory bandwidth and machine balance in current high performance computers[J]. IEEE TCCA Newsletter, 1995: 19-25.

19 MEMSYS 16Slide20
Slide21

21

Slide22

22

Slide23

Concurrent Design:

Components23Components of DMC UnitSpike is the modified RISC-V simulator which has been extended to support the memory tracing.Microcode is responsible for the requests

insertion and the generation the HMC requests.Library contains the data structure of the coalescing tree, requests from RISC-V cores and HMC requests etc.Master's Thesis

Xi WangSlide24

Concurrent Design:

Tree LogicCoalescing Tree logic-Reduce RequestsFor the case that the address of the requests in the tree are consecutive , we will reduce it into 1 HMC request regardless of the request type. As shown in the case a , b and c on the right

24

Case aCase cCase b

MEMSYS 16Slide25

Concurrent Design:

Tree LogicCoalescing Tree logic----Reduce RequestsFor the case that the requests data are not concecutive:If they are READ requests, we can still reduce it as one HMC request if the read bytes are smaller than the 128 bytes; (Max and Min read/write bytes are 128 and 16 respectively)25

Case dMEMSYS 16Slide26

Evaluation: Test Cases

26The first one is the High Performance Conjugate Gradient Benchmark, also known as HPCG, that performs symmetric Gauss-Seidel algorithms .The 2rd test case is a synthetic graph theory benchmark called

SSCA#2 which is developed by the High Productivity Computer Systems (HPCS).The 3rd test case is the stream benchmark, which is designed to measure sustainable memory bandwidth for contiguous, long-vector memory accesses.

The last 2 test cases are

the scatter and gather, which is a type of memory

a

ddressing that often arises when addressing vectors in sparse linear algebra operations. Slide27

Evaluation: Decreases of Total Request Cost

27APAWPA

Total Request Cost = Control (32 bytes) + DataFor instance, one

16 bytes HMC read request consists of:

-

R

equest packet = 16 bytes

(control

cost

)

-

R

esponse packet

= 32 bytes (16 control + 16 data)

Total Request Cost

=

32

bytes(

control

) + 16 bytes(

Data

) = 48 bytes

MEMSYS 16