/
Modern Hardware for DBMS Modern Hardware for DBMS

Modern Hardware for DBMS - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
345 views
Uploaded On 2019-06-29

Modern Hardware for DBMS - PPT Presentation

Mrutyunjay Mjay University of Colorado Denver Motivation Hardware Trends MultiCore CPUs Many Core CoProcessors GPU N VIDIA AMD Radeon Huge main memory capacity with complex access characteristics Caches NUMA ID: 760776

memory data ssd gpu data memory gpu ssd cpu write flash page size core host chip read multi device

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Modern Hardware for DBMS" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Modern Hardware for DBMS

Mrutyunjay (

Mjay

)

University of Colorado, Denver

Slide2

Motivation

Hardware Trends

Multi-Core CPUs

Many Core: Co-Processors

GPU (

N

VIDIA, AMD Radeon)

Huge

main memory capacity with complex access characteristics (Caches, NUMA

)

Non-Volatile Storage

Flash SSD (Solid State Drive)

Slide3

Multi-Core CPU: Motivation

Around 2005,

frequency-scaling

wall,

improvements by adding multiple processing cores to

the same

CPU chip,

forming

chip multiprocessors

servers

with multiple CPU

sockets

of

multicore

processors

(SMP of CMP)

Slide4

The Multi-core Alternative

Use Moore’s law to place more cores per chip

2x cores/chip with each CMOS generation

Roughly same clock frequency

Known as multi-core chips or chip-multiprocessors (CMP)

The good news

Exponentially scaling peak performance

No power problems due to clock frequency

Easier design and verification

The bad news

Need parallel program if we want to ran a single app faster

Power density is still an issue as transistors shrink

Slide5

Multi-Core CPU: Challenges

This how we think its works.

This how EXACTLY it works.

Slide6

Multi-Core CPU: Challenges

Type of coresE.g. few OOO cores Vs many simple coresMemory hierarchyWhich caching levels are shared and which are privateCache coherenceSynchronization On-chip interconnectBus Vs Ring Vs scalable interconnect (e.g., mesh)Flat Vs hierarchical

Slide7

Multi-Core CPU

All processor have access to unified physical memoryThe can communicate using loads and storesAdvantagesLooks like a better multithreaded processor (multitasking)Requires evolutionary changes the OSThreads within an app communicate implicitly without using OSSimpler to code for and low overheadApp development: first focus on correctness, then on performanceDisadvantagesImplicit communication is hard to optimizeSynchronization can get trickyHigher hardware complexity for cache management

Slide8

NUMA Architecture

NUMA: Non-Uniform Memory Access

Slide9

Many-Core: GPU / GPGPU

GPU (Graphics Processing Unit) is a specialized microprocessor for accelerating graphics renderingGPUs traditionally for graphics computingGPUs now allow general purpose computing easilyGPGPU: using GPU for general purpose computing Physics, Finance, Biology, Geosciences, Medicine, etcNVIDIA and AMD Radeon

Slide10

GPU vs CPU

GPU design with up to a thousand of core enables massively parallel computing GPUs architecture with streaming multiprocessors has form of SIMD processors

CPU GPU

Slide11

SIMD Processor

SIMD: Single Instruction Multiple Data

Distributed memory SIMD computer

Shared memory

SIMD computer

Slide12

NVIDIA GPUs with SIMD Processors

Each GPU has ≥ 1 Streaming Multiprocessors (SMs)Each SM has design of an simple SIMD Processor8-192 Streaming Processors (SPs)NVIDIA GeForce 8-Series GPUs and later

Slide13

Questions from Previous Session

SMP of

CMP: SMP: sockets of multicore processors (Multiple CPU in single system)CMP: Chip Multiprocessor (Single Chip with multi/many cores)

SP: Streaming Processor

SFU: Special Function

Units

Double Precision

Unit

Multithreaded Instruction

Unit

Hardware thread

scheduling

Slide14

GPU Cores

14 Streaming Multiprocessors per GPU 32 cores per Streaming Multiprocessors

Slide15

Development tools for GPU

Two main approaches:Other tool ?  OpenACC

Slide16

What is CUDA?

CUDA = Compute Unified Device ArchitectureA development framework for Nvidia GPUsExtensions of C language Support NVIDIA GeForce 8-Series & later

Definitions

Host = CPUDevice = GPUHost memory = RAM Device memory = RAM on GPU

Host memory

Device memory

Host(CPU)

Device(GPU)

PCI Express bus

Slide17

CUDA Compute Model

CPU sends data to the GPUCPU instructs the processing on GPUGPU processes data CPU collects the results from GPU

Host memory

Device memory

Host(CPU)

Device(GPU)

1

2

3

4

Slide18

CUDA Example

CPU sends data to the GPU

CPU instructs the processing on GPUGPU processes data CPU collects the results from GPU

Host Codeint N= 1000;int size = N*sizeof(float);float A[1000], *dA; cudaMalloc((void **)&dA, size);cudaMemcpy(dA , A, size, cudaMemcpyHostToDevice);ComputeArray <<< 10, 20 >>> (dA ,N);cudaMemcpy(A, dA, size, cudaMemcpyDeviceToHost);cudaFree(dA); Device Code__global__ void ComputeArray(float *A, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i<N) A[i] = A[i]*A[i]; }

Slide19

CUDA Example

A kernel is executed as a

grid

of

blocksA block is a batch of threads that can cooperate with each other by:– Sharing data through shared memory – Synchronizing their execution Threads from different blocks cannot cooperate

Slide20

GPU Computation Challenge

Limiting kernel launchesLimiting data transfers(Solution Overlapped Transfers)

GPU in Databases & Data Mining

GPU strengths are useful

Memory bandwidth

Parallel processing

Accelerating SQL queries – 10x improvement

Also well suited for stream mining

Continuous queries on streaming data instead of one-time queries on static database

Slide21

Memory/Storage

Slide22

Memory Hierarchy

Slowest part: Main Memory and Fixed Disk.Can we decrease the latency between Main Memory and Fixed disk?Solution: SSD

Slide23

SSD: New Generation Non-Volatile Memory

A Solid-State Disk (SSD) is a data storage device that emulates a hard disk drive (HDD). It has no moving parts like in HDD.NAND Flash SSD’s are essentially arrays of flash memory devices which include a controller that electrically and mechanically emulate, and are software compatible with magnetic HDD’s

Slide24

SSD: Architecture

Host Interface LogicSSD ControllerRAM BufferFlash Memory Package

Slide25

Flash Memory

NAND-flash

cells have a limited lifespan due to their limited number of P/E

cycles (Program/Erase Cycle).

What will be the initial state of SSD?

Ans

: Still looking for it.

Slide26

SSD: Architecture

Slide27

Read, Write and Erase

Reads are aligned on page

size:

It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary

.

Writes are aligned on page

size:

When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write

amplification

Pages cannot be

overwritten:

A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write

”.

Erases are aligned on block

size:

Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

Slide28

Example of Write:

Buffer small writes: To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes

Align writes:

Align writes on the page size, and write chunks of data that are multiple of the page size.

Slide29

SSD: How it stores data?

Slide30

SSD: How it stores data?

Latency difference for each type.More levels increases the latency: Delays in read and write.Solution: Hybrid SDD, consisting mixed levels

Slide31

Garbage collection

The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed

.

Split cold and hot

data

.

:

Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector

easier

Buffer hot

data:

Extremely hot data should be buffered as much as possible and written to the drive as infrequently as possible.

Slide32

Flash Translation Layer

The main factor that made adoption of SSDs so easy is that they use the same host interfaces as HDDs.

Although presenting an array of Logical Block Addresses (LBA) makes sense for HDDs as their sectors can be overwritten, it is not fully suited to the way flash memory

works

For this reason, an additional component is required to hide the inner characteristics of NAND flash memory and expose only an array of LBAs to the host. This component is called the

Flash Translation Layer

(FTL),

and resides in the SSD controller

.

The FTL is critical and has two main purposes:

logical block mapping

and

garbage collection

.

This mapping takes the form of a table, which for any LBA gives the corresponding PBA. This mapping table is stored in the

RAM of the SSD

for speed of access, and is persisted in flash memory in case of power failure. When the

SSD powers up, the table is read from the persisted version and reconstructed into the RAM

of the SSD

Slide33

Internal Parallelism in SSDs

Internal parallelism: Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.Multiple levels of parallelism: Channel-level parallelismPackage-level parallelismChip-level parallelismPlane-level parallelism

Slide34

Characteristics and latencies of NAND-flash memory

Slide35

Advantages & Disadvantages

SSD

Advantages

Read and write are much faster than traditional

HDD

Allow PCs to boot up and launch programs far more

quickly

More physically Robust

.

Use less power and generate less

heat

SSD

Disadvantages

Lower capacity than

HDDs

Higher storage cost per

GB

Limited number of data write

cycles

Performance degradation over time

Slide36

Reference

http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives

/

.

Slide37

Questions???