Mrutyunjay Mjay University of Colorado Denver Motivation Hardware Trends MultiCore CPUs Many Core CoProcessors GPU N VIDIA AMD Radeon Huge main memory capacity with complex access characteristics Caches NUMA ID: 760776
Download Presentation The PPT/PDF document "Modern Hardware for DBMS" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Modern Hardware for DBMS
Mrutyunjay (
Mjay
)
University of Colorado, Denver
Slide2Motivation
Hardware Trends
Multi-Core CPUs
Many Core: Co-Processors
GPU (
N
VIDIA, AMD Radeon)
Huge
main memory capacity with complex access characteristics (Caches, NUMA
)
Non-Volatile Storage
Flash SSD (Solid State Drive)
Slide3Multi-Core CPU: Motivation
Around 2005,
frequency-scaling
wall,
improvements by adding multiple processing cores to
the same
CPU chip,
forming
chip multiprocessors
servers
with multiple CPU
sockets
of
multicore
processors
(SMP of CMP)
Slide4The Multi-core Alternative
Use Moore’s law to place more cores per chip
2x cores/chip with each CMOS generation
Roughly same clock frequency
Known as multi-core chips or chip-multiprocessors (CMP)
The good news
Exponentially scaling peak performance
No power problems due to clock frequency
Easier design and verification
The bad news
Need parallel program if we want to ran a single app faster
Power density is still an issue as transistors shrink
Slide5Multi-Core CPU: Challenges
This how we think its works.
This how EXACTLY it works.
Slide6Multi-Core CPU: Challenges
Type of coresE.g. few OOO cores Vs many simple coresMemory hierarchyWhich caching levels are shared and which are privateCache coherenceSynchronization On-chip interconnectBus Vs Ring Vs scalable interconnect (e.g., mesh)Flat Vs hierarchical
Slide7Multi-Core CPU
All processor have access to unified physical memoryThe can communicate using loads and storesAdvantagesLooks like a better multithreaded processor (multitasking)Requires evolutionary changes the OSThreads within an app communicate implicitly without using OSSimpler to code for and low overheadApp development: first focus on correctness, then on performanceDisadvantagesImplicit communication is hard to optimizeSynchronization can get trickyHigher hardware complexity for cache management
Slide8NUMA Architecture
NUMA: Non-Uniform Memory Access
Slide9Many-Core: GPU / GPGPU
GPU (Graphics Processing Unit) is a specialized microprocessor for accelerating graphics renderingGPUs traditionally for graphics computingGPUs now allow general purpose computing easilyGPGPU: using GPU for general purpose computing Physics, Finance, Biology, Geosciences, Medicine, etcNVIDIA and AMD Radeon
Slide10GPU vs CPU
GPU design with up to a thousand of core enables massively parallel computing GPUs architecture with streaming multiprocessors has form of SIMD processors
CPU GPU
Slide11SIMD Processor
SIMD: Single Instruction Multiple Data
Distributed memory SIMD computer
Shared memory
SIMD computer
Slide12NVIDIA GPUs with SIMD Processors
Each GPU has ≥ 1 Streaming Multiprocessors (SMs)Each SM has design of an simple SIMD Processor8-192 Streaming Processors (SPs)NVIDIA GeForce 8-Series GPUs and later
Slide13Questions from Previous Session
SMP of
CMP: SMP: sockets of multicore processors (Multiple CPU in single system)CMP: Chip Multiprocessor (Single Chip with multi/many cores)
SP: Streaming Processor
SFU: Special Function
Units
Double Precision
Unit
Multithreaded Instruction
Unit
Hardware thread
scheduling
Slide14GPU Cores
14 Streaming Multiprocessors per GPU 32 cores per Streaming Multiprocessors
Slide15Development tools for GPU
Two main approaches:Other tool ? OpenACC
Slide16What is CUDA?
CUDA = Compute Unified Device ArchitectureA development framework for Nvidia GPUsExtensions of C language Support NVIDIA GeForce 8-Series & later
Definitions
Host = CPUDevice = GPUHost memory = RAM Device memory = RAM on GPU
Host memory
Device memory
Host(CPU)
Device(GPU)
PCI Express bus
Slide17CUDA Compute Model
CPU sends data to the GPUCPU instructs the processing on GPUGPU processes data CPU collects the results from GPU
Host memory
Device memory
Host(CPU)
Device(GPU)
1
2
3
4
Slide18CUDA Example
CPU sends data to the GPU
CPU instructs the processing on GPUGPU processes data CPU collects the results from GPU
Host Codeint N= 1000;int size = N*sizeof(float);float A[1000], *dA; cudaMalloc((void **)&dA, size);cudaMemcpy(dA , A, size, cudaMemcpyHostToDevice);ComputeArray <<< 10, 20 >>> (dA ,N);cudaMemcpy(A, dA, size, cudaMemcpyDeviceToHost);cudaFree(dA); Device Code__global__ void ComputeArray(float *A, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i<N) A[i] = A[i]*A[i]; }
Slide19CUDA Example
A kernel is executed as a
grid
of
blocksA block is a batch of threads that can cooperate with each other by:– Sharing data through shared memory – Synchronizing their execution Threads from different blocks cannot cooperate
Slide20GPU Computation Challenge
Limiting kernel launchesLimiting data transfers(Solution Overlapped Transfers)
GPU in Databases & Data Mining
GPU strengths are useful
Memory bandwidth
Parallel processing
Accelerating SQL queries – 10x improvement
Also well suited for stream mining
Continuous queries on streaming data instead of one-time queries on static database
Slide21Memory/Storage
Slide22Memory Hierarchy
Slowest part: Main Memory and Fixed Disk.Can we decrease the latency between Main Memory and Fixed disk?Solution: SSD
Slide23SSD: New Generation Non-Volatile Memory
A Solid-State Disk (SSD) is a data storage device that emulates a hard disk drive (HDD). It has no moving parts like in HDD.NAND Flash SSD’s are essentially arrays of flash memory devices which include a controller that electrically and mechanically emulate, and are software compatible with magnetic HDD’s
Slide24SSD: Architecture
Host Interface LogicSSD ControllerRAM BufferFlash Memory Package
Slide25Flash Memory
NAND-flash
cells have a limited lifespan due to their limited number of P/E
cycles (Program/Erase Cycle).
What will be the initial state of SSD?
Ans
: Still looking for it.
Slide26SSD: Architecture
Slide27Read, Write and Erase
Reads are aligned on page
size:
It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary
.
Writes are aligned on page
size:
When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write
amplification
Pages cannot be
overwritten:
A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write
”.
Erases are aligned on block
size:
Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.
Slide28Example of Write:
Buffer small writes: To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes
Align writes:
Align writes on the page size, and write chunks of data that are multiple of the page size.
Slide29SSD: How it stores data?
Slide30SSD: How it stores data?
Latency difference for each type.More levels increases the latency: Delays in read and write.Solution: Hybrid SDD, consisting mixed levels
Slide31Garbage collection
The garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed
.
Split cold and hot
data
.
:
Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector
easier
Buffer hot
data:
Extremely hot data should be buffered as much as possible and written to the drive as infrequently as possible.
Slide32Flash Translation Layer
The main factor that made adoption of SSDs so easy is that they use the same host interfaces as HDDs.
Although presenting an array of Logical Block Addresses (LBA) makes sense for HDDs as their sectors can be overwritten, it is not fully suited to the way flash memory
works
For this reason, an additional component is required to hide the inner characteristics of NAND flash memory and expose only an array of LBAs to the host. This component is called the
Flash Translation Layer
(FTL),
and resides in the SSD controller
.
The FTL is critical and has two main purposes:
logical block mapping
and
garbage collection
.
This mapping takes the form of a table, which for any LBA gives the corresponding PBA. This mapping table is stored in the
RAM of the SSD
for speed of access, and is persisted in flash memory in case of power failure. When the
SSD powers up, the table is read from the persisted version and reconstructed into the RAM
of the SSD
Slide33Internal Parallelism in SSDs
Internal parallelism: Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.Multiple levels of parallelism: Channel-level parallelismPackage-level parallelismChip-level parallelismPlane-level parallelism
Slide34Characteristics and latencies of NAND-flash memory
Slide35Advantages & Disadvantages
SSD
Advantages
Read and write are much faster than traditional
HDD
Allow PCs to boot up and launch programs far more
quickly
More physically Robust
.
Use less power and generate less
heat
SSD
Disadvantages
Lower capacity than
HDDs
Higher storage cost per
GB
Limited number of data write
cycles
Performance degradation over time
Slide36Reference
http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives
/
.
Slide37Questions???