/
Graphics Hardware UMBC Graphics for Games Graphics Hardware UMBC Graphics for Games

Graphics Hardware UMBC Graphics for Games - PowerPoint Presentation

luna
luna . @luna
Follow
65 views
Uploaded On 2023-11-23

Graphics Hardware UMBC Graphics for Games - PPT Presentation

CPU Architecture Start 14 instructions per cycle Pipelined takes 816 cycles to complete Memory is slow 1 cycle for registers 2 0 24 cycles for L1 cache 2 2 1020 cycles for L2 cache 2 ID: 1034650

cpu gpu tips performance gpu cpu performance tips buffer nvidia texture shader graphics read command amp write simd render

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Graphics Hardware UMBC Graphics for Game..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Graphics HardwareUMBC Graphics for Games

2. CPU ArchitectureStart 1-4 instructions per cyclePipelined, takes 8-16 cycles to completeMemory is slow1 cycle for registers ~202-4 cycles for L1 cache ~2210-20 cycles for L2 cache ~2450-70 cycles for L3 cache ~26200-300 cycles for memory ~28

3. CPU GoalsMake one thread go very fastAvoid pipeline stallsBranch predictionDon’t wait to know branch target (misprediction ~64 instructions)Out-of-order executionDon’t wait for one instruction to finish before starting othersMemory prefetchRecognize common access patterns, get data to fast levels of cache earlyBig cachesThough bigger caches add latency

4. CPU Performance TipsAvoid unpredictable branchesSpecialize codeAvoid virtual functions when possibleSort similar cases togetherAvoid caching data you don’t useCore of data oriented design / Structure of Arrays organizationstruct{position,velocity}[num] vs. struct{position[NUM],velocity[NUM]} Avoid unpredictable accessO(N) linear search faster than O(log N) binary up to 50-100 elementsLinear access = can prefetch; branch predictable (keep looping) until final iterationAsymptotically worse, but for small N, constants matterDon’t underestimate how big small N can be

5. GPU GoalsMake 1000’s of threads running the same program go very fastHide stallsShare hardwareAll running the same programSwap threads to hide stallsLarge, flexible register setEnough for active and stalled threads

6. Architecture (MIMD vs SIMD)MIMD (CPU-Like)SIMD (GPU-Like)FlexibilityHorsepowerEase of Use(AMD K10 chip-architect.org)(NVIDIA Volta / wikichip.org)Actual computationalUnits

7. SIMD Branchingif( x ) // mask threads{ // issue instructions}else // invert mask{ // issue instructions} // unmaskThreads agreeorThreads disagreeTHEN

8. SIMD Loopingwhile(x) // update mask{ // do stuff}Everyone runs as long as slowestActive Inactive % Active = UtilizationThis example: 36% utilization

9. GPU Programming ModelVertex ShaderRuns on onevertex at a timeRasterizerCollects sets ofthree verticesPixelRuns on one pixelas a timePixelTexture/Buffer/CacheRasterizeVertexDisplayedPixelsZ-buffer/Blend

10. Z-buffer/BlendGPU Processing ModelSIMD for efficiencySame processorsfor vertex & pixelSIMD BatchesLimits on # of vertices & pixels that can run togetherImprove UtilizationLimit divergenceBasic schedulingIf batch of pixels, run itOtherwise run some vertices to make moreTexture/Buffer/CacheDisplayedPixelsRasterizePixelVertex

11. NVIDIA Maxwell[NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014]

12. Maxwell SIMD Processing Block32 Cores8 Special Function Units (SFU)Double precision, trig, …Still issue 1/thread, but run ¼ rate8 Load/Store memory accessHide latency by interleaving threadsWave (AMD) / Warp (NVIDIA)One set of lanes (AMD) / threads (NVIDIA)Want at least 4-8 interleaved% of max = Occupancy

13. GPU RegistersScalar General Purpose Register (SGPR)Same value for all threadsAMD term: Pixar’s Renderman (& GLSL) called these uniformVector General Purpose Registers (VGPR)Different value in every threadAMD term: Pixar & GLSL called these varying# wavefronts usually limited by VGPR

14. Maxwell StreamingMultiprocessor (SMM)4 SIMD blocks (128 total cores)Share L1 CachesBetween-core shared memoryCommunication through this shared memoryis fastShare tessellation HWHardware support for tessellation shaders

15. Maxwell Graphics Processing Cluster (GPC)4 SMM (512 total cores)Share rasterizer

16. Full NVIDIA Maxwell4 GPC (2048 total cores)Share L2Share dispatchDecides which threadsto launch and when

17. NVIDIA Volta[NVIDIA, NVIDIA TESLA V100 GPU ARCHITECTURE]

18. Things in newer GPUsMMMMMOOOORRRRREEEE CORES“Tensor cores” = bunch of float16 multiply/add coresTargeting AI applications, but general purposeHardware support for new shader typesMesh shadersRay tracing shaders

19. Graphics System ArchitectureYour CodeAPI / DriverCurrent Frame (Buffering Commands)Previous Frame(s) (Submitted, Pending Execution)GPUProduceConsumeGPUGPU(s)DisplayEngine

20. Care and Feeding of a GPU

21. Parallel SubmissionOpenGL and DX11 had a 1 CPU-thread bottleneck to the GPUDX12, Vulkan and Metal are designed for multiple CPU cores to simultaneously submit work to a single GPUAny thread can build Command Lists & submit to GPU when readyA Command List includes all necessary state, can execute in any orderTell GPU about resource dependencies (Barriers / Transitions)Enforces partial ordering of command list execution

22. Resource TransitionsWhat kind of memory operations does this buffer need to support?(CPU/GPU) (read/write) (once/many times)CPU write once, GPU read once; GPU write once, GPU read many times; …Source/Target stage for the transitionWhen is it ready? When will it be used?What use should it be optimized for?CPU staging, Texture, Render Target, Depth Buffer, Vertex Buffer, Index Buffer,Graphics read/write unordered access view (UAV), Compute buffer, …Explicitly transition between theseE.g. between render target write in pass A and texture read in pass B

23. Setting up a Command ListWhat do I write?Render Targets, UAVsTransition/barrier to make sure anyone else reading or writing them is doneWhat do I read?Transition if just written or in a different formatWhat shader(s) am I using?What shader parameter blocks?Given by Descriptors & Root Signature

24. Setting up a Graphics Command ListBegin Pass / End PassPrimarily necessary for batching on mobileRendering Graphics Pipeline State Object (PSO)Primitive Type: triangle list, fan, strip, quad, points, …Rasterizer State: Solid/wire frame, two sided / CW side / CCW side, MSAADepth/Stencil State: Comparison (<, ≤,=,≠, ≥, >), update?Blend State: Transparent layer with opacity 𝛼: 𝛼 * new + (1 - 𝛼) * oldGeneralize to (a * new (op) b * old) for limited selection of a, b, and (op)

25. GPU Performance Tips

26. GPU Performance Tips: CommunicationReading results derails the CPU → GPU train…..Occlusion queries → DeathWhen used poorly: don’t ask for the answer for 2-3 framesFramebuffer reads → DEATH!!!Almost always…CPU-GPU communication should be one wayIf you must read, do it a few frames later…

27. GPU Performance Tips: API & DriverBatch shader/texture/constant changesBatches can execute together, but not if split by other stuffSome engines generate a list of stuff to render, then sort it by stateMinimize Draw callsOne instanced draw is much more efficient than many static drawsMinimize CPU → GPU trafficUse static vertex / index buffers if you canCopy from CPU to GPU once, then leave thereUse dynamic buffers if you mustWith discarding locks: region being updated / region in queue / region being rendered

28. GPU Performance Tips: ShadersNo unnecessary work!Precompute constant expressionsDivide by constant → Multiply by reciprocalx*(1./3.) vs. x/3.Not always the same in float math, compiler is not allowed to make that optimizationMinimize fetchesPrefer compute (generally)If ALU/TEX < 4+, ALU is under-utilizedIf combining static textures, bake it all down…

29. GPU Performance Tips: Shader OccupancyKnow what’s limiting youMemory bandwidth limited, compute limited, local/shared memory capacity limited, VGPR limited, …Limit VGPR usage by specializing shadersIn UE4, #ifdefs in shaders, compiles versions with different #define choicesBut… exponential growth in number of shader variants to compileLimit VGPR usage with computation that’s constant across the warpCan re-merge with cross-wave primitives

30. GPU Performance Tips: ShadersCareful with flow controlAvoid divergenceFlatten small branches (map to conditional move / predicated instructions)Prefer simple control structureSpecialize shader (though can lead to 1000’s of shaders)Double-check the compilerShader compilers are getting better…Look over artists’ shouldersMaterial editors give them lots of rope….

31. GPU Performance Tips: VerticesUse the right data formatCache-optimized index buffer Minimal aligned packed vertex dataE.g. (position, normal, texture coordinate); (position, normal); (position); …Or index into separate buffers for each vertex component usedCull invisible geometryCoarse-grained (few thousand triangles) is enough“Heavy” Geometry load is ~2MTris and rising

32. GPU Performance Tips: PixelsSmall triangles hurt performanceGPU always renders 2x2 pixel blocks (for texture filtering)Renders extra “fake” pixels to fill block → Waste at triangle edgesRespect the texture cacheAdjacent pixels should touch same or adjacent texelsUse mipmap if you can, adjacent in high mip level is cache miss in lowest levelUse the smallest possible texture formatPer texel sizes: DXT5 1b / RGBA8 4b / RGBA16F 8b / RGBA32F 16bAvoid incoherent texture reads (commonly from texture as data table)Do work per vertex There are usually less of those (see small triangles)

33. GPU Performance Tips: PixelsHW is very good at Z cullingEarly Z, Hierarchical ZIf possible, submit geometry front to back“Z Priming” is commonplace (UE4 does this)Render with simple shader to z-bufferThen render with real shaderHelps a ton for complex forward shaders, but useful even for G-buffer

34. GPU Performance Tips: Frame BufferTurn off what you don’t needAlpha blendingColor/Z writesMinimize redundant passesMultiple lights/textures in one passAs long as it doesn’t kill your occupancyUse the smallest possible pixel formatConsider clip/discard in transparent regionsThrows out pixels instead of blending them

35. GPU Performance Tips: OverallLearn as much as possible about GPU internalsUse that to guide your optimization decisionsBenchmark, don’t assumeComplex interrelationships can surprise youBuild a good A/B test timing frameworkUE4 has abtest console command: do console command A, time a bit, do B, time a bit, …Do at least big-picture optimizations early