DirectX 12 Bennett Sorbo Program Manager Direct3D Windows Graphics Agenda Overview Improving GPU efficiency Reducing CPU overhead Summary Next Steps Overview DirectX 12 provides a single API for lowlevel access to a variety of GPU hardware ID: 156234
Download Presentation The PPT/PDF document "Improving Performance in Your Game" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1Slide2
Improving Performance in Your Game
DirectX 12:
Bennett Sorbo
Program Manager
Direct3D, Windows GraphicsSlide3
Agenda
OverviewImproving GPU efficiency
Reducing CPU overhead
Summary / Next StepsSlide4
Overview
DirectX 12 provides a single API for low-level access to a variety of GPU hardware
Enables games to leverage higher-level knowledge to achieve great performance gains
Today, we’ll discuss best practices for specific DirectX 12 features to achieve these gains in your gameSlide5
Increasing GPU efficiencySlide6
GPU Efficiency
Three key areas for GPU-side gainsExplicit resource transitions
Parallel GPU execution
GPU-generated workloadsSlide7
Modern GPUs require resources to be in different ‘states’ for different use cases, and knowledge of when these transitions need to occur
In DirectX 12, app is responsible for identifying when these transitions need to occur.Making these transitions explicit makes it clear when operations are expensive..
GPU Efficiency:
Explicit resource transitionsSlide8
.. but also gives games the opportunity to eliminate unnecessary transitions. Two key opportunities:
First, UAV synchronization is now exposed as an explicit resource barrier.
Previously, driver would ensure all writes to a UAV were in order of dispatch by inserting “Wait for Idle” commands after each dispatch.
GPU Efficiency:
Explicit
resource transitions
(cont’d
)
Dispatch
Dispatch
Dispatch
WaitForIdle
WaitForIdle
WaitForIdle
DispatchSlide9
If app has high-level knowledge that dispatches can run out of order, WaitForIdle’s
can be removed
But more importantly, dispatches can then run in parallel to achieve higher GPU occupancy
Particularly beneficial for large numbers of dispatches with low thread counts
GPU Efficiency:
Explicit
resource transitions
(cont’d
)
Dispatch
Dispatch
Dispatch
WaitForIdle
Dispatch
Dispatch
Dispatch
Dispatch
WaitForIdle
DispatchSlide10
Second, the ResourceBarrier
API allows application to perform transitions over a period of time.App specifies starting/destination states at ‘begin’ and ‘end’
ResourceBarrier
calls. Promises not to use resource while in transition.
Driver can use this information to eliminate redundant pipeline stalls, cache flushes
GPU Efficiency:
Explicit
resource transitions
(cont’d
)Slide11
Example rendering scenario (before)
Example rendering scenario (after)
GPU Efficiency:
Explicit
resource transitions
(cont’d
)
Draw call that renders to Tex1
Resource Barrier (Tex1)
Render Target -> SRV
Driver emits ‘
WaitForIdle
’ command
SetDescriptorHeap
Driver emits ‘
WaitForIdle
’ command
Bind Tex1 as SRV, sample in Draw call
API Calls
Hardware Commands
…
Draw call that renders to Tex1
Resource Barrier (Tex1)
Render Target -> SRV
BEGIN
SetDescriptorHeap
Driver emits ‘
WaitForIdle
’ command
Bind Tex1 as SRV, sample in Draw call
API Calls
Hardware Commands
…
Resource Barrier (Tex1)
Render Target -> SRV
ENDSlide12
Modern hardware has the ability to run multiple workloads in parallel on multiple ‘engines’
DirectX 12 allows games to target engines explicitly. The developer knows best about what operations can happen in parallel, what the dependencies are
Three engine types exposed in DirectX 12: 3D, Compute, Copy
Up to app to know, manage dependencies between queues
GPU Efficiency:
Parallel GPU executionSlide13
The copy engine type is great for getting data around without blocking/interrupting the main 3D engine.
Two notable use cases:Texture streaming‘lazy’ CPU
readback
Especially great if going across PCI-E
Demo
GPU Efficiency:
Parallel GPU execution
(cont’d)Slide14
< GPUView
comparison between serial/parallel execution >
GPU Efficiency:
Parallel GPU executionSlide15
Really excited about compute engine scenarios as well
Two notable use cases:Long-running, low priority compute workTightly interleaved 3D/Compute work within a frame
Get the gain from running different types of workloads that stress different parts of GPU
Canonical example: compute-heavy dispatches during shadow map generation.
GPU Efficiency: Parallel GPU execution
(cont’d)Slide16
ID3D11Asynchronous -> ID3D12QueryHeap
Query Heaps generalize query functionality – output stored into any buffer on the GPU or in system memory.
ID3D12CommandList::
ResolveQueryData
(
ID3D12QueryHeap
*
pQueryHeap
,
D3D12_QUERY_TYPE
Type
,
UINT
StartElement
,
UINT
ElementCount
,
ID3D12Resource
*
pDestinationBuffer
,
UINT64
AlignedDestinationBufferOffset
)
Two key performance opportunities:
Binary occlusion
Batched query ‘resolve’ operations
GPU Efficiency:
GPU-generated workloadsSlide17
Predication has also been generalized
ID3D12CommandList::
SetPredication
(
ID3D12Resource
*
pBuffer
,
UINT64
AlignedBufferOffset
,
D3D12_PREDICATION_OP Operation)
Predicate on general buffer: query-derived, CPU-populated, GPU-populated – enables new rendering scenarios
GPU Efficiency:
GPU-generated workloads
(cont’d)Slide18
ExecuteIndirect
– powerful new API for executing GPU-generated Draw/Dispatch workloadsBroad hardware compatibility
Can vary the following between invocations:
Vertex/Index buffers
Root constants,
Inline SRV/UAV/CBV descriptors
Enables new scenarios, dramatic efficiency improvements
GPU Efficiency:
GPU-generated workloads
(cont’d)Slide19
DemoAlways going to be very efficient: two ways to maximize
Set a proper ‘max count’, or just use CPU count.Group these together, ideally put space between generation and consumption of arguments.
GPU Efficiency:
GPU-generated workloads
(cont’d
)Slide20
Reducing CPU OverheadSlide21
CPU Overhead
Many improvements just for showing up:No high-frequency ref-counting
No hazard tracking
No state shadowing
Three other opportunities to take advantage of:
Resource Binding
Multi-threading
Memory allocationSlide22
CPU Overhead: Resource Binding
What’s new:Descriptor Heap access
Root Signatures
Descriptor Heap: Actual GPU memory that contains resource access metadata
Root Signature: Binding parameters that can be passed to a shader invocation. Can contain:
Location in descriptor heap
‘Inline’ descriptors
Actual constant dataSlide23
Descriptor Heap best practicesDo: keep your descriptor heap as static as possible.
Avoid: frequently changing descriptor heaps.Root Signature best practices
Do: keep your root signature small
Do: take advantage of inline descriptors/data
Avoid: binding unnecessary pipeline stages
This is an area where you can move the needle on CPU performance – take advantage of the new flexibility here.
CPU Overhead: Resource
Binding
(cont’d)Slide24
CPU Overhead: Multi-threading
In DirectX 11, driver created background thread outside app control.
In DirectX 12, multi-threading is app-controlled, first-class citizen via ID3D12CommandList.
Not just command lists: you can create PSO and buffers/textures on background threads.
Recommendation: Serial workload? Create own background submission thread.Slide25
CPU Overhead: Resource allocation
In DirectX 11, driver-managed versioning, sub-allocation behind app’s back.
DirectX 12 provides tools like fences, resource placement to put apps in charge. Persistently-mapped resources.
Recommendations:
Use appropriate number of fences
Expire resources based on engine knowledgeSlide26
Ashes of the Singularity case study
Dan Baker
Graphics Architect, Oxide GamesSlide27
Resource Binding in Nitrous
Nitrous designed from start to map to hardware binding models
Three key engine design points:
Textures pre-grouped in descriptor heap
Bindings shared across shader stages – less bind calls
Built around Static Samplers
Findings:
Easy to stay within one descriptor heap/frame
Important to avoid redundant state sets
Optional usage of Root
CBVs can provide win
Result: resource binding overhead is a
fraction of what it is on D3D11Slide28
Resource Management in Nitrous
Nitrous also benefits from more explicit resource management
Two classes of resources:
Formally tracked, persistent resources
Temporary, frame-specific resources
Frame-specific resources linearly allocated out of heap, with no resource tracking – minimal overheadSlide29
DemoSlide30
Conclusion
Many opportunities with DirectX 12 to achieve dramatic performance improvements in your game
Get started today!
Enroll in the Early Access program at http://aka.ms/dxeap to receive the latest SDK, DirectX 12 drivers, documentation, …
Check out Channel9 for previous DirectX12 talks
Q/ASlide31Slide32
Backup
< Would need to explain ‘residency’, how this worked in DX11 >< WDDM2 residency management provides flexibility/performance.
< Don’t need to track resource usage/frame if memory usage isn’t a concern – keep it all resident. >