/
Improving Performance in Your Game Improving Performance in Your Game

Improving Performance in Your Game - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
568 views
Uploaded On 2015-10-10

Improving Performance in Your Game - PPT Presentation

DirectX 12 Bennett Sorbo Program Manager Direct3D Windows Graphics Agenda Overview Improving GPU efficiency Reducing CPU overhead Summary Next Steps Overview DirectX 12 provides a single API for lowlevel access to a variety of GPU hardware ID: 156234

resource gpu dispatch efficiency gpu resource efficiency dispatch directx transitions cpu cont

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Improving Performance in Your Game" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1
Slide2

Improving Performance in Your Game

DirectX 12:

Bennett Sorbo

Program Manager

Direct3D, Windows GraphicsSlide3

Agenda

OverviewImproving GPU efficiency

Reducing CPU overhead

Summary / Next StepsSlide4

Overview

DirectX 12 provides a single API for low-level access to a variety of GPU hardware

Enables games to leverage higher-level knowledge to achieve great performance gains

Today, we’ll discuss best practices for specific DirectX 12 features to achieve these gains in your gameSlide5

Increasing GPU efficiencySlide6

GPU Efficiency

Three key areas for GPU-side gainsExplicit resource transitions

Parallel GPU execution

GPU-generated workloadsSlide7

Modern GPUs require resources to be in different ‘states’ for different use cases, and knowledge of when these transitions need to occur

In DirectX 12, app is responsible for identifying when these transitions need to occur.Making these transitions explicit makes it clear when operations are expensive..

GPU Efficiency:

Explicit resource transitionsSlide8

.. but also gives games the opportunity to eliminate unnecessary transitions. Two key opportunities:

First, UAV synchronization is now exposed as an explicit resource barrier.

Previously, driver would ensure all writes to a UAV were in order of dispatch by inserting “Wait for Idle” commands after each dispatch.

GPU Efficiency:

Explicit

resource transitions

(cont’d

)

Dispatch

Dispatch

Dispatch

WaitForIdle

WaitForIdle

WaitForIdle

DispatchSlide9

If app has high-level knowledge that dispatches can run out of order, WaitForIdle’s

can be removed

But more importantly, dispatches can then run in parallel to achieve higher GPU occupancy

Particularly beneficial for large numbers of dispatches with low thread counts

GPU Efficiency:

Explicit

resource transitions

(cont’d

)

Dispatch

Dispatch

Dispatch

WaitForIdle

Dispatch

Dispatch

Dispatch

Dispatch

WaitForIdle

DispatchSlide10

Second, the ResourceBarrier

API allows application to perform transitions over a period of time.App specifies starting/destination states at ‘begin’ and ‘end’

ResourceBarrier

calls. Promises not to use resource while in transition.

Driver can use this information to eliminate redundant pipeline stalls, cache flushes

GPU Efficiency:

Explicit

resource transitions

(cont’d

)Slide11

Example rendering scenario (before)

Example rendering scenario (after)

GPU Efficiency:

Explicit

resource transitions

(cont’d

)

Draw call that renders to Tex1

Resource Barrier (Tex1)

Render Target -> SRV

Driver emits ‘

WaitForIdle

’ command

SetDescriptorHeap

Driver emits ‘

WaitForIdle

’ command

Bind Tex1 as SRV, sample in Draw call

API Calls

Hardware Commands

Draw call that renders to Tex1

Resource Barrier (Tex1)

Render Target -> SRV

BEGIN

SetDescriptorHeap

Driver emits ‘

WaitForIdle

’ command

Bind Tex1 as SRV, sample in Draw call

API Calls

Hardware Commands

Resource Barrier (Tex1)

Render Target -> SRV

ENDSlide12

Modern hardware has the ability to run multiple workloads in parallel on multiple ‘engines’

DirectX 12 allows games to target engines explicitly. The developer knows best about what operations can happen in parallel, what the dependencies are

Three engine types exposed in DirectX 12: 3D, Compute, Copy

Up to app to know, manage dependencies between queues

GPU Efficiency:

Parallel GPU executionSlide13

The copy engine type is great for getting data around without blocking/interrupting the main 3D engine.

Two notable use cases:Texture streaming‘lazy’ CPU

readback

Especially great if going across PCI-E

Demo

GPU Efficiency:

Parallel GPU execution

(cont’d)Slide14

< GPUView

comparison between serial/parallel execution >

GPU Efficiency:

Parallel GPU executionSlide15

Really excited about compute engine scenarios as well

Two notable use cases:Long-running, low priority compute workTightly interleaved 3D/Compute work within a frame

Get the gain from running different types of workloads that stress different parts of GPU

Canonical example: compute-heavy dispatches during shadow map generation.

GPU Efficiency: Parallel GPU execution

(cont’d)Slide16

ID3D11Asynchronous -> ID3D12QueryHeap

Query Heaps generalize query functionality – output stored into any buffer on the GPU or in system memory.

ID3D12CommandList::

ResolveQueryData

(

ID3D12QueryHeap

*

pQueryHeap

,

D3D12_QUERY_TYPE

Type

,

UINT

StartElement

,

UINT

ElementCount

,

ID3D12Resource

*

pDestinationBuffer

,

UINT64

AlignedDestinationBufferOffset

)

Two key performance opportunities:

Binary occlusion

Batched query ‘resolve’ operations

GPU Efficiency:

GPU-generated workloadsSlide17

Predication has also been generalized

ID3D12CommandList::

SetPredication

(

ID3D12Resource

*

pBuffer

,

UINT64

AlignedBufferOffset

,

D3D12_PREDICATION_OP Operation)

Predicate on general buffer: query-derived, CPU-populated, GPU-populated – enables new rendering scenarios

GPU Efficiency:

GPU-generated workloads

(cont’d)Slide18

ExecuteIndirect

– powerful new API for executing GPU-generated Draw/Dispatch workloadsBroad hardware compatibility

Can vary the following between invocations:

Vertex/Index buffers

Root constants,

Inline SRV/UAV/CBV descriptors

Enables new scenarios, dramatic efficiency improvements

GPU Efficiency:

GPU-generated workloads

(cont’d)Slide19

DemoAlways going to be very efficient: two ways to maximize

Set a proper ‘max count’, or just use CPU count.Group these together, ideally put space between generation and consumption of arguments.

GPU Efficiency:

GPU-generated workloads

(cont’d

)Slide20

Reducing CPU OverheadSlide21

CPU Overhead

Many improvements just for showing up:No high-frequency ref-counting

No hazard tracking

No state shadowing

Three other opportunities to take advantage of:

Resource Binding

Multi-threading

Memory allocationSlide22

CPU Overhead: Resource Binding

What’s new:Descriptor Heap access

Root Signatures

Descriptor Heap: Actual GPU memory that contains resource access metadata

Root Signature: Binding parameters that can be passed to a shader invocation. Can contain:

Location in descriptor heap

‘Inline’ descriptors

Actual constant dataSlide23

Descriptor Heap best practicesDo: keep your descriptor heap as static as possible.

Avoid: frequently changing descriptor heaps.Root Signature best practices

Do: keep your root signature small

Do: take advantage of inline descriptors/data

Avoid: binding unnecessary pipeline stages

This is an area where you can move the needle on CPU performance – take advantage of the new flexibility here.

CPU Overhead: Resource

Binding

(cont’d)Slide24

CPU Overhead: Multi-threading

In DirectX 11, driver created background thread outside app control.

In DirectX 12, multi-threading is app-controlled, first-class citizen via ID3D12CommandList.

Not just command lists: you can create PSO and buffers/textures on background threads.

Recommendation: Serial workload? Create own background submission thread.Slide25

CPU Overhead: Resource allocation

In DirectX 11, driver-managed versioning, sub-allocation behind app’s back.

DirectX 12 provides tools like fences, resource placement to put apps in charge. Persistently-mapped resources.

Recommendations:

Use appropriate number of fences

Expire resources based on engine knowledgeSlide26

Ashes of the Singularity case study

Dan Baker

Graphics Architect, Oxide GamesSlide27

Resource Binding in Nitrous

Nitrous designed from start to map to hardware binding models

Three key engine design points:

Textures pre-grouped in descriptor heap

Bindings shared across shader stages – less bind calls

Built around Static Samplers

Findings:

Easy to stay within one descriptor heap/frame

Important to avoid redundant state sets

Optional usage of Root

CBVs can provide win

Result: resource binding overhead is a

fraction of what it is on D3D11Slide28

Resource Management in Nitrous

Nitrous also benefits from more explicit resource management

Two classes of resources:

Formally tracked, persistent resources

Temporary, frame-specific resources

Frame-specific resources linearly allocated out of heap, with no resource tracking – minimal overheadSlide29

DemoSlide30

Conclusion

Many opportunities with DirectX 12 to achieve dramatic performance improvements in your game

Get started today!

Enroll in the Early Access program at http://aka.ms/dxeap to receive the latest SDK, DirectX 12 drivers, documentation, …

Check out Channel9 for previous DirectX12 talks

Q/ASlide31
Slide32

Backup

< Would need to explain ‘residency’, how this worked in DX11 >< WDDM2 residency management provides flexibility/performance.

< Don’t need to track resource usage/frame if memory usage isn’t a concern – keep it all resident. >