Rendering battlefield 4

Rendering battlefield 4 Rendering battlefield 4 - Start

Added : 2017-09-25 Views :47K

Download Presentation

Rendering battlefield 4




Download Presentation - The PPT/PDF document "Rendering battlefield 4" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Rendering battlefield 4

Slide1

Rendering battlefield 4 with mantle

Johan Andersson – Electronic Arts

Slide2

Slide3

DX11

Mantle

Avg: 78 fps

Min: 42 fps

Core i7-3970x, AMD Radeon R9 290x, 1080p ULTRA

Avg: 120 fps

Min: 94 fps

+58%!

Slide4

Bf4 mantle goals

Goals:Significantly improve CPU performanceMore consistent & stable performanceImprove GPU performance where possibleAdd support for a new Mantle rendering backend in a live gameMinimize changes to engine interfacesCompatible with built PC contentWork on wide set of hardwareAPU to quad-GPUBut x64 only (32-bit Windows needs to die)

Non-goals:

Design new renderer

from scratch for Mantle

Take advantage of asymmetric MGPU

(

APU+discrete

)

Optimize

video memory consumption

Slide5

Bf4 mantle strategic goals

Prove

that low-level graphics APIs work outside of consoles

Push

the industry towards low-level graphics APIs everywhere

Build

a foundation for the future that we can build great games on

Slide6

shaders

Slide7

shaders

Shader resource bind points replaced with a resource table object -

descriptor set

This is how the hardware accesses the shader resources

Flat list of images, buffers and samplers used by any of the shader stages

Vertex

shader streams converted to vertex shader buffer loads

Engine assign each shader resource to specific slot in the descriptor

set(s)

Can share slots between shader stages = smaller descriptor sets

The

mapping takes a while to wrap one’s head

around

Slide8

Shader conversion

DX11

bytecode

shaders gets converted to AMDIL

& mapping applied using

ILC

tool

Done at load time

Don’t have to change our shaders!

Have

full source & control over the

process

Could

write AMDIL directly or use other frontends if wanted

Slide9

Descriptor sets

Very simple usage in BF4: for each draw call write flat list of resources

Essentially direct replacement of SetTexture/SetConstantBuffer/SetInputStream

Single dynamic descriptor set object per frame

Sub-allocate for each draw call and write list of

resources

~15000 resource slots written per frame in

BF4, still very fast

Slide10

Descriptor sets

Slide11

Descriptor sets – future OPTIMIZATIONS

Use static

descriptor

sets when possible

Reduce resource duplication by reusing & sharing more across shader stages

Nested descriptor sets

Slide12

Compute pipelines

1:1 mapping between pipeline & shader

No state built into pipeline

Can execute in parallel with rendering

~100 compute pipelines in BF4

Slide13

Graphics pipelines

All graphics shader stages combined to a single pipeline object together with important graphics state

~10000 graphics pipelines in BF4 on a single level, ~25 MB of video memory

Could

use smaller working pool of active state objects to keep reasonable amount in memory

Have not been required for us

Slide14

Pre-building pipelines

Graphics pipeline creation is expensive operation, do at load time instead of runtime!

Creating one of our graphics pipelines take

~10-60

ms

each

Pre-build using N parallel low-priority jobs

Avoid 99.9% of runtime stalls caused by pipeline creation!

Requires knowing the graphics pipeline state that will be used with the shaders

Primitive type

Render target formats

Render target write masks

Blend modes

Not fully trivial to know all state, may require engine changes / pre-defining use cases

Important to design for!

Slide15

Pipeline cache

Cache built pipelines both in memory cache and disk cacheImproved loading timesMax 300 MBSimple LRU policyLZ4 compressed (free)Database signature:Driver versionVendor IDDevice ID

Slide16

memory

Slide17

Memory management

Mantle devices exposes multiple memory heaps with characteristicsCan be different between devices, drivers and OS:esUser explicitly places resources in wanted heapsDriver suggests preferred heaps when creating objects, not a requirement

Type

Size

Page

CPU access

GPU Read

GPU Write

CPU Read

CPU Write

Local

256 MB

65535

CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined

130

170

0.0058

2.8

Local

4096

MB

65535

130

180

0

0

Remote

16106

MB

65535

CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined

2.6

2.6

0.1

3.3

Remote

16106

MB

65535

CpuVisible|CpuGpuCoherent

2.6

2.6

3.2

2.9

Slide18

Frostbite memory heaps

System Shared MappedCPU memory that is GPU visible. Write combined & persistently mapped = easy & fast to write to in parallel at any timeSystem Shared PinnedCPU cached for readback. Not used much

Video

Shared

GPU memory accessible by CPU. Used for descriptor sets and dynamic

buffers

Max 256 MB (legacy constraint)

Avoid keeping persistently mapped as WDMM doesn’t like this and can decide to move it back to CPU memory

Video

Private

GPU private memory.

Used

for

render targets, textures and other resources CPU does not need to access

Slide19

Memory REFERENCES

WDDM needs to know which memory allocations are referenced for each command buffer

In order to make sure they are resident and not paged out

Max ~1700 memory references are supported

Overhead with having lots of references

Engine needs to keep track of what memory is referenced while building the command buffers

Easy & fast to do

Each reference is either read-only or read/write

We use a simple global list of references shared for all command buffers.

Slide20

Memory pooling

Pooling memory allocations were required for us

Sub allocate within larger 1 – 32 MB chunks

All resources stored memory handle + offset

Not as elegant as just void* on consoles

Fragmentation can be a concern, not too much issues for us in practice

GPU virtual memory mapping is fully supported, can simplify & optimize management

Slide21

Overcommitting video memory

Avoid overcommitting video memory!

Will lead to severe stalls as

VidMM

moves blocks and moves memory back and forth

VidMM is a black box

One of the biggest issues we ran into during development

Recommendations

Balance memory pools

Make sure to use read-only memory references

Use memory priorities

Slide22

Memory priorities

Setting priorities on the memory allocations helps

VidMM

choose what to page out when it has to

5 priority levels

Very high = Render targets with MSAA

High = Render targets and UAVs

Normal = Textures

Low =

Shader

& constant buffers

Very low = vertex & index buffers

Slide23

Memory residency future

For best results manage which resources are in video memory yourself & keep only ~80% used

Avoid all stalls

Can

async

DMA in and

out

We

are thinking of redesigning

to

fully avoid

possibility of overcommitting

Hoping

WDDM’s memory

residency management

can be simplified & improved in the future

Slide24

Resource management

Slide25

Resource lifetimes

App manages lifetime of all resources

Have to make sure GPU is not using an object or memory while we are freeing it on the CPU

How we’ve always worked with GPUs on the consoles

Multi-GPU adds some additional complexity that consoles do not have

We keep track of lifetimes on a per frame granularity

Queues for object destruction & free memory operations

Add to queue at any time on the CPU

Process queues when GPU command buffers for the frame are done executing

Tracked with command buffer fences

Slide26

Linear frame allocator

We use multiple linear allocators with Mantle for both transient buffers & images

Used for huge amount of small constant data and other GPU frame data that CPU writes

Easy to use and very low overhead

D

on’t have to care about lifetimes or state

Fixed memory buffers for each frame

Super cheap sub-allocation from from any thread

If full, use heap allocation (also fast due to pooling)

Alternative: ring buffers

Requires being able to stall & drain pipeline at any allocation if full, additional complexity for us

Slide27

tiling

Textures should be tiled for performance

Explicitly handled in Mantle, user selects linear or tiled

Some formats (BC) can’t be accessed as linear by the GPU

On consoles we handle tiling offline as part of our data processing pipeline

We know the exact tiling formats and have separate resources per platform

For Mantle

Tiling formats are opaque, can be different between GPU architectures and image types

Tile textures with DMA image upload from

SystemShared

to

VideoPrivate

Linear source, tiled destination

Free

Slide28

command buffers

Slide29

Command buffers

Command buffers are the atomic unit of work dispatched to the GPU

Separate creation from execution

No “immediate context” a la DX11 that can execute work at any call

Makes resource synchronization and setup significantly easier & faster

Typical

BF4 scenes have around ~50

command buffers per frame

Reasonable tradeoff for us with submission overhead vs CPU load-balancing

Slide30

Command buffer sources

Frostbite has 2 separate sources of command buffers

World rendering

Rendering the world with tons of

objects, lots of draw calls.

Have all frame data up front

All resources except for render targets are read-only

Generated in parallel up front each frame

Immediate rendering (“the rest”)

Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc

Managing

resource state, memory and running on different queues (graphics, compute, DMA)

Sequentially generated in a single job, simulate an immediate context by splitting the command buffer

Both

are very important and have different requirements

Slide31

Resource transitions

Key

design

in Mantle to

significantly lower driver overhead

& complexity

Explicit hazard

tracking by the app/engine

Drives architecture-specific caches & compression

AMD: FMASK, CMASK, HTILE

Enables explicit memory management

Examples:

Optimal render target writes → Graphics

shader

read-only

Compute

shader

write-only →

DrawIndirect

arguments

Mantle has a strong validation layer that tracks transitions which is a major help

Slide32

Managing resource transitions

Engines need a clear design on how to handle state transitions

Multiple approaches possible:

Sequential in-order command buffers

Generate one command buffer at the time in order

Transition resources on-demand when doing operation on them, very simple

Recommendation: start with this

Out-of-order multiple command buffers

Track state per command buffer, fix up transitions when order of command buffers is known

Hybrid approaches & more

Slide33

Managing resource transitions in frostbite

Current approach in Frostbite is quite basic:

We keep track of a single state for each resource (not

subresource

)

The “immediate rendering” transition resources as needed depending on operation

The

out of order “world rendering” command

buffers don’t need to transition states

Already have write

access to

MRTs

and read-access to all resources setup outside

them

Avoids the problem of them not knowing the state during

generation

Works now but as we do more general parallel rendering it will have to change

Track resource state for each command buffer &

fixup

between command buffers

Slide34

Dynamic State objects

Graphics state is only set with the pipeline object and 5 dynamic state objectsState objects: color blend, raster, viewport, depth-stencil, MSAANo other parameters such as in DX11 with stencil ref or SetViewport functionsFrostbite use case:Pre-create when possibleOtherwise on-demand creation (hash map)Only ~100 state objects!Still possible to end up with lots of state objectsEsp. with state object float & integer values (depth bounds, depth bias, viewport)But no need to store all permutations in memory, objects are fast to create & app manages lifetimes

Slide35

queues

Slide36

queues

Universal queue can do both graphics, compute and presents

We use also use additional queues to parallelize GPU operations:

DMA queue – Improve perf with faster transfers & avoiding idling graphics will transfering

Compute queue - Improve perf by utilizing idle ALU and update resources simultaneously with gfx

More GPUs = more queues!

Slide37

Order of execution within a queue is sequentialSynchronize multiple queues with GPU semaphores (signal & wait)Also works across multiple GPUs

Compute

Graphics

Queues synchronization

S

Wait

W

S

Slide38

Queues synchronization cont

Started out with explicit semaphores

Error prone to handle when having lots of different semaphores & queues

Difficult to visualize & debug

Switched to more representation more similar to a job graph

Just a model on top of the semaphores

Slide39

Gpu job graph

Each GPU job has list of dependencies (other command buffers) Dependencies has to finish first before job can run on its queueThe dependencies can be from any queueWas easier to work with, debug and visualizeReally extendable going forward

Graphics 1

Graphics 2

DMA

Compute

Graphics 2

Slide40

Async dma

AMD GPUs have dedicated hardware DMA engines, let’s use them!

Uploading through DMA is faster than on universal queue, even if blocking

DMA have alignment restrictions, have to support falling back to copies on universal queue

Use case: Frame

buffer & texture uploads

Used by resource initial data uploads and our

UpdateSubresource

Guaranteed to be finished before the GPU universal queue starts rendering the frame

Use case:

Multi-GPU

frame buffer copy

Peer-to-peer copy of the frame buffer to the GPU that will present it

Slide41

Async compute

Frostbite has lots of compute shader passes that could run in parallel with graphics work

HBAO, blurring, classification, tile-based lighting,

etc

Running as async compute can improve GPU performance by utilizing ”free” ALU

For example while doing shadowmap rendering (ROP bound)

Slide42

Async compute – tile-based lighting

3 sequential compute shadersInput: zbuffer & gbufferOutput: HDR texture/UAVRuns in parallel with graphics pipeline that renders to other targets

Compute

Graphics

TileZ

Gbuffer

Shadowmaps

Reflection

Distort

Transp

Cull lights

Lighting

S

S

Wait

W

Slide43

Async compute – tile-based lighting

We manually prepare the resources for the async computeImportant to not access the resources on other queues at the same time (unless read-only state)Have to transition resources on the queue that last used it Up to 80% faster in our initial tests, but not fully reliableBut is a pretty small part of the frame timeNot in BF4 yet

Compute

Graphics

TileZ

Gbuffer

Shadowmaps

Reflection

Distort

Transp

Cull lights

Lighting

S

S

Wait

W

Slide44

Multi-

gpu

Slide45

Multi-gpu

Multi-GPU alternatives:

AFR – Alternate Frame Rendering (1-4 GPUs of the same power)

Heterogeneous AFR – 1 small + 1 big GPU (APU + Discrete)

SFR – Split Frame Rendering

Multi-GPU Job Graph – Primary strong GPU + slave GPUs helping

Frostbite supports AFR natively

No synchronization points within the frame

For resources that are not rendered every frame: re-render resources for each GPU

Example: sky

envmap

update on weather change

With Mantle multi-GPU is explicit and we have to build support for it ourselves

Slide46

Multi-gpu afr with mantle

All resources explicitly duplicated on each GPU with

async

DMA

Hidden internally in our rendering

abstraction

Every frame alternate which GPU we build command buffers for and are using resources from

Our

UpdateSubresource

has to make sure it updates resources on all GPU

Presenting the screen has to in some modes copy the frame buffer to the GPU that owns the display

Bonus:

C

an

simulate multi-GPU mode even with single GPU!

Multi-GPU works in windowed mode!

Slide47

GPUs are independently rendering & presenting to the screen – can cause micro-stutteringFrames are not presented in a regular intervalsFrame rate can be high but presentation & gameplay is not smoothFCAT is a good tool to analyse this

Multi-

gpu issues

GPU0

GPU1

Frame 0

P

Frame 1

P

Frame 2

P

Frame 3

P

GPU0

GPU1

Irregular presentation interval

Slide48

GPUs are independently rendering & presenting to the screen – can cause micro-stutteringFrames are not presented in a regular intervalsFrame rate can be high but presentation & gameplay is not smoothFCAT is a good tool to analyse thisWe need to introduce dependency & dampening between the GPUs to alleviate this – frame pacing

Multi-

gpu issues

GPU0

GPU1

Frame 0

P

Frame 1

P

Frame 2

P

Frame 3

P

Ideal presentation interval

Slide49

Frame pacing

Measure average frame rate on each GPUShort history (10-30 frames)Filter out spikesInsert delay on the GPU before each presentForce the frame times to become more regular and GPUs to alignDelay value is based on the calculate avg frame rate

GPU0

GPU1

Frame 0

P

Frame 1

P

Frame 2

P

Frame 3

P

GPU0

GPU1

Delay

D

Slide50

conclusion

Slide51

Mantle dev recommendations

The validation layer is a critical friend!

You’ll end up with a lot of object & memory management code, try share with console code

Make sure you have control over memory usage and can avoid overcommitting video memory

Build a robust solution for resource state management early

Figure out how to pre-create your graphics pipelines, can require engine design changes

Build

for multi-GPU support from the start, easier than to

retrofit

Slide52

future

Second wave of Frostbite Mantle titles

Adapt Frostbite core rendering layer based on learnings from Mantle

Refine binding & buffer updates to further reduce overhead

Virtual memory management

More

async

compute &

async

DMAs

Multi-GPU job graph R&D

Linux

Would like to see how our Mantle renderer behaves with different memory management & driver model

Slide53

Questions?

Email: johan@frostbite.comWeb: http://frostbite.comTwitter: @repi


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.
Youtube