Game Engines amp GPUs Current amp Future Johan Andersson Rendering Architect 25 Agenda Goal Share and discuss current amp future graphics use cases in our games and implications for graphics hardware ID: 253562
Download Presentation The PPT/PDF document "The Intersection of" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Intersection of
Game Engines & GPUs:Current & Future
Johan AnderssonRendering Architect
2.5Slide2
Agenda
GoalShare and discuss current & future graphics use cases in our games and implications for graphics hardwareAreasEngine overviewShadersParallelizationTexturingRaytracingGPU computeConclusionsQ & ASlide3
Frostbite
DICE proprietary engineXbox 360PS3Windows (Direct3D 10)FocusLarge outdoor environmentsSingleplayer & multiplayerDestruction!New: Content workflowsSlide4
BFBC screenshotSlide5
BFBC screenshotSlide6Slide7
Graph-based surface shaders
Artist-friendly Easy to create, tweak & manageFlexibleProgrammers & artists can extend & expose featuresData-centricEncapsulates resourcesTransformable
Rich high-level shading frameworkUsed by all content & systemsSlide8Slide9
Shader permutations
Generate shader permutationsFor each used combination of features/dataHLSL vertex & pixel shadersMany features = permutation explosionShader graphs, lighting, geometryBalance perf. vs permutations vs featuresDynamic branchingLive with many permutationsSlide10
Shader subroutines
Next step: Static subroutine linkingInline in all subroutines at call siteSimilar to a switch statementReduces # permutations Implementation moved to driver or GPUDoesn’t work with instancingFuture step: Dynamic subroutinesControl function pointers inside shaderProblem solved, but coherency importantSlide11
Rendering & ParallelizationSlide12
Jobs
Must utilize multi-core6 HW threads on Xbox 3606 SPUs on PS32-8 cores on PCJob definitionFully independent stateless functionPS3 SPU requirementGraph dependenciesTask-parallel and data-parallelSlide13
Rendering jobs
Refactor rendering systems to jobsMost will move to GPUEventuallyOne-way data flowCompute shaders & stream outputJobsDecal projectionParticle simulationTerrain geometry processingUndergrowth generation [2]Frustum cullingOcclusion culling
Command buffer generationPS3: Triangle cullingSlide14
Parallel command buffer recording
Dispatch draw calls and state to multiple command buffers in parallelScales linearly with # cores1500-4000 draw calls per frameSuper-important for all platforms, used on:Xbox 360PS3 (SPU-based)No support in DX10!Slide15
DX10 parallel command buffer rec.
Single most important DX10 issue For us and many others (in the future)Until future API supportReduce draw calls with instancingTrade GPU performance for CPU performanceReduce state & constant updatesSlow dynamic constant path Manual software command buffers Difficult to update dynamic resources efficiently in parallel due to APISlide16
PS3 geometry processing (1/2)
Slow GPU triangle & vertex setup Unique situation with ”free” processorsNot fully utilizedSolution: SPU triangle cullingTrade SPU time for GPU performanceCull back faces, micro-triangles, frustumSony PS3 EDGE library5 jobs processes frame geometry in parallelOutput is new index buffer for each draw callSlide17
PS3 geometry processing (2/2)
Great flexibility and programmability!Custom processingPartition bounding box cullingTriangle part cullingClip plane triangle trivial accept & rejectTriangle cull volumes (inverse clip planes)Future: No vertex & geometry shadersDIY compute shaders with fixed-func tesselation and triangle setup unitsOutput buffer streaming still importantSlide18
Occlusion culling
Buildings occlude objectsTons of objectsDifficult to implementBuilding destructionDynamic occludeesHeavy GPU occlusion queriesInvisible objects still have toUpdate logic & animationsGenerate command bufferProcessed on CPU & GPUSlide19
Software occlusion culling
Solution: Rasterize course zbuffer on SPU/CPULow-poly occluder meshes100m view distanceMax 10000 vertices/frameManually conservative256x114 float z-bufferCreated for PS3, now on allCull all objects against zbufferBefore passed to all other systems = big savingsScreen-space bbox testSlide20
GPU occlusion culling
Want GPU rasterization & testing, but:Occlusion queries introduces overhead & latencyCan be manageable, not idealConditional rendering only helps GPUNot CPU, frame memory or draw callsFuture1: Low-latency extra GPU exec contextRasterization and testing done on GPULockstep with CPUFuture2: Move entire cull & rendering to GPUScene graph, cull, systems, dispatch. End goal.Slide21
TexturingSlide22
Texture formats
UsingDXT1/5 color maps, sRGBBC5 (3Dc) normal mapsBC4 (DXT5A) for grayscale maskssRGB support for BC4/5 would be niceDXT1 replacement neededLow quality565 color bleedingRG/RGB masks compresses badlyHDR envmaps & lightmaps
RGB DXT1 mask
DXT color bleedSlide23Slide24
Future texture sampling
Texture sampling derivatives1st order texel derivatives2nd order as well?Implement in sampler unitBad performance or quality with shader sampling Artifacts with ddx/ddy techniqueReplace normalmaps with easily compressed bumpmapsBicubic upsamplingTerrain masks
Terrain heightmap
Derived normals [2]Slide25Slide26
Current sparse textures
Save memory for terrainStatic quadtree mask textureDynamic sparse destruction maskImplementationIndirection texture lookup in atlasArrays too small, want 8192 slicesCorrect bilinear filtering by bordersSiggraph’07 course for details [2]
Source mask
Atlas textureSlide27
HW sparse textures
Virtual textureHW texture filtering & mipmappingFallback on non-resident tile access Lower mipmap, default value or shader boolAt least 32k x 32k, fp issues with larger?Application-controlled tile commit/free~128 x 128 tilesFeedback mechanism for referenced tilesEasy view-dependent allocationFuture: Latency-free allocation & generationAlt1. CPU thread callback & blockAlt2. Keep everything on GPU. ”Command” shader?Slide28
Cached Procedural Unique Texturing
Unique dynamic sparse texture on all objects Defined by texture shader graphCombine procedurals, compositing, streaming and uv-space geometryDynamically commit & render visible tilesHighly complex compositingThanks to high frame-to-frame coherencyUpsample and refineNew dynamic effects made possibleAffect every surfaceSlide29
RaytracingSlide30
Raytracing
Much recent debate & interest in RTRTWhat we are interested in:Performance!! Rasterization for primary raysDeterministicEasy integration into enginesJust another method for certain effects & objectsNot replace whole pipeline Efficient dynamic geometryProcedural & manual animation (foliage, characters)Destruction (foliage, buildings, objects)Slide31
Mirror’s EdgeSlide32
Raytraced reflections wanted
Glass & metalMostly planar surfacesReflection localityCorrect reflections for important objectsMain characterSimplified world geometry & shading for restCommon for gamesBrickmaps? [3]Slide33
Soft reflections
Mirror’s EdgeSlide34
GPGPUSlide35
GPGPU uses
Effect physicsParticle vs world soft collisionAI pathfindingAI visibilityView rasterization. Obstruction from smoke & foliageProcedural animationTrees, undergrowth, hairPost-processingSlide36
CUDA DOF post-process filter
Circle of confusion mapThesis work at DICE [4]Test CUDA and performancePoisson disc blur
Multi-passed diffusionSeperable diffusionGood:Easy to learn (C)Map complex algorithmsThread & memory controlBad:Performance vs shaders
Beta interopVendor-specific
OutputSlide37
GPU Compute programming model
Wanted:Easy & efficient Direct3D 10 interopLow-latency Compute tasksVendor-independent base interfaceOpenCL?Efficient CPU multi-core backendServer, older GPUs, debuggingMCUDA [5]Eventually platform-independentFuture consolesSlide38
Conclusions
Shader subroutinesMore software-controlled pipelineMore texture sampler functionalityLimited-case raytracingGPU compute for gamesSlide39
Questions?
Contact: johan.andersson@dice.seSlide40
References
[1] Tartarchuk, Natasha & Andersson, Johan. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. Link[2] Andersson, Johan. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. Link[3] Christensen, Per H. & Batali, Dana. "An Irradiance Atlas for Global Illumination in Complex Production Scenes“. Eurographics Symposium on Rendering 2004. Link
[4] Lonroth, Per & Unger, Mattias. ”Advanced Real-time Post-Processing using GPGPU techniques”. Master thesis, 2008.[5] John Stratton, Sam Stone, Wen-mei Hwu. "MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores". Technical report, University of Illinois at Urbana-Champaign, IMPACT-08-01, March, 2008. Slide41
Bonus slidesSlide42
Real-time REYES
Very interestingDisplacement mapping & proceduralsStochastic samplingPotentially more efficient & generalCompared to maxed out rasterization & tessellation on everything = pixel-sized trianglesButNo experience More research & experimentation neededSlide43
Terrain detail
Deriving normal from heightfield good in distanceFuture: HW tessellation & procedural displacement shaders for up close ground detailSlide44
Texture arrays
Use cases:Everything!Rich parameterized shadersVary slice index per instance, triangle or texel Instancing without comprimising on variation or perf.Cascaded shadow mapsHW PCF only in DX 10.1 Stable Cascaded Bounding Box Shadow MapsSparse texturesMore slices plzFor tile pools. 64x64x8192Slide45
Other raytracing uses
Global Illumination & Ambient OcclusionIncremental Photon Mapping?Async collision raycastsAI pathfinding, gameplay, sound obstructionSeperate collision world from visual worldCPU job-based now