For this model, we replaced the set of material-dependent closest-hit shaders with a single closest-hit shader, and this shader only fetches only the data necessary to extract the geometric normal and surface cache parameterization. The ray-generation shader then applies lighting as a normal-weighted surface-cache evaluation.
With the aggressive optimizations to the Surface-Cache pipeline, we created a new payload structure to minimize bandwidth pressure. The high-quality payload used in the UE4 model requires 64-bytes to store GBuffer-like parameters for dynamic lighting. This includes parameters such as BaseColor, Normal, Roughness, Opacity, Specular, and more. In comparison, the Surface-Cache payload only requires 20-bytes to store the necessary parameters for a Surface-Cache lighting lookup.
After constructing the two models, we anticipated that mixing the two strategies could provide a natural scaling control to govern the performance vs quality tradeoff. We accomplished this by conditionally separating surface-cache terms, such as albedo, direct lighting, and indirect lighting depending upon the desired level of dynamic evaluation. This also presented a mechanism to incorporate the surface-cache indirect lighting with the dynamic evaluation of UE4 model, eliminating a fundamental problem with unshadowed SkyLight.
*Sorted-Deferred tracing pipeline原始方案,网上找到的资料源自《RAY TRACING IN FORTNITE》这篇文章——后来被收录进了《Ray tracing gems》书中。文末会附一篇PDF地址。
This allows us to repurpose the Surface-Cache tracing stage as the prerequisite to the Hit-Lighting pipeline. It also grants us the flexibility to optionally invoke dynamic evaluation on a per-ray basis. For instance, we have experimented with this idea to optionally invoke Hit-Lighting on meshes which have no Surface Cache parameterization.
Since we chose to omit an any-hit shader, we must iteratively traverse through the scene in the ray-generation shader whenever we encounter a partially opaque surface. This iteration count is governed by a MaxTranslucentSkipCount shader parameter.
We should also point out another useful feature in DXR 1.1: Inline ray tracing. The RayQuery interface avoids the complexity of the shader binding table and this allows for the hardware traversal in standard compute and pixel shaders. Utilizing ray queries also offers the compiler significant opportunity to optimize. In the ray-generation case, it is strongly recommended to minimize the amount of "live state" spanning a TraceRay() call. With Inline ray tracing, the compiler can minimize this without developer intervention.
我们也需要指出DXR 1.1中的另一个有用的特性,串联射线追踪(Inline ray tracing)。其中的RayQuery接口避免了shader binding table的复杂度,并支持标准计算中的硬件遍历以及像素shader。利用射线查询也为编译器带来了重要的优化机会:在射线生成时,最小化一个TraceRay() 调用所贯穿的"live state"数量是被强烈建议的;而通过inline ray tracing,编译器可以在不需要开发者干预的情况下最小化这个值。
Except for the need to supply mesh-varying vertex and index buffer data to hit-group shaders, the Surface-Cache pipeline could use inline ray tracing. This auxiliary buffers are only a requirement for PC; however, as console ray tracing intrinsics already provide access to the geometric normal as part of the ray-tracing Hit structure. Because of this, we actually do leverage inline ray tracing on consoles and benefit from a noticeable speedup on certain platforms. To learn more about this console specialization, please see Aleksander and Tiago's talk.
除了需要支持网格变化的顶点和索引缓冲数据(用于hit-group shader)之外,表面缓存管线也可以使用inline ray tracing。这些辅助的缓冲数据只在PC上是需要的;而在主机上内建的光线追踪架构中,已经提供了对几何体法线的访问——作为光追命中结构(Hit structure)的一部分。因此,我们在主机上利用inline ray tracing在多个平台获得了客观的速度提升。对于主机专题优化的部分,可以参照Aleksander和Tiago的讲座。(*由于主机开发隔得比较远,我短期内可能不会读到这篇讲座,有兴趣可以按图中的出处搜索)
2 整合案例——The Matrix Awakens
*虽然这里拆分了一节,但原文语境中主要还是在说硬件光追。
“The Matrix Awakens”是一个包含了很多高难度挑战的用以验证光追模型的UE引擎实验。开放世界场景对大量的实例数量提出了要求,它可以轻易地超出我们最顶层加速结构的重构建时间。场景中的大量活动的汽车和行人,也需要大量对于底层加速结构的动态改变(refits 改装)。汽车油漆和玻璃材质也不能离开镜面反射(以表现其真实性)。
DEMO的目标发布平台是Xbox Series X/S和PS5。这些主机有比较不成熟的光追支持,有着相对更低的算力以及更慢的遍历速度——相对于高端PC来说。然而,主机API提供了极大的光追管线灵活度,使其可以发挥我们的优势。例如,对静态网格的加速结构构建可以被预构建和流式加载,显著地降低每一帧中改变底层加速结构的时间。对于更多主机平台优化的细节,仍是再次推荐Aleksander和Tiago的讲座。
Instead, we must make some concessions with approximate geometric representations. We use Nanite fallback meshes as simplified representations of the rasterized geometry and store them in bottom-level acceleration structures. These fallback meshes bring their own set of challenges, as they do not provide any topological guarantee with the base mesh.
Ray tracing against multiple geometric levels-of-detail is not a new problem, however. Tabellion and Lamorlette presented a solution for this issue back in 2004 [Tabellion et al 2004]. Unfortunately, a proper implementation was too expensive for our budget. Instead, we modified our traversal algorithm to first cast a short ray, to some defined epsilon-distance, ignoring back faces. Only after successfully traversing this distance did we cast a long ray, without any sidedness properties. This figure illustrates the process.
Screen traces also overcome self-intersection artifacts, by providing hardware traversal with a starting t-value that aligns with the bounding view frustum.
However, limiting trace distance had a profoundly negative impact on the overall look. Car reflections no longer showed the skyline in the distance. But more importantly, accurate sky occlusion from the global illumination solver was completely gone.
In the typical case, HLOD representations are direct replacements for rasterized geometry; however, due to limitations with performant trace distance, we needed to incorporate the HLOD representation before the rasterizer would typically need this substitution. Because of this, we were often presented with two different mesh representations occupying the same space in our top-level acceleration structure.
Incorporation of far-field traces into ordered traversal places a new set of stages into our original hardware traversal pipeline. We add an intermediary compaction step, collating near-field misses into new ray-tiles, and then a subsequent indirect dispatch to trace the ray-tiles against the far-field representation.
Submitting both near-field and far-field representations to the same top-level acceleration structure is NOT ideal. Doing so creates needless geometric overlap. While the ray-mask removes unnecessary traversal, the damage to the top-level acceleration structure build is substantial. Our early experiments with overlaying the far-field representation revealed a show-stopping 44% penalty to all near-field traversal costs.
A proper solution would be to support multiple top-level acceleration structures, which would also avoid the need to use the ray-mask mentioned previously. We were hesitant to take on this architectural change mid-production under the already aggressive development schedule, but we needed to do something.
With some reordering of stages, we can minimize the dispatch costs by first resolving all Surface-Cache stages before optionally requeuing results for Hit-Lighting. We do this by cascading through both geometric representations. Hits are then compacted and optionally requeued for Hit-Lighting, while misses cascade to apply SkyLight evaluation.
使用Lumen的工程必须选择一种追踪方式。软件光追适合需要绝对快速的追踪的项目,它能在(当时的)次世代主机上以60帧运行。使用了kitbashing产生了大量重叠的网格结构的工程也应使用软件光追(这里kitbashing是一款3D工具软件),正如官方例子“Lumen in the land of Nanite”与“Valley of the Ancients”之中那样。
让我们看看两种追踪方式在不同场景的性能表现——从“Lumen in the land of Nanite”开始,它包含了巨量的重叠网格。这是一个为Nanite压力测试而制作的内容,洞穴中表面上的每个点都可能包含上百个重叠网格。使用硬件光追时,射线必须遍历每一个重叠的网格,而软件光追则有一个更快的融合版本。在这个内容上硬件光追的开销是我们无法承受的。
One of our early experiments was prefiltered cone tracing. It’s very difficult to implement, but if you can pull it off, tracing a single cone gives you the results of many rays. Cones can be very effective at solving noise and they essentially make the Final Gather trivial.
We implemented cone tracing against Mesh Signed Distance Fields. Whenever the cone intersects a surface, we calculate the mip of the surface cache to sample using the size of the cone intersection, giving prefiltered lighting for cones that intersect.
我们实现了针对网格距离场(Mesh Signed Distance Fields)的锥体追踪。但锥体与一个表面相交时,我们使用锥体相交的尺寸来计算表面缓存的mip,为这一相交提供预过滤光照。
When the cone has a near miss with the surface, it’s only partially occluded so this becomes a transparency problem. We can approximate how much the cone was occluded using the distance from the cone axis to the surface, which we can get from the distance field.
作为替代我们选择使用蒙特卡洛积分方式(Monte Carlo integration,一种用采样策略近似积分的方案)。这使得质量能提到最高的同时,也支持所有类型的光追(硬件、软件),不过这把所有降噪问题留给了Final Gather来处理。
*之前的文章也提到过,Irradiance和Radiance这组概念翻译一次后我就尽量保持原文了。
The most popular approach for solving diffuse light transfer is the Irradiance Field. Irradiance Fields trace from probes placed in a volume, then pre-integrate irradiance at the probe position, and then interpolate that to the pixels on your screen.
在去年的讲座中我介绍了Lumen的不透明Final Gather,不过之后我们在“The Matrix Awakens”的技术DEMO中进行了压力测试。它成为了我们其它领域的原型,例如Volumetric Final Gather和纹理空间收集(texture space gather),因此我想分享一些那之后我们的洞察。
The opaque Final Gather has three main parts, first there’s the Screen Space Radiance Cache which is operating at 1/16th resolution in each dimension. It’s backed up by the World Space Radiance Cache, which is handling distant lighting at a much lower resolution. At full resolution there’s the interpolation, integration, the temporal filter and Contact AO.
不透明的Final Gather有3个主要部分,首先是屏幕空间光照缓存——在每个维度上以1/16的分辨率来操作;作为它的后备的是世界空间光照缓存(World Space Radiance Cache),以较低的分辨率处理远距离的光照问题。在全分辨率上计算的有:插值、整合(积分)、分时过滤器和邻接AO。
We importance sample the incoming lighting. Importance sampling is only as good as the importance estimate, and we have a very accurate estimate of the incoming lighting from last frame’s Screen Space Radiance Cache, reprojected into the current frame. We can very efficiently find all of last frames’ rays because they’re indexed by direction as well as position in the radiance cache. Where the reprojection fails, like the edges of the screen, we fall back to the World Space Radiance Cache and still have effective importance sampling.
We’re operating in a downsampled space, so we can afford to launch a whole threadgroup per probe to do better sampling. Product Importance Sampling is normally only possible in offline rendering, but we can do it in real-time.
The World Space Radiance Cache has sparse coverage, and we use a clipmap distribution to maintain a bounded screen distance between the probes, to make sure that we don’t over-sample, or under-sample.
“The Matrix Awakens”中有一个实验性的夜晚模式,它是完全被自发光网格照亮的。场景的大部分灯光来自小而亮的电灯泡,因此处理它们的直接光照对我们的GI方法来说绝对是一次性能测试,因为我们不是明确地(explicitly)从光源进行采样(*而是通过探针,但要算的光源很多)。世界空间光照缓存能更精确地解决直接光照问题——基于它在方向上的高精度,以及分时稳定性。
Our Volumetric Final Gather covers the view frustum with a probe volume, which is a froxel grid. We trace octahedral probes and skip invisible probes determined with an HZB test.
We overlap this new translucency Radiance Cache with the Opaque World Radiance Cache, so that it’s many dispatches fit in the gaps and it’s almost free.
Looking at our pipeline at a high level, first we generate rays by importance sampling the visible GGX lobe, then we trace the rays using our ray tracing pipeline. Then we run our spatial reuse pass, which looks at screen space neighbors and reweights them based on their BRDF. Then we do a temporal accumulation, and finally a bilateral filter to clean up any remaining noise.
The Bilateral Filter is our last ditch effort for when the physically based reuse isn’t enough. We run it on areas that had high variance after the spatial reuse pass, and we force it on at double strength in areas that were newly revealed by disocclusion, which don’t have any temporal history. We use tonemapped weighting in the Bilateral Filter to remove fireflies, which would crush our highlights if used in the spatial reuse pass, but works perfectly here.
The roughest reflections from .4 to 1 often cover half the screen and they are a significant optimization opportunity. Then there are the glossy reflections from .3 to .4 which require more directionality but are also quite slow to trace.
We can solve the incoherency in the roughest reflections by just reusing the work that we did for Diffuse GI. The Screen Space Radiance Cache has enough directional resolution for a wide specular lobe, and we can just resample it, by importance sampling the GGX lobe to get a direction, and then interpolating radiance from the screen probes.
For that middle range of glossy reflections, the Screen Space Radiance Cache doesn’t have enough directional resolution, but the World Space Radiance Cache does. We shorten the reflection ray, and interpolate from the World Space Radiance Cache on miss.
There’s a gotcha with implementing a tile based reflection pipeline - all of the denoising passes read from neighbors which may not have been processed. For the temporal filter’s neighborhood clamp, we could clear the unused regions of the texture, or branch in the temporal filter. It’s slightly faster to have the pass that runs before the temporal filter clear the tile border. We can only clear texels in unused tiles, to avoid a race condition with the other threads of the spatial reuse pass.
评论区
共 条评论热门最新