We don’t want to draw all low error clusters. Many of them represent the same area but at different detail.
我们并不希望绘制全部低误差的集群,它们很多有着相同的面积但细节程度并不相同。
The cut happens at the point where a parent’s error is too high, but it’s child’s error is small enough to be valid to draw. Parent says no but the child says yes. This is entirely local, does not depend on the entire path to this node, and thus can be evaluated in parallel.
That sounds complicated but really means we need to make the error calculation for parents to always be at least as large as their children. This is forced during the offline DAG building by modifying the parent’s stored error and bounds used for projecting it.
TAA is built to blend subpixel differences over time. It does our work for us so long as the error is subpixel. This is the reason why getting an accurate error estimate is so important.
The clusters we can cull are exactly the ones that fail the LOD selection test from before. Any cluster whose ParentError is already small enough can be culled. Interestingly, this means that an acceleration structure for LOD culling should be based on ParentError, not the ClusterError itself.
With that we build a BVH over the clusters. As with any BVH, the parents conservatively bound their children which in this case also includes ParentError.
Traversing this tree is a classic parallel expansion work scheduling problem. Implemented naively looks like this with many passes, each processing a single level of the tree, appending any passing children nodes to a buffer to be processed by the next pass.
Each pass depends on the previous, so the GPU is completely drained at every level of the tree. Because the CPU doesn’t know how deep the recursion will go, enough dispatches have to be issued to cover the worst case. This means we can very easily end up with empty dispatches that don’t do any processing at all!
Because the BVH traversal can go deep and the number of active nodes at a given time can be small relatively to the width of the GPU, the BVH culling phase will not always be able to fill the GPU.
Tiny triangles are terrible for a typical rasterizer, HW rasterizers included. They are designed to be highly parallel in pixels not triangles since that’s the typical workload.
Primitive shaders or mesh shaders can be faster but are still bottlenecked and not designed for this. Could we possibly beat the hardware with a software rasterizer?
Instead we use 64b atomics! Specifically a global image InterlockedMax to the visibility buffer .
与之相对的我们使用了64b的原子单元(多线程概念)。尤其是一个全局的可见性缓冲的关联锁映射。
This 64b integer hasDepth in the high bits which is what gives us the depth test, And the payload in the low bits. In our case the payload is the visible cluster index and triangle index.
With that detail the visibility buffer shows its true power. The payload needs to be small enough to pack in 34 bits or less. Without that we wouldn’t be able to do fast software rasterization.
So, here is our scanline rasterizer. Instead of the inner loop iterating from rect min to max testing whether this pixel is in or out, we solve for the x interval that passes and only iterate over those.
Because material are still coherent pixel shaders we still have finite difference based derivatives to use for texture filtering.
由于材质仍然是使用连贯的像素shader,因此我们仍然能确定有限差分的派生并能应用纹理过滤。
Unlike traditional rasterization the pixel quads span across triangles. This is a very good thing because with tiny triangles quad overdraw can get out of hand very quickly.
Unfortunately there are more shadow rays than primary since there are on average more than 1 light per pixel. We need something at least as fast as what we have for primary.
Nanite supports normal shadow map drawing but this new architecture enables new techniques that weren’t practical before. It allowed us to implement efficient virtual shadow maps.
The resolution we rasterize into the shadow map is made to match the screen pixels that those triangles cast onto. If that region of the shadow map doesn’t cast onto anything on screen we don’t draw to it.
That’s tiny proportional to the cost of a full resolution primary view but spinning up the pipeline for a minor amount of work can be very inefficient.
(阴影)对于整个屏幕的原始视图渲染而言只是一个小比例的额外开销,但为了少量内容而进行管线轮转(which is 需要同步)可能是非常低效的。
Now not only can Nanite draw the entire scene with a single chain of dependent dispatch indirects. It can render all shadow maps for every light in the scene, to all of their virtualized mipmaps at once.
The physical texture we are writing to isn’t contiguous in virtual space. This means clusters that overlap multiple pages can’t expect the addressing of a pixel to be direct.
For the software rasterizer it is best to keep the inner loop as simple as possible. We’ve found even a single additional shift in the inner loop is measurable. So instead we emit 1 visible cluster to the rasterizer per overlapped page, do the page translation once for the cluster and scissor to the page pixels. SW clusters are small so most overlap a single page.
Hardware clusters are bigger, often overlap multiple pages and duplicating the vertex and triangle cost doesn’t make sense. Instead we do the virtual to physical page table translation per pixel. Because we are doing atomic UAV writes, even in the HW path, we are free to scatter them.
Just like in the primary view, Nanite picks the LOD matching 1 pixel error. In the case of shadows this means the pixels of the mip level it is rasterizing to. This maintains the property of roughly scaling cost with screen resolution not scene complexity.
That does not mean the triangles drawn to the shadow map are exactly the same as those drawn in the primary view. That mismatch can cause incorrect self shadowing. We address that discrepancy with a short screen space trace to span the zone where they could differ.
评论区
共 条评论热门最新