It’s been two months of an architectural overhaul for the Bright Engine render pipeline. The engineering team have been making substantial changes to how the engine draws the scene, resulting in significant performance boosts without loss of graphical fidelity. Let’s dive under the hood to see what’s changed.
A Huge Thank you to our Patrons! Your contributions make Bright Engine Possible!
Zeppot • Daniel Burkhart • Massum Zaidi • Mark Makarainen
Improvement: Rendering Architecture
Render performance has been struggling for some time. In the last R1 update, we managed to introduce some significant improvements on the CPU side of the equation. The result was a 52% jump on a NVIDIA RTX 3070 reaching 79 frames per second. While that’s above the 60 FPS threshold, this is a pretty high-end GPU. And considering the test was done in the relatively basic Demo Plains scene, 79 frames per second is still pretty abysmal. We tested the same scene with a mid-range NVIDIA GTX 1660 Ti and it only managed to eke out 38 FPS. Needless to say, that’s not even close to good enough. That’s why we decided to begin investigating the problem more thoroughly using GPU and CPU profilers. It didn’t take long to reveal some glaring issues with the way Bright Engine renders the scene. And we’ve since begun fixing these issues - starting with the render loop. Overall not much has changed to the bulk of Bright Engine’s render loop. However, the start has been altered drastically. For starters, the engine now performs early depth testing. This is a trick where the whole scene is rendered once with very simplistic shaders for the sole purpose of calculating depth values.
Until now, Bright Engine has handled resource management pretty poorly. While there were systems in place to prevent double loading of texture and audio data, each system operated independently. Apart from making the codebase far more cluttered, the existing method was pretty useless when it came to handling model data.
All too often, duplicate data would be loaded into memory even though it wasn’t needed or ever used. And in some cases, it even caused tiny memory leaks that built up in the background. This was particularly noticeable when switching between different zones.
Needless to say, that’s not good. But rather than continue with the bandaid solutions we’ve been using so far, we decided to rip out the old system entirely and put in a new one. And the results are pretty encouraging with a 25% drop in memory usage.
Once the scene depth map has been calculated, we can insert it into all the necessary framebuffers, throughout the rest of the loop and disable shaders from writing any depth information. This is a massive time saver for the GPU, eliminating a lot of unnecessary work. Something else that needed changing was the way SSAO was handled. In the past, when this effect was turned on the whole scene would be rendered to generate a scene depth map and corresponding scene viewspace normal map. After all, this data is required to calculate the post-processing effect. In version R2, we’ve managed to completely eliminate this step. How? It’s simple, we’ve already calculated the scene depth map in the early depth pass, and view space normals have been shifted into the gBuffer shader calculations of the deferred renderer. In other words, we’re recycling data that’s already been calculated, resulting in another boost to performance
However, this change didn’t just improve performance, it also allowed an upgrade in visual quality. The original SSAO calculation shader used something called a renderbuffer to store depth information. Without going too deep into the weeds, this is a block of data inside the GPU that is exceptionally quick to read from. The problem with this approach is that render buffers are a bit like a black box, as in it’s hard to work out exactly what’s going on inside them. And it proves particularly problematic when it comes to alpha. So much so, that the new early depth pass uses a depth texture buffer instead. Yes, this is slower. But it enables us to handle alpha data in a more flexible way that renderbuffers simply don’t allow. The end result is dynamic foliage now being included in the SSAO pass (courtesy of the recycled data), while simultaneously shifting it into the deferred rendering pipeline.
Improvement: Vertex and Image Cache Optimisation
Reducing the render queue proved to be an excellent place to gain extra frames. But another area was inside the realm of memory itself. Modern GPUs have plenty of VRAM to spare, but the caches inside the processor remain relatively tiny. This is deliberate to maintain high performance. Unfortunately, it means optimising data to fit inside these tiny caches is a real nightmare. But in layman's terms, the smaller the buffers, the better. Model data got some attention with Bi-Tangent vector information no longer being stored as a vertex attribute. Since we already have the Normal and Tangent vectors, we can quickly calculate the Bi-Tangent vector using the cross-product method inside vertex shaders. While it’s technically more work for the GPU, the work is actually faster than trying to read the data from the cache. To save even more space, vertex colours are no longer stored as 32-bit floats, but rather 8 bit bytes. And index buffers, used to reduce the number of vertices proceeded, are now stored as 16 bits rather than 32 bits. We took the same strategy with framebuffers. Shadow maps are now drawn out with 16-bit precision instead of 32 bit (which was frankly overkill). And gBuffer textures now use 16-bit precision as well. The only exception is the position texture which continues to use a 32-bit precision level since it creates artefacts at 16 bits.
Improvement: Shader Performance
Shaders also got some love in this update. Until now, all uniform variables have been sent one by one by the CPU to the GPU each frame for each shader. While sending uniform variables in this fashion isn’t particularly processor-intensive, it does result in some notable idling by the GPU, and that is wasted performance.
Fortunately, there is a marvellous creation called Uniform Buffer Objects (UBOs) that allows us to overcome this problem. Instead of sending the data in pieces, Bright Engine now stores the data into UBOs which are sent to the GPU only once per frame. Needless to say, it’s a lot faster, reducing work for the CPU, and letting the GPU get started with its rendering tasks, that much sooner.
We also did some adjustments to the PBR shader adding new optimisations to point and spotlights calculations. And the dynamic foliage shader had a good chunk of duplicate calculation code removed for good measure.
As a reminder, before implementing these optimisations the Demo Plains scene ran at an average of 38 FPS on an NVIDIA GTX 1660 Ti. This frame rate has now been increased to 50 FPS - a 32% jump.
That’s a pretty substantial boost to performance. But obviously, we’ve still got a long way to go. 50 FPS is below the 60 FPS target. And lower-end GPUs are bound to continue struggling. Our investigations using profilers, revealed a lot of time being wasted on the CPU - something we plan on addressing in the coming updates. And with plenty of optimisations left to integrate on our list, we’re going to continue striving towards getting this number as high as possible.
- Fixed bug where SSAO texture was not being blurred correctly resulting in screen-space artefacts
- Fixed bug where GPU was still trying to calculate point light shadows even if no shadow-casting light source was in the scene (performance)
- Fixed bug where GPU was still trying to calculate spotlight shadows even if no shadow-casting light source was in the scene (performance)
- Fixed bug where scene depth data was being written into alpha channel of gBuffer Scene Albedo Map
- Fixed bug where the terrain would be fully emissive is no emission map was assigned to a layer
- Fixed bug where generated wind noise texture was being bound once per terrain chunk instead of just once per frame
- Fixed bug where a light's illuminance value wouldn't respect timeline settings
- Fixed bug where the units for a spot light's inner and outer radius were incorrectly displayed
With a renewed emphasis on performance, our plan for R3 is to continue boosting the frame rate. But we also want to progress with the engine’s feature set. Fortunately, we’ve found an area in which we can do both - dynamic foliage. The existing system is actually poorly optimised, and quite restrictive when it comes to using custom geometry. That’s why we’re going to revamp the dynamic foliage system to use custom meshes similar to the model brush tools while also linking both systems into a universal wind & weather system. We’ve actually had a universal wind & weather system planned for a while, and now the time has come to integrate it. But with foliage being such a key part of the world-building process, we’ve decided to start working on a new tool for Bright Engine - a 3D Flora generator. Stay tuned to see how the progress continues to unfold.