Before getting started a note about the Terrain objects. The attentive reader of the last news posts knows that „Terrain objects" have been replaced by a pure Component based system to reduce complexity and improve flexibility. So far a lot of old Terrain Object code still lingered around in the engine code base. Since the last news post a big effort has been done to eradicate all remaining Terrain Object code from the engine code base. Hence the goal is now achieved. Everything related to the old system is gone, dead, finito. This brought us a lot closer to an actual release. The second big step included transparency.
Due to the switch to the Component only system and some changes recent changes the render speed took a dive in an unfavorable way. To counter this the rendering code, especially the shaders and how rendering passes are combined, has been optimized. A bunch of superfluous work has been isolated and removed or reworked to remove bottle necks where they really are not necessary. Furthermore stencil buffering has been added to help in rendering only affected pixels. The net result is quite a remarkable speed gain. There is though still room for more optimizations which are planed for the next couple of weeks.
In modern game engines shadow casting and therefore lighting in general is one of the biggest time sinks followed by occlusion problems. Not far away is though transparency. In general transparency requires touching pixels on screen multiple times. Whenever you touch pixels though multiple times you take a speed impact. Transparency has therefore been often cheated around big time to keep the costs low. As a result many transparency solutions in game engines are very limited. It's easy to bring down engines by using transparency in a suboptimal way. The goal in this game engine is to free transparency of as many shackles as possible to give game developers more freedom for their ideas. With the Terrain Objects eventually gone for good the transparency system could be reworked and improved quite a bit.
Transparency in the conventional systems is applied using the Painters Algorithm. In short the solid geometry is rendered and then all transparent geometry one by one sorted from back to front. This is simple but suffers from various problems. For one CPU time has to be spend to sort objects on a more or less fine grained resolution. Another problem is transparent faces intersecting which can not be handled correctly. Last but not least if distorted transparency is included (for example a door with uneven glass) render time propels upwards as you have to make a copy of the color buffer before each new object using distorted transparency. At last these solutions work well in their boundaries. But what about deferred rendering?
With deferred rendering lighting is separated from geometry. First all geometry is written and then all lights applied one by one. Obviously this yields a problem as transparency is by definition incompatible with transparency. A typical approach is to apply transparency in the conventional way. While doable like this you loose the advantage of deferred rendering. Is there a solution? There is.
To deal with this problem Depth Peeling can be used. The idea is rather simply so I'll just outline it. Interested readers find more with Google (it's a paper as well as tutorials). The scene is rendered multiple times with the transparent objects. Instead of keeping the front most pixels (as with the solid geometry case) the farthest away pixel is kept. This way you render sort of the „most farthest away layer" of all transparent objects. You can then process these as with the solid geometry case. Then the depth is copied and the process repeated but this time the depth copy is used to reject the layer we just rendered. Hence we render now the 2nd farthest away layer and process it. This way we venture from back to front through the transparent objects. What is now the advantage over the old way? Depth Peeling automatically sorts transparency for us on the pixel scale. If you have for example a bunch of windows not intersecting each other on the screen with the old way you need as many passes as there are transparent objects. With Depth Peeling though you need only 1 pass since no transparency overlaps.
Great tech but where is the catch? The catch is when to stop. This algorithm does not provide a stop condition. How many layers do you have to render in total? You don't know in advance! Hence you have to use Occlusion Query for each pass to determine if anything has been rendered or not. If nothing has been rendered we are beyond the front most layer and have to stop. Unfortunately Occlusion Queries are slow. Furthermore we render a useless pass just to figure out that we have to stop. This turns Depth Peeling slow and barely usable. Many revert therefore to the old method to get the job done. Is there a way to solve this problem? Well, there is... if you play „smart" with shaders :D
The title states it already. If we would be able to know in advance how many layers we need we could just rush through them at full throttle. The speed gain would be remarkable. But how do we determine the count? OpenGL does not provide any functionality to obtain such a result. Occlusion Queries are the first thing one thinks about but it can only tell how many pixels in the entire screen have been rendered not the amount of overdraw. What we need is some shader magic. Let's start at the beginning. Take the following example screenshot.
This image shows a bunch of transparent objects littered in the room. How many layers do we need? At the top left is a counter showing the number of layers. You need 5 layers to render this scene properly. Okay, how we get this number now without being slow? We use a temporary full-screen texture for this task. A single value texture is all we need. Now all transparent objects are rendered without any depth testing with a very simple shader. The shader just writes the value 1/255 for every pixel. What does this give us? A lot actually. We have now a texture where each pixel stores the number of layers required. The image below shows an example output scaled by 25 so the result can be seen.
If you check out the image with a color picker you will notice that 125 is the maximum value which equals to 5. That's right, that's the amount of layers we need! But how on earth do you get this value back into your program? Any kind of reading from GPU memory is so slow your game turns into a slide-show. Maybe we can play a little MacGyver on this one.
The next step is to determine the maximum value. This is a simple down-scale shader with ping-pong textures to bring down the value to a 1x1 texture. What we actually want to do is to apply a Maximum filter kernel to all image pixels. The Maximum kernel has the same nice property as the Gauss filter kernel in that it is separable. A separable kernel simply means that if you have a 25x25 filter kernel that you get the same result by applying first a 25x1 and then a 1x25 filter kernel. Instead of 125 pixels you need only to tap 50. Using this property the textures can be down-sample very fast as this chain shows:
In my implementation the image is reduced by the factor 8 in each step. You need therefore 8 steps in total to down-sample the image. This is fast... damn fast!
But what now? We have a 1x1 with the value but how to we get it out? MacGyver for the rescue again. Remember what does Occlusion Query do? It counts the number of pixels written to the screen. Can we misuse this somehow to determine the value of the pixel? We can! Let's say the value of the pixel is X. What properties has this value from a computation science point of view? We have to increment a value X times by 1 to get to the value. That we can use. If we render 100 pixels and for each render only a pixel if X is smaller than a given threshold (each 1 larger than the previous) then we will write X pixels. That's the trick there. If you use now an Occlusion Query we can count the number of pixels less than X which equals to X itself. We are done. Now we know right before rendering the first transparent object how many layers we need and this with next to no time lost determining this value. Here a test video showing the result.
In the upper left corner is again the counter. It shows the number of transparency layers using a dot for each layer as this is easier to get while moving around than a number. As you can see dropping in a bunch of transparent objects cranks up the number of layers a lot. You can also see how the number varies from location to location. Using this pre-counter and the stencil buffer to reduce rendering to the utmost needed pixels even large numbers of overlapping transparent objects render fast even if they are all distorted (which is a horror scenario for the traditional way of doing all this). Due to video capturing the framerate is lower but you can still see that this is remarkably fast.
In the end there are more optimizations possible this way. Knowing the layer count in advance allows to use optimized render path for example for the case of 1 transparent layer only. But that I'll talk about maybe another time.
With the transparency back working and being optimized the next step is take another look at particle systems and prop fields.