OpenTomb is a cross-platform reimplementation of classic Tomb Raider 1—5 engines.
A step-by-step explanation of how to get all the performance from a modern GPU, or at least way more than necessary for sixteen year old games, by mostly being very kind to the driver.
Posted by ZCochrane on Jul 31st, 2013
Hi guys! I’m the one who does the Mac OS X port of OpenTomb and I’ve done the odd graphic improvement. Over on Tombraiderforums, I had written a few longer posts about this, and Lwmte suggested to me that I could post them here again, for easier reference. For this “official” publication, I’ve cleaned them up a bit. If you want to see my changes in action, not all of them are in TeslaRus’s main repository yet, but you can see them in my fork at Sourceforge.net.
I’ll go in chronological order, which means I’ll start with optimizing the graphics
There is an old saying, dating back to biblical times: The root of all evil is premature optimization. And it’s 100% true. It is very easy to get lost making some arcane function completely unreadable, in hopes of having “better performance”, without having any performance impact at all. That does not mean do not optimize, but it means optimize when you have an actual performance problem, ideally one that is easily repeatable.
For me, that performance problem was a simple scenario: Start BOAT.TR2 (the Venice level from Tomb Raider 2), get out of free look, get into fly mode and fly a little to the left (towards the sewers). Then look at the bulk of the level. In the original version, on my laptop, the FPS dropped to around 20, even though my laptop was not cheap and the data comes from 1997. In my mind, that is a problem.
Finding out the cause of this problem requires specific tools. For Mac OS X, there is OpenGL Profiler, which can be downloaded for free from Apple’s developer pages (look for the Graphics Tools package). There are similar applications for Windows and Linux as well; they come in either free (distributed by AMD or Nvidia) or very expensive. I haven’t really worked with them, so I can’t give a recommendation (or recall a name), but all should be able to do the few things I show here.
These profilers help a lot. They can also easily mislead you. A common feature does both very well is the timed trace. This produces a list of all OpenGL functions your applications called in a given time (usually you’ll run it over one frame), and how long each call took. They tend to be long stretches of the same calls over and over again with different data, so I’ve shortened them a bit here. This is one generated with an early version of OpenTomb:
It may not be entirely obvious at first glance, but these function calls draw three rectangles. I’m guessing they’re from the same mesh, but don’t hold me to that. Of all these calls, the glDrawArrays call is the one doing the actual work. You can see the times it took: 4, 12, then 4 microseconds again, roughly. That variation is normal. Here’s the misleading part: The time is not how long it took to draw the polygon. The polygon will get drawn whenever the graphics card feels like it; the only guarantee is that it will be done by the end of the frame (in detail: The programmer decides when the frame is over by calling SDL_GL_SwapBuffers() or equivalent, and this function does not return until everything has been drawn. This function calls a system-specific one to do all the work; on Mac OS X, for example, it’s CGLFlushBuffers). The time you see there is how long OpenTomb took to talk with the driver about this polygon. While this isn’t a lot, it adds up, because there are very few things drawn: Each call produces one polygon with four corners.
One of the default OpenGL optimization is to draw more things per glDraw… call. Doing that with Tomb Raider levels takes a bit of work, because every mesh can refer to many different texture pages, and you can’t change the texture per draw call. The solution is to combine all textures into one big texture, a so-called texture atlas. That is a tricky beast, so tricky that it deserves its own article. For now the important part is that it works well. With this atlas, sorted textures, and a few other changes, I was able to get this:
Now I’m using glDrawElements instead of glDrawArrays, and it takes far longer. It’s still a win, though, because each call now draws dozens of triangles versus two before. The first number in each call is the number of elements (triangle corners).
But are these timings good or not? It helps to have something to compare to. I picked GLLara, another Tomb Raider related tool that I’m developing, in this case a clone (not a port) of the XNALara posing application. Here’s a typical trace for that:
There are a lot of differences here, because GLLara uses OpenGL 3.2 Core Profiler with shaders only, while OpenTomb uses OpenGL 1.5 (roughly) without any shaders. The important part stays, though: Some state gets set with functions that include terms like bind, pointer, attrib, use, enable, disable and so on, and then triangles get drawn with glDrawElements. In this case, the models are a lot more complex. However, the time spent in glDrawElements is only a fraction of the value for OpenTomb.
Here you can see the full effect of what I said above: The time taken by glDrawElements has nothing to do with how long it actually takes to draw something. It is how long your all with the driver. The actual data is already on the GPU in both cases (GLLara uses Vertex Array Objects, so it uses glBindVertexArray to say where; glBindBuffer is used in OpenTomb, which requires a bit more work), so its size is not really relevant. The goal then is to get the driver out of the way, by talking to it exactly the way it wants to, so it has to do as little work as possible. There are two main ways of doing that, and you should do both. I started with the easier one: Vertex storage.
In OpenTomb, you see all the gl…Pointer calls. They don’t take much time and they’re necessary, but they tell you interesting information nonetheless. First of all, OpenTomb uses buffers in some cases (then the pointer is a low hexadecimal value) and data on the CPU in others (then the pointer is a huge hexadecimal value). It also uses data on the CPU as the final argument of glDrawElements, while GLLara uses buffers here too. Unless the data changes every frame (and sometimes even then), all data should be in buffers. Luckily, moving it there is very easy.
The other thing is not. Right now, vertex, color, normal and so on all have their own arrays. Instead, you can store them interleaved. Interleaved means you have an array of blocks. Each block belongs to one vertex, and contains its position, normal vertex, color, texture coordinate and so on.
It isn’t immediately obvious why this would be faster, but everyone says it is (notably Nvidia, AMD but also Apple). Let’s try it out.
That is a very noticeable improvement. I was tempted to say, “good enough”, but in my test scenario above, I still didn’t get 60 frames per second (with V-sync enabled).
You can also see that there is still some air here. We’re still a far cry from GLLara concerning the time for glDrawElements. And now other functions are becoming interesting, too. glPopClientAttrib did not get any faster or slower, but if we can get rid of it now, that would be a very noticeable speed-up in percent.
The one function that absolutely has to be there is glDrawElements, so it would be interesting to see what it actually does. Using a normal time profiler (on a Mac, this will always be Instruments; on PC and Linux, there are many options), I saw that it spent almost all of its time in a function called gldUpdateDispatch. This is private to the driver and not documented, but the consensus in the Mac game developer community is that it handles state changes. That function will obviously not be called the same on Windows and Linux, but you will probably find something similar.
State changes are a key concept in OpenGL. When drawing something, you first set some states (e.g. which textures to use, which features to enable), then draw the thing. This is the pattern you’ve seen in all the traces. What is perhaps not obvious from them is that the states remains set until you explicitly change them, even after drawing or the frame has ended, which can cause all sorts of confusion.
One of the basic rules of OpenGL optimizing is that state changes are evil and you should avoid them. Sadly, the rule is usually not explained in more detail. Which state changes are evil? You have to change state at some point, so which state changes are worse than others? There are hardly any precise answers, so I’ve built up my own rule of thumb. This requires some background.
From a programmer’s point of view, OpenGL can be run in two ways. Either you write shaders to do all the actual maths of getting pictures on the screen, or you use the fixed-function pipeline, which you configure through various glEnable calls. For example, OpenTomb, which doesn’t use shaders, calls glEnable(GL_TEXTURE_2D) to enable texturing. GLLara does not. It tells the GPU to use a specific shader, and whether this shader uses textures or not is its own problem.
This is not how a modern graphics card (DirectX 9 or higher) works, though. Modern graphics cards use only shaders. If you use the fixed function pipeline, there are still shaders being used. Only they’re not written by you, they are automatically generated by the driver.
(If this thought sounds scary to you, because you don’t trust the driver to do it right: You’re right. Write your own shaders unless you absolutely have to target devices that don’t support OpenGL 2.0 or higher. I only didn’t introduce shaders in OpenTomb yet because I don’t know whether TeslaRus wants to support such older computers.)
The way the driver does this is by waiting until you draw something. When you do, it checks whether any state changes that affect the shader code were made since the last draw call. If yes, it looks whether it has a matching shader, and if not, it creates one. This is highly optimized and very fast, but obviously, it can’t ever beat an explicit instruction to „use this shader now“.
One optimization in particular is the assumption that you’ll use the same shader as before. That’s usually true. For example, look at GLLara’s trace, which draws three things but sets shaders only twice, because two of the things use the same shader. It just provides different data through the various glBindSomething and glVertexAttrib calls.
OpenTomb actually uses the same shader for everything, too. It always enables the texture, culling and the alpha test, so the state that affects the shader ends up being the same. The problem is that the driver doesn’t know that. It only sees that the program called functions that change the state. In the interest of optimal performance, OpenGL drivers will not check whether the new state is the same as the old. So the driver checks whether it needs a new shader. Looking at the trace, it is obvious that the answer will be “No”, but it takes time to get there.
With that follows my rule of thumb: State changes either change the data (that is everything with glBind…() and gl…Pointer()), uniform values for the current shader (for example all changes to the current matrix), or the current shader itself (mostly glEnable()). Of these, state changes that change the current shader are evil (and they are all about equally evil), the others are harmless. And because of the way OpenGL works, even if the new state is the same as the old state, it still counts as a state change.
Applying that rule of thumb to OpenTomb required way more refactoring than you might imagine at first thought. After all, you have to make sure that the current shader is only changed when you actually want a different one, and that things with the same shader get all drawn together. It paid off, though:
In general, a shorter trace for the same scene is always better. You can see the result of applying the above rule of thumb now: Between two draw calls, I change some bindings, I change pointers (to always the same value, but they are relative to the buffer, so there is no way to avoid that, other than using Vertex Array Objects), and I change the matrix, and then it’s off to drawing. Calling glDrawElements that now takes consistently less than ten microseconds and typically less than five. That means it is even better than GLLara. I have my suspicions of why GLLara is slower, but you didn’t come here to hear about that app.
As I mentioned above, still, this does not tell you how long it takes to actually draw stuff. Specifically for OpenTomb, though, you can roughly assume that drawing doesn’t take any time at all, because Tomb Raider is old and just doesn’t have a lot of stuff to draw. The main goal here is to make sure the driver doesn’t get in the way. For other applications, you will probably spend more time trying to make sure that things that don’t get seen don’t get drawn.
With that, I get 60 FPS in my test scenario, and quite a lot more with vertical sync turned off. The performance problem is solved.
Of course, there are still things that could be optimized further. For example, do I really have to bind the same texture over and over again? I could also use Vertex Array Objects to condense the buffer binding and points to a single call. I could also stop using the matrix stack and do the multiplications myself to shave off a few fractions of microseconds. But all that is premature optimization again. I’ll keep it in mind if I ever hit another performance issue in OpenTomb.