Thus far, we’ve discussed the reasons for wanting to do depth of field (DoF) on the GPU. We’ve figured out that we need to get some idea of what our focal length should be, and to figure out by how much the depth of each fragment in our scene render differs from that focal length. All of this information is available in the depth buffer we generated when we rendered the scene’s geometry; if we rendered into a framebuffer object with a depth attachment, this means we have the information available on the GPU in the form of a texture.
Getting data back off the GPU is a pain. It can be accomplished effectively with pixel buffer objects, but the available bandwidth between CPU and GPU is comparatively tiny and we don’t really want to take any up if we can help it. Additionally, we don’t want either the CPU or the GPU stalled while waiting for each other’s contribution, because that’s inefficient. It’s therefore more logical to analyse the depth buffer on the GPU using a shader. As a bonus, you can do this at the same time as you’re linearising the depth buffer for other post-processing operations – for example, you might need a linear depth buffer to add fog to your scene or for SSAO.
What we’re going to do is generate a representation of how each fragment’s depth differs from the focal length, the results of which will look something like this:
Modern 3D games use a bunch of tricks to convince our brains that we are viewing their world through some bizarre hybrid sense organ which consists of about 30% human eye and 70% movie camera. Hence we get lens flares, aperture changes and other movie staples which aren’t exactly true to life; we accept these effects probably because a) we are all so highly mediated these days that we expect the things which appear on our TVs/monitors to look like that and b) because they make shiny lights dance around the screen, and us primates love that stuff.
(An aside; anybody who wears glasses is totally used to lens flares, bloom lighting and film grain effects in everyday life, which is probably another reason why us nerds are so accepting of seeing the world as a movie. These settings can be temporarily toggled off with the use of a small amount of detergent and a soft cloth, but tend to return to the defaults over time).
What the human eye does have in spades, though, is dynamic depth of field. Anything outside of the centre of the field of view is out of focus and therefore appears blurred (and also in black and white, but let’s pretend we don’t know that). Humans generally focus on the thing in the centre of their visual field, even when the thing they are actually attending to isn’t (hence when you watch something out of the corner of your eye, it’s still blurry). Because depth of field effects weren’t at all viable on early graphics hardware, a lot of people have got used to everything in a scene having the same sharpness and dislike the addition of depth of field. However, used tastefully, it can nicely work as a framing effect; in addition it’s pretty handy to hide lower-resolution assets in the background.
The technique I am going to explain here has a major advantage for my purposes; the whole thing can be done as a post-process on the GPU, meaning that you don’t have to fiddle around with scene graphs or reading your depth buffer back for calculations on the CPU.
I probably don’t need to tell anybody that the state of 3D graphics is somewhat sad under OS X when compared with Windows. This isn’t really due to differences between the OpenGL API used by MacOS and DirectX, as used by Windows; even OpenGL-only Windows applications typically run significantly better than their MacOS counterparts. It’s more easily attributable to two other factors:
There are far more computers running Windows than MacOS; these computers are more likely to be running intensive 3D applications (i.e. video games)
Apple is notoriously disinterested in games, and notoriously tardy in keeping up with new OpenGL specifications.
This means that a) it takes a while for any new OpenGL features to make it to the Mac platform and b) they suck when they finally arrive.
As of Mavericks, the OpenGL Core profile has been upgraded to 4.0, and GLSL version 400 is supported. This means that shader subroutines have become available. Theoretically, shader subroutines are a really neat feature. Because of the way that graphics cards work, conditionals and branching in shader code incur a large performance penalty. Similarly, dynamic loops are much less efficient than loops of a fixed length, even though more recent graphics APIs claim to have fixed that. What this means is that if you have a shader which choses whether to do a cheap operation, or a more expensive operation, then it will perform worse then either (usually because it’s doing both). If that shader then choses how many times to do the expensive operation, the performance gets even worse despite the fact that it should theoretically be avoiding unnecessary iterations through the loop. This means that the best option has always been to write two different shaders for the simple and the complex operation, and not to bother dynamically limiting the number of iterations in a loop, but just hard code the smallest number you think you can get away with.
Shader subroutines were supposed to fix the first of these problems; it was supposed to be possible to write “modular” shaders, where a uniform allows you to change which operations a shader uses. In my renderer, which is admittedly poorly optimised, I would like to chose between a parallax-mapped, self-shadowing shader (expensive) or a simpler version which uses vertex normals – in this specific case, the simpler version is for drawing reflections, which don’t require the same level of detail. Here’s the results (Nvidia GTX680, MacOS X 10.9.4, similar results using both default Apple drivers and Nvidia web drivers):
No subroutines – both main render and reflection render use expensive shader: frame time approx. 0.028s
Subroutines coded in the shader, and uniforms set, but subroutines never actually called: frame time approx. 0.031s
Subroutines in use, cheaper subroutine used for drawing reflections: frame time approx. 0.035s
Vertex normals should be really cheap, and save a lot of performance when compared with parallax mapping everything. In addition, I’m not mixing and matching different subroutines – one subroutine is used for each pass, so switching only occurs once per frame. The problem is, the mere existence of code indicating the presence of a subroutine incurs a significant performance hit, which is actually more expensive than just giving up and using the much more complicated and expensive shader for everything.