Category Archives: Unsubstantiated bitching

How I Learned to Stop Blitting and Love the Framebuffer

A deferred rendering pipeline provides great opportunities to run post processing shaders on the results from your G-buffer (A G-buffer is the target for the first pass in a deferred renderer, usually consisting of colour, normal, depth ± material textures, which stores the information needed for subsequent lighting and effects passes). There are a couple of issues, however:

1) OpenGL can’t both read from and write to a texture at the same time; attempting to do so will give an undefined result (i.e. per OpenGL tradition it will look like the result you wanted, except when it doesn’t)
2) What, therefore, do you do about transparency in a deferred renderer? You need the information about what’s behind the transparent object, and how far away it is. That’s in your G-buffer, which you’re already writing to (I’m assuming here that you’re rendering transparent materials last, which is really the only sane way to do it).

What you’re going to need is a copy of a subset of your G-buffer to provide the information you need to draw the stuff behind your transparent material. There’s an expensive way and a cheap way to do this; the expensive way is to do all of your deferred lighting before you render any transparent materials (which does make life easier in some ways, but means you need to do two lighting passes, one for opaque and one for transparent materials), the cheap way is to decide that you’re not going to bother to light the stuff behind the transparency because you’re already going to be throwing a bunch of refraction effects on top anyway and all you really need is a bit of detail to sell the effect.

Here’s where I tripped myself up, in the usual manner for a novice learning OpenGL from 10-year-old tutorials on the internet; I naïvely thought that the most logical thing to do at this point would be to blit (i.e. copy the pixels directly) from my G-buffer to another set of textures which I would then use as the source for rendering transparency. Duplicating a chunk of memory seemed like it was going to be a much faster operation than actually drawing anything. Because I’m not a total idiot, I did at least avoid using glCopyTexImage2D and went straight for the faster glCopyTexSubImage2D operation instead. The typical way you would use this is:

//bind the framebuffer you’re going to read from
glBindFrameBuffer(GL_FRAMEBUFFER, myGBuffer);
glViewport(0.0, 0.0, my_gBuffer_width, my_gBuffer_height);
//specify which framebuffer attachment you’re going to read
glReadBuffer(GL_COLOR_ATTACHMENT0); //or whichever
//bind the texture you’re going to copy to
glBindTexture(GL_TEXTURE_2D, myDuplicateTexture);
//do the copy operation
glCopyTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 0, 0,  my_gBuffer_width, my_gBuffer_height);

The problem with this is, the performance is terrible, especially if you’re copying across a PCI bus. I was copying a colour and a depth texture for a water effect, and a quick root around in the driver monitor led to the discovery that the GL was spending at least 70% of its time on those two copy operations alone. I’d imagine that this is likely a combination of copying into system memory for some reason and stalling the pipeline; to make matters worse, in this sort of situation you can only really copy the textures immediately before you need to use them, so unless you’re going to rig up some complicated double-buffer solution copying the last frame, asynchronous pixel buffer transfers aren’t going to save you. Using driver hints to keep the texture data in GPU memory might work, or it might not.

Here’s one solution:
Continue reading How I Learned to Stop Blitting and Love the Framebuffer

On GLSL subroutines in Mac OS X and disappointment

I probably don’t need to tell anybody that the state of 3D graphics is somewhat sad under OS X when compared with Windows. This isn’t really due to differences between the OpenGL API used by MacOS and DirectX, as used by Windows; even OpenGL-only Windows applications typically run significantly better than their MacOS counterparts. It’s more easily attributable to two other factors:

  1. There are far more computers running Windows than MacOS; these computers are more likely to be running intensive 3D applications (i.e. video games)
  2. Apple is notoriously disinterested in games, and notoriously tardy in keeping up with new OpenGL specifications.

This means that a) it takes a while for any new OpenGL features to make it to the Mac platform and b) they suck when they finally arrive.

As of Mavericks, the OpenGL Core profile has been upgraded to 4.0, and GLSL version 400 is supported. This means that shader subroutines have become available. Theoretically, shader subroutines are a really neat feature. Because of the way that graphics cards work, conditionals and branching in shader code incur a large performance penalty. Similarly, dynamic loops are much less efficient than loops of a fixed length, even though more recent graphics APIs claim to have fixed that. What this means is that if you have a shader which choses whether to do a cheap operation, or a more expensive operation, then it will perform worse then either (usually because it’s doing both). If that shader then choses how many times to do the expensive operation, the performance gets even worse despite the fact that it should theoretically be avoiding unnecessary iterations through the loop. This means that the best option has always been to write two different shaders for the simple and the complex operation, and not to bother dynamically limiting the number of iterations in a loop, but just hard code the smallest number you think you can get away with.

Shader subroutines were supposed to fix the first of these problems; it was supposed to be possible to write “modular” shaders, where a uniform allows you to change which operations a shader uses. In my renderer, which is admittedly poorly optimised, I would like to chose between a parallax-mapped, self-shadowing shader (expensive) or a simpler version which uses vertex normals – in this specific case, the simpler version is for drawing reflections, which don’t require the same level of detail. Here’s the results (Nvidia GTX680, MacOS X 10.9.4, similar results using both default Apple drivers and Nvidia web drivers):

  • No subroutines – both main render and reflection render use expensive shader: frame time approx. 0.028s
  • Subroutines coded in the shader, and uniforms set, but subroutines never actually called: frame time approx. 0.031s
  • Subroutines in use, cheaper subroutine used for drawing reflections: frame time approx. 0.035s

Vertex normals should be really cheap, and save a lot of performance when compared with parallax mapping everything. In addition, I’m not mixing and matching different subroutines – one subroutine is used for each pass, so switching only occurs once per frame. The problem is, the mere existence of code indicating the presence of a subroutine incurs a significant performance hit, which is actually more expensive than just giving up and using the much more complicated and expensive shader for everything.

So, yeah; disappointing.

Lies, damn lies and GL_MAX_TEXTURE_UNITS

Warning: this post contains much bitching, only some of which is substantiated, and much of which probably only applies to Intel integrated graphics

So, I guess you could probably point out that I’m being a bit melodramatic and that, essentially, anybody who tries to do much in the way of multitexturing using integrated graphics gets what they deserve.

However, you may find it useful to know that, despite OpenGL telling you that you have 48 texture units available, don’t, under any circumstances, try to actually use all of them. In fact, you’re playing fast-and-loose if you even try to use some. It might seem logical to you, as an OpenGL novice, to write your code so that each texture unit is only used in one of your shaders and is reserved for a particular texture function; say, I have a shader for drawing grass, to I bind my grass texture to GL_TEXTURE23, set my sampler uniform to use that texture unit, and call it a day.

Don’t do that.

In my testing, again on pretty limited integrated hardware, I halved my drawing time by using a total of less than 8 texture units and binding textures as required. This includes the fact that I use GL_TEXTURE0 both for a material’s main texture in the first pass, and for doing post-processing on the entire framebuffer in a later pass.

In short – “fewer texture units used” trumps “fewer texture binds” every time, when using limited hardware.