Hackers News

computer graphics, mathematics, shaders, fractals, demoscene and more

Intro


In this article I want to correct a popular misconception that’s been making the rounds in computer graphics aficionado circles for a long time now. It has to do with branching in the GPUs. Unfortunately there are a couple of educational websites out there that are spreading some misinformation and it would be nice correcting that. I tried contacting the authors without success, so without further ado, here goes my attempt to fix things up:

The issue


So, say I have this code, which I actually published the other day:

vec2 snap45( in vec2 v )
{
vec2 s = sign(v);
float x = abs(v.x);
return x>0.923880?vec2(s.x,0.0):
x>0.382683?s*sqrt(0.5):
vec2(0.0,s.y);
}

The exact details of what it does don’t matter for this discussion. All we care about is the two ternary operations, which as you know, implement conditional execution. Indeed, depending on the value of the variable x, the function will return different results. This could be implemented also with regular if statements, and all that I’m going to say stays the same.

But here’s the problem – when seeing code like this, somebody somewhere will invariably propose the following “optimization”, which replaces what they believe (erroneously) are “conditional branches” by arithmetical operations. They will suggest something like this:

vec2 snap45( in vec2 v )
{
vec2 s = sign(v);
float x = abs(v.x);

float w0 = step(0.92387953,x);
float w1 = step(0.38268343,x)*(1.0-w0);
float w2 = 1.0-w0-w1;

vec2 res0 = vec2(s.x,0.0);
vec2 res1 = vec2(s.x,s.y)*sqrt(0.5);
vec2 res2 = vec2(0.0,s.y);

return w0*res0 + w1*res1 + w2*res2;
}

There are two things wrong with this practice. The first one shows an incorrect understanding of how the GPU works. In particular, the original shader code had no conditional branching in it. Selecting between a few registers with a ternary operator or with a plain if statement does not lead to conditional branching; all it involves is a conditional move (a.k.a. “select”), which is a simple instruction to route the correct bits to the destination register. You can think of it as a bitwise AND+NAND+OR on the source registers, which is a simple combinational circuit. Again, there is no branching – the instruction pointer isn’t manipulated, there’s no branch prediction involved, no instruction cache to invalidation, no nothing.

For the record, of course real branches do happen in GPU code, but those are not what’s used by the GPU for small moves between registers like we have here. This is true for any GPU made in the last 20+ years. While I’m not an expert in CPUs, I am pretty sure this is true for them as well.

The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:

float step( float x, float y )
{
return x < y ? 1.0 : 0.0;
}

So people using the step() “optimization” are using the ternary operation anyways, which produces the 0.0 or 1.0 which they will use to select the output, only wasting two multiplications and one or two additions. The values could have been conditionally moved directly, which is what the original shader code did.

But don’t take my word for it, let’s look at the generated machine code for the relevant part of the shader I published:

GLSL


return x>0.923880?vec2(s.x,0.0):
x>0.382683?s*sqrt(0.5):
vec2(0.0,s.y);

AMD Compiler


s_mov_b32 s0, 0x3ec3ef15
v_mul_f32 v3, 0x3f3504f3, v1
v_mul_f32 v4, 0x3f3504f3, v0
s_mov_b32 s1, 0x3f6c835e
v_cmp_gt_f32 vcc, abs(v2), s0
v_cndmask_b32 v3, 0, v3, vcc
v_cndmask_b32 v0, v0, v4, vcc
v_cmp_ngt_f32 vcc, abs(v2), s1
v_cndmask_b32 v1, v1, v3, vcc
v_cndmask_b32 v0, 0, v0, vcc

Microsoft Compiler


lt r0.xy, l(0, 0), v0.xy
lt r0.zw, v0.xy, l(0, 0)
iadd r0.xy, -r0.xyxx, r0.zwzz
itof r0.xy, r0.xyxx
mul r1.xyzw, r0.xyxy, l4(0.707107)
lt r2.xy, l(0.923880,0.382683), |v0.xx|
mov r0.z, l(0)
movc r1.xyzw, r2.yyyy, r1.xyzw, r0.zyzy
movc o0.xyzw, r2.xxxx, r0.xzxz, r1.xyzw

Here we can see that the GPU is not branching. Instead, according to the AMD compiler, it’s performing the required comparisons (v_cmp_gt_f32 and v_cmp_ngt_f32 – cmp=compare, gt=greater than, ngt=not greated than), and then using the result to mask the results with the bitwise operations mentioned earlier (v_cndmask_b32 – cnd=conditional).

The Microsoft compiler has expressed the same idea/implementation in a different format, but you can still see the comparison (lt – “lt”=less than) and the masking or conditional move (movc – mov=move, c=conditionally).

Not related to the discussion, but also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.

Conclusion


So, if you ever see somebody proposing this

float a = mix( b, c, step( y, x ) );

as an optimization to

float a = x < y ? b : c

then please correct them for me. The misinformation has been around for 20 years / 10 GPU generation, and that’s more than too long.

Thanks!




admin

The realistic wildlife fine art paintings and prints of Jacquie Vaux begin with a deep appreciation of wildlife and the environment. Jacquie Vaux grew up in the Pacific Northwest, soon developed an appreciation for nature by observing the native wildlife of the area. Encouraged by her grandmother, she began painting the creatures she loves and has continued for the past four decades. Now a resident of Ft. Collins, CO she is an avid hiker, but always carries her camera, and is ready to capture a nature or wildlife image, to use as a reference for her fine art paintings.

Related Articles

Leave a Reply