Neighbourhood sampling order during texture filtering
In my pixel-art scaling tinkering I found a source of glitches in anti-aliased threshold operations involving derivative functions in shaders. So I worked out a workaround which saves me having to think too hard about a bunch of corner cases.
When applying a window function in a shader (here I’ll use linear interpolation for simplicity; you would normally let the hardware do something like that, but real logic can be more complex) it’s perfectly reasonable to end up with something like this:
ivec2 i = ivec2(floor(uv));
vec2 weight = uv - vec2(i);
vec4 a = texelFetch(s, i + ivec2(0,0)), b = texelFetch(s, i + ivec2(1,0)),
c = texelFetch(s, i + ivec2(0,1)), d = texelFetch(s, i + ivec2(1,1));
return mix(mix(a, b, weight.x), mix(c, d, weight.x), weight.y);
From that you might build things up and apply other filters and whatever
else; but what I noticed is that when subsequent operations want to use
the derivative functions (dFdx()
, dFdy()
, and fwidth()
) then
there can be problems.
Problems like this:
Imagine you’re sampling along a line spanning from pixels M to R in the example
below. Ignore the Y dimension for now. You’ll see a
, b
, and weight
take
step changes every time an integer boundary is crossed:
Those step changes cause dFdx()
to return much larger values than
expected at the transitions, even while the output of the function
itself appears smooth.
This can affect the LOD calculation in mipmapping (though you should probably be doing that manually, here) and the small transition band used for anti-aliased thresholding becomes an unexpectedly large band.
Here’s an workaround I came up with:
ivec4 i = ivec4(floor((uv.xyxy + vec4(1,1,0,0)) / 2.0) * 2.0 + vec4(0,0,1,1));
vec2 weight = abs(uv - vec2(i.xy));
vec4 a = texelFetch(s, i.xy), b = texelFetch(s, i.zy),
c = texelFetch(s, i.xw), d = texelFetch(s, i.zw);
return mix(mix(a, b, weight.x), mix(c, d, weight.x), weight.y);
This rearranges the offset coordinates so that a
always gets a pixel from an
even column and even row index, d
always gets a pixel from an odd row and odd
column, etc..
Consequently, variables change like so:
While a
and b
do still take step changes, they do so when their
corresponding weights are zero. Depending on the situation this may take care
of the problem already, or it may be necessary to rearrange a bit more of the
arithmetic to ensure the multiplication by the zero-weighting happens earlier
to force the switch to appear smooth.
Alternatively it’s possible to be careful to calculate the derivatives on the continuous values earlier in the code. That’s probably better if you don’t mind doing it, but sometimes that can involve managing a bit more data and remembering to do things that you might not otherwise want to remember.
Another benefit of doing it this way is that if one input pixel contains an outlier which causes a different path to be taken, then all the fragments which touch that pixel will see the aberration in the same stage in the function so they can all take the same branch in unison; rather than each having to branch differently at different stages depending on that pixel’s relative position to themselves (though branching may still cause problems with derivatives).
This can be extended to a mod-n system for kernels of size n. Capturing pixels into something more like a ring buffer, where only the edge cases (still the zero-weighted cases) get updated during a transition.
vec4 ix = floor((uv.x + vec4(3,2,1,0)) / 4.0) * 4.0 + vec4(0,1,2,3);
vec4 iy = floor((uv.y + vec4(3,2,1,0)) / 4.0) * 4.0 + vec4(0,1,2,3);
vec4 weightx = window(abs(uv.x - ix));
vec4 weighty = window(abs(uv.y - iy));
vec4 acc = vec4(0);
for (int i = 0; i < 4; ++i) {
for (int j = 0; j < 4; ++j) {
acc += texelFetch(s, ivec2(ix[j], iy[i])) * weightx[j] * weighty[i];
}
}
The ring-buffer analogy is misleading, of course, because the adjacent pixels are computed concurrently and they all fill up their own private copies of the buffer at the same time without sharing context, so there isn’t the bandwidth saving of a classical ring buffer. But the real point is that they all have mostly the same values at the same offests and so this mitigates a class of glitches in the derivatives.