Compressed GPU texture formats – a review and compute shader decoders – part 2

This is the second part of the blog series I started in part 1. We have covered the S3TC, RGTC and ETC family of formats. This served as a good introduction to the topic of texture compression, but from here, the complexity will explode.

BPTC

This post will be dedicated to the BPTC compressed formats. These formats represent BC6 and BC7 formats in Vulkan, and is the state-of-the-art in texture compression on desktop GPUs, completing RGB and RGBA compression. BC1 and BC3 were the only proper desktop alternatives before these formats came along, and the quality of BC1 and BC3 aren’t … great.

BPTC also adds support for HDR. This is a big deal as before BPTC it was not possible to properly compress HDR data.

I pondered a bit over which format I should present first, but I think it’s appropriate to start with BC6 since it is much simpler than BC7 in terms of overall complexity. I feel like when BC6 was designed it sacrificed complexity from BC7 in order to add HDR.

Common ideas

Partitions

As we saw in ETC2, there was a very crude and early attempt to add partitions into the block format. The T and H formats in particular let you specify two different color endpoint pairs which could be selected at will using 1 bit per texel. Rather than letting the partition be dictated by the 2×4 / 4×2 sub-block, you could select any partition scheme.

Specifying an entire bit per texel to select a partition is quite overkill however, especially for a format which just gives you 64 bits for color. This makes the T and H modes crude hacks on top of ETC1. BPTC instead adds the concept of a pre-made table of partition shapes, and support for more than 2 partitions. Rather than specifying the partition index for each texel, we specify say, 64 common shapes, and just spend up to 6 bits per block (6 / 16 bits per texel) to encode the information. Of course, this idea falls flat if the exact partition we need doesn’t exist, but BPTC’s partitions seem well thought out. This is just a pain to implement since it means copy-pasting over a lot of tables ._.

Variable bit depth for endpoints and weights

Earlier formats were extremely rigid in how the blocks were to be encoded. You get X number of bits to specify endpoints and 2 bits per texel for the weights, a neat 32/32 bit split between the control block and texel data. Overall, this is a good setup for 64 bit block formats like BC1 and ETC, but it falls a bit short for 128 bit block formats like BPTC. Now we have more freedom and headroom to either spend lots of bits on endpoint accuracy, more weight bits per texel, or perhaps more partitions — which requires a lot more bits to encode multiple endpoints. Essentially, everything becomes a trade-off. More partitions can deal better with uncorrelated texels, more endpoint precision can deal with smooth gradients and more weight bits improves general PSNR when dynamic range in the block is large.

How all of this is configured is done through …

Mode hell

As we also saw with ETC2, there were multiple “modes” a block could enter into, depending on the encoded bits. We have the “differential”, “individual”, “T”, “H” and “Planar” modes. This already adds a lot of complexity to an encoder — which needs to figure out which mode is best through heuristics and trial and error — and decoder — which has to implement everything.

BPTC makes the mode idea more explicit and a certain number of bits are reserved to express the “mode” a block will be in.

Exploiting endpoint symmetry

While earlier formats used endpoint symmetry as a way to enable different modes (e.g. BC1), BPTC has already encoded the mode explicitly. Instead, we exploit symmetry to save one weight bit. We can simply assume that the MSB weight bit of the first texel is 0. We can always flip the order of the endpoints to make this work. Actually, we can save one bit per partition, since one texel per partition can be assumed to have MSB weight bit of 0. Which texels to treat specially is all done through another look-up table (sigh …) defined by the specification. Unfortunately, this means bitfield extraction from the 128-bit payload becomes very awkward and irregular … For this purpose I had to make a bitfield helper function. I doubt it’s very efficient, but you gotta do what you gotta do …

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bitextract.h

Normalized weight un-quantization

Up until now there hasn’t been any good way of interpolating between endpoints in any format. Interpolation in S3TC and RGTC involves a non-POT divider, which is awkward to make bit-exact, and ETC technically does not interpolate between endpoints, as there is an offset table (which is basically the same thing in practice, but it doesn’t concern itself with a divider). While ETC has bit-exact decoding, S3TC and RGTC do not, as the interpolation is defined in terms of floating point.

With the number of weight bits being variable in BPTC, it makes sense to normalize the weights to a range which is easier to work with, and can easily support bit-exact decoding. Values in the range of [0, 64] were chosen. BPTC can use either 2, 3 or 4-bit weights, and the specification defines a table which normalizes the weights onto the [0, 64] range. Interpolation can then be easily done in fixed point as:

interpolated = (a * (64 - weight) + b * weight + 32) >> 6;

This avoids the non-POT divider headache we’ve seen earlier. I tried finding a neat arithmetic expression to re-normalize the weights without a LUT, but I gave up pretty quickly.

No planar mode?

One thing I found rather puzzling was the lack of a planar mode in BPTC. ETC2 has it and seems kinda useful …

BC6 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc6.comp

BC6 is laser focused on compressing HDR data (FP16). There is only RGB support, no RGBA at all.

Representing floating point without floating point

In the other formats we have looked at, we have used normalized integers as a way to represent endpoints and colors. Interpolation of integer endpoints is very easy since it’s just fixed point math. When we start introducing floating point into the mix, there is a question of how we represent this efficiently. As we have seen, we cannot spend too many bits on representing endpoints, even with 8-bit color. With FP16, representing endpoints in full 16-bits per channel is extremely wasteful. As with 8-bit color, we need to have a simple and efficient solution for quantization of FP16 values with an arbitrary amount of bits.

A hypothetical solution for FP16 could be achieved by storing the exponent (5 bits) and just quantize the mantissa accordingly. BC6 isn’t far from this idea, although, it is much simpler than that. BC6 exploits the fact that floating point representations of finite numbers is monotonic for positive numbers.

For example, if we consider that the internal representation of endpoints is 16-bit unsigned, we interpolate endpoints as 16-bit integers, and perform a scaling operation and bitcast to FP16:

fp16_bits = (interpolated_value * 31) / 64;
fp16_texel = bitcast<fp16>(fp16_bits);

What this effectively does is to map 0 to 0.0 in FP16 and 0xffff to 0x7bff, which is 65504.0 and the largest finite representable value in FP16 (0x7c00 is +Inf). Everything in-between is monotonically increasing. The interpolation itself ends up being non-linear (closer to logarithmic), but for HDR data this should be fine. Linear light values are perceptually logarithmic anyways, so it might actually make more sense to interpolate in a logarithmic space rather than linear here, a very nice hack indeed! Of course, this is just a crude approximation as the FP16 representation is only piece-wise logarithmic.

For the signed format (BC6H_SFLOAT), we do a very similar fix-up step, but it is a bit more involved since we need to take care of the sign bit. The interpolated value is assumed to be a signed integer in [-0x8000, 0x7fff] range.

signed = interpolated_value < 0;
fp16_bits = (abs(interpolated_value) * 31) / 32;
fp16_bits |= signed ? 0x8000 : 0;
fp16_texel = bitcast<fp16>(fp16_bits);

A recognized bug in the specification (or feature depending on the situation) is that -0x8000 will be translated to -Inf (0x7c00 | sign_bit). It is not possible to represent +Inf.

Transformed endpoints

A thing BC6 can do is to assume correlation between two endpoints. This is similar to the “differential” mode in ETC2, but BC6’s variant of it is far more flexible. Rather than giving us X number of bits per component, we encode the first endpoint with more bits, and then the other endpoint as an offset from the first, with fewer bits. This combines very well with multiple partitions, since the other partition’s endpoints are also encoded as differentials from the base endpoint.

Mode “hell”

BC6 has a lot of modes to choose from, 14 to be specific. However, that is only at first glance. I like to separate these into 2 major types:

  • 2 partition modes
  • 1 partition modes

After selecting how many partitions you have, most of the modes just let you specify how many bits are spent on the base endpoint color, and how many bits are spent for each transformed endpoint (delta bits). Essentially, modes 0, 1, 2, 6, 10, 14, 18, 22, 26 are all the same, with different bit-allocation schemes. Mode 30 appears to be slightly different on first glance since it does not have transformed endpoints, but that’s just because delta bits == endpoint bits at this point, so it’s meaningless to transform the endpoints. The story is the same for the modes where number of partitions is 1.

Based on the mode, we either get 2 bits per texel of weights or 3 bits.

The major issue in decoding BC6 is that the bit layout for each mode is completely nonsensical and irregular. The bits are packed in seemingly random places, so unfortunately the decoder ends up with:

if ((mode & 2) == 0)
{
    if ((mode & 1) != 0)
        interp = decode_bc6_mode1(payload, linear_pixel, part, anchor_pixel);
    else
        interp = decode_bc6_mode0(payload, linear_pixel, part, anchor_pixel);
}
else
{
    switch (mode)
    {
    case 2:
        interp = decode_bc6_mode2(payload, linear_pixel, part, anchor_pixel);
        break;
    case 3:
        interp = decode_bc6_mode3(payload, linear_pixel);
        break;
    case 6:
        interp = decode_bc6_mode6(payload, linear_pixel, part, anchor_pixel);
        break;
    case 7:
        interp = decode_bc6_mode7(payload, linear_pixel);
        break;
    case 10:
        interp = decode_bc6_mode10(payload, linear_pixel, part, anchor_pixel);
        break;
    case 11:
        interp = decode_bc6_mode11(payload, linear_pixel);
        break;
    case 14:
        interp = decode_bc6_mode14(payload, linear_pixel, part, anchor_pixel);
        break;
    case 15:
        interp = decode_bc6_mode15(payload, linear_pixel);
        break;
    case 18:
        interp = decode_bc6_mode18(payload, linear_pixel, part, anchor_pixel);
        break;
    case 22:
        interp = decode_bc6_mode22(payload, linear_pixel, part, anchor_pixel);
        break;
    case 26:
        interp = decode_bc6_mode26(payload, linear_pixel, part, anchor_pixel);
        break;
    case 30:
        interp = decode_bc6_mode30(payload, linear_pixel, part, anchor_pixel);
        break;
    default:
        interp = DecodedInterpolation(ivec3(0), ivec3(0), 0);
        break;
    }
}

Not pretty 🙁 Most of the code in bc6.comp is used to decode all the weird variants.

Endpoint quantization

Once we have the endpoints, we un-quantize them to full 16-bit integer range by a simple shift. At this point, we can interpolate the endpoints using the normalized weights, and apply the BC6 fixup to turn it into a final FP16 value.

ivec3 unquantize_endpoint(ivec3 ep, int bits)
{
    ivec3 unq;
    // Specialization constant
    if (SIGNED)
    {
        // Sign-extend
        ep = bitfieldExtract(ep, 0, bits);
        if (bits < 16)
        {
            ivec3 sgn = 1 - ((ep >> 30) & 2); // 1 or -1
            ivec3 abs_ep = abs(ep);
            unq = ((abs_ep << 15) + 0x4000) >> (bits - 1);
            // Special cases. Boolean mix FTW.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0x7fff), greaterThanEqual(abs_ep, ivec3((1 << (bits - 1)) - 1)));
            unq *= sgn;
        }
        else
            unq = ep;
    }
    else
    {
        // Zero-extend
        ep = ivec3(bitfieldExtract(uvec3(ep), 0, bits));
        if (bits < 15)
        {
            unq = ((ep << 15) + 0x4000) >> (bits - 1);
            // Special-cases.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0xffff), equal(ep, ivec3((1 << bits) - 1)));
        }
        else
            unq = ep;
    }
    return unq;
}

Summary

BC6 introduces a lot of new concepts that BC7 will make use of as well. The main complication is all the different modes and the awkward bit packing layouts. After extracting the quantized endpoints, the process from there is quite straight forward.

BC7 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc7.comp

BC7 is similar to BC6 in many ways, and uses many of the same ideas.

8-bit endpoints

Just like BC6 endpoints are un-quantized to 16 bits before interpolation, BC7 un-quantizes to 8-bit integers. This is done through the typical “bit-replication” algorithm that we often see when converting RGB444 and RGB565. The weight interpolation is done the exact same way as BC6 with a [0, 64] range of weights.

Support for 3 partitions

BC6 only supports 2 partitions, but BC7 can support 3. There are separate tables for 3 partition modes compared to 2 partition modes. (Fun …)

Flexible alpha channel

Unlike BC6, BC7 fully supports alpha, and finally we have a way to encode RGB and A together. This is a big improvement over the older formats like S3TC and ETC2 which encode alpha completely separately in different block formats. When we encode RGB and A together, we can allocate bits between them as we please, e.g.:

  • No alpha. If alpha is constant 1.0 inside a block, we can use all 128 bits for RGB.
  • Correlated RGB and A. If these components are correlated, we can encode endpoints as RGBA with one weight per texel. This saves a lot of weight bits.
  • Uncorrelated RGB and A. This is the case essentially assumed by BC2, BC3 and ETC2. However, BC7 will let us tune a little bit how many bits we spend on color and how many bits we spend on alpha.

Uncorrelated color channel support

Since we can encode RGB and A as uncorrelated endpoints, what if we could freely select which component is uncorrelated? BC7 can do this, and this is expressed through the rotation bits. After decoding, we can swap A out with any other component. That way we can technically encode (R, GB), (RB, G) or (GB, R) for example. This seems rather niche, but it’s there … I suspect the designers happened to have 2 extra bits lying around they could use for this purpose.

No transformed endpoints

In a somewhat curious design choice, there is no support for transformed (correlated) endpoints in BC7. I suspect that for this reason, the number of modes could be kept somewhat sensible. I also suspect the need for transformed endpoints isn’t as great since we’re not working with HDR values anymore.

Modes

There are 8 modes in BC7, which all seem to be carefully chosen based on their use cases. Each mode makes some choices like:

  • How many partitions? (1 – 3)
  • Is there alpha? (Mode [0, 3] don’t have, [4, 7] have).
  • How many bits per endpoint component? (Color and alpha have separate bit depth)
  • How many bits per index?
  • Is alpha uncorrelated or correlated? (Mode 4 and 5 have two weight indices)
  • Which component is uncorrelated? (For mode 4 and 5)

This is all tabulated by the specification, and from there the algorithms are similar for all the modes. Other than this, there might be some leftover bits, and I believe the BC7 designers just invented some use for them. In some modes, the bits can be used as a shared LSB of the endpoints for example, or it can be used to select how many weight bits color and alpha channels receive respectively in mode 4.

Summary

BPTC is a fairly advanced texture compression format. BC6 and BC7 share many ideas as expected and introduces a large, but not unmanageable configuration space. From an implementation point-of-view though, the large reliance on look-up tables is annoying, but understandable.

BPTC is focused on perfecting color texture encoding and leaves the other kind of encoding to RGTC, which I think does a great job with 1 and 2 channel textures already.

Up next?

In the third (probably not final) part we tackle the monster that is ASTC. Complexity will jump another 10x.