Compressed GPU texture formats – a review and compute shader decoders – part 2

This is the second part of the blog series I started in part 1. We have covered the S3TC, RGTC and ETC family of formats. This served as a good introduction to the topic of texture compression, but from here, the complexity will explode.

BPTC

This post will be dedicated to the BPTC compressed formats. These formats represent BC6 and BC7 formats in Vulkan, and is the state-of-the-art in texture compression on desktop GPUs, completing RGB and RGBA compression. BC1 and BC3 were the only proper desktop alternatives before these formats came along, and the quality of BC1 and BC3 aren’t … great.

BPTC also adds support for HDR. This is a big deal as before BPTC it was not possible to properly compress HDR data.

I pondered a bit over which format I should present first, but I think it’s appropriate to start with BC6 since it is much simpler than BC7 in terms of overall complexity. I feel like when BC6 was designed it sacrificed complexity from BC7 in order to add HDR.

Common ideas

Partitions

As we saw in ETC2, there was a very crude and early attempt to add partitions into the block format. The T and H formats in particular let you specify two different color endpoint pairs which could be selected at will using 1 bit per texel. Rather than letting the partition be dictated by the 2×4 / 4×2 sub-block, you could select any partition scheme.

Specifying an entire bit per texel to select a partition is quite overkill however, especially for a format which just gives you 64 bits for color. This makes the T and H modes crude hacks on top of ETC1. BPTC instead adds the concept of a pre-made table of partition shapes, and support for more than 2 partitions. Rather than specifying the partition index for each texel, we specify say, 64 common shapes, and just spend up to 6 bits per block (6 / 16 bits per texel) to encode the information. Of course, this idea falls flat if the exact partition we need doesn’t exist, but BPTC’s partitions seem well thought out. This is just a pain to implement since it means copy-pasting over a lot of tables ._.

Variable bit depth for endpoints and weights

Earlier formats were extremely rigid in how the blocks were to be encoded. You get X number of bits to specify endpoints and 2 bits per texel for the weights, a neat 32/32 bit split between the control block and texel data. Overall, this is a good setup for 64 bit block formats like BC1 and ETC, but it falls a bit short for 128 bit block formats like BPTC. Now we have more freedom and headroom to either spend lots of bits on endpoint accuracy, more weight bits per texel, or perhaps more partitions — which requires a lot more bits to encode multiple endpoints. Essentially, everything becomes a trade-off. More partitions can deal better with uncorrelated texels, more endpoint precision can deal with smooth gradients and more weight bits improves general PSNR when dynamic range in the block is large.

How all of this is configured is done through …

Mode hell

As we also saw with ETC2, there were multiple “modes” a block could enter into, depending on the encoded bits. We have the “differential”, “individual”, “T”, “H” and “Planar” modes. This already adds a lot of complexity to an encoder — which needs to figure out which mode is best through heuristics and trial and error — and decoder — which has to implement everything.

BPTC makes the mode idea more explicit and a certain number of bits are reserved to express the “mode” a block will be in.

Exploiting endpoint symmetry

While earlier formats used endpoint symmetry as a way to enable different modes (e.g. BC1), BPTC has already encoded the mode explicitly. Instead, we exploit symmetry to save one weight bit. We can simply assume that the MSB weight bit of the first texel is 0. We can always flip the order of the endpoints to make this work. Actually, we can save one bit per partition, since one texel per partition can be assumed to have MSB weight bit of 0. Which texels to treat specially is all done through another look-up table (sigh …) defined by the specification. Unfortunately, this means bitfield extraction from the 128-bit payload becomes very awkward and irregular … For this purpose I had to make a bitfield helper function. I doubt it’s very efficient, but you gotta do what you gotta do …

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bitextract.h

Normalized weight un-quantization

Up until now there hasn’t been any good way of interpolating between endpoints in any format. Interpolation in S3TC and RGTC involves a non-POT divider, which is awkward to make bit-exact, and ETC technically does not interpolate between endpoints, as there is an offset table (which is basically the same thing in practice, but it doesn’t concern itself with a divider). While ETC has bit-exact decoding, S3TC and RGTC do not, as the interpolation is defined in terms of floating point.

With the number of weight bits being variable in BPTC, it makes sense to normalize the weights to a range which is easier to work with, and can easily support bit-exact decoding. Values in the range of [0, 64] were chosen. BPTC can use either 2, 3 or 4-bit weights, and the specification defines a table which normalizes the weights onto the [0, 64] range. Interpolation can then be easily done in fixed point as:

interpolated = (a * (64 - weight) + b * weight + 32) >> 6;

This avoids the non-POT divider headache we’ve seen earlier. I tried finding a neat arithmetic expression to re-normalize the weights without a LUT, but I gave up pretty quickly.

No planar mode?

One thing I found rather puzzling was the lack of a planar mode in BPTC. ETC2 has it and seems kinda useful …

BC6 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc6.comp

BC6 is laser focused on compressing HDR data (FP16). There is only RGB support, no RGBA at all.

Representing floating point without floating point

In the other formats we have looked at, we have used normalized integers as a way to represent endpoints and colors. Interpolation of integer endpoints is very easy since it’s just fixed point math. When we start introducing floating point into the mix, there is a question of how we represent this efficiently. As we have seen, we cannot spend too many bits on representing endpoints, even with 8-bit color. With FP16, representing endpoints in full 16-bits per channel is extremely wasteful. As with 8-bit color, we need to have a simple and efficient solution for quantization of FP16 values with an arbitrary amount of bits.

A hypothetical solution for FP16 could be achieved by storing the exponent (5 bits) and just quantize the mantissa accordingly. BC6 isn’t far from this idea, although, it is much simpler than that. BC6 exploits the fact that floating point representations of finite numbers is monotonic for positive numbers.

For example, if we consider that the internal representation of endpoints is 16-bit unsigned, we interpolate endpoints as 16-bit integers, and perform a scaling operation and bitcast to FP16:

fp16_bits = (interpolated_value * 31) / 64;
fp16_texel = bitcast<fp16>(fp16_bits);

What this effectively does is to map 0 to 0.0 in FP16 and 0xffff to 0x7bff, which is 65504.0 and the largest finite representable value in FP16 (0x7c00 is +Inf). Everything in-between is monotonically increasing. The interpolation itself ends up being non-linear (closer to logarithmic), but for HDR data this should be fine. Linear light values are perceptually logarithmic anyways, so it might actually make more sense to interpolate in a logarithmic space rather than linear here, a very nice hack indeed! Of course, this is just a crude approximation as the FP16 representation is only piece-wise logarithmic.

For the signed format (BC6H_SFLOAT), we do a very similar fix-up step, but it is a bit more involved since we need to take care of the sign bit. The interpolated value is assumed to be a signed integer in [-0x8000, 0x7fff] range.

signed = interpolated_value < 0;
fp16_bits = (abs(interpolated_value) * 31) / 32;
fp16_bits |= signed ? 0x8000 : 0;
fp16_texel = bitcast<fp16>(fp16_bits);

A recognized bug in the specification (or feature depending on the situation) is that -0x8000 will be translated to -Inf (0x7c00 | sign_bit). It is not possible to represent +Inf.

Transformed endpoints

A thing BC6 can do is to assume correlation between two endpoints. This is similar to the “differential” mode in ETC2, but BC6’s variant of it is far more flexible. Rather than giving us X number of bits per component, we encode the first endpoint with more bits, and then the other endpoint as an offset from the first, with fewer bits. This combines very well with multiple partitions, since the other partition’s endpoints are also encoded as differentials from the base endpoint.

Mode “hell”

BC6 has a lot of modes to choose from, 14 to be specific. However, that is only at first glance. I like to separate these into 2 major types:

  • 2 partition modes
  • 1 partition modes

After selecting how many partitions you have, most of the modes just let you specify how many bits are spent on the base endpoint color, and how many bits are spent for each transformed endpoint (delta bits). Essentially, modes 0, 1, 2, 6, 10, 14, 18, 22, 26 are all the same, with different bit-allocation schemes. Mode 30 appears to be slightly different on first glance since it does not have transformed endpoints, but that’s just because delta bits == endpoint bits at this point, so it’s meaningless to transform the endpoints. The story is the same for the modes where number of partitions is 1.

Based on the mode, we either get 2 bits per texel of weights or 3 bits.

The major issue in decoding BC6 is that the bit layout for each mode is completely nonsensical and irregular. The bits are packed in seemingly random places, so unfortunately the decoder ends up with:

if ((mode & 2) == 0)
{
    if ((mode & 1) != 0)
        interp = decode_bc6_mode1(payload, linear_pixel, part, anchor_pixel);
    else
        interp = decode_bc6_mode0(payload, linear_pixel, part, anchor_pixel);
}
else
{
    switch (mode)
    {
    case 2:
        interp = decode_bc6_mode2(payload, linear_pixel, part, anchor_pixel);
        break;
    case 3:
        interp = decode_bc6_mode3(payload, linear_pixel);
        break;
    case 6:
        interp = decode_bc6_mode6(payload, linear_pixel, part, anchor_pixel);
        break;
    case 7:
        interp = decode_bc6_mode7(payload, linear_pixel);
        break;
    case 10:
        interp = decode_bc6_mode10(payload, linear_pixel, part, anchor_pixel);
        break;
    case 11:
        interp = decode_bc6_mode11(payload, linear_pixel);
        break;
    case 14:
        interp = decode_bc6_mode14(payload, linear_pixel, part, anchor_pixel);
        break;
    case 15:
        interp = decode_bc6_mode15(payload, linear_pixel);
        break;
    case 18:
        interp = decode_bc6_mode18(payload, linear_pixel, part, anchor_pixel);
        break;
    case 22:
        interp = decode_bc6_mode22(payload, linear_pixel, part, anchor_pixel);
        break;
    case 26:
        interp = decode_bc6_mode26(payload, linear_pixel, part, anchor_pixel);
        break;
    case 30:
        interp = decode_bc6_mode30(payload, linear_pixel, part, anchor_pixel);
        break;
    default:
        interp = DecodedInterpolation(ivec3(0), ivec3(0), 0);
        break;
    }
}

Not pretty 🙁 Most of the code in bc6.comp is used to decode all the weird variants.

Endpoint quantization

Once we have the endpoints, we un-quantize them to full 16-bit integer range by a simple shift. At this point, we can interpolate the endpoints using the normalized weights, and apply the BC6 fixup to turn it into a final FP16 value.

ivec3 unquantize_endpoint(ivec3 ep, int bits)
{
    ivec3 unq;
    // Specialization constant
    if (SIGNED)
    {
        // Sign-extend
        ep = bitfieldExtract(ep, 0, bits);
        if (bits < 16)
        {
            ivec3 sgn = 1 - ((ep >> 30) & 2); // 1 or -1
            ivec3 abs_ep = abs(ep);
            unq = ((abs_ep << 15) + 0x4000) >> (bits - 1);
            // Special cases. Boolean mix FTW.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0x7fff), greaterThanEqual(abs_ep, ivec3((1 << (bits - 1)) - 1)));
            unq *= sgn;
        }
        else
            unq = ep;
    }
    else
    {
        // Zero-extend
        ep = ivec3(bitfieldExtract(uvec3(ep), 0, bits));
        if (bits < 15)
        {
            unq = ((ep << 15) + 0x4000) >> (bits - 1);
            // Special-cases.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0xffff), equal(ep, ivec3((1 << bits) - 1)));
        }
        else
            unq = ep;
    }
    return unq;
}

Summary

BC6 introduces a lot of new concepts that BC7 will make use of as well. The main complication is all the different modes and the awkward bit packing layouts. After extracting the quantized endpoints, the process from there is quite straight forward.

BC7 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc7.comp

BC7 is similar to BC6 in many ways, and uses many of the same ideas.

8-bit endpoints

Just like BC6 endpoints are un-quantized to 16 bits before interpolation, BC7 un-quantizes to 8-bit integers. This is done through the typical “bit-replication” algorithm that we often see when converting RGB444 and RGB565. The weight interpolation is done the exact same way as BC6 with a [0, 64] range of weights.

Support for 3 partitions

BC6 only supports 2 partitions, but BC7 can support 3. There are separate tables for 3 partition modes compared to 2 partition modes. (Fun …)

Flexible alpha channel

Unlike BC6, BC7 fully supports alpha, and finally we have a way to encode RGB and A together. This is a big improvement over the older formats like S3TC and ETC2 which encode alpha completely separately in different block formats. When we encode RGB and A together, we can allocate bits between them as we please, e.g.:

  • No alpha. If alpha is constant 1.0 inside a block, we can use all 128 bits for RGB.
  • Correlated RGB and A. If these components are correlated, we can encode endpoints as RGBA with one weight per texel. This saves a lot of weight bits.
  • Uncorrelated RGB and A. This is the case essentially assumed by BC2, BC3 and ETC2. However, BC7 will let us tune a little bit how many bits we spend on color and how many bits we spend on alpha.

Uncorrelated color channel support

Since we can encode RGB and A as uncorrelated endpoints, what if we could freely select which component is uncorrelated? BC7 can do this, and this is expressed through the rotation bits. After decoding, we can swap A out with any other component. That way we can technically encode (R, GB), (RB, G) or (GB, R) for example. This seems rather niche, but it’s there … I suspect the designers happened to have 2 extra bits lying around they could use for this purpose.

No transformed endpoints

In a somewhat curious design choice, there is no support for transformed (correlated) endpoints in BC7. I suspect that for this reason, the number of modes could be kept somewhat sensible. I also suspect the need for transformed endpoints isn’t as great since we’re not working with HDR values anymore.

Modes

There are 8 modes in BC7, which all seem to be carefully chosen based on their use cases. Each mode makes some choices like:

  • How many partitions? (1 – 3)
  • Is there alpha? (Mode [0, 3] don’t have, [4, 7] have).
  • How many bits per endpoint component? (Color and alpha have separate bit depth)
  • How many bits per index?
  • Is alpha uncorrelated or correlated? (Mode 4 and 5 have two weight indices)
  • Which component is uncorrelated? (For mode 4 and 5)

This is all tabulated by the specification, and from there the algorithms are similar for all the modes. Other than this, there might be some leftover bits, and I believe the BC7 designers just invented some use for them. In some modes, the bits can be used as a shared LSB of the endpoints for example, or it can be used to select how many weight bits color and alpha channels receive respectively in mode 4.

Summary

BPTC is a fairly advanced texture compression format. BC6 and BC7 share many ideas as expected and introduces a large, but not unmanageable configuration space. From an implementation point-of-view though, the large reliance on look-up tables is annoying, but understandable.

BPTC is focused on perfecting color texture encoding and leaves the other kind of encoding to RGTC, which I think does a great job with 1 and 2 channel textures already.

Up next?

In the third (probably not final) part we tackle the monster that is ASTC. Complexity will jump another 10x.

Compressed GPU texture formats – a review and compute shader decoders – part 1

Compressed texture formats is one of the esoteric aspects of graphics programming almost no one cares all that much about. Neither did I, however, I’ve recently taken an academic interest in the zoo of compressed texture formats.

During development in Granite, I occasionally find it useful to test scenes which target mobile on desktop and vice versa, and in Vulkan, where there are no fallback paths for unsupported compression formats, we gotta roll our own decompression.

While it really isn’t all that useful to write a decoder for these formats, my goal is to create a suite of reasonably understandable compute shader kernels which can decode all of the standard formats I care about. Of course, I could just use a Frankenstein decoder which merges together a lot of C reference decoders and call it a day, but that’s not aesthetically pleasing or interesting to me. By implementing these formats straight from the Khronos Data Format specification, I learned a lot of things I would not otherwise know about these formats.

There are several major families of formats we can consider multi-vendor and standardized. Each of them fill their own niche. Unfortunately, desktop and mobile each have their own timelines with different texture compression standards, which is not fully resolved to this day in GPU hardware. (Basis Universal is something I will need to study eventually as well as it aims to solve this problem in software.)

By implementing all these formats, I got to see the evolution of block compression formats, see the major differences and design decisions that went into each format.

The major format families

First, it is useful to summarize all the families of texture compression I’ve looked at.

S3TC / DXT

The simplest family of formats. These formats are also known as the “BC” formats in Vulkan, or rather, BC 1, 2 and 3. This is the granddad of texture compression, similar to how I view MPEG1 in the video compression world.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs, probably for historical patent reasons.

RGTC

A very close relative of S3TC. These formats are very simple formats which specialize in encoding 1 and 2 uncorrelated channels, perfect for normal maps, metallic-roughness maps, etc. It is somewhat questionable to call these a separate family of formats (the Data Format specification separates them), since the basic format is basically exactly equal to the alpha format of S3TC, except that it extends the format to also support SNORM (-1, 1 range) alongside UNORM. These formats represent BC4 and BC5 in Vulkan.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs.

ETC

The ETC family of formats is very similarly laid out to S3TC in how different texture types are supported, but the implementation detail is quite different (and ETC2 is quite the interesting format). To support encoding full depth alpha and 1/2-component textures, there is the EAC format, which mirrors the RGTC formats.

These formats are firmly rooted in mobile GPUs. ETC1 was originally the only mandated format for OpenGLES 2.0 implementations, and ETC2 was mandated for OpenGLES 3.0 GPUs. It has almost no support on desktop GPUs. Intel iGPU is an exception here.

BPTC

This is where complexity starts to explode and where things get interesting. BC6 and BC7 are designed to compress high quality color images at 8bpp. BC6 adds support for HDR, which is to this day, one of only two ways to compress HDR images.

On desktop, BPTC is the state of the art in texture compression and was introduced around 2010.

ASTC

ASTC is the final boss of texture compression, and is the current state of the art in texture compression. Its complexity is staggering and aims to dominate the world with 128 bits. Mere mortals are not supposed to understand this format.

ASTC’s roots are on Mali GPUs, but it was always a Khronos standard, and is widely supported now on mobile Vulkan implementation (and Intel iGPU :3), at least the LDR profile. What you say, profiles in a texture compression format? Yes … yes, this is just the beginning of the madness that is ASTC.

PVRTC?

PVRTC is a PowerVR-exclusive format that has had some staying power due to iOS and I will likely ignore it in this series. However, it seems like a very different kind of format to all the others and studying it might be interesting. However, there is zero reason to use this format in Granite, and I don’t want to chew over too much.

What is a texture compression format anyway?

In a texture compression format, the specification describes a process for taking random bits given to it, and how to decode the bit-soup into texels. There are fundamental constraints in texture compression which is unique to this problem domain, and these restrictions heavily influence the design of the formats in question.

Fixed block size

To be able to randomly access any texel in a texture, there must be an O(1) mapping from texture coordinate to memory address. The only reasonable way to do this is to have a fixed block size. In all formats, 4×4 is the most common one. (As you can guess, ASTC can do odd-ball block sizes like 6×5).

Similarly, for reasons of random access, the number of bits spent per block must be constant. The typical block sizes are 64-bits and 128-bits, which is 4bpp and 8bpp respectively at 4×4 block size.

Image and video compression has none of these restrictions. That is a major reason why image and video compression is so much more efficient.

A set of coding tools

Each format has certain things it can do. The more complex the operations the format can do, the more expensive the decoding hardware becomes (and complex a software decoder becomes), so there’s always a challenge to balance complexity with quality per bit when standardizing a format. The most typical way to add coding tools is to be able to select between different modes of operation based on the content of the block, where each mode is suited to certain patterns of input. Use the right tool for the job! As we will see in this study, the number of coding tools will increase exponentially, and it starts to become impossible to make good use of all the tools given to you by the format.

Encoding becomes an optimization task where we aim to figure out the best coding tools to use among the ones given to us. In simpler formats, there are very few things to try, and approaching the optimal solution becomes straight forward, but as we get into the more esoteric formats, the real challenge is to prune dead ends early, since brute forcing our way through a near-infinite configuration space is not practical (but maybe it is with GPU encode? :3)

Commonalities across formats

Image compression and video compression uses the Discrete Cosine Transform (DCT) even to this day. This fundamental compression technique has been with us since the 80s and refuses to die. All the new compression formats just keep piling on complexity on top of more complexity, but in the center of it all, we find the DCT, or some equivalent of it.

Very similarly, texture compression achieves its compression through interpolation between two color values. Somehow, the formats all let us specify two endpoints which are constant over the 4×4 block and interpolation weights, which vary per pixel. Most of the innovation in the formats all comes down to how complicated and esoteric we can make the process of generating the endpoints and weights.

The weight values are typically expressed with very few bits of precision per texel (usually 2 or 3), and this is the main way we will keep bits spent per pixel down. This snippet is the core coding tool in all the formats I have studied:

decoded_texel = mix(endpoint0, endpoint1, weight_between_0_and_1);

To correlate, or not to correlate?

The endpoint model blends all components in lock-step. Typically the endpoint will be an RGB value. We call this correlated, because this interpolation will only work well if chrominance remains fairly constant with luminance being the only component which varies significantly. In uncorrelated input, say, RGB with an alpha mask, many formats let us express decorrelated inputs with two sets of weights.

decoded_rgb = mix(endpoint0_rgb, endpoint1_rgb, rgb_weight);
decoded_alpha = mix(endpoint0_alpha, endpoint1_alpha, alpha_weight);

This costs a lot more bits to encode since alpha_weight is very different from rgb_weight, but it should be worth it.

Many formats let us express if there is correlation or not. Correlation should always be exploited.

Working around the horrible endpoint interpolation artifacts

Almost all formats beyond the most trivial ones try really hard to come up with ways to work around the fact that endpoint interpolation leads to horrible results in all but the simplest input. The most common approach here is to split the block into partitions, where each partition has its own endpoints.

S3TC – The basics

A compute shader decoder:

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/s3tc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC1 – 4×4 – 64 bits

The BC1 format is extremely simple and a good starting point. 32 bits is used to encode two RGB endpoints in RGB565 format. The other 32 bits encode 16 weights, with 2 bits allocated to each texel.

This lets us represent interpolation weights of 0, 1/3, 2/3 and 1.

Since there is a symmetry in this design, i.e.:

mix(a, b, l) == mix(b, a, 1.0 - l)

there would be two ways to specify the same block, where we swap endpoints and invert the weights to compensate. This is an extra bit of information we can exploit. Based on the integer representation of the two endpoints, we can check if one of greater than the other, and use a different decoding mode based on that information. This exploitation of symmetry will pop up again in many formats later! In the secondary mode, we add support for 1-bit alpha, called a punch-through in most formats. In this mode, the interpolation weights become 0, 1/2, 1 and BLACK. This lets us represent fully transparent pixels. However, color becomes BLACK, so this will only work with pre-multiplied alpha schemes, otherwise there will be black rings around textures. I don’t think this mode is used all that much these days, but it is an option.

That is it for this format, it really is that simple.

One thing to note is that the specification is defined in terms of floating point with under-specified requirements for precision, and thus there is no bit-exact representation of the decoded values. Almost all hardware decoders of this format will give slightly different results, which is unfortunate. MPEG1 and MPEG2 also made the same mistake back in the day, where the DCT is specified in terms of floating point.

BC2 – 4×4 – 128 bits

BC2 is a format which adds alpha support by splicing together two blocks. A BC1 block describes color, and a second block adds an alpha plane with 4-bit UNORM. This format is quite obscure since the next format, BC3, generally does a much better job at compressing alpha. A curious side effect of BC2 and 3 is that the punch-through mode in the BC1 block no longer exists, i.e. the symmetry exists, so we lose 1 bit of information. I wonder why that information bit was not used to toggle between BC2 alpha decode (noisy alpha) and BC3 alpha decode (smooth alpha) …

BC3 – 4×4 – 128 bits

The alpha encoding of BC3 is very similar to how BC1 works. It also forms the basis of RGTC. The 64 bits of the alpha block spends 16 bits to encode 2 8-bit endpoints for alpha. We now get 3 bits as interpolation weights instead of 2 since there’s 48 bits left for this purpose. Similar to BC1, we can exploit symmetry and get two different modes based on the endpoints. In the first mode, the interpolation weights are as expected, [0, 1, 2, …, 7] / 7. In the second mode, we have 6 interpolated values [0, 1, 2, …, 5] / 5, and two values are reserved to represent 0.0 and 1.0. This can be very useful. This is essentially a very early attempt to introduce partitions into the mix, as we can essentially split up a block into 3 partitions on demand: (Fully opaque texels, fully transparent texels, the in-betweens). This can let us specify a tighter range for the endpoints as there is never a need to use a full [0, 0xff] endpoint range.

Summary

The S3TC formats are very simple, but there are certainly things to note about them. Alpha support is just bolted on, it is not integrated into the format. This means that even though the block is 128 bits, there is no way to spend more than 64 bits on color, even if the alpha plane has a completely flat value.

RGTC

RGTC (red-green) is basically BC3’s alpha block format turned into its own thing. Their main use is with non-color textures, e.g. normal maps, metallic-roughness maps, luminance maps, etc. It is very simple, and quality is quite good.

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC4 – 4×4 – 64 bits

This is BC3’s alpha block format as its own format, which returns one component. The only real difference from BC3 alpha is that it also supports an SNORM variant, which is very useful for normal maps, although I only bothered with UNORM, since my shaders need to assume input can be from any format.

BC5 – 4×4 – 128 bits

RGTC assumes uncorrelated channels, and thus the only sensible choice was to just slap together two BC4 blocks side-by-side, and voila, we can encode 2 channels instead of 1.

Summary

RGTC is simple and nice. It only needs to consider single channels of data, and writing encoders for it is very easy, and it is probably the simplest format out there. For what they do, I really like these formats.

Like S3TC, there is no bit-exact decoding, which is rather unfortunate.

ETC – Refining S3TC

ETC, or Ericsson Texture Compression is a family with multiple generations. ETC2 is backwards compatible with ETC1 in that all valid ETC1 blocks will decode the same way in ETC2, but ETC2 exploits some undefined behavior of ETC1 to extend the format into something more interesting.

ETC has a bit-exact decode, which makes verification very easy. 😀

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/etc2.comp

ETC1 – 4×4 – 64 bits

In many ways, ETC1 is quite similar to BC1, but there are some key differences. Just like BC1, 32 bits are spent to encode endpoints, and 32 bits are spent to give 16 texels 2 bits of weight information each. The main difference between ETC1 and BC1 is how endpoints work.

Sub-blocks

As a very crude form of partitioning, ETC1 allows you to split a block into either 2×4 or 4×2 sub-blocks, where each sub-block has its own endpoints. To do this, endpoints are expressed in a more compact way than BC1. Rather than specifying two RGB values, ETC in general likes to express endpoints as RGB +/- delta-intensity, where delta-intensity is described by a table. This makes things far more compact since we enforce constant chrominance. By saving so many bits, we can express 4 endpoints in total, 2 for each sub-block.

Uncorrelated or correlated endpoints?

Since we have to specify two sets of endpoints, the format gives us a way to specify if the two endpoints have completely different colors, or if the endpoints should be specified in base + offset form. This is controlled with a single bit, which changes the encoding from ep0 = RGB444, ep1 = RGB444 to ep0 = RGB555, ep1 = ep0 + sign_extend(RGB333). These values are not allowed to overflow in any way, which is something ETC2 exploits to great effect later.

NOTE: I found it more instructional to call it uncorrelated and correlated endpoint modes, but the specification calls it “individual” and “differential” modes.

Summary

ETC1 is somewhat different than BC1, but overall, it’s quite similar. It turns out, if you add a few restrictions on top of ETC1, you get ETC1S, which can trivially be transcoded to BC1. Basically, enforce the correlated endpoint mode, set the RGB333 bits to 0, enforce that delta-luma is the same for both sub-blocks.

There is also no way to express alpha with ETC1, which is unfortunate, but ETC1 is completely obsolete for my use cases either way.

ETC2 RGB – 4×4 – 64 bits

As mentioned earlier, ETC1 is a sub-set of ETC2. ETC2 adds a bunch of new and curious modes which gives us a small glimpse into more flexible ways to express endpoints, and even adds a mode which I have never seen in any other formats ever since.

Exploiting undefined behavior

When you select the correlated (differential) endpoint mode, there were some restrictions on overflow. We can exploit this fact in order to gain 3 new modes of operation for the ETC2 color codec!

First, we check if R + sign_extend(dR) is outside [0, 31] range. If so, we activate the so-called “T” mode. In this mode, we essentially add a partitioning scheme to the codec. We now remove the concept of two sub-blocks and let all texels access all available endpoints. We encode two RGB444 values (A, B), and a delta value (d). We form a T-shape by specifying 4 possible color values as A, B, B + d, B – d. This can be useful if the block is smooth, except for some weird outliers. A would be the outlier color, and B represents the middle of the smooth colors.

If G + sign_extend(dG) overflows, we enter a very similar “H” mode. In this mode, we do the exact same thing, except that the 4 possible colors become A + d, A – d, B + d, B – d.

If B + sign_extend(dB) overflows, we enter a very interesting mode, which I have never seen again in future formats. I’m not sure why, since it seems very useful for expressing smooth gradients. Essentially, in this mode we don’t encode weights per texel, but rather express RGB at texel (0, 0), at texel (4, 0) and texel(0, 4), and just bilinearly interpolate across the block to obtain the actual color. This is very different from the other endpoint interpolation we’ve seen earlier, because that flattens everything into a single line in the color space, but now we can access an entire 2D plane in the space instead.

Punch-through alpha

Like BC1’s 1-bit alpha scheme, ETC2 is very similar. When this format is enabled, we remove the capability for uncorrelated endpoints (individual mode), and replace the bit with a selection to select if all texels are Opaque, or potentially transparent. This idea is the exact same as BC1. In the transparent mode, code == 2 marks the texel as being transparent black. It does not work in planar mode though, this bit is ignored there.

Alpha support

Very similar to BC3, ETC2 also supports full 8-bit alpha by slapping together a separate block alongside the color block. The way this works is very similar to how RGTC works, but instead of two endpoints, ETC2 encodes a center point, and then uses tables to expand that range into 8 possible values using a table selector and a multiplier. These 8 possible values for the block are then selected with 3bpp indices. We lose the capability to cleanly represent 0.0 and 1.0 though, which is somewhat curious.

EAC – 4×4 – 64 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/eac.comp

EAC is ETC2’s version of RGTC, it is designed as a way to encode 1 and 2 de-correlated channels, basically the exact same approach as RGTC where the alpha block format is reused for the 1/2-channel formats. EAC is a bit different in that the internal precision is 11 bits (for some reason).

Unfortunately, EAC is kinda awkward since it’s technically bit-exact in the final fixed point value between [0, 2047], but it specifies many different ways how this can be converted to a floating point value.

Next up, BPTC

S3TC, RGTC and ETC represents the simpler formats, and hopefully I’ve summarized what these formats can do, next up, I’ll go through the BPTC formats, which significantly increases complexity.