Compressed GPU texture formats – a review and compute shader decoders – part 3/3

This is a long overdue blog post. Last year, I began a blog series where I explored the weird and wonderful world of texture compression formats. Unfortunately, I didn’t have time and motivation to finish it at the time, but it’s time I get back to finish these posts.

Up until now, I’ve covered the major families: DXT (BC 1-5), ETC2 and BPTC (BC 6-7) in part 1 and part 2. Complexity has been steadily increasing, but now we’re finally going to tackle the end boss of texture compression, ASTC.

ASTC

ASTC, or Adaptive Scalable Texture Compression is the result of a fever dream which attempts to answer the question of “how complex can we make a texture compression format and still trick IHVs to make it into silicon”. Complexity in codec design generally translates to better quality / bit at the cost of more expensive decode/encode. As hardware designs evolve, we can afford more complex codecs. Finding a sweet spot is incredibly hard, and ASTC cranked it to 11, for better or worse.

It began life on the Mali Midgard GPU series way back in the day, but has since been widely adopted by the mobile GPU ecosystem by all relevant vendors in the Khronos sphere, at least the core LDR profile.

Profiles you say, in my texture formats? What is this, H.264? Well … we’ll get to that 🙂

Notably, it’s also supported on the Nintendo Switch (Tegra supports Android after all), causing all sorts of fun issues for emulators … There’s actually multiple independent compute shader decoders of ASTC out there in the wild!

ASTC (released in 2012) is still considered the state of the art, and I don’t think anyone else have made a serious attempt at introducing a new hardware accelerated texture compression format since. The state of the art in ASTC compression is still being developed with recent improvements from Pete Harris on astc-encoder, the official reference encoder, and that is to my knowledge the only encoder implementation of ASTC HDR/3D profiles.

Desktop IHVs other than Intel have so far refused to expose support for ASTC, which is annoying, since it seems we’re forever stuck with two different worlds of texture compression, BC variant zoo on desktop and ASTC on mobile.

The compute shader decoder

Last year, around the time I wrote blogs on BPTC, I also implemented an ASTC LDR/HDR decoder that produced bit-exact results compared to the reference implementation and real hardware in my validation suite. Compared to the BC6 and 7 shaders, the shader complexity is ridiculous, and somehow it works. It’s integrated in Granite such that I can load ASTC compressed scenes transparently, which is useful for testing in a “because I can!” sense.

The shader is certainly not optimized for speed. I focused on readability and debuggability, since it was hard enough to implement this as-is. It’s more than fast enough for my purposes at least.

Enter the horror here. Here be dragons.

One format to bind them all

ASTC has a goal of better quality / bit than every existing format at the same time. This is not easy, considering that BC went the route of many different specialized formats designed to solve specific texture compression needs and you would expect each format to excel at their specific use case. As far as I know, ASTC succeeds here, given a sufficiently sophisticated encoder.

To supplant every format, you’d need to support these features at the very least:

R (BC4, EAC), RG (BC5, EAC), RGB (BC1, ETC2), RGBA (BC3, ETC2, BC7)
4bpp (BC1, ETC2 RGB, BC4), 8bpp (BC3, BC5, ETC2 RGBA, BC6, BC7), and even lower (2bpp PVRTC?)
LDR vs HDR (BC6)
Decent multi-partition support
Decorrelated channel support

ASTC goes beyond these requirements in an attempt to future-proof the format. Bit-rate is fine grained. From sub 1bpp, up to 8bpp in fine steps. 2D is boring, so ASTC also supports 3D blocks, although adoption of this is non-existing outside Arm as far as I know. For SDF rendering, 3D block compression seems useful in theory at least.

Codec concepts

The ASTC specification in the Khronos Data Format document is 35 pages (!). There’s a lot of ground to cover when implementing this format, but fortunately, the codec features are mostly orthogonal, i.e. the complexity is additive, not exponential, making it somewhat manageable.

Bit-rate scalability

The ASTC format like every other format we’ve seen up until now is block based, where each block consumes a fixed number of bits. ASTC chose 128 bits, but bit-rate scalability is achieved through different block sizes. No longer can we rely on nice 4×4 blocks :(. Instead, we have ridiculous things like 8×6 blocks or 5×5 blocks, all the way up to 12×12 blocks. Very little actually changes depending on block size, we just get more weights to decode.

In practice from what I have seen, encoders tend to focus on specific block sizes like 4×4, 6×6 and 8×8. In an ideal world, we’d be able to change the block size dynamically within a texture (think H.264 macro blocks being split into smaller blocks adaptively), but we need random access to stay sane.

Good old color endpoint + weight architecture

At a fundamental level, ASTC does not change how texture compression works. There’s still the concept of encoding color endpoints on a block level, and then per-texel weights are applied to interpolate between them. However, as we’ll see, a ton of complexity goes into efficiently encoding these color entry points and weights.

If you’re familiar with video encoding, there are some fun parallels to draw here, since H.264 and beyond have just piled on more and more complexity on top of the basic motion compensated DCT blocks since the 80s. Texture compression is very similar. Keep piling codec features on top of the same fundamental architecture forever, what could go wrong!

Bits are boring, trits and quints is where it’s at!

Up until now, color endpoints and weights have been encoded with bits, but especially for weights, it’s very hard to achieve fine grained bit rate deltas. What if we could spend a fractional number of bits instead? ASTC achieves this through trinary and quintary (is this a word?) numbers, encoded into a binary representation.

Every encoded weight or endpoint is given N bits as LSBs, and optionally one trit or quint as the MSBs.

The binary encoding works by grouping values together in blocks. For trits, we’re aiming to encode 5 values together into a single number, which is 3^5 = 243 combinations. If we encode this into 8 bits, we’re pretty close to the theoretically optimal encoding.

This is where we start running into sprinkles of incomprehensible code snippets in the spec which is a mix of C and Verilog designed to be fast in HDL, but unreadable nonsense in software. I have no idea what this code is trying to do, so to decode, I just build a LUT which converts 8 bits into 5 trits and call it a day. Quints are similar, where we encode 3 values together, which is 5^3 = 125 combinations, and fits snugly into 7 bits.

As an example, here’s what I ended up with in GLSL:

// Trit-decoding.
// quant.x = number of bits per value
int block = idiv5_floor(index); // Classic cute trick: (v * 0x3334) >> 16
int offset = index - block * 5;
start_bit += block * (5 * quant.x + 8);

int t0_t1_offset = start_bit + (quant.x * 1 + 0);
int t2_t3_offset = start_bit + (quant.x * 2 + 2);
int t4_offset = start_bit + (quant.x * 3 + 4);
int t5_t6_offset = start_bit + (quant.x * 4 + 5);
int t7_offset = start_bit + (quant.x * 5 + 7);

// ;__;
int t = (extract_bits(payload, t0_t1_offset, 2) << 0) |
    (extract_bits(payload, t2_t3_offset, 2) << 2) |
    (extract_bits(payload, t4_offset, 1) << 4) |
    (extract_bits(payload, t5_t6_offset, 2) << 5) |
    (extract_bits(payload, t7_offset, 1) << 7);

// LUT magic
t = int(texelFetch(LUTTritQuintDecode, t).x);
t = (t >> (3 * offset)) & 7;

int m_offset = offset * quant.x;
m_offset += idiv5_ceil(offset * 8); // (1) Explanation below

if (quant.x != 0)
{
    int m = extract_bits(payload, m_offset + start_bit, quant.x);
    ret = (t << quant.x) | m;
}

… and similar garbage code for quints.

Note for (1): The reason the T value is scattered around is a special feature where the bit stream can be terminated early if the block is only partially filled. Every trit value requires less than 8/5 bits to encode, so after a value is emitted, we know that we must have emitted ceil(8 * count / 5) bits to encode the trits in binary.

Complex weight un-quantization

For purposes of orthogonality, it’s generally desirable that one part of the decoding process does not affect other parts of the decoding process. Weights is one such case. The color interpolator shouldn’t have to care how many bits the weights were encoded with, and thus we decode weights to a fixed range. In our case, weights have a range of [0, 64].

This is very similar to BPTC. In BPTC, the codec defines un-quantization LUTs for 2bpp, 3bpp and 4bpp like this:

const int weight_table2[4] = int[](0, 21, 43, 64);
const int weight_table3[8] = int[](0, 9, 18, 27, 37, 46, 55, 64);
const int weight_table4[16] = int[](0, 4, 9, 13, 17, 21, 26, 30, 34, 38, 43, 47, 51, 55, 60, 64);

Why not just bit-replication you ask? Well, dividing by non-POT values after weight scale is a PITA, that’s why, and we cannot bit-replicate trits and quints either way. ASTC un-quantizes in a more roundabout way, but uses the same idea from BPTC where weight range is normalized to [0, 64].

Again, the un-quantization step is gnarly enough that I just made a LUT for that as well, because why not:

int decode_weight(uvec4 payload, int weight_index, ivec4 quant)
{
    int primary_weight = decode_integer_sequence(payload, 0,
        weight_index, quant.xyz);
    // quant.w is offset into unquant LUT.
    primary_weight = int(texelFetch(LUTWeightUnquantize,
        primary_weight + quant.w).x);
    return primary_weight;
}

Color endpoints

Color endpoints are weird in that there are multiple phases to it:

Decode N values of integer sequence; which N values to use depends on which partition the texel belongs to
Un-quantize N values to 8 bits in range [0, 0xff]
Interpret N 8-bit values in some magical way depending on the endpoint format (of which there are 16!)
Emit multiple RGBA values in UNORM8 (LDR) or UNORM12 (HDR)

void decode_endpoint(out ivec4 ep0, out ivec4 ep1, out int decode_mode,
    uvec4 payload, int bit_offset, ivec4 quant, int ep_mode,
    int base_endpoint_index, int num_endpoint_bits)
{
    num_endpoint_bits += bit_offset;
    // Need to explicitly terminate bitstream for color end points.
    payload &= build_bitmask(num_endpoint_bits);

    // Could of course use an array,
    // but that doesn't lower nicely to indexed registers on all GPUs.
    int v0, v1, v2, v3, v4, v5, v6, v7;

    // End point mode is designed so that the top 2 bits encode
    // how many value pairs are required.
#define DECODE_EP(i) \
    int(texelFetch(LUTEndpointUnquantize, quant.w + \
        decode_integer_sequence(payload, bit_offset, i + \
        base_endpoint_index, quant.xyz)).x)

    int hi_bits = ep_mode >> 2;
    v0 = DECODE_EP(0); v1 = DECODE_EP(1);
    if (hi_bits >= 1) { v2 = DECODE_EP(2); v3 = DECODE_EP(3); }
    if (hi_bits >= 2) { v4 = DECODE_EP(4); v4 = DECODE_EP(5); }
    if (hi_bits >= 3) { v6 = DECODE_EP(6); v7 = DECODE_EP(7); }

    switch (ep_mode) { ... }
}

The end point modes are defined as:

Every endpoint has code snippets explaining how to take the n values and turn them into RGBA values, which are to be fed to the weight interpolator stage. As expected, we just need a horrible switch statement which handles every case. :<

Each partition can even select its own encoding format, which is pretty wild. Poor encoder that has to consider all these scenarios …

LDR endpoints

The endpoint decoding process starts off with trivial cases like luminance endpoints.

Mode 0 (direct)

e0 = (v0, v0, v0, 0xFF);
e1 = (v1, v1, v1, 0xFF);

Since we have an explicit decode step here, ASTC also allows us to take advantage of redundancy between the endpoint values, e.g.:

Mode 1 (base + offset)

L0 = (v0 >> 2) | (v1 & 0xC0);
L1 = L0 + (v1 & 0x3F);
if (L1 > 0xFF) { L1 = 0xFF; }
e0 = (L0, L0, L0, 0xFF);
e1 = (L1, L1, L1, 0xFF);

This encoding scheme improves precision when L0 and L1 are sufficiently close, i.e. correlated. BC7 don’t really attempt to take advantage of correlation between end points, outside the minimal shared endpoint / subset bits. BC6 does however, through use of delta bits.

One question that immediately pops into my head is how this is supposed to work in practice for trits and quint encoding. Since bits are reinterpreted and shuffled around like this, any un-quantization that’s not just bit replication would probably create unexpected results in the MSBs.

Luminance + Alpha (Mode 4/5)

ASTC is a little awkward in how 2-component textures are encoded. The common case for 2-component textures is normal maps, where we normally encode it as RG, ala BC5 or EAC. ASTC only supports luminance alpha, so in order to efficiently encode these formats, we have to pre-swizzle the texture into (R, R, R, G), and apply a (R, W, x, x) swizzle in the VkImageView.

Similar to luminance-only, there is a direct mode, and a correlated mode.

RGB

Other highlights from the LDR formats are base + scale, where two endpoints are encoded RGB and RGB * A. Seems very nice for luminance gradients, and quite compact! We also have direct modes, with a special feature which takes some ideas from YUV. Blue contraction improves color accuracy close to gray. Of course, there’s a base + offset mode that we already saw for luma and luma + alpha formats.

RGBA

RGBA encoding is very similar to RGB, with two tacked on alpha values, nothing particularly interesting here.

HDR endpoint insanity

While LDR encoding modes are pretty straight forward once you stare at it long enough, HDR is just incomprehensible. Where BC6 was fairly naive w.r.t. endpoints, ASTC is the complete opposite.

Similar to BC6, HDR is implemented in a way where we interpret a floating point format as UNORM. This means we’re interpolating in a pseudo-logarithmic domain, i.e. the perceptual domain.

While BC6 basically directly interpolates in U/SNORM16 with a direct conversion to FP16, ASTC is a little more … particular about it.

When decoding HDR end points, UNORM12 values are generated. It starts innocent enough in mode 2 where 8-bit inputs are just shifted in place:

int y0, y1;
if (v1 >= v0)
{
    y0 = v0 << 4;
    y1 = v1 << 4;
}
else
{
   // Oh hai thar BPTC shared bits +
   // BC1 symmetry exploitation
   y0 = (v1 << 4) + 8;
   y1 = (v0 << 4) - 8;
}

ep0 = ivec4(ivec3(y0), 0x780 /* 1.0 */);
ep1 = ivec4(ivec3(y1), 0x780 /* 1.0 */);

but eventually you just have to give up trying to reason about the spec when this is the wall of code that greets you.

This is the point where you yell WTF out loud, cry a little inside, contemplate your life decisions and copy paste the spec.

Partition hell

Like BPTC, ASTC supports partitions. As we’ve seen before, this feature is designed to deal with sharp transitions in the block. BPTC made it pretty simple where it was possible to spend 6 bits to select one of the 64 partition layouts. This works fine for BPTC since it’s locked to 4×4 blocks and using LUTs makes sense.

ASTC does not have fixed block sizes, so what to do? One LUT for every possible combination? No. ASTC went the route of procedurally generated partition assignments using a 10 bit seed. This works for any block size and partition count, which “solves” that problem. Again, the spec has a long, incomprehensible code snippet defining the process.

Screw this. As usual, look-up texture it is. ASTC can support up to 4 partitions which is pretty wild. No idea how useful this is in practice, as we’ll probably end up spending all bits just encoding color endpoints at that rate …

This seems like a nightmare for encoders. Most of the resulting partitions are worthless garbage, and exhaustively testing over 1000 partition combinations per block does not seem very practical …

Mode hell

Speaking of many combinations, there’s a lot of different modes. BC7 has only 8 modes, and BC6H has 14. From what I’ve seen, even BPTC encoders just focus on 2 or 3 modes at most. ASTC has several thousand modes if we follow the same logic! 😀

Most of the mode bits are spent on encoding things like:

Quantization mode of weight grid
Resolution of weight grid [see below]
Decorrelated colors
Void-extent [see below]
Reserved [see below]

Once we have looked at mode bits, we can compute how many bits are required to encode weights, and based on that result, we also know how many bits are required to encode color endpoints. The color endpoint quantization level is thus implicit (so much fun …). There are also tons of edge cases we have to account for and handle error cases …

Weight grid interpolation

A special ASTC feature is that the weight grid isn’t actually 1:1 with texels like any other format we’ve seen so far. The weight grid can be much lower resolution than the block itself, and the real weights are reconstructed with bi-linear interpolation. This seems quite useful for scenarios where the block encodes a smooth gradient and the weights are highly predictable. Might as well spend more bits on encoding color endpoints instead. The spec outlines a specific fixed-point scheme that must be followed exactly, and finally the code snippets in the spec actually make some sense!

This feature is a must-have for larger block sizes. There just isn’t room for a full weight grid when we start to hit big boy block sizes like 8×8. At this low bit-rate, the only tool we have to encode high frequency features is partitions …

Void extent

Encoding constant high-precision colors can actually be kind of awkward in most texture compression formats with limited endpoint precision, but ASTC has a special mode for it, where exact FP16 values can be encoded if desired. The feature even goes so far as to specify that a region outside the block also has the same color, which theoretically allows a texture filtering unit to short circuit. This doesn’t sound very practical, and I’m not sure if this feature is actually used.

Painful error handling

Another “feature” of ASTC is how errors are handled. Unlike other texture compression formats, there are many encodings which are illegal, and a correct decoder must detect all error scenarios and return the proper error color. This error color is either saturated magenta or NaN (ææææææææ!) depending on the implementation and whether or not HDR is supported.

Inconsistent decode formats

ASTC decoders will either decode to UNORM8 or FP16, depending on the profile supported by the GPU. This is kinda annoying, especially when we consider formats like SRGB8 where apparently the RGB portion is supposed to decode to UNORM8 viewed as SRGB, but alpha is still FP16? (._.) Not even the reference decoder seems to get that right, so I just decode alpha as UNORM8 in that case.

Interpolation and final encode

Finally, a breather. Once color endpoints and weights are fully decoded, we enter interpolation. It’s not all trivial though. That would be boring.

The first step is always to expand the UNORM8 or UNORM12 results into UNORM16. This is context sensitive on things like the endpoint format as well as how we’re encoding the value after interpolation. This final encode is the raw data that is fed into the texture filtering unit.

if (decode_mode == MODE_HDR)
{
    ep0 <<= 4; // Simple expansion 12 -> 16-bit
    ep1 <<= 4;
}
else if (decode_mode == MODE_HDR_LDR_ALPHA)
{
    ep0.rgb <<= 4;
    ep1.rgb <<= 4;
    ep0.a *= 0x101; // Bit replicate UNORM8 to UNORM16,
    ep1.a *= 0x101; // for FP16 conv later
}
else if (DECODE_8BIT)
{
    ep0 = (ep0 << 8) | ivec4(0x80); // Treat the data as 8.8 fixed point.
    ep1 = (ep1 << 8) | ivec4(0x80); // Also bake in 0.5 rounding factor.
}
else
{
    ep0 *= 0x101;
    ep1 *= 0x101;
}

// This is why weights have [0, 64] range and not [0, 63].
ivec4 color = (ep0 * (64 - weight) + ep1 * weight + 32) >> 6;

Now we have a complete UNORM16 color value that needs to be encoded.

Encode to UNORM8 / SRGB8

For 8-bit decode, we make use of the 8.8 fixed point expansion, and just shift the result down. Easy!

 imageStore(OutputImage, coord.xy, uvec4(final_color >> 8));

Encode LDR to FP16

This one is annoying. When interpolating, we did bit replication, and we need to convert the UNORM16 value to FP16. It’s not as simple as just dividing by 0xffff, of course. We explicitly need to treat 0xffff as 1.0, and for other values, we divide by 2^16 with round-to-zero semantics. We have to do this by hand of course. Soft-float is so much fun! <_<

// ASTC has a very peculiar way of converting the decoded result to FP16.
// 0xffff -> 1.0, and for everything else we get
// roundDownQuantizeFP16(vec4(c) / vec4(0x10000)).
ivec4 msb = findMSB(color);
ivec4 shamt = msb;
ivec4 m = ((color << 10) >> shamt) & 0x3ff;
ivec4 e = msb - 1;
uvec4 decoded = m | (e << 10);
uvec4 denorm_decode = color << 8;
decoded = mix(decoded, uvec4(denorm_decode), lessThan(e, ivec4(1)));
decoded = mix(decoded, uvec4(0x3c00), equal(color, ivec4(0xffff)));
return decoded;

Encode HDR to FP16

The intention is that the interpolation is logarithmic, but floats are just piecewise logarithmic. BC6 doesn’t care at all and just reinterprets the resulting UNORM16 interpolation as FP16, but ASTC tries to be a little smarter. The mantissa is tweaked a bit.

// Interpret the value as FP16, but with some extra fixups along the way to make the interpolation more
// logarithmic (apparently). From spec:
ivec4 e = color >> 11;
ivec4 m = color & 0x7ff;
ivec4 mt = 4 * m - 512;
mt = mix(mt, ivec4(3 * m), lessThan(m, ivec4(512)));
mt = mix(mt, ivec4(5 * m - 2048), greaterThanEqual(m, ivec4(1536)));

From what I can tell, there is no signed float support in ASTC, except for the void extent.

Decode format extensions

In Vulkan, there is a weird extension called VK_EXT_astc_decode_mode which allows a VkImageView to select which format an ASTC block should decode to. Why is this even a thing? One quirk of ASTC is that even UNORM blocks full of LDR data must be decoded to bit-exact FP16. This means that if a GPU architecture decodes compressed textures into L1 cache, we would end up consuming a lot more cache than strictly necessary. The whole point of the decode mode extension is to allow us to explicitly decode to a lower precision format to improve cache hit rates, presumably. (That makes me wonder what GPUs actually do in practice …)

In Granite, I setup the decode mode as:

// ...
if (format_is_srgb(create_info.format))
   return true;

if (format_is_compressed_hdr(create_info.format))
{
   auto &features = device->get_device_features();
   if (features.astc_decode_features.decodeModeSharedExponent)
      astc_info.decodeMode = VK_FORMAT_E5B9G9R9_UFLOAT_PACK32; // Nice
   else
      astc_info.decodeMode = VK_FORMAT_R16G16B16A16_SFLOAT;
}
else
{
   astc_info.decodeMode = VK_FORMAT_R8G8B8A8_UNORM;
}
// ...

Fortunately, there are now special SFLOAT format variants for ASTC in the VkFormat enum, which are tied to the ASTC HDR extension, and can be used to notify applications if the blocks likely contain HDR blocks. The awkward part of ASTC is that even when using the UNORM format, you can still decode HDR blocks, and it just happens to work on GPUs which support ASTC HDR. Profiles are fun, right? ._. With decode mode, we can at least clamp down on these shenanigans. HDR blocks fail to decode in UNORM8 mode.

Conclusion

So far, this is the endgame of texture compression, and I really don’t want to look at this again. 🙂 It’s been a year since I looked into this last time, so I might have missed some details.

There’s tons of smart ideas in this format, but I also feel it’s too clever for its own good. Writing a decoder was difficult enough, and I don’t even want to think about how painful an encoder would be. A format is only as good as its encoding ecosystem after all.