February 2024 – Maister's Graphics Adventures

As I previously discussed in my PyroFling post about real-time video streaming, one challenge I mentioned was related to error correction. Using UDP, packet loss is inevitable, so there’s two approaches to reduce streaming jank:

Error masking – hallucinate missed frames
Forward error correction (FEC) – add redundancy to avoid dropped packets

Re-sending packets is a waste of time in a low latency environment like this, so we can ignore that. If re-sending was okay, I’d just use TCP and forget about all of this anyway.

With intra-refresh, error masking is half decent, so I wanted to focus on FEC. Error correction is its own field of study, but I didn’t have the time to actually study the field. State-of-the-art error correction is extremely advanced, complex to implement and IP encumbered (*ahem*, RaptorQ, *ahem*), but I evaluated some less recent approaches.

Understanding the data

Which FEC mechanism we choose needs to take the input data into consideration.

N bytes split into 1024 byte sub-packets

A video packet is a variable number of bytes every frame, which I split up into 1024 bytes (+ header) each. A successful transmission only happens when all N bytes are received successfully. Sending partially valid data to the video decoder is likely going to result in horrible things happening, so if even one sub-packet is dropped, I have to drop the full frame.

Some error correction schemes rely on fixed block lengths, which isn’t ideal for our variable length input. The classic example everyone taking classes on the subject learn is the Hamming (7, 4) code, but this code is better suited for noisy analog channels where we don’t know if any bit was actually received correctly. What we really want is a method that takes extra knowledge about packet loss into account.

Erasure channel

Sending UDP packets over the internet functions like an erasure channel. At the receiver, we know if data was missed. Corrupt packets are dropped by the network (random bit-errors causing CRC check to pass is theoretically possible I suppose, but I don’t consider that).

Small block size

Since a packet is received all or nothing, we’re actually error correcting in a vectorized fashion. The message we’re looking at error correcting is byte n for for all sub-packets P in [0, ceil(N / 1024)), where N is the video packet size in bytes. The error correction algorithm will perform the exact same operation for every byte n in a given packet.

Another way of looking at this is to consider every packet a single 8192-bit number, but that’s a very mathematical way of looking at it. Either way, given a typical 10 mbit/s stream at 60 fps, we expect about 20 sub-packets per video frame. Some frames will be very small, and some will be larger.

Flexible redundancy ratios and block sizes

Some block codecs like Reed-Solomon are well known and very powerful, but seemed a bit too rigid in its block structure. It also has block lengths that seem better suited to bit-streams.

Being able to adapt how much FEC is used is quite useful in a dynamic system such as streaming. A Compact Disc (which uses Reed-Solomon) has to bake in a fixed amount of error correction, but with streaming, a feedback channel can let us dynamically adjust the amount of error correction used as needed. I quickly rejected these codes.

The most basic FEC – YOLO XOR

I don’t think this code has a formal name, but understanding it is the foundation for the upcoming section. Given N packets, just take the XOR of all the packets and send that as one FEC block. If there is 1 packet loss, we can recover it by taking the XOR of all received packets and the FEC block. For small video frames with a small number of packets, this is actually the method I went with.

The downside of course is that there is no obvious way to recover more than 1 packet. I found that spurious packet loss could have 2 or 3 drops in some cases, especially in very large video frames that span up to 100 sub-packets, so this approach was too naive for me.

Fountain codes

While looking around, I ran into a very clever scheme called a fountain code, in particular, the Luby Transform. There is a nice YouTube video explaining it. I also dug up Chapter 50 in an old textbook I used in my university studies, which had a chapter dedicated to this with more mathematical rigor.

This method has some nice properties that are well suited for network transmission:

Designed for erasure channels (e.g. IP networks)
Flexible FEC ratios
Receiver can complete decode after receiving enough data packets (with some major caveats)

A fountain code is called so because the encoder can spit out an arbitrary number of packets. There is no fixed block structure, and the process is pseudo-random. As long as the encoder and decoder agree on a seed for the process, very little side channel data needs to be communicated.

The algorithm is essentially YOLO XOR, with a lot of statistical tweaks. First, we consider the degree d of a packet, which is the number of blocks we take the XOR of when generating a packet. A degree of 1 is the base case where we send a block as-is, and a degree of ceil(N / 1024) is taking the XOR of everything (i.e. the YOLO XOR case).

The packets chosen for XOR-ing are randomized. On the receiver end, we look at all our received packets, and if we find a case where we have a packet with degree d, and d – 1 of the packets have been recovered, we can recover the last one through … more XOR-ing. By recovering a new packet, this may cause other packets to reach this condition and the cycle continues. To kick-start this process, some packets with degree = 1 must be transmitted.

To make this work well, the literature describes a very particular distribution for d to minimize the expected redundancy. I implemented all of this, but I found some unfortunate practical problems. It is (very) possible I had bugs of course, but debugging completely random processes is not very fun and it’s not like I had a reference result to compare against.

Non-deterministic amount of packets needed to complete decode

Given the completely random process, it’s unbounded how many packets have to be encoded to actually be able to decode, even with no packet loss. Studying the literature, the examples I found seemed to assume a very large number of blocks K. K would be 10000 for example, and as K increases, the variance of redundancy ratios decreases. For my example of K = 20, the algorithm seemed to collapse. Occasionally, I needed 2 or 3x redundancy to complete the decode, which is obviously unacceptable.

Painful statistical modelling

The statistical distribution for the degree factor d depends on the number of blocks to send, K. This value changes every frame. K can be arbitrarily large, so computing LUTs got awkward.

Smol brain LT code

Following the enlightened example of grug brain, I massively simplified the LT code down to something that maybe isn’t as theoretically good, but it worked very well in practice for my particular needs, where packet loss ratios are fairly low and K is low.

This basically boils down to a heavily “rigged” LT, but otherwise the encoder and decoder does not really change.

Send all packets as-is first (d = 1)

This is an obvious thing to do. If there is no packet loss, we guarantee that we can start decoding immediately (good for latency). There is no randomness in this process.

Fixed degree factor

I found that a fixed d factor of K / 2 worked well. For odd K, alternate d between ceil(K / 2) and floor(K / 2). For large K, clamping the factor d to something reasonable like 64 worked well too.

Mirror selection of blocks

For every pair of blocks, we want to ensure that blocks selected for XOR are the complement of each other. This guarantees that by receiving a pair of FEC blocks, we will always be able to correct one missed packet. With 50% probability, we can recover two packet losses. The odd/even split above is designed to make sure that an odd/even pair always covers all K blocks. As the number of blocks increases, we’ll be able to recover more losses (with lower and lower probability).

Results

Given a fixed number of data packets, and N randomly lost packets, we can observe how well this FEC recovers data. The recovery rate for 1 lost packet is 100% due to our design, so that’s not interesting. XOR degree factor is 20.

With larger data blocks, and more FEC blocks to match the redundancy ratio, recovery ratio improves, but beyond 4 losses, the codec starts collapsing. That’s fine. I haven’t had too many issues with burst losses like these. XOR degree factor is 40. There’s some interesting stair-stepping here, which might be caused by the mirroring. This suggests we get most bang for our bandwidth by using odd number of FEC blocks.

It’s possible to tweak things however. Using a smaller degree is good when using a lot of FEC packets and more errors are expected. Maybe it’s possible to use a blend of high degree and low degree packets as well (basically the entire point of LT), but this kind of tweaking can be left to another time. PyroFling’s expectation is low number of losses per video frame and simplicity beats theoretical performance.

If we add e.g. 0.5% random, uncorrelated packet loss ratio, a degree factor of d = N / 2 seems much better.

For larger data sets, the number of expected losses starts increasing, so degree factors seem to prefer d = 100 over d = 200. For larger video frames, we’re far more likely to encounter at least one packet loss, so that’s why loss ratio for 0 FEC packets approaches 100%.

Is this even good?

Compared to the state of the art, this is likely far from optimal, but this was good enough for my uses and here’s the latest log from a 4-hour play session between Trondheim and Bergen:

2322932 complete, 54 dropped video, 9683 FEC recovered

~99.5% of dropped packets were avoided. This was at 25% FEC redundancy rate. Every dropped video packet is disruptive and lasts many frames, so this improvement was transformative.

This isn’t an academic project, so I don’t really care about comparing against a million different FEC algorithms. 🙂

UDP pacing to reduce bursty drops

A common technique in UDP streaming is to not send all packets immediately, but pace them over an interval. Sending over the full frame interval increases latency by quite a bit, but pacing a stream to a max instantaneous rate of e.g. 60 mbit/s worked alright. The added latency is only a few ms for a ~10-15 mbit/s stream, which is acceptable. Since I’m on Linux, and I was lazy, I rely on the kernel to do this automatically:

sudo tc qdisc add dev $iface root fq maxrate 60000000

Conclusion

Hopefully this demonstrates a simple FEC that is fairly accessible. The last piece of the PyroFling adventure will be to finally tackle Vulkan Video encode.

Month: February 2024

Real-time video streaming experiments with forward error correction