Part 0: The why

8 minutes read

I should probably head this whole adventure with a little backstory. I was tasked at work with implementing low-latency streaming of video and occasionally still images, apropos of a project grant. Though the main assignment was “streaming any raw data”, I vividly remembered how the current video and image streaming modules “worked”, and wanted to do something against it apropos this.

The current state of affairs was a Python script that took in RTSP streams through OpenCV, individually re-encoded every frame as a JPEG, then sent that along for storage and processing. I’ve been itching to flex the power of Rust for a couple of months now, and this is exactly the systems-level, low-latency, and high performance task where it would shine.

So I set off with a simple game plan to a proof-of-concept implementation:

  1. Minimise work in the ingest module. Just take the RTSP stream in raw and send the individual H.264 (or whatever else) frames along. No need to replace Python here for now. There has to be a simple library that just passes the raw encoded frames.
  2. Implement a fancy as heck display module in Rust that can handle every codec we will probably have to deal with.

[a week or two passes]

Two issues:

The state of low-level RTSP libraries in Python

Let’s get the lay of the land on PyPi. Searching for “rtsp” brings up the following options:

rtsp

What does the readme-

        /((((((\\\\
=======((((((((((\\\\\
     ((           \\\\\\\
     ( (*    _/      \\\\\\\
       \    /  \      \\\\\\________________
        |  |   |      </    __             ((\\\\
        o_|   /        ____/ / _______       \ \\\\    \\\\\\\
             |  ._    / __/ __(_-</ _ \       \ \\\\\\\\\\\\\\\\
             | /     /_/  \__/___/ .__/       /    \\\\\\\     \\
     .______/\/     /           /_/           /         \\\
    / __.____/    _/         ________(       /\
   / / / ________/`---------'         \     /  \_
  / /  \ \                             \   \ \_  \
 ( <    \ \                             >  /    \ \
  \/      \\_                          / /       > )
           \_|                        / /       / /
                                    _//       _//
                                   /_|       /_|

…that’s definitely one way to get my attention! This unicorn is Brackets Approved!

The next line though…

Convenience-wrapper around OpenCV-Python RTSP functions

Yup, it’s just OpenCV. Provides a fully decoded image, and there’s no accessible way to bypass that. Also, pulling in all of OpenCV just for reading a network stream sounds like the definition of overkill.

rtsp-curl

Convert rtsp.c to rtsp_curl.py

I’m sure this works just fine for the author, but I’d rather use something a bit more documented and tested.

zephyr-rtsp

Version 0.0.6

Solid start. The API example seems much better thought out, but…

import cv2
from zephyr import Client

if __name__ == "__main__":
  client = Client(url="rtsp://localhost:8554/test")

  while True:
    ret, frame = client.read()
    cv2.imshow('frame', frame)

    if cv2.waitKey(1) & 0xFF == ord("q"):
      client.release()
      break

It does its own decoding yet again, though at least it seems to only call into ffmpeg instead of commandeering all of OpenCV.

fast-rtsp

Python extension for fast opencv-

And that’s all I needed to hear.

It is at this point where I gave up. Had I gone just a couple items further in the search results, I would have run across…

aiortsp

This is a very simple asyncio library for interacting with an RTSP server, with basic RTP/RTCP support.

The intended use case is to provide a pretty low level control of what happens at RTSP connection level, all in python/asyncio.

This library does not provide any decoding capability, it is up to the client to decide what to do with received RTP packets.

…the exact library I needed.

But that didn’t happen. I got discouraged by a sea of OpenCV, and so began the journey of rewriting everything from scratch.

Screw Python, let’s do everything in Rust

Searching for “rtsp” on crates.io provides us with much less. rtsp is a stub, gst-plugin-rtsp is for GStreamer, rtsp-types is just types, parser, that sort of stuff, and I’d rather not have to strip functions out of rave.

This is when, while scrolling IETF Datatracker, dreading having to implement the protocol from scratch, a stray web search guided me to retina. A library written for a network video recorder with pretty solid support for h.264, which is what I will have to deal with most of the time. Nearly perfect, and the few remaining issues can be worked around.

Game plan part one is finally coming together, so let’s get on with game plan part two

Fancy as heck display module

We have data coming in, it’s time to process it. Searching for “h264” gives us…

OpenH264 and the game of Not Getting Sued

Let’s read the decoding example:

use openh264::decoder::Decoder;
use openh264::nal_units;

let h264_in = include_bytes!("../tests/data/multi_512x512.h264");
let api = OpenH264API::from_source();
let mut decoder = Decoder::new(api)?;

// Split H.264 into NAL units and decode each.
for packet in nal_units(h264_in) {
    // On the first few frames this may fail, so you should check the result
    // a few packets before giving up.
    let maybe_some_yuv = decoder.decode(packet);
}

Nice and reassuringly brief. There are some weird terms here like “NAL units” and “YUV”, but we’ll cross those bridges when we get there.

I’d like to be able to show the decoded image in a window, so I’ll use the ubiquitous winit to create a window, and softbuffer to present to it.

For testing, it would be handy to have a single frame of raw h.264 bitstream saved to a file, so let’s generate one.

Test frame

Let’s convert something appropriate, like this PAL video test pattern

Image of a TV test signal. It has a main grid, colours, high and low contrast areas, everything you need to make sure you're perfectly tuned in
Zacabeb, CC0, via Wikimedia Commons

Originally I used the openh264 library to encode, but for the sake of brevity here I’ll just use ffmpeg with some flags that make it compatible with the openh264 decoder. All of these concepts will be explained in later posts.

$ ffmpeg -i PAL_test_pattern.png -pix_fmt yuv420p -profile:v baseline \
 -c:v libx264 PAL_test_pattern.h264

And just like that, we have a single frame in PAL_test_pattern.h264. Just to make sure, let’s verify real quick with ffplay

The previous image being displayed through ffplay

Yup, looks right. With that sorted, it’s coding time!

Coding time!

Let’s start with the softbuffer example. First, we’ll set up a binary crate with the required dependencies.

[dependencies]
openh264 = "0.5.0"
softbuffer = "0.4.1"
winit = "0.29.15"

The example code spawns a window with winit, attaches a softbuffer Surface, then renders a basic test pattern on the window every time the compositor asks for it.

use std::num::NonZeroU32;
use std::rc::Rc;
use winit::event::{Event, WindowEvent};
use winit::event_loop::{ControlFlow, EventLoop};
use winit::window::WindowBuilder;

fn main() {
    let event_loop = EventLoop::new().unwrap();
    let window = Rc::new(WindowBuilder::new().build(&event_loop).unwrap());
    let context = softbuffer::Context::new(window.clone()).unwrap();
    let mut surface = softbuffer::Surface::new(&context, window.clone()).unwrap();

    event_loop.run(move |event, window_target| {
        window_target.set_control_flow(ControlFlow::Wait);

        match event {
            Event::WindowEvent {window_id, event: WindowEvent::RedrawRequested} if window_id == window.id() => {
                let (width, height) = {
                    let size = window.inner_size();
                    (size.width, size.height)
                };
                surface.resize(
                    NonZeroU32::new(width).unwrap(),
                    NonZeroU32::new(height).unwrap(),
                ).unwrap();

                let mut buffer = surface.buffer_mut().unwrap();
                for index in 0..(width * height) {
                    let y = index / width;
                    let x = index % width;
                    let red = x % 255;
                    let green = y % 255;
                    let blue = (x * y) % 255;

                    buffer[index as usize] = blue | (green << 8) | (red << 16);
                }

                buffer.present().unwrap();
            }
            Event::WindowEvent {
                event: WindowEvent::CloseRequested,
                window_id,
            } if window_id == window.id() => {
                window_target.exit();
            }
            _ => {}
        }
    }).unwrap();
}

Running that provides us with a nice pattern:

A window showing a tiling pattern of black transitioning to red on the X axis and green on the Y axis, with a blue stripey pattern overlaid

Continuing the classic programmer tradition of copy-pasting things until they work, let’s insert the openh264 decoding example at the top before we enter the event loop.

[...]
let mut surface = softbuffer::Surface::new(&context, window.clone()).unwrap();

// Pack the frame into the binary for simplicity
const H264_IN: &[u8] = include_bytes!("PAL_test_pattern.h264");

// Prepare the decoder
let api = OpenH264API::from_source(); // Why does this matter?
let mut decoder = Decoder::new(api).unwrap();

// Find the decoded frame
let mut frame = vec![];
for nal in nal_units(H264_IN) {
    let frame_maybe = decoder.decode(nal).unwrap();
    if let Some(decoded_frame) = frame_maybe {
        frame.resize((decoded_frame.width()*decoded_frame.height()*3) as usize,0);
        decoded_frame.write_rgb8(frame.as_mut());
        break;
    }
}

event_loop.run(move |event, window_target| {
[...]

And now we can replace the contents of the for loop drawing the test pattern with a loop that converts this pixel data into a format that softbuffer understands.

[...]
let mut buffer = surface.buffer_mut().unwrap();
for index in 0..(width * height) {
    let data_starting_index = index as usize * 3;
    buffer[index as usize] =
        frame[data_starting_index+2] as u32 | // Blue
        ((frame[data_starting_index+1] as u32) << 8) | // Green
        ((frame[data_starting_index] as u32) << 16); // Red
}

buffer.present().unwrap();
[...]

For now, we’ll force the window to be the correct size when creating it.

[...]
let event_loop = EventLoop::new().unwrap();
let window = Rc::new(WindowBuilder::new()
    .with_inner_size(PhysicalSize::new(768, 576))
    .with_resizable(false)
    .build(&event_loop).unwrap());
let context = softbuffer::Context::new(window.clone()).unwrap();
[...]

And just like that…

A window created by winit showing the test pattern image

Let’s pipe the RTSP stream into this. Spawning the decoder into a thread that receives messages and making winit constantly request repaints… This is going to be easy as p-

thread 'main' panicked at src/main.rs:28:47:
called `Result::unwrap()` on an `Err` value: Error { native: 16, decoding_state: 0, misc: None }

Huh. What gives?

H.264 and its many flavours

Maybe I should stop trying to make things happen without understanding what they really entail. H.264 is a very old standard designed by lots of people and companies, and it turns out, it’s not even a single codec. There are lots of profiles that each add more and more advanced coding tools to the mix, and if the burden of implementing all that wasn’t hard enough already, they are also under separate patents.

The biggest pro of OpenH264 is Cisco’s pledge, that as long as somebody uses the precompiled library provided by them, they will not pass on the licensing costs to the user. This makes OpenH264 very popular, especially for companies, because it removes a lot of legal headaches from software development. However…

The biggest con of OpenH264 is that it doesn’t support many profiles. Actually, it only supports the lowest one, Constrained Baseline. Most devices we’d have to interface with would much prefer to run at Main or even High, thus the bitstreams put out by them through RTSP might as well be Greek to the decoder.

So what now?

We would definitely prefer to not have to pay MPEG-LA for using software like libav in our pipeline, but we’d also like to be able to support as many of the codec’s features as possible.

However, there is a way to have our cake and eat it too. We already have paid MPEG-LA for a hardware license! A couple cents of each processor and graphics card sale is immediately forwarded to video codec patent holders, because all three major silicon vendors incorporate video encoding and decoding pipelines into their products.

Intel has QuickSync, NVIDIA has VDPAU and NVENC/NVDEC, but the common denominator seems to be Intel’s open source VAAPI. Intel (obviously) supports it, NVIDIA hardware can be made to support it through a compatibility layer, and AMD also provides official implementations.

Now I’ll just have to learn to make use of it.

Help.