Published on 2024 April 17

Part 1: Overview

5 minutes read •

The task has been set, so let’s do some research.

Every good API needs to have documentation. For VA-API, it’s located at https://intel.github.io/libva/. As of the writing of this article, the published version is 2.19.0. We can follow the lead to the source code of the reference implementation too.

I’m not the first person to attempt developing Rust bindings to this API, there’s Ferrovanadium advertising a “Blatantly unsound API” among other humorous features. As for other resources,ffmpeg obviously supports VA-API acceleration, so we’ll also be able to study how that functions in case we get stuck. Finally, there’s the official libva-utils providing further examples.

Introduction

Reading the documentation’s introduction, we can see the separate functions that VA-API provides.

A core API
Encoding functions for many codecs
Decoding functions for many codecs (Why is H.264 missing from the list? Because it’s part of the core!)
Video processing
Protected Content (the bad DRM)
FEI (meaning Flexible Encoding Infrastructure according to the Intel SDK developer reference for HEVC FEI)

We also get a guarantee that all functions are inherently thread-safe (though we should be careful about what order they execute in), as long as they’re not called from signal handlers. Not sure how we can forbid that with Rust’s type system, but we’ll figure something out. Worst case we’ll just replicate the warning.

Finally, we get a quick C example with a few comments:

// Initialization
dpy = vaGetDisplayDRM(fd);
vaInitialize(dpy, ...);
// Create surfaces required for decoding and subsequence encoding
vaCreateSurfaces(dpy, VA_RT_FORMAT_YUV420, width, height, &surfaces[0], ...);
// Set up a queue for the surfaces shared between decode and encode threads
surface_queue = queue_create();
// Create decode_thread
pthread_create(&decode_thread, NULL, decode, ...);
// Create encode_thread
pthread_create(&encode_thread, NULL, encode, ...);
// Decode thread function
decode() {
  // Find the decode entrypoint for H.264
  vaQueryConfigEntrypoints(dpy, h264_profile, entrypoints, ...);
  // Create a config for H.264 decode
  vaCreateConfig(dpy, h264_profile, VAEntrypointVLD, ...);
  // Create a context for decode
  vaCreateContext(dpy, config, width, height, VA_PROGRESSIVE, surfaces,
    num_surfaces, &decode_context);
  // Decode frames in the bitstream
  for (;;) {
    // Parse one frame and decode
    vaBeginPicture(dpy, decode_context, surfaces[surface_index]);
    vaRenderPicture(dpy, decode_context, buf, ...);
    vaEndPicture(dpy, decode_context);
    // Poll the decoding status and enqueue the surface in display order after
    // decoding is complete
    vaQuerySurfaceStatus();
    enqueue(surface_queue, surface_index);
  }
}
// Encode thread function
encode() {
  // Find the encode entrypoint for HEVC
  vaQueryConfigEntrypoints(dpy, hevc_profile, entrypoints, ...);
  // Create a config for HEVC encode
  vaCreateConfig(dpy, hevc_profile, VAEntrypointEncSlice, ...);
  // Create a context for encode
  vaCreateContext(dpy, config, width, height, VA_PROGRESSIVE, surfaces,
    num_surfaces, &encode_context);
  // Encode frames produced by the decoder
  for (;;) {
    // Dequeue the surface enqueued by the decoder
    surface_index = dequeue(surface_queue);
    // Encode using this surface as the source
    vaBeginPicture(dpy, encode_context, surfaces[surface_index]);
    vaRenderPicture(dpy, encode_context, buf, ...);
    vaEndPicture(dpy, encode_context);
  }
}

This last part looks interesting, so let’s go through it step by step:

High-level operation

First off, we have to initialize the library. To do that, first we have to acquire a so-called “display handle”

dpy = vaGetDisplayDRM(fd);

In this case, we’re grabbing onto Linux’s Direct Rendering Manager device (the good DRM) through a file descriptor.

What's the Direct Rendering Manager?

The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs of modern video cards.

We all know that in Unix, everything is a file. And for everything to be a universally usable file, there should be a unified and standardised way to make use of that file. That is what the Direct Rendering Manager subsystem is for.

Pop open a terminal on a linux machine:

$ ls /dev/dri
by-path  card1  renderD128

This specific computer has a single graphics card (built into the AMD APU), so there are two files, and a directory that shows the same two files, but as nodes on the PCI-e device tree.

Each card gets a card[x] file and a renderD[x] file. The card file is for privileged access meant for setting global preferences (like the current screen resolution), while the renderD128 and up files are for user applications to utilise to submit render tasks to the device.

It is this latter one that we’ll eventually work with for our VA-API adventures. Read more

vaInitialize(dpy, ...);

The next line actually tells VA-API that we are indeed going to do VA-API things with this display handle, so it better wake up.

vaCreateSurfaces(dpy, VA_RT_FORMAT_YUV420, width, height, &surfaces[0], ...);

Next we’re going to create some surfaces that will store images in a specified “render target format” and have a set resolution. Surfaces can be thought of as special data buffers that (for all intents and purposes) live on the GPU and have the required metadata to be interpreted as an image.

surface_queue = queue_create();
pthread_create(&decode_thread, NULL, decode, ...);
pthread_create(&encode_thread, NULL, encode, ...);

Then there’s some C-style threading stuff. We set up a queue to send messages between threads, then spawn two threads that will be executing the two functions defined below.

Decoding workflow

To decode images, the function has to go through a couple setup steps:

vaQueryConfigEntrypoints(dpy, h264_profile, entrypoints, ...);

First it lists the supported “entrypoints” for a “profile”.

A profile is a specific codec, and an entrypoint is a processing pipeline that the hardware can implement for that codec. In this case we’re listing the supported pipelines for the H.264 profile.

vaCreateConfig(dpy, h264_profile, VAEntrypointVLD, ...);

Then it creates a “config” targeting the H.264 profile and calling into the VLD entrypoint.

VLD stands for Variable Length Decoding, I imagine because that’s a good collective name for all codecs. Decoding variable length inputs to known-resolution outputs.

vaCreateContext(dpy, config, width, height, VA_PROGRESSIVE, surfaces,
    num_surfaces, &decode_context);

A config itself is just an inert blob vaguely pointing at a part of the processing tools exposed by the API. To be able to make use of it, we have to create a “context” from the config.

All further operations are going to be done in this context, so this way we can separate our operations without having to register the same display multiple times.

Now we decode the given frames:

for (;;) {
    vaBeginPicture(dpy, decode_context, surfaces[surface_index]);
    vaRenderPicture(dpy, decode_context, buf, ...);
    vaEndPicture(dpy, decode_context);
    
    vaQuerySurfaceStatus();
    enqueue(surface_queue, surface_index);
}

If you’ve seen OpenGL, this might look familiar. We begin a picture, which marks a surface and starts collecting information. Then we provide the data we want it to process in the form of buffers, and finally we tell it we’re done, and it starts crunching in the background.

If we want to make sure that operations are complete on a surface, we can use vaQuerySurfaceStatus.

Encoding workflow

Encoding is done very similarly:

// Find the encode entrypoint for HEVC
// Every process we do in VA-API goes through the same rigmarole,
// just different entrypoints and buffers. 
vaQueryConfigEntrypoints(dpy, hevc_profile, entrypoints, ...);
// Create a config for HEVC encode
vaCreateConfig(dpy, hevc_profile, VAEntrypointEncSlice, ...);
// Create a context for encode
vaCreateContext(dpy, config, width, height, VA_PROGRESSIVE, surfaces,
    num_surfaces, &encode_context);

Same setup process as last time, but this time we use the HEVC (H.265) profile and the EncSlice entrypoint.

And the actual frame processing is done the same way:

for (;;) {
    // Dequeue the surface enqueued by the decoder
    surface_index = dequeue(surface_queue);
    // Encode using this surface as the source
    vaBeginPicture(dpy, encode_context, surfaces[surface_index]);
    // Submit buffers
    vaRenderPicture(dpy, encode_context, buf, ...);
    vaEndPicture(dpy, encode_context);
}

Easy-peasy! Let’s get some FFI bindings going.