GeoTIFFs to GPUs part 2: Barrels Out of Bytes - streaming Rust bits to the GPU

My interactions with low-level programming languages like C/C++ has always been superficial. Usually I just want to run them, and will dive into the Cmake config file to get something to compile. But otherwise, I'll leave troubleshooting segfaults to colleagues who know what they're doing.

Things have changed after I learned Rust.

Rust is a language for building Foundational Software. To me, it felt really empowering to be able to make use of computational resources more efficiently, while having safeguards in place to avoid runtime crashes. I started off in 2024 wanting to get closer to the metal with fearless concurrency for geospatial data processing at scale, and found that while there exists great tools for vector/tabular data, the raster/n-dimensional scene wasn't that great.

Along the way, I played with a library (or 'crate' as Rustaceans say) called burn that had Keras-like vibes (for those who did ML 'back in the day'). The fact that I could train a machine learning model on an integrated GPU via WebGPU was fun, though I knew of course that a CUDA backend was still better for real work. It got me thinking, what else am I missing, to stream raster data, essentially graphical images, to GPU memory (the G being Graphics!), in Rust?

Long story short, I think we're almost there. What I'm about to present below are all the pieces wiring together a raster stored in a GeoTIFF file format, to the bytes that will be resident in CUDA GPU memory.

Overview

At a fairly medium level, the steps are:

Read the raw bytes (typically compressed) from the TIFF file into CPU/host memory
Parse the TIFF tag metadata from the file header
Allocate GPU memory based on the size and data type of the image (obtained from metadata)
Decompress the raw image bytes (if needed)
Decode the uncompressed image bytes into GPU memory

What I won't cover:

How to read bytes over a network (assume that the reader can figure this out)
How to do partial reads of a subset (crop) of the TIFF image (the API for this isn't exposed yet)
How to meaningfully use the data resident in GPU memory as a 'tensor' (this is for the next blog post!)

Then there are things that nvTIFF (v0.5.0) simply doesn't support yet:

Decoding of exotic GeoTIFFs with tie points (ground control points), maybe other ones too.
Direct-to-GPU reads, since integration with cuFile isn't available yet

Pre-requisites

You will need:

A CUDA-enabled GPU, compute capability 6.0+, i.e. Pascal generation (2016) or newer
Linux (recommended), Windows might work but untested

Things to install:

nvTIFF - https://docs.nvidia.com/cuda/nvtiff/installation.html
nvCOMP - https://docs.nvidia.com/cuda/nvcomp/installation.html

If you're on Debian Linux like me, and have NVIDIA repos enabled, do:

sudo apt install nvtiff-cuda-12 nvcomp-cuda-12

Step 0: Binding nvTIFF (C++) to Rust with bindgen

TLDR: Skip this section and just install nvtiff-sys if creating foreign function interfaces (FFI) isn't your thing!

In this section, we'll be using bindgen to create Rust bindings to the nvTIFF library. To be fair, the bindgen tutorial is probably where you should start looking first,

References:

nvTIFF C++ API - https://docs.nvidia.com/cuda/nvtiff/userguide.html
cudarc Rust docs - https://docs.rs/cudarc/0.16.4/cudarc/index.html

Note: I've also tried autocxx and considered cxx, which are supposed to be easier, but didn't quite get it.

Assuming that you've already started a Rust project (e.g. with cargo init), you'll want to add bindgen to your 'Cargo.toml' file via:

cargo add bindgen

Next, create a 'wrapper.h' file in the project root folder with this line:

#include <nvtiff.h>

This will point to the nvTIFF header file, which at least on my system, is found at '/usr/lib/nvtiff.h', which symlinks to '/usr/include/libnvtiff/12/nvtiff.h'.

Now comes the cool part, creating the 'build.rs' file, also in the project root folder. There are three main parts to this:

Tell cargo to tell rustc to link the system nvTIFF shared library.

use std::env;
use std::path::PathBuf;

fn main() {
    println!("cargo:rustc-link-lib=nvtiff");

    ...

Configure and generate the bindings with bindgen::Builder. We'll point to the 'wrapper.h' file created above, and also add an include path to where the 'cuda_runtime.h' file is located. There are some extra configuration options possible (e.g. different enum flavors) which I've omitted for brevity.

Note that we're only allow-listing certain functions from nvTIFF's API related to decoding TIFFs. If you'd like to see more API methods wrapped, feel free to contribute to nvtiff-sys!

    let bindings = bindgen::Builder::default()
        // The input header we would like to generate bindings for.
        .header("wrapper.h")
        .clang_arg("-I/usr/local/cuda/include") // Path to cuda_runtime.h
        // Tell cargo to invalidate the built crate whenever any of the included header
        // files changed.
        .parse_callbacks(Box::new(bindgen::CargoCallbacks::new()))
        // Only allow certain functions to have Rust bindings generated for
        // https://docs.rs/bindgen/0.71.1/bindgen/struct.Builder.html#method.allowlist_function
        .allowlist_function("nvtiffDecodeImage")
        ...
        .allowlist_function("nvtiffStreamParse")
        ...
        // Finish the builder and generate the bindings.
        .generate()
        // Unwrap the Result and panic on failure.
        .expect("Unable to generate bindings");

Finally, we can write the output bindings! I've set it to '$CARGO_MANIFEST_DIR/src/nvtiff.rs', where 'CARGO_MANIFEST_DIR' is an environment variable that corresponds to where the Cargo.toml file is (usually the project root dir).

    let out_path = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap());
    bindings
        .write_to_file(out_path.join("src/nvtiff.rs"))
        .expect("Couldn't write bindings!");
}

With that, you can run cargo build, and the bindings should appear under 'src/nvtiff.rs'. This is the sample output for the nvtiffStreamParse function.

unsafe extern "C" {
    pub fn nvtiffStreamParse(
        buffer: *const u8,
        buffer_size: usize,
        tiff_stream: nvtiffStream_t,
    ) -> nvtiffStatus_t::Type;
}

I've created quite a few of these bindings already, and you can preview the code at https://docs.rs/nvtiff-sys/0.1.2/src/nvtiff_sys/nvtiff.rs.html 🤗.

Note that raw bindings to CUDA functions are typically unsafe. While unsafe Rust does sound scary, there are still some guardrails in place, and we can look into creating safe(r) abstractions over these in the future.

Step 1: Putting bytes on CPU stream

The next steps will make use of the foreign function interface (FFI) bindings we just made for nvTIFF. Essentially, the code below will be a Rust version of the nvTIFF Decode quickstart guide.

Note: The terms host (CPU) and device (GPU) will be used interchangeably.

Let's start on the CPU side of things. First, initialize a TIFF stream on host using nvtiffStreamCreate

use nvtiff_sys::{nvtiffStream, nvtiffStatus_t, nvtiffStreamCreate};

let mut host_stream = std::mem::MaybeUninit::uninit();
let tiff_stream: *mut *mut nvtiffStream = host_stream.as_mut_ptr();

let status_cpustream: u32 = unsafe { nvtiffStreamCreate(tiff_stream) };
assert_eq!(status_cpustream, nvtiffStatus_t::NVTIFF_STATUS_SUCCESS); // 0

The first two lines is how we get a mutable raw pointer. The output from nvtiffStreamCreate is an unsigned integer status code, and we check that it is 0 (NVTIFF_STATUS_SUCCESS) before continuing (you will see this pattern used a lot below).

Next, with the tiff_stream handle initialized, we can parse the TIFF file buffer into that CPU memory space using nvtiffStreamParse.

use bytes::Bytes;
use nvtiff_sys::nvtiffStreamParse;

let v: Vec<u8> = std::fs::read("images/float32.tif").unwrap();
let bytes: Bytes = Bytes::copy_from_slice(&v);

let status_parse: u32 = unsafe { nvtiffStreamParse(&byte_stream.as_ptr(), usize::MAX, tiff_stream) };
assert_eq!(status_parse, nvtiffStatus_t::NVTIFF_STATUS_SUCCESS); // 0

The input bytes (essentially uint8 values) could be read from disk (as in the example above), or fetched over the network using a crate like object_store, minreq, etc. Either way, we pass those bytes to nvtiffStreamParse which will fill the tiff_stream handle with the TIFF data.

Keep this tiff_stream around, because we'll be parsing metadata and decoding data from it later!

Step 2: Parse TIFF metadata

Now nvTIFF is able to parse a variety of TIFF metadata, both on the file-level and at the tag-level. For the purposes of this post, we just want to work out one thing - how many bytes to allocate in CUDA memory. The formula for that is:

(Width * Height * Bits per pixel) / 8

i.e., the image's width in pixels, multiplied by the image's height in pixels, multiplied by how many bits are used per pixel (determined by the data type), divided by 8 (8 bits = 1 byte). So for an image with a Width of 3 pixels, Height of 4 pixels, and float32 dtype, the number of bytes = 3 * 4 * 32 / 8 = 48 bytes!

These image/pixel size information can be retrieved using nvtiffStreamGetFileInfo like so:

use nvtiff_sys::{nvtiffFileInfo, nvtiffStreamGetFileInfo};

let mut file_info = nvtiffFileInfo::default();
let status_fileinfo: u32 = unsafe { nvtiffStreamGetFileInfo(tiff_stream, &raw mut file_info) };
assert_eq!(status_fileinfo, nvtiffStatus_t::NVTIFF_STATUS_SUCCESS); // 0

Here, we need to instantiate an nvtiffFileInfo struct first. We then pass the tiff_stream and file_info pointer (using &raw mut) to nvtiffStreamGetFileInfo which will populate the file_info struct with the TIFF file metadata.

Finally, we can compute the number of bytes needed to store the image in memory:

let num_bytes: usize = file_info.image_width as usize // Width
    * file_info.image_height as usize // Height
    * (file_info.bits_per_pixel as usize / 8); // Bytes per pixel (e.g. 4 bytes for f32)

This num_bytes will be used in the very next step!

Step 3: Allocate GPU memory

I want to emphasize that this CUDA device stream/memory handling part took me a good week to figure out! At least two hard lessons were learned:

My initial attempt was to use nvtiffDecode, but to this day, I cannot fully wrap my head around the double asterisk **image_out argument, specifically how to initialize what (I think) is essentially a Vec of a Vec in CUDA memory. Trying this in unsafe Rust meant a lot of panics, and no useful hints from the compiler on what I should do. Solution (or workaround): Use nvtiffDecodeImage instead!
For a while, I thought the Rust-CUDA project was what I needed to look into. I was almost handwriting my own wrapper around cudaMalloc and other cuda functions, in unsafe rust, with no idea whether it'll all work in the end. Pro tip: Use cudarc, which has safe wrappers that are more battle-tested in crates like dfdx, candle-core, cubecl-cuda and so on. It just works™.

With that, the code to allocate on the GPU is done using cudarc::driver::CudaStream::alloc_zeros as follows:

use std::sync::Arc;
use cudarc::driver::{CudaContext, CudaSlice, CudaStream, DevicePtrMut, DriverError};

let ctx: Arc<CudaContext> = CudaContext::new(0)?; // Set on GPU:0
let stream: Arc<CudaStream> = ctx.default_stream();

let mut image_stream: CudaSlice<u8> = stream.alloc_zeros::<u8>(num_bytes)?;

This should allocate memory of size num_bytes on GPU:0, provided that you have a CUDA-enabled GPU, and enough GPU RAM for the number of bytes needed, otherwise a cudarc::driver::DriverError will be raised.

The next line is a little bit of pointer type casting. We'll need to cast the pointer of the cudarc initialized CUDA stream, into a CUDA stream type that nvTIFF understands.

let cuda_stream: *mut nvtiff_sys::CUstream_st = stream.cu_stream().cast::<_>();

At this point, we have a pointer cuda_stream to allocated GPU memory, next is to decode tiff_stream from host (CPU) to device (GPU)!

Step 4 & 5 : Decompress and decode raw image bytes into GPU memory

The moment of truth. To do the TIFF decoding, we must first set up a decoder handle, here using nvtiffDecoderCreateSimple

let mut decoder_handle = std::mem::MaybeUninit::uninit();
let nvtiff_decoder: *mut *mut nvtiffDecoder = decoder_handle.as_mut_ptr();

let status_decoder: u32 = unsafe { nvtiffDecoderCreateSimple(nvtiff_decoder, cuda_stream) };
assert_eq!(status_decoder, nvtiffStatus_t::NVTIFF_STATUS_SUCCESS);

This nvtiffDecoderCreateSimple function will instantiate a decoder handle with default memory allocators. Next, you can set up the decoding parameters. You could specify the decode params using nvtiffDecodeParamsCreate, but I'm lazy and just want the entire image (not a subset), so am just gonna use empty/default params:

let mut params = std::mem::MaybeUninit::uninit();
let decode_params: *mut nvtiffDecodeParams = params.as_mut_ptr();

For the careful ones, you could use nvtiffDecodeCheckSupported now to determine whether the TIFF file can be decoded by nvTIFF, or...

Just do the decoding straightaway using nvtiffDecodeImage like so:

let (image_ptr, _record): (u64, _) = image_stream.cuda_slice.device_ptr(stream);
let image_out_d = image_ptr as *mut std::ffi::c_void;
let status_decode: u32 = unsafe {
    nvtiffDecodeImage(
        *tiff_stream,
        *nvtiff_decoder,
        decode_params,
        0, // image_id
        image_out_d,
        cuda_stream,
    )
};
assert_eq!(status_decode, nvtiffStatus_t::NVTIFF_STATUS_SUCCESS);

This nvtiffDecodeImage happens asynchronously with respect to the host. It'll call nvCOMP to do GPU-based decompression if it has to

Sure, we are passing around pointers very precariously. Also, don't ask me why we are manipulating the cudarc generated image_stream instead of the transmuted cuda_stream. This was the best I could come up with to satisfy the Rust type checker.

The result is the same regardless, we end up with the tiff data decoded into GPU memory 🎉

Further reading: for those interested, check out my Pull Request at https://github.com/weiji14/cog3pio/pull/27 where all of the above nvTIFF decoding code was implemented, in a slightly saner manner.

Next steps?

So does the story end here? Not quite, because while the TIFF data is in GPU memory now, it's not very user friendly. The data hasn't been placed in a 'Tensor' format, and is essentially still in a very raw 1-dimensional form.

This will be the focus of the next chapter, as we coerce this 1D array on the GPU into a portable format called DLPack that is cross-language (Rust/C++/Python/etc) and cross-framework (CuPy/Torch/JaX/etc). Crucially, DLPack's DLTensor struct has a shape field that we'll need to provide to make our 1D array look like an n-dimensional 'Tensor'.

Stay tuned for the last episode~

Follow along this trilogy!

Part 1: An Unexpected Conflict - multi-threaded reads, too many jobs

Part 2: Barrels Out of Bytes - streaming Rust bits to the GPU

Part 3: The Last Stage - via DLPack into the world