GeoTIFFs to GPUs part 3: The Last Stage - via DLPack into the world

A lot has happened in the past year. I had hoped to finish this last blog post before my talk at FOSS4G 2025, but getting the Rust-to-Python zero-copy took longer than expected. In some sense, maybe it was good to have done the benchmarks on just the low-level Rust bindings.

The delay was mainly me fighting dependency hell. CUDA 13 was released and I semi-regret trying to jump on it too quickly because packages for Debian 13 Forky took ages and I couldn't develop on my local GPU for weeks (double unlucky 13)... Also, nvTIFF v0.6.0 got released, and then nvTIFF v0.7.0..., with a bunch of changes to the decoder API I had to deal with.

Alas, I could blabber on with more excuses, but let's get on.

Our journey ended last time when we got our TIFF data decoded into a CudaSlice<T>, essentially a Vec<T> equivalent on a CUDA GPU for the Rust folks, or a 1-dimensional array of numbers for anyone else. It's good enough for benchmarking (hey look, I got the data on the GPU!), but not immediately usable because:

Data science libraries are mostly in Python, not Rust, and
Most algorithms on TIFF images typically expect a 2D matrix + bands (so 3D) rather than a 1D flat array.

The rest of this blog post will thus cover two things. First is DLPack - a tensor exchange protocol that we'll coerce our CudaSlice<T> into, and use it to cross the Rust/Python boundary. Second is the user-facing interface. I'll be using a Python library called xarray which ties into a rich ecosystem of data science tools, but notably, lets you define custom backends to open file formats (and in which I'll handle the 1D -> 3D reshaping).

DLPack - the tensor exchange protocol

As a bit of history, DLPack has been around for almost a decade now (v0.1 was released June 2017). I was vaguely aware of DLPack around 2021-ish maybe, when I switched from Chainer (backed by CuPy) to Pytorch or was playing around with RAPIDS AI libraries, it was just one of those curious methods that appeared in the API docs.

When I started my job at DevSeed in 2023, I saw how Arrow really standardized exchange of tabular data, and contemplated using FixedShapeTensor for n-dimensional array data exchange. For a while, I was looking for a Arrow-but-for-raster thing, having forgotten or rather, not known that DLPack is exactly that thing!

So what is DLPack? It is the best n-dimensional array data exchange mechanism. I'll sum it up with a table:

	DLPack	Buffer Protocol	`__cuda_array_interface__`
Cross-language	✅	✅ (usually C to Python)	(Python-only)
Cross-device	✅	(CPU-only)	(GPU-only)

An exchange protocol itself implies being able to work across libraries or frameworks. In Python-land, I've managed to get away with array libraries (e.g. numpy, cupy) using protocols like the (Python) Buffer Protocol or __cuda_array_interface__ without me having to dig into those internals. But once you want to work across languages (e.g. Rust to Python), or across hardware devices (e.g. CPU to GPU), that's where DLPack comes in.

The choice of cog3pio using DLPack as the exchange protocol is a key difference with async-tiff that decided to just use the Buffer Protocol. As a library focused on GPUs, cog3pio does associate itself more with the machine learning field, and so DLPack is a more natural fit given that it handles special dtypes (e.g. bfloat16) while the Buffer Protocol doesn't yet (though it could be argued that float16 TIFF files are already rare, let alone bfloat16!). There's no right or wrong here really, just different design decisions (GPU-compatible vs CPU-only, language-agnostic vs Python-centric).

I would love to see more DLPack implementors. I've evangelized it in zarr-python (zarrs already has it), and think this is the way forward to decouple file formats from array libraries and/or compute hardware. We only need to look at Arrow to realize how nice things could be.

Rust-side DLPack

Back to our TIFF data in a CudaSlice<T>. How do we put that into a DLPack tensor? Using a crate called dlpark (there's another one called dlpk but no proper cuda support yet :p).

Easiest case is if the underlying TIFF data type is uint8. We do a zero-copy of the CUDA slice in GPU memory to a DLManagedTensorVersioned using a dlpark::SafeManagedTensorVersioned::new method.

use cudarc::driver::CudaSlice;
use dlpark::SafeManagedTensorVersioned;

let cuslice: CudaSlice<u8> = ...;  // from nvtiffDecodeImage
let tensor: SafeManagedTensorVersioned = SafeManagedTensorVersioned::new(cuslice)?;

For TIFF data which are not uint8, we have to do a transmute operation. Technically, nvTIFF decodes everything to a uint8 (byte) buffer, and it is up to us to interpret those bytes as a different dtype based on the TIFF metadata (specifically the SampleFormat and BitsPerSample tags).

use nvtiff_sys::result::{NvTiffError, NvTiffStatusError};

let cuslice: CudaSlice<u8> = ...;
let cuview: CudaView<_> = unsafe { cuslice.transmute::<T>(len_elem) }
    .ok_or_raise(|| NvTiffError::StatusError(NvTiffStatusError::BadTiff))?;
let tensor = SafeManagedTensorVersioned::new(cuview)
    .or_raise(|| NvTiffError::StatusError(NvTiffStatusError::AllocatorFailure))?;

For the full implementation details, see https://github.com/weiji14/cog3pio/pull/57. We've now got our tensor into DLPack in Rust, the next step is to zero-copy it into Python.

DLPack streamed to Python

At this point, we have a SafeManagedTensorVersioned thing which is just a safe wrapper around a non-null (DL)ManagedTensorVersioned C struct. The good thing is that dlpark already implements the IntoPyObject trait here, so it should be straightforward to write a pyo3 method that returns the SafeManagedTensorVersioned object and be done.

It is indeed easy if the DLPack tensor was in CPU memory (as I've done before), but having it in CUDA GPU made it harder 🫠

There is a Python Specification for DLPack which details how Python libraries should produce or consume DLPack tensors. For me, on the producer side, I had to implement two methods:

Second one is fairly easy, just a (device_type, device_id) tuple that indicates where the data is stored (e.g. (2, 0) means CUDA device 0). The first one was much more challenging, in the sense that I had to deal with CUDA stream synchronization properly.

Well actually, the stream handling wasn't that bad. I was stuck for months thinking there was a memory leak because of segfaults I couldn't debug, with a workaround I didn't like. Finally managed to narrow it down to some silly difference between two types of tokio thread schedulers that might have corrupted the network fetch somehow.

In the interim, I did get my head around CUDA's legacy default stream vs per-thread default stream. So happened that cudarc=0.18.0 added the ability to create per-thread streams while I was stuck. Then I was able to do a much-needed refactor to defer the TIFF-decode-to-GPU from the CudaCogReader struct new instantiation to the dlpack call as you'll see below.

All that said, this is what I ended up with, with all the wisdom of hindsight. I'll just share the struct method signatures.

On the pure Rust side:

use cudarc::driver::CudaStream;

pub struct CudaCogReader {
    tiff_stream: *mut nvtiffStream, // data
    image_info: nvtiffImageInfo, // metadata
}

impl CudaCogReader {
    pub fn new(byte_stream: &Bytes) -> NvTiffResult<Self> {}
    
    pub fn dlpack(&self, stream: &Arc<CudaStream>) -> NvTiffResult<SafeManagedTensorVersioned> {}
}

Notice how the CUDA stream is passed in later at the dlpack() method. This means only CPU RAM is needed for the new() call to store the TIFF data stream (compressed) and metadata at the start. CUDA GPU memory is only required later by calling dlpack() which triggers the decode.

Here's the corresponding pyo3 side that will be exposed through Python:

pub(crate) struct PyCudaCogReader {
    inner: CudaCogReader,
    device: dlpark::ffi::Device,
}

#[pymethods]
impl PyCudaCogReader {
    #[new]
    fn new(path: &str, device_id: usize) -> PyResult<Self> {}
    
    fn __dlpack__(
        &self,
        stream: Option<u8>,
        max_version: Option<(u32, u32)>,
        dl_device: Option<(usize, usize)>,
        copy: Option<bool>,
    ) -> PyResult<SafeManagedTensorVersioned> {}
}

The new method here differs from the Rust version by requiring a device_id integer, corresponding to the CUDA device id (usually 0) you want to decode on. I debated on whether to have it accept Option<usize>, but decided to keep it an explicitly required parameter for now (a default is set however in the Xarray section below). For the __dlpack__ method, I basically had to conform to the specification.

See https://github.com/weiji14/cog3pio/pull/58 for how PyCudaCogReader was implemented initially, and https://github.com/weiji14/cog3pio/pull/77 for the final tweaks that completes the __dlpack__ method implementation. We're done with seeing Rust code now, and shall move into Python-land.

Xarray - nice backend entrypoints

The xarray data structure adds coordinates and labels the dimensions of an array. So in our case, having TIFF data in what should be a 3-D array, we can call the dimensions or axis as bands (channels or samples), y (height) and x (width), with numbered coordinates for each pixel.

You could also add attributes or be fancy and use rasterix to lazily generate coordinates based on the TIFF's affine transformation. But we'll keep it simple and focus on the minimum required to get our DLPack tensor into an xarray.DataArray data object for now.

The way an xarray user would typically read a file format is with xarray.open_dataarray() like so:

import xarray as xr

path: str = "something.tif"
data: xr.DataArray = xr.open_dataarray(path, engine="...")

That engine parameter is what we'll plug into. This requires subclassing from BackendEntrypoint (see guide).

Side note: originally I wanted to modify my existing backend entrypoint in cog3pio to support decoding to CUDA too (originally only handled CPU decoding). In https://github.com/weiji14/cog3pio/pull/71, I implemented a new device: tuple[int, int] parameter (mimicking the __dlpack_device__ tuple mentioned above) that would allow the user to choose which device (CUDA or CPU) to decode TIFFs into. The code worked, but it was not elegant, requiring a nasty match-case, and opened up the question of whether I should handle decoding to devices in the future (via CPU or CUDA intermediary)? In other words, it didn't fit the UNIX philosophy of "Do One Thing and Do it Well". So I decided to drop xarray in cog3pio, and have the backend/engine go into cupy-xarray instead, which will focus solely on decoding TIFFs to CuPy arrays.

The custom BackendEntrypoint class signature (simplified to the relevant bits) is as follows:

from xarray.backends import BackendEntrypoint

class Cog3pioBackendEntrypoint(BackendEntrypoint):
    def open_dataset(
        self,
        filename_or_obj: str,
        *,
        device_id: int | None = None,
    ) -> xr.Dataset:
        ...

A pet peeve of mine is that we can only implement open_dataset -> xr.Dataset even though we may only be interested in open_dataarray -> xr.DataArray (see xarray issue). But I digress.

The open_dataset() method can be broken down into three parts.

Get device_id either from user, or via cupy.cuda.runtime.getDevice().
```
import cupy as cp

def open_dataset(
    self,
    filename_or_obj: str,
    *,
    device_id: int | None = None,
) -> xr.Dataset:
    
    if device_id is None:
        device_id: int = cp.cuda.runtime.getDevice()
```
Why is the default device_id None and not 0? See in-depth answer at https://github.com/xarray-contrib/cupy-xarray/pull/81#discussion_r2889190821. The meaning of 0 gets complicated when doing multi-node GPU, or even single node + multi-worker jobs (and I'm not sure how the underlying nvTIFF library would handle this), so we'll let CuPy handle that.
Pass in the filepath (or a http URL) into cog3pio.CudaCogReader, and then handle the output shape.
```
with cp.cuda.Stream(ptds=True):
    cog = CudaCogReader(path=filename_or_obj, device_id=device_id)
    array: cp.ndarray = cp.from_dlpack(cog)  # 1-D Array
    x_coords, y_coords = cog.xy_coords()  # TODO consider using rasterix
    height, width = (len(y_coords), len(x_coords))
    channels: int = math.prod(array.shape) // (height * width)
    if array.ndim == 1:
        # Workaround to convert from 1-D to 3-D shape
        # TODO remove once PyCapule returns 3-D shape (e.g. via cuTENSOR)
        array = array.reshape(height, width, channels)  # HWC
        array = array.transpose(2, 0, 1)  # CHW
```
We're putting this in a per-thread default stream (ptds) context instead of the legacy default stream. The cog variable is a PyCapule that is DLPack compatible, so we can call cupy.from_dlpack to put things into a CuPy array. Theoretically, you could also use torch.from_dlpack, jax.dlpack.from_dlpack and so on, but interchange between CuPy to other GPU array libraries should be zero-copy anyway.

Now you might notice that if-statement which does reshaping from a 1-D vector to a 3-D tensor shape. This is a current limitation of cog3pio v0.1.0, due to cudarc not having a safe Tensor struct that has a shape attribute (technically, arrays are all 1-D!). I'm hoping that cuTensor support will make the reshape unnecessary soon, hence the if-statement which is hopefully future-proof.
Put the array data and coordinates into an xarray.DataArray, then convert to an xarray.Dataset.
```
dataarray: xr.DataArray = xr.DataArray(
    data=array,
    coords={
        "band": np.arange(channels, dtype=np.uint8),
        "y": y_coords,
        "x": x_coords,
    },
    name=None,
    attrs=None,
)
return dataarray.to_dataset(name="raster")
```
Here we wrap the CuPy tensor (assuming channel-first ordering) with coordinates (band, y, x). Note that band count starts from 0, contrary to GDAL which starts counting from 1. And because code lives inside open_dataset instead of open_dataarray (see pet peeve above), we need to call .to_dataset() at the end.

After all that, one more thing to do, is to add this entry in pyproject.toml:

[project.entry-points."xarray.backends"]
cog3pio = "cupy_xarray.cog3pio:Cog3pioBackendEntrypoint"

And finally, we can call the code like so:

with xr.open_dataarray(
    filename_or_obj="https://github.com/blacha/cogeotiff/raw/core-v9.5.0/packages/core/data/DEM_BS28_2016_1000_1141.tif",
    engine="cog3pio",
    device_id=0,
) as da:
    print(da.data.__class__)
    print(da)

which should output

<class 'cupy.ndarray'>
<xarray.DataArray 'raster' (band: 1, y: 244, x: 63)> Size: 61kB
array([[[-9999.   , -9999.   , ..., -9999.   , -9999.   ],
        [-9999.   , -9999.   , ..., -9999.   , -9999.   ],
        ...,
        [-9999.   , -9999.   , ...,   242.714,   242.302],
        [-9999.   , -9999.   , ...,   242.144,   241.707]]],
      shape=(1, 244, 63), dtype=float32)
Coordinates:
  * band     (band) uint8 1B 0
  * y        (y) float64 2kB 5.362e+06 5.362e+06 ... 5.362e+06 5.362e+06
  * x        (x) float64 504B 1.68e+06 1.68e+06 1.68e+06 ... 1.68e+06 1.68e+06

There you have it, a GeoTIFF decoded into a CuPy array/GPU memory 😌.

Again, see at https://github.com/xarray-contrib/cupy-xarray/pull/81 for the full implementation code! That Pull Request hasn't been merged as of this blog post's original publication date, but that means there's still time for feedback ;)

Where to from here?

We've gone on quite a journey, from wrapping C++ code in Rust and piping tensor data to Python via DLPack protocol, and wrapping it in a nice xarray structure. I wish it stopped here, but there's still work to do:

Correct DLPack shape (mentioned above), by using cuTensor?
Handle NaN value masking somehow
Partial decode - get just a region of interest or specific bands/channels

Something I'm dreading is the mention in nvTIFF v0.7.0's release notes that they will overhaul their entire decode API in the next version. Eventually, I'm thinking of rewriting the entire stack in Mojo (as mentioned in my FOSS4G 2025 talk) to enable non-CUDA GPU decoding, and maybe even try my hand at SIMD instructions. Those, and working towards direct memory access transfers which will involve understanding the hardware level even more.

But I'm also content with what exists now :)

Recently I've been thinking about Jevons paradox a lot. Greater efficiency might still induce more energy consumption. So no need to rush, happy if it takes a couple more years, as good craft takes time.

Anyways, this is the last episode, thanks for reading!

Previous trilogy chapters:

Part 1: An Unexpected Conflict - multi-threaded reads, too many jobs

Part 2: Barrels Out of Bytes - streaming Rust bits to the GPU

Part 3: The Last Stage - via DLPack into the world