The Pangeo Machine Learning Ecosystem in 2023

TLDR

Open source tools developed by the Pangeo ML community are enabling the shift to cloud-native geospatial Machine Learning. Join the Pangeo ML community working in towards scalable GPU-native workflows! 🚀

Overview

Building next-generation Machine Learning (ML) tools

At FOSS4G SotM Oceania 2023, we presented on "The ecosystem of geospatial machine learning tools in the Pangeo World" (see the recording here). One of the driving forces of the Pangeo community is to build better tools that will enable scientific workflows on petabyte-scale datasets, such as Climate/Weather projections that will impact the planet over the coming decades.

To do that, we need to be fast.

These next-generation tools need to be scalable, efficient, and modular. So we are designing them with three aspects in mind:

Work with cloud-native data using GPU-native compute
Be able to stream subsets of data on-the-fly
Go from single sensor to multi-modal models

Neither of these core technologies are particularly new. NVIDIA has been leading the development of GPU-native RAPIDS AI libraries since 2018. Streaming has been around since the 2010s if not earlier, and is practically the most common way to consume music and video content nowadays. Since then, we have also seen the rise in multi-modal Foundation Models that are able to take in visual (image) and language (text and sound) cues.

Let's now take a step back, and picture what we're working with.

Layers of the Pangeo Machine Learning stack

There are three main layers to a Machine Learning data pipeline. It starts with data storage file formats at the bottom row, an in-memory array representation in the middle, and high-level libraries and documentation resources that users or developers interact with at the top.

The key to connecting all of these layers are open standards.

Cloud-native geospatial file formats

For the file formats, we favour cloud-native geospatial because it allows us to efficiently access subsets of data without reading the entire file. Generally speaking, you would store rasters as Zarr or Cloud-Optimized GeoTIFFs, and vectors (points/lines/polygons) in FlatGeobuf or (Geo)Parquet. Ideally though, these files would be indexed using a SpatioTemporal Asset Catalog (STAC) which makes it easier to discover datasets using standardized queries. This can be a whole topic in itself, so check out this guide published in October 2023 for more details!

In memory array representations

In the Python world, NumPy arrays have been the core way of representing arrays in-memory, but there are many others too, along with an ongoing movement to standardize the array/dataframe API at https://data-apis.org. Geospatial folks would most likely be familiar with vector libraries like GeoPandas GeoDataFrames (built on top of pandas); or raster libraries like rioxarray and stackSTAC that reads into xarray data structures.

NumPy arrays are CPU-based, but there are also libraries like CuPy which can do GPU-accelerated computations. Instead of GeoPandas, you could use libraries like cuSpatial (built on top of cuDF and part of RAPIDS AI) to run GPU-accelerated algorithms. Deep Learning libraries like PyTorch, TensorFlow or JaX tend to be GPU-based as well, but there are also libraries like Datashader (for visualization) and Xarray that are designed to be CPU/GPU agnostic and can hold either.

High-level Pangeo ML libraries

Finally, to make life simpler, we have high-level convenience libraries wrapping the low-level stuff. These are designed to have a nicer user interface to connect the underlying file formats and in-memory array representations. The Pangeo Machine Learning Working Group mostly works on Climate/Weather datasets, so we'll focus on multi-dimensional arrays for now.

Stepping into the GPU-native world, cupy-xarray allows users to use GPU-backed CuPy arrays in n-dimensional Xarray data structures (see our previous blog post on this). An exciting development on this front is the experimental kvikIO engine that enables low-latency reading data from Zarr stores into GPU memory using NVIDIA GPUDirect Storage technology (see this blog post). Preliminary benchmarks suggest that the GPU-based kvikIO engine can take about 25% less time for data reads compared to the regular CPU-based zarr engine!

Once you have tensors loaded (lazily) into an Xarray data structure, xbatcher enables efficient iteration over batches of data in a streaming fashion. This library makes it easier to train machine learning models on big datacubes such as time-series datasets or multi-variate ocean/climate model outputs, as users can do on-the-fly slicing using named variables (more readable than numbered indexes). There is also an experimental cache mechanism we'd like more people to try and provide feedback on!

To connect all of the pieces, zen3geo implements Composable DataPipes for geospatial. It acts as the glue to chain together different building blocks, such as readers for Vector/Raster file formats, converters between different in-memory array representations, and even custom pre-processing functions. The composable design pattern makes it well suited for building complex machine learning data pipelines for multi-modal models that can take in different inputs (e.g. Images, Point Clouds, Trajectory, Text/Sound, etc). Going forward, there are plans to refactor the backend to be asynchronous-first to overcome I/O bottlenecks.

Summary

We've presented a snapshot of the Pangeo Machine Learning ecosystem in 2023. The basis of any machine learning project is the data, and we touched on how cloud-native geospatial file formats and in-memory array representations built on open standards act as the foundation for our work. Lastly, we highlight some of the high-level Pangeo ML libraries enabling user friendly access to GPU-native compute, streaming data batches, and composable geospatial data pipelines.

Where to learn more

Educational resources:
Pangeo ML Working Group:
- Monthly meetings
- Discourse Forum

Acknowledgments

The work above is the cumulative effort of folks from the Pangeo, Xarray and RAPIDS AI community, plus more! In particular, we'd like to acknowledge the work of Deepak Cherian at Earthmover and Negin Sobhani at NCAR for their work on cupy-xarray/kvikIO, Max Jones at Carbonplan for recent developments on the xbatcher package, and Wei Ji Leong at Development Seed for the development of zen3geo.

Appendix I: Further Reading

Note: A version of this blog post has been published at https://xarray.dev/blog afterhttps://github.com/xarray-contrib/xarray.dev/pull/625 was merged!