Xarray 101

Xarray 101#

Sources:

Overview: Why Xarray?#

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

Here is an example of how we might structure a dataset for a weather forecast:

You’ll notice multiple data variables (temperature, precipitation), coordinate variables (latitude, longitude), and dimensions (x, y, t). We’ll cover how these fit into Xarray’s data structures below.

Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface. For example:

Apply operations over dimensions by name: x.sum('time').
Select values by label (or logical location) instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
Easily use the split-apply-combine paradigm with groupby: x.groupby('time.dayofyear').mean().
Database-like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of an array’s dimensions or insert dummy dimensions of size 1 to align arrays (e.g., using np.newaxis).

The immediate payoff of using xarray is that you’ll write less code. The long-term payoff is that you’ll understand what you were thinking when you come back to look at it weeks or months later.

Xarray’s Data structures#

Xarray provides two data structures: the DataArray and Dataset. The DataArray class attaches dimension names, coordinates and attributes to multi-dimensional arrays while Dataset combines multiple arrays.

Both classes are most commonly created by reading data. To learn how to create a DataArray or Dataset manually, see the Working with labeled data tutorial.

Xarray has a few small real-world tutorial datasets hosted in the pydata/xarray Github repository. We’ll use the xarray.tutorial.load_dataset convenience function to download and open the air_temperature (National Centers for Environmental Prediction) Dataset by name.

import numpy as np
import xarray as xr

print("Numpy version:", np.__version__)
print("Xarray version:", xr.__version__)

Numpy version: 1.25.2
Xarray version: 2023.8.0

To Pandas and back#

DataArray and Dataset objects are frequently created by converting from other libraries such as pandas or by reading from data storage formats such as NetCDF or zarr.

To convert from / to pandas, we can use the to_xarray methods on pandas objects or the to_pandas methods on xarray objects:

import pandas as pd

print("Pandas version:", pd.__version__)

Pandas version: 2.0.3

series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series

a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

arr = series.to_xarray()
arr

<xarray.DataArray (index: 10)>
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Coordinates:
  * index    (index) object 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'

arr.to_pandas()

index
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

We can also control what pandas object is used by calling to_series or to_dataframe:

to_series will always convert DataArray objects to pandas.Series, using a MultiIndex for higher dimensions

ds.air.to_series()

time                 lat   lon  
2013-01-01 00:00:00  75.0  200.0    241.199997
                           202.5    242.500000
                           205.0    243.500000
                           207.5    244.000000
                           210.0    244.099991
                                       ...    
2014-12-31 18:00:00  15.0  320.0    297.389984
                           322.5    297.190002
                           325.0    296.489990
                           327.5    296.190002
                           330.0    295.690002
Name: air, Length: 3869000, dtype: float32

to_dataframe will always convert DataArray or Dataset objects to a pandas.DataFrame. Note that DataArray objects have to be named for this.

ds.air.to_dataframe()

			air
time	lat	lon
2013-01-01 00:00:00	75.0	200.0	241.199997
		202.5	242.500000
		205.0	243.500000
		207.5	244.000000
		210.0	244.099991
...	...	...	...
2014-12-31 18:00:00	15.0	320.0	297.389984
		322.5	297.190002
		325.0	296.489990
		327.5	296.190002
		330.0	295.690002

3869000 rows × 1 columns

Since columns in a DataFrame need to have the same index, they are broadcasted.

ds.to_dataframe()

			air
lat	time	lon
75.0	2013-01-01 00:00:00	200.0	241.199997
		202.5	242.500000
		205.0	243.500000
		207.5	244.000000
		210.0	244.099991
...	...	...	...
15.0	2014-12-31 18:00:00	320.0	297.389984
		322.5	297.190002
		325.0	296.489990
		327.5	296.190002
		330.0	295.690002

3869000 rows × 1 columns

Indexing and Selecting Data#

Xarray offers extremely flexible indexing routines that combine the best features of NumPy and Pandas for data selection.

The most basic way to access elements of a DataArray object is to use Python’s [] syntax, such as array[i, j], where i and j are both integers.

As xarray objects can store coordinates corresponding to each dimension of an array, label-based indexing is also possible (e.g. .sel(latitude=0), similar to pandas.DataFrame.loc). In label-based indexing, the element position i is automatically looked-up from the coordinate values.

By leveraging the labeled dimensions and coordinates provided by Xarray, users can effortlessly access, subset, and manipulate data along multiple axes, enabling complex operations such as slicing, masking, and aggregating data based on specific criteria.

This indexing and selection capability of Xarray not only enhances data exploration and analysis workflows but also promotes reproducibility and efficiency by providing a convenient interface for working with multi-dimensional data structures.

In total, xarray supports four different kinds of indexing, as described below and summarized in this table:

Dimension lookup	Index lookup	`DataArray` syntax	`Dataset` syntax
Positional	By integer	`da[:,0]`	not available
Positional	By label	`da.loc[:,'IA']`	not available
By name	By integer	`da.isel(space=0)` or `da[dict(space=0)]`	`ds.isel(space=0)` or `ds[dict(space=0)]`
By name	By label	`da.sel(space='IA')` or `da.loc[dict(space='IA')]`	`ds.sel(space='IA')` or `ds.loc[dict(space='IA')]`

Visualization#

Xarray lets you easily visualize datasets easily by default and integrates really well with the Holoviz ecosystem.

Plot the air temperature seasonal mean from above groupby

seasonal_mean.air.plot(col="season", col_wrap=2);

../_images/e05b64e22d89ed4dcd229bace65e3c859ffca3a014aee53843bf9a7d3561a470.png

# contours
seasonal_mean.air.plot.contour(col="season", levels=20, add_colorbar=True);

../_images/6299542a251ffe6e14c57f8868e6f9a7a79fe5f0a6b175003a829dcef1b541da.png

# line plots as well
seasonal_mean.air.mean("lon").plot.line(hue="season", y="lat");

../_images/3318d31ad08c2dfca37f18271a93afee93520fd9d855a6f1e52099c8c542371a.png

For more see the user guide, the gallery, and the tutorial material.

What’s Next#

Read the tutorial material and user guide
See the description of common terms used in the xarray documentation
Answers to common questions on “how to do X” with Xarray are here
Ryan Abernathey has a book on data analysis with a chapter on Xarray
Project Pythia has foundational and more advanced material on Xarray. Pythia also aggregates other Python learning resources.
The Xarray Github Discussions and Pangeo Discourse are good places to ask questions.
Tell your friends! Tweet!

Xarray 101

Contents

Xarray 101#

Overview: Why Xarray?#

Xarray’s Data structures#

Dataset#

What is all this anyway? (String representations)#

DataArray#

String representations#

Named dimensions#

Coordinates#

Attributes#

To Pandas and back#

Indexing and Selecting Data#

Label-based indexing#

Position-based indexing#

Concepts for computation#

Broadcasting: adjusting arrays to the same shape#

Broadcasting in numpy#

Alignment: putting data on the same grid#

Controlling alignment#

High level computation#

`groupby`#

`resample`#

`weighted`#

Visualization#

Reading and writing files#

What’s Next#

Xarray 101

Contents

Xarray 101#

Overview: Why Xarray?#

Xarray’s Data structures#

Dataset#

What is all this anyway? (String representations)#

DataArray#

String representations#

Named dimensions#

Coordinates#

Attributes#

To Pandas and back#

Indexing and Selecting Data#

Label-based indexing#

Position-based indexing#

Concepts for computation#

Broadcasting: adjusting arrays to the same shape#

Broadcasting in numpy#

Alignment: putting data on the same grid#

Controlling alignment#

High level computation#

groupby#

resample#

weighted#

Visualization#

Reading and writing files#

What’s Next#

`groupby`#

`resample`#

`weighted`#