Loading...

xarray

N-dimensional labeled arrays and datasets for scientific data analysis

Advanced Data Analysis & Machine Learning Essential Core Tool
Quick Info
  • Category: Advanced Data Analysis & Machine Learning
  • Level: Essential
  • Type: Core Tool
  • Requires:

Why We Recommend xarray

Xarray extends pandas-like labels and indexing to multi-dimensional arrays, making it perfect for working with complex scientific datasets like electrophysiology recordings, imaging data, and climate data. It handles coordinates, dimensions, and metadata elegantly.

Common Use Cases

  • Multi-dimensional electrophysiology data analysis
  • Time-series data with multiple channels
  • NetCDF and HDF5 file handling
  • Coordinate-based data selection and alignment

Getting Started

Xarray is an open-source library that brings the power of labeled, multi-dimensional arrays to Python. It’s built on top of NumPy and integrates seamlessly with pandas, providing an intuitive interface for working with complex scientific datasets.

Why xarray?

  • Labeled Dimensions: Index data by dimension names instead of axis numbers
  • Coordinate-Based Selection: Select data using meaningful coordinates (time, frequency, channel)
  • NetCDF Integration: Native support for NetCDF and HDF5 formats
  • Broadcasting: Automatic alignment of arrays based on dimension names
  • Metadata: Keep track of units, descriptions, and other metadata

Key Concepts

DataArray

A single multi-dimensional array with labeled dimensions:

import xarray as xr
import numpy as np

# Create a DataArray with labeled dimensions
data = xr.DataArray(
    np.random.randn(3, 4, 5),
    dims=["time", "channel", "trial"],
    coords={
        "time": np.arange(3),
        "channel": ["Ch1", "Ch2", "Ch3", "Ch4"],
        "trial": np.arange(5)
    }
)

Dataset

A collection of DataArrays with shared dimensions:

ds = xr.Dataset({
    "lfp": (["time", "channel"], lfp_data),
    "spikes": (["time", "unit"], spike_data),
})

Common Use Cases in Research

Electrophysiology Data

# Load multi-channel recording
lfp = xr.DataArray(
    recording_data,
    dims=["time", "channel", "trial"],
    coords={
        "time": times,
        "channel": channel_names,
        "trial": trial_ids
    }
)

# Select specific channels and time window
baseline = lfp.sel(channel=["Ch1", "Ch2"], time=slice(0, 1000))

# Average across trials
mean_response = lfp.mean(dim="trial")

Time-Frequency Analysis

# Store spectrogram with time and frequency coordinates
spectrogram = xr.DataArray(
    tfr_data,
    dims=["time", "frequency", "channel"],
    coords={
        "time": time_bins,
        "frequency": freq_bins,
        "channel": channels
    }
)

# Select theta band
theta = spectrogram.sel(frequency=slice(4, 8))

Integration with Other Tools

  • Pandas: Convert between DataFrames and DataArrays
  • NumPy: All NumPy operations work on xarray objects
  • Matplotlib: Direct plotting with labeled axes
  • Dask: Parallel computing with large arrays
  • NetCDF4: Read/write NetCDF files natively

Installation

pixi add xarray
# or
conda install -c conda-forge xarray
# or
pip install xarray

Best Practices

  • Use meaningful dimension and coordinate names
  • Include units and descriptions in attributes
  • Save to NetCDF format for efficient storage
  • Use .sel() for label-based indexing, .isel() for position-based
  • Leverage automatic broadcasting for operations across dimensions

Prerequisites

Top