HDF5 (Hierarchical Data Format version 5) is a file format and library designed for storing and managing large, complex scientific datasets. It provides a flexible, efficient structure for organizing heterogeneous data with metadata.
Key Features
- Hierarchical organization: Store data in groups and datasets, similar to filesystem directories
- Self-describing: Metadata embedded within the file structure
- Efficient storage: Compression and chunking for large arrays
- Cross-platform: Consistent format across operating systems and languages
- Partial I/O: Read subsets of large datasets without loading entire file
Scientific Applications
HDF5 is widely used in neuroscience for:
- Multi-channel electrophysiology recordings
- Calcium imaging time series
- Simulation outputs and model parameters
- Large-scale experimental metadata
Python Integration
import h5py
import numpy as np
# Write data
with h5py.File('experiment.h5', 'w') as f:
# Create groups
recording = f.create_group('recording')
# Store datasets
recording.create_dataset('traces', data=neural_data)
recording.create_dataset('timestamps', data=timestamps)
# Add metadata
recording.attrs['sampling_rate'] = 30000
recording.attrs['subject_id'] = 'S01'
# Read data
with h5py.File('experiment.h5', 'r') as f:
traces = f['recording/traces'][:]
rate = f['recording'].attrs['sampling_rate']
When to Use HDF5
Best for:
- Large multidimensional arrays (>1GB)
- Complex hierarchical data structures
- Data requiring partial access
- Long-term archival storage
Consider alternatives for:
- Simple tabular data (use Parquet or CSV)
- Small datasets (<100MB)
- Highly nested JSON-like structures