Loading...

HDF5

Hierarchical Data Format for storing large scientific datasets

Open Science & Data Sharing Intermediate Data Format
Quick Info
  • Category: Open Science & Data Sharing
  • Level: Intermediate
  • Type: Data Format

Why We Recommend HDF5

HDF5 is the standard format for storing large, complex scientific datasets with rich metadata. Its hierarchical structure and efficient compression make it ideal for neuroscience data like multi-channel recordings and imaging time series.

Common Use Cases

  • Store multi-channel electrophysiology recordings
  • Organize calcium imaging time series with metadata
  • Archive large-scale experimental data
  • Enable partial I/O for large datasets

Getting Started

HDF5 (Hierarchical Data Format version 5) is a file format and library designed for storing and managing large, complex scientific datasets. It provides a flexible, efficient structure for organizing heterogeneous data with metadata.

Key Features

  • Hierarchical organization: Store data in groups and datasets, similar to filesystem directories
  • Self-describing: Metadata embedded within the file structure
  • Efficient storage: Compression and chunking for large arrays
  • Cross-platform: Consistent format across operating systems and languages
  • Partial I/O: Read subsets of large datasets without loading entire file

Scientific Applications

HDF5 is widely used in neuroscience for:

  • Multi-channel electrophysiology recordings
  • Calcium imaging time series
  • Simulation outputs and model parameters
  • Large-scale experimental metadata

Python Integration

import h5py
import numpy as np

# Write data
with h5py.File('experiment.h5', 'w') as f:
    # Create groups
    recording = f.create_group('recording')

    # Store datasets
    recording.create_dataset('traces', data=neural_data)
    recording.create_dataset('timestamps', data=timestamps)

    # Add metadata
    recording.attrs['sampling_rate'] = 30000
    recording.attrs['subject_id'] = 'S01'

# Read data
with h5py.File('experiment.h5', 'r') as f:
    traces = f['recording/traces'][:]
    rate = f['recording'].attrs['sampling_rate']

When to Use HDF5

Best for:

  • Large multidimensional arrays (>1GB)
  • Complex hierarchical data structures
  • Data requiring partial access
  • Long-term archival storage

Consider alternatives for:

  • Simple tabular data (use Parquet or CSV)
  • Small datasets (<100MB)
  • Highly nested JSON-like structures
Top