Loading...

DuckDB

In-process SQL database designed for analytical queries

Open Science & Data Sharing Intermediate Data Tool
Quick Info
  • Category: Open Science & Data Sharing
  • Level: Intermediate
  • Type: Data Tool

Why We Recommend DuckDB

DuckDB brings the power of SQL to your research data without the complexity of server setup. It's incredibly fast for analytical queries, works directly with Pandas DataFrames and Parquet files, and requires zero configuration. Perfect for querying large datasets that don't fit in memory.

Common Use Cases

  • Query large CSV, JSON, and Parquet files with SQL
  • Build relational databases without server management
  • Perform complex joins and aggregations on research data
  • Work with data larger than RAM

Getting Started

DuckDB is an in-process SQL OLAP database management system designed for analytical workloads. Think of it as “SQLite for analytics” - it’s embedded, serverless, and optimized for analytical queries on large datasets.

Why DuckDB?

  • Fast: Columnar storage optimized for analytics
  • Embedded: No server to manage, just a Python import
  • SQL Interface: Use familiar SQL syntax
  • Large Data: Handles datasets larger than RAM
  • Format Support: Query CSV, JSON, Parquet directly
  • Pandas Integration: Seamless DataFrame conversion

Key Features

Direct File Querying

import duckdb

# Query CSV files directly
result = duckdb.query("""
    SELECT subject, AVG(reaction_time) as avg_rt
    FROM 'experiment_data.csv'
    GROUP BY subject
    ORDER BY avg_rt
""").df()

Pandas Integration

import pandas as pd

# Query Pandas DataFrames
df = pd.read_csv('data.csv')
result = duckdb.query("""
    SELECT *
    FROM df
    WHERE condition = 'A'
""").df()

Multiple File Formats

# JSON
duckdb.query("SELECT * FROM 'data.json'")

# Parquet (very efficient)
duckdb.query("SELECT * FROM 'data.parquet'")

# Multiple CSVs with wildcards
duckdb.query("SELECT * FROM 'data/*.csv'")

Complex Queries

con = duckdb.connect()

# Create table from DataFrame
con.register('trials', trials_df)

# Complex analytical query
result = con.execute("""
    WITH trial_stats AS (
        SELECT
            subject,
            condition,
            COUNT(*) as n_trials,
            AVG(accuracy) as mean_acc,
            STDDEV(reaction_time) as std_rt
        FROM trials
        WHERE reaction_time > 0.1
        GROUP BY subject, condition
    )
    SELECT
        condition,
        AVG(mean_acc) as avg_accuracy,
        AVG(std_rt) as avg_variability
    FROM trial_stats
    GROUP BY condition
""").df()

Persistent Databases

# Create persistent database file
con = duckdb.connect('research.db')

# Load data into tables
con.execute("""
    CREATE TABLE experiments AS
    SELECT * FROM 'experiments.csv'
""")

# Query anytime
results = con.execute("""
    SELECT * FROM experiments
    WHERE date > '2024-01-01'
""").df()

Performance Optimization

Parquet for Storage

# Convert CSV to Parquet (much faster to query)
duckdb.query("""
    COPY (SELECT * FROM 'large_data.csv')
    TO 'large_data.parquet' (FORMAT PARQUET)
""")

Indexes

con.execute("CREATE INDEX idx_subject ON trials(subject)")

Partitioned Data

# Query partitioned Parquet files efficiently
duckdb.query("""
    SELECT * FROM 'data/year=2024/month=*/data.parquet'
    WHERE month = 3
""")

Research Use Cases

Behavioral Data Analysis

# Aggregate across subjects and conditions
duckdb.query("""
    SELECT
        condition,
        COUNT(*) as n_trials,
        AVG(correct) as accuracy,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY rt) as median_rt
    FROM trials
    GROUP BY condition
""")

Joining Multiple Datasets

# Combine trials, subjects, and session info
duckdb.query("""
    SELECT t.*, s.age, s.group, sess.date
    FROM trials t
    JOIN subjects s ON t.subject_id = s.id
    JOIN sessions sess ON t.session_id = sess.id
    WHERE s.group = 'control'
""")

Time Series Queries

# Window functions for time-based analysis
duckdb.query("""
    SELECT
        subject,
        trial_num,
        accuracy,
        AVG(accuracy) OVER (
            PARTITION BY subject
            ORDER BY trial_num
            ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
        ) as rolling_accuracy
    FROM trials
""")

Getting Started

Install DuckDB:

pip install duckdb

Basic usage:

import duckdb

# Quick query
result = duckdb.query("""
    SELECT
        condition,
        COUNT(*) as n,
        AVG(reaction_time) as mean_rt
    FROM 'data.csv'
    GROUP BY condition
""").df()

print(result)

Tips

  • Use Parquet format for large datasets (10-100x faster than CSV)
  • Learn SQL window functions for time-series analysis
  • Use EXPLAIN to understand query performance
  • Register Pandas DataFrames with con.register()
  • Use df() to convert results back to Pandas
  • Leverage wildcard patterns for multiple files
  • Consider partitioning large datasets by date/subject
  • Use COPY to export results efficiently
Top