DuckDB is an in-process SQL OLAP database management system designed for analytical workloads. Think of it as “SQLite for analytics” - it’s embedded, serverless, and optimized for analytical queries on large datasets.
Why DuckDB?
- Fast: Columnar storage optimized for analytics
- Embedded: No server to manage, just a Python import
- SQL Interface: Use familiar SQL syntax
- Large Data: Handles datasets larger than RAM
- Format Support: Query CSV, JSON, Parquet directly
- Pandas Integration: Seamless DataFrame conversion
Key Features
Direct File Querying
import duckdb
# Query CSV files directly
result = duckdb.query("""
SELECT subject, AVG(reaction_time) as avg_rt
FROM 'experiment_data.csv'
GROUP BY subject
ORDER BY avg_rt
""").df()
Pandas Integration
import pandas as pd
# Query Pandas DataFrames
df = pd.read_csv('data.csv')
result = duckdb.query("""
SELECT *
FROM df
WHERE condition = 'A'
""").df()
Multiple File Formats
# JSON
duckdb.query("SELECT * FROM 'data.json'")
# Parquet (very efficient)
duckdb.query("SELECT * FROM 'data.parquet'")
# Multiple CSVs with wildcards
duckdb.query("SELECT * FROM 'data/*.csv'")
Complex Queries
con = duckdb.connect()
# Create table from DataFrame
con.register('trials', trials_df)
# Complex analytical query
result = con.execute("""
WITH trial_stats AS (
SELECT
subject,
condition,
COUNT(*) as n_trials,
AVG(accuracy) as mean_acc,
STDDEV(reaction_time) as std_rt
FROM trials
WHERE reaction_time > 0.1
GROUP BY subject, condition
)
SELECT
condition,
AVG(mean_acc) as avg_accuracy,
AVG(std_rt) as avg_variability
FROM trial_stats
GROUP BY condition
""").df()
Persistent Databases
# Create persistent database file
con = duckdb.connect('research.db')
# Load data into tables
con.execute("""
CREATE TABLE experiments AS
SELECT * FROM 'experiments.csv'
""")
# Query anytime
results = con.execute("""
SELECT * FROM experiments
WHERE date > '2024-01-01'
""").df()
Performance Optimization
Parquet for Storage
# Convert CSV to Parquet (much faster to query)
duckdb.query("""
COPY (SELECT * FROM 'large_data.csv')
TO 'large_data.parquet' (FORMAT PARQUET)
""")
Indexes
con.execute("CREATE INDEX idx_subject ON trials(subject)")
Partitioned Data
# Query partitioned Parquet files efficiently
duckdb.query("""
SELECT * FROM 'data/year=2024/month=*/data.parquet'
WHERE month = 3
""")
Research Use Cases
Behavioral Data Analysis
# Aggregate across subjects and conditions
duckdb.query("""
SELECT
condition,
COUNT(*) as n_trials,
AVG(correct) as accuracy,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY rt) as median_rt
FROM trials
GROUP BY condition
""")
Joining Multiple Datasets
# Combine trials, subjects, and session info
duckdb.query("""
SELECT t.*, s.age, s.group, sess.date
FROM trials t
JOIN subjects s ON t.subject_id = s.id
JOIN sessions sess ON t.session_id = sess.id
WHERE s.group = 'control'
""")
Time Series Queries
# Window functions for time-based analysis
duckdb.query("""
SELECT
subject,
trial_num,
accuracy,
AVG(accuracy) OVER (
PARTITION BY subject
ORDER BY trial_num
ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
) as rolling_accuracy
FROM trials
""")
Getting Started
Install DuckDB:
pip install duckdb
Basic usage:
import duckdb
# Quick query
result = duckdb.query("""
SELECT
condition,
COUNT(*) as n,
AVG(reaction_time) as mean_rt
FROM 'data.csv'
GROUP BY condition
""").df()
print(result)
Tips
- Use Parquet format for large datasets (10-100x faster than CSV)
- Learn SQL window functions for time-series analysis
- Use
EXPLAINto understand query performance - Register Pandas DataFrames with
con.register() - Use
df()to convert results back to Pandas - Leverage wildcard patterns for multiple files
- Consider partitioning large datasets by date/subject
- Use
COPYto export results efficiently