Loading...

Pydoit

Python task automation tool for defining and executing task dependencies

Research Infrastructure & Reproducibility Intermediate Recommended Tool
Quick Info
  • Category: Research Infrastructure & Reproducibility
  • Level: Intermediate
  • Type: Recommended Tool
  • Requires:
    • python

Why We Recommend Pydoit

Pydoit brings make-style build automation to Python with a simple, Pythonic interface. It's perfect for automating repetitive analysis tasks, managing data processing pipelines, and ensuring your analysis steps run in the correct order.

Common Use Cases

  • Automating data analysis workflows
  • Task dependency management
  • Incremental builds (only run what changed)
  • Jupyter notebook automation

Getting Started

Pydoit (or just “doit”) is a task management and automation tool written in Python. It allows you to define tasks and their dependencies in Python code, and automatically determines which tasks need to be executed.

Why Pydoit?

  • Python-Based: Define tasks in Python, not a DSL
  • Smart Execution: Only runs tasks when inputs change
  • Dependency Tracking: Automatic task ordering
  • Jupyter Integration: Load tasks directly in notebooks
  • Simple: Easier learning curve than Snakemake/Make
  • Flexible: Python code for task definitions

Basic Concepts

Tasks as Python Functions

# dodo.py
def task_download_data():
    """Download raw data files"""
    return {
        'actions': ['wget https://example.com/data.csv -O data/raw.csv'],
        'targets': ['data/raw.csv'],
    }

def task_process_data():
    """Process downloaded data"""
    return {
        'actions': ['python scripts/process.py'],
        'file_dep': ['data/raw.csv'],
        'targets': ['data/processed.csv'],
    }

def task_analyze():
    """Run analysis"""
    return {
        'actions': ['python scripts/analyze.py'],
        'file_dep': ['data/processed.csv'],
        'targets': ['results/analysis.png'],
    }

Running Tasks

# List all tasks
doit list

# Run all tasks
doit

# Run specific task
doit process_data

# Force re-run (ignore up-to-date)
doit -a

# Clean generated files
doit clean

Task Structure

Basic Task

def task_example():
    return {
        'actions': ['echo "Hello World"'],
        'verbosity': 2,  # Show command output
    }

With Dependencies

def task_preprocess():
    return {
        'actions': ['python preprocess.py %(dependencies)s %(targets)s'],
        'file_dep': ['data/raw.csv'],
        'targets': ['data/clean.csv'],
    }

def task_analyze():
    return {
        'actions': ['python analyze.py'],
        'file_dep': ['data/clean.csv'],  # Depends on preprocess
        'targets': ['results/plot.png'],
    }

Python Actions

def process_data(input_file, output_file):
    import pandas as pd
    df = pd.read_csv(input_file)
    # ... process ...
    df.to_csv(output_file, index=False)

def task_process():
    return {
        'actions': [(process_data, ['data/raw.csv', 'data/clean.csv'])],
        'file_dep': ['data/raw.csv'],
        'targets': ['data/clean.csv'],
    }

Research Workflow Example

Complete Analysis Pipeline

# dodo.py
import pandas as pd
from pathlib import Path

SUBJECTS = ['S01', 'S02', 'S03']

# Download data
def task_download():
    """Download dataset from repository"""
    return {
        'actions': ['wget https://repo.org/data.zip', 'unzip data.zip'],
        'targets': ['data/raw/dataset.csv'],
        'uptodate': [True],  # Only download once
    }

# Process each subject
def task_preprocess():
    """Preprocess individual subject data"""
    for subject in SUBJECTS:
        yield {
            'name': subject,
            'actions': [f'python scripts/preprocess.py {subject}'],
            'file_dep': [f'data/raw/{subject}.csv'],
            'targets': [f'data/processed/{subject}.csv'],
        }

# Aggregate results
def task_aggregate():
    """Combine all subject data"""
    return {
        'actions': ['python scripts/aggregate.py'],
        'file_dep': [f'data/processed/{s}.csv' for s in SUBJECTS],
        'targets': ['data/combined.csv'],
    }

# Statistical analysis
def task_statistics():
    """Run statistical tests"""
    return {
        'actions': ['python scripts/stats.py'],
        'file_dep': ['data/combined.csv'],
        'targets': ['results/statistics.csv'],
    }

# Generate figures
def task_figures():
    """Create publication figures"""
    return {
        'actions': ['python scripts/plot.py'],
        'file_dep': ['data/combined.csv', 'results/statistics.csv'],
        'targets': ['figures/figure1.png', 'figures/figure2.png'],
    }

# Generate report
def task_report():
    """Generate final report"""
    return {
        'actions': ['jupyter nbconvert --execute report.ipynb'],
        'file_dep': ['report.ipynb', 'results/statistics.csv'],
        'targets': ['report.html'],
    }

Jupyter Notebook Integration

Load Doit in Notebook

# In Jupyter notebook
from doit import load_ipython_extension
load_ipython_extension()

# Now you can use %%doit magic

Define Tasks in Cells

# Cell 1: Define task
def task_process():
    return {
        'actions': ['python process.py'],
        'file_dep': ['data.csv'],
        'targets': ['processed.csv'],
    }

# Cell 2: Run task
%doit process

Notebook as Task

def task_run_notebook():
    """Execute analysis notebook"""
    return {
        'actions': ['jupyter nbconvert --execute analysis.ipynb'],
        'file_dep': ['analysis.ipynb', 'data/input.csv'],
        'targets': ['analysis.html'],
    }

Advanced Features

Task Groups

def task_process_all():
    """Process all data files"""
    for filename in Path('data/raw').glob('*.csv'):
        yield {
            'name': filename.stem,
            'actions': [f'python process.py {filename}'],
            'file_dep': [str(filename)],
            'targets': [f'data/processed/{filename.stem}.csv'],
        }

Cleanup

def task_analyze():
    return {
        'actions': ['python analyze.py'],
        'file_dep': ['data.csv'],
        'targets': ['results.png'],
        'clean': True,  # Include in 'doit clean'
    }

# Run cleanup
# doit clean

Custom Uptodate Check

def check_modified(task, values):
    """Check if file was modified recently"""
    from pathlib import Path
    import time

    file_path = Path(task.file_dep[0])
    if not file_path.exists():
        return False

    # Re-run if modified in last hour
    age = time.time() - file_path.stat().st_mtime
    return age > 3600

def task_analyze():
    return {
        'actions': ['python analyze.py'],
        'file_dep': ['data.csv'],
        'uptodate': [check_modified],
    }

Parameters

# Pass parameters to tasks
DOIT_CONFIG = {'default_tasks': ['all']}

def task_process():
    def process(threshold):
        print(f"Processing with threshold={threshold}")
        # ... process ...

    return {
        'actions': [(process, )],
        'params': [{'name': 'threshold', 'default': 0.5}],
    }

# Run with parameter:
# doit process threshold=0.8

Comparison with Other Tools

Pydoit vs Make

  • Pydoit: Python-based, easier for Python users
  • Make: Industry standard, steeper learning curve

Pydoit vs Snakemake

  • Pydoit: Simpler, good for straightforward pipelines
  • Snakemake: More powerful, better for complex workflows

When to Use Pydoit

  • Medium-sized projects (10-50 tasks)
  • Python-focused workflows
  • Need simplicity over features
  • Working primarily in Jupyter

When to Use Snakemake

  • Large projects (100+ tasks)
  • Need cluster execution
  • Complex dependency patterns
  • Bioinformatics workflows

Common Patterns

Parallel Execution

# dodo.py
DOIT_CONFIG = {
    'num_process': 4,  # Run 4 tasks in parallel
}

def task_process():
    for i in range(10):
        yield {
            'name': f'file_{i}',
            'actions': [f'python process.py file_{i}.csv'],
        }

Conditional Execution

def task_download():
    """Download data if not present"""
    data_file = Path('data/dataset.csv')

    if data_file.exists():
        actions = []
    else:
        actions = ['wget https://example.com/data.csv']

    return {
        'actions': actions,
        'targets': ['data/dataset.csv'],
    }

Logging

import logging

def task_analyze():
    def analyze():
        logger = logging.getLogger('doit')
        logger.info("Starting analysis...")
        # ... analysis code ...
        logger.info("Analysis complete")

    return {
        'actions': [analyze],
        'verbosity': 2,
    }

Best Practices

  1. Organize by dodo.py: Keep tasks in dodo.py file
  2. Use generators: For similar tasks, yield from generator
  3. File dependencies: Always specify file_dep and targets
  4. Small tasks: Break work into small, reusable tasks
  5. Clean targets: Add ‘clean’: True for generated files
  6. Document tasks: Use docstrings (shown in doit list)
  7. Test tasks: Run tasks individually during development

Configuration

dodo.py Configuration

DOIT_CONFIG = {
    'default_tasks': ['all'],
    'verbosity': 2,
    'num_process': 4,
    'backend': 'json',  # or 'sqlite3', 'dbm'
}

Installation

# With pixi
pixi add doit

# With pip
pip install doit

# With conda
conda install -c conda-forge doit

Resources

Summary

Pydoit is perfect for:

  • Automation: Repeating analysis workflows
  • Dependencies: Managing task order automatically
  • Incremental: Only rerun what changed
  • Jupyter-friendly: Works great in notebooks

For Python-based research workflows that don’t need the complexity of Snakemake, Pydoit provides an excellent middle ground between manual scripts and full workflow systems.

Prerequisites

  • python
Top