Loading...

CSV

Plain text file format for tabular data using commas to separate values

Open Science & Data Sharing Essential Core Tool
Quick Info
  • Category: Open Science & Data Sharing
  • Level: Essential
  • Type: Core Tool

Why We Recommend CSV

CSV is the universal format for tabular data. It's human-readable, works with every data analysis tool, and is perfect for sharing datasets. Its simplicity makes it ideal for reproducible research and long-term data storage.

Common Use Cases

  • Storing experimental data and results
  • Sharing datasets with collaborators
  • Exchanging data between tools
  • Long-term archival of tabular data

Getting Started

CSV (Comma-Separated Values) is a simple file format for storing tabular data. Each line represents a row, and commas separate the columns. It’s the most common format for data exchange in research.

Why CSV?

  • Universal: Supported by every data analysis tool
  • Human-Readable: Can be opened in any text editor
  • Simple: Easy to create, edit, and understand
  • Platform-Independent: Works on any operating system
  • Archival: Plain text ensures long-term accessibility
  • Version Control Friendly: Text format works with Git

Structure

Basic Format

name,age,score
Alice,25,95
Bob,30,87
Charlie,28,92

With Header Row

subject_id,condition,reaction_time,accuracy
S01,control,0.45,0.95
S01,treatment,0.38,0.98
S02,control,0.52,0.89
S02,treatment,0.41,0.96

Working with CSV in Python

Reading with Pandas

import pandas as pd

# Read CSV file
df = pd.read_csv('data.csv')

# Specify delimiter
df = pd.read_csv('data.txt', delimiter='\t')  # Tab-separated

# Skip rows
df = pd.read_csv('data.csv', skiprows=2)

# Specify data types
df = pd.read_csv('data.csv', dtype={'subject_id': str, 'score': float})

# Handle missing values
df = pd.read_csv('data.csv', na_values=['NA', 'missing', '-'])

# Use specific column as index
df = pd.read_csv('data.csv', index_col='subject_id')

Writing with Pandas

# Save DataFrame to CSV
df.to_csv('output.csv', index=False)

# With custom separator
df.to_csv('output.txt', sep='\t', index=False)

# Control precision
df.to_csv('output.csv', index=False, float_format='%.3f')

# Handle missing values
df.to_csv('output.csv', index=False, na_rep='NA')

# Append to existing file
df.to_csv('output.csv', mode='a', header=False, index=False)

Using Python’s csv Module

import csv

# Reading
with open('data.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['subject_id'], row['score'])

# Writing
with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'age', 'score'])
    writer.writerow(['Alice', 25, 95])
    writer.writerow(['Bob', 30, 87])

Best Practices for Research Data

Include Metadata Header

# Experiment: Visual discrimination task
# Date: 2024-01-15
# PI: Dr. Smith
subject_id,session,trial,stimulus,response,reaction_time
S01,1,1,left,left,0.45
S01,1,2,right,right,0.38

Use Descriptive Column Names

# Good
subject_id,reaction_time_ms,accuracy_percent,condition

# Avoid
s,rt,acc,c

Consistent Data Types

# Good - consistent date format
subject,date,score
S01,2024-01-15,95
S02,2024-01-16,87

# Avoid - mixed formats
subject,date,score
S01,2024-01-15,95
S02,Jan 16 2024,87

Handle Missing Data Explicitly

subject_id,pre_score,post_score
S01,85,90
S02,NA,88
S03,78,NA

Common Issues and Solutions

Commas in Data

# Problem: Values contain commas
name,location,score
"Smith, John","Boston, MA",95

# Solution: Pandas handles this automatically
df = pd.read_csv('data.csv')

Different Encodings

# Specify encoding
df = pd.read_csv('data.csv', encoding='utf-8')

# Try latin-1 for older files
df = pd.read_csv('data.csv', encoding='latin-1')

Large Files

# Read in chunks
chunks = pd.read_csv('large_data.csv', chunksize=10000)
for chunk in chunks:
    process(chunk)

# Read specific columns only
df = pd.read_csv('data.csv', usecols=['subject_id', 'score'])

Neuroscience Examples

Behavioral Data

subject_id,session,trial,stimulus_type,contrast,choice,reaction_time,correct
S01,1,1,visual,0.5,left,0.45,1
S01,1,2,visual,1.0,right,0.38,1
S01,1,3,visual,0.25,left,0.52,0

Spike Times

unit_id,spike_time,trial,brain_region
1,0.152,1,V1
1,0.247,1,V1
1,0.389,1,V1
2,0.103,1,V1
2,0.445,1,V1

ROI Measurements

roi_id,area_pixels,mean_intensity,max_intensity,x_centroid,y_centroid
1,245,128.5,255,120.3,85.7
2,189,95.2,198,145.8,92.1
3,312,142.7,243,98.2,110.5

Alternatives and When to Use Them

Use CSV When:

  • Data is tabular
  • File size is manageable (<1GB)
  • Need human readability
  • Sharing with non-programmers
  • Long-term archival

Consider Alternatives:

  • Parquet: Large datasets, better compression
  • HDF5: Hierarchical data, very large files
  • JSON: Nested/hierarchical structures
  • NetCDF: Multi-dimensional arrays
  • Excel: Need formatting, multiple sheets

Data Validation

import pandas as pd

# Read CSV
df = pd.read_csv('experiment_data.csv')

# Validate structure
required_columns = ['subject_id', 'trial', 'response', 'reaction_time']
assert all(col in df.columns for col in required_columns)

# Check data types
assert df['trial'].dtype == int
assert df['reaction_time'].dtype == float

# Check for missing values
assert df['subject_id'].notna().all()

# Check value ranges
assert (df['reaction_time'] >= 0).all()
assert df['response'].isin(['left', 'right', 'none']).all()

Converting to/from Other Formats

CSV to Excel

df = pd.read_csv('data.csv')
df.to_excel('data.xlsx', index=False, sheet_name='Results')

Excel to CSV

df = pd.read_excel('data.xlsx', sheet_name='Results')
df.to_csv('data.csv', index=False)

Multiple CSVs to Single File

import pandas as pd
from pathlib import Path

# Combine multiple CSV files
csv_files = Path('data').glob('*.csv')
df_list = [pd.read_csv(f) for f in csv_files]
combined = pd.concat(df_list, ignore_index=True)
combined.to_csv('combined_data.csv', index=False)

Tips for Reproducible Research

  1. Document Structure: Include README explaining columns
  2. Use ISO Dates: YYYY-MM-DD format (2024-01-15)
  3. Include Units: Either in column names or documentation
  4. Version Files: Use timestamps or version numbers
  5. Validate Data: Check structure before analysis
  6. Use UTF-8: Standard encoding for international characters
  7. No Formatting: Keep plain CSV, no colors or fonts
  8. Backup Regularly: CSV corruption is rare but possible

Tools for Working with CSV

  • Python: pandas, csv module
  • R: read.csv(), readr
  • Excel: Opens CSV directly
  • Text Editors: Any editor (Notepad++, VSCode, etc.)
  • Command Line: awk, cut, grep
  • Databases: COPY command (PostgreSQL), LOAD DATA (MySQL)

Standard Library csv Module

For simple cases without pandas:

import csv

# Read
with open('data.csv', 'r') as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        print(row)

# Write
data = [
    ['name', 'score'],
    ['Alice', 95],
    ['Bob', 87]
]

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data)

Summary

CSV is the workhorse format for tabular research data. Its simplicity and universal support make it ideal for:

  • Sharing data with collaborators
  • Publishing datasets
  • Long-term archival
  • Working across different tools and platforms

While it has limitations for very large or complex data, CSV remains the first choice for most research data exchange scenarios.

Resources

Top