CSV (Comma-Separated Values) is a simple file format for storing tabular data. Each line represents a row, and commas separate the columns. It’s the most common format for data exchange in research.
Why CSV?
- Universal: Supported by every data analysis tool
- Human-Readable: Can be opened in any text editor
- Simple: Easy to create, edit, and understand
- Platform-Independent: Works on any operating system
- Archival: Plain text ensures long-term accessibility
- Version Control Friendly: Text format works with Git
Structure
Basic Format
name,age,score
Alice,25,95
Bob,30,87
Charlie,28,92
With Header Row
subject_id,condition,reaction_time,accuracy
S01,control,0.45,0.95
S01,treatment,0.38,0.98
S02,control,0.52,0.89
S02,treatment,0.41,0.96
Working with CSV in Python
Reading with Pandas
import pandas as pd
# Read CSV file
df = pd.read_csv('data.csv')
# Specify delimiter
df = pd.read_csv('data.txt', delimiter='\t') # Tab-separated
# Skip rows
df = pd.read_csv('data.csv', skiprows=2)
# Specify data types
df = pd.read_csv('data.csv', dtype={'subject_id': str, 'score': float})
# Handle missing values
df = pd.read_csv('data.csv', na_values=['NA', 'missing', '-'])
# Use specific column as index
df = pd.read_csv('data.csv', index_col='subject_id')
Writing with Pandas
# Save DataFrame to CSV
df.to_csv('output.csv', index=False)
# With custom separator
df.to_csv('output.txt', sep='\t', index=False)
# Control precision
df.to_csv('output.csv', index=False, float_format='%.3f')
# Handle missing values
df.to_csv('output.csv', index=False, na_rep='NA')
# Append to existing file
df.to_csv('output.csv', mode='a', header=False, index=False)
Using Python’s csv Module
import csv
# Reading
with open('data.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['subject_id'], row['score'])
# Writing
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['name', 'age', 'score'])
writer.writerow(['Alice', 25, 95])
writer.writerow(['Bob', 30, 87])
Best Practices for Research Data
Include Metadata Header
# Experiment: Visual discrimination task
# Date: 2024-01-15
# PI: Dr. Smith
subject_id,session,trial,stimulus,response,reaction_time
S01,1,1,left,left,0.45
S01,1,2,right,right,0.38
Use Descriptive Column Names
# Good
subject_id,reaction_time_ms,accuracy_percent,condition
# Avoid
s,rt,acc,c
Consistent Data Types
# Good - consistent date format
subject,date,score
S01,2024-01-15,95
S02,2024-01-16,87
# Avoid - mixed formats
subject,date,score
S01,2024-01-15,95
S02,Jan 16 2024,87
Handle Missing Data Explicitly
subject_id,pre_score,post_score
S01,85,90
S02,NA,88
S03,78,NA
Common Issues and Solutions
Commas in Data
# Problem: Values contain commas
name,location,score
"Smith, John","Boston, MA",95
# Solution: Pandas handles this automatically
df = pd.read_csv('data.csv')
Different Encodings
# Specify encoding
df = pd.read_csv('data.csv', encoding='utf-8')
# Try latin-1 for older files
df = pd.read_csv('data.csv', encoding='latin-1')
Large Files
# Read in chunks
chunks = pd.read_csv('large_data.csv', chunksize=10000)
for chunk in chunks:
process(chunk)
# Read specific columns only
df = pd.read_csv('data.csv', usecols=['subject_id', 'score'])
Neuroscience Examples
Behavioral Data
subject_id,session,trial,stimulus_type,contrast,choice,reaction_time,correct
S01,1,1,visual,0.5,left,0.45,1
S01,1,2,visual,1.0,right,0.38,1
S01,1,3,visual,0.25,left,0.52,0
Spike Times
unit_id,spike_time,trial,brain_region
1,0.152,1,V1
1,0.247,1,V1
1,0.389,1,V1
2,0.103,1,V1
2,0.445,1,V1
ROI Measurements
roi_id,area_pixels,mean_intensity,max_intensity,x_centroid,y_centroid
1,245,128.5,255,120.3,85.7
2,189,95.2,198,145.8,92.1
3,312,142.7,243,98.2,110.5
Alternatives and When to Use Them
Use CSV When:
- Data is tabular
- File size is manageable (<1GB)
- Need human readability
- Sharing with non-programmers
- Long-term archival
Consider Alternatives:
- Parquet: Large datasets, better compression
- HDF5: Hierarchical data, very large files
- JSON: Nested/hierarchical structures
- NetCDF: Multi-dimensional arrays
- Excel: Need formatting, multiple sheets
Data Validation
import pandas as pd
# Read CSV
df = pd.read_csv('experiment_data.csv')
# Validate structure
required_columns = ['subject_id', 'trial', 'response', 'reaction_time']
assert all(col in df.columns for col in required_columns)
# Check data types
assert df['trial'].dtype == int
assert df['reaction_time'].dtype == float
# Check for missing values
assert df['subject_id'].notna().all()
# Check value ranges
assert (df['reaction_time'] >= 0).all()
assert df['response'].isin(['left', 'right', 'none']).all()
Converting to/from Other Formats
CSV to Excel
df = pd.read_csv('data.csv')
df.to_excel('data.xlsx', index=False, sheet_name='Results')
Excel to CSV
df = pd.read_excel('data.xlsx', sheet_name='Results')
df.to_csv('data.csv', index=False)
Multiple CSVs to Single File
import pandas as pd
from pathlib import Path
# Combine multiple CSV files
csv_files = Path('data').glob('*.csv')
df_list = [pd.read_csv(f) for f in csv_files]
combined = pd.concat(df_list, ignore_index=True)
combined.to_csv('combined_data.csv', index=False)
Tips for Reproducible Research
- Document Structure: Include README explaining columns
- Use ISO Dates: YYYY-MM-DD format (2024-01-15)
- Include Units: Either in column names or documentation
- Version Files: Use timestamps or version numbers
- Validate Data: Check structure before analysis
- Use UTF-8: Standard encoding for international characters
- No Formatting: Keep plain CSV, no colors or fonts
- Backup Regularly: CSV corruption is rare but possible
Tools for Working with CSV
- Python: pandas, csv module
- R: read.csv(), readr
- Excel: Opens CSV directly
- Text Editors: Any editor (Notepad++, VSCode, etc.)
- Command Line: awk, cut, grep
- Databases: COPY command (PostgreSQL), LOAD DATA (MySQL)
Standard Library csv Module
For simple cases without pandas:
import csv
# Read
with open('data.csv', 'r') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
print(row)
# Write
data = [
['name', 'score'],
['Alice', 95],
['Bob', 87]
]
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(data)
Summary
CSV is the workhorse format for tabular research data. Its simplicity and universal support make it ideal for:
- Sharing data with collaborators
- Publishing datasets
- Long-term archival
- Working across different tools and platforms
While it has limitations for very large or complex data, CSV remains the first choice for most research data exchange scenarios.