Python CSV Validator Library

Posted: 21 July 2011 by Alistair Miles in Uncategorized
Tags: , , , ,

As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.

The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator).

Here’s a simple example:

import sys
import csv
from csvvalidator import *

field_names = (
               'study_id',
               'patient_id',
               'gender',
               'age_years',
               'age_months',
               'date_inclusion'
               )

validator = CSVValidator(field_names)

# basic header and record length checks
validator.add_header_check('EX1', 'bad header')
validator.add_record_length_check('EX2', 'unexpected record length')

# some simple value checks
validator.add_value_check('study_id', int,
                          'EX3', 'study id must be an integer')
validator.add_value_check('patient_id', int,
                          'EX4', 'patient id must be an integer')
validator.add_value_check('gender', enumeration('M', 'F'),
                          'EX5', 'invalid gender')
validator.add_value_check('age_years', number_range_inclusive(0, 120, int),
                          'EX6', 'invalid age in years')
validator.add_value_check('date_inclusion', datetime_string('%Y-%m-%d'),
                          'EX7', 'invalid date')

# a more complicated record check
def check_age_variables(r):
    age_years = int(r['age_years'])
    age_months = int(r['age_months'])
    valid = (age_months >= age_years * 12 and
             age_months % age_years < 12)
    if not valid:
        raise ValueError(age_years, age_months)
validator.add_record_check(check_age_variables,
                           'EX8', 'invalid age variables')

# validate the data and write problems to stdout
data = csv.reader('/path/to/data.csv', delimiter='\t')
problems = validator.validate(data)
write_problems(problems, sys.stdout)

For more involved examples, see example.py and tests.py … or use the source, Luke 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s