Data Transformation Patterns for Clinical Data Curation – Dealing with Time Series

Posted: 21 June 2011 by Alistair Miles in Uncategorized
Tags: , , , , , , ,

Background: Clinical Data Curation in MalariaGEN and WWARN

In both MalariaGEN‘s Consortial Projects and WWARN we’ve been involved in aggregating clinical data from different studies and research groups, and a big challenge is dealing with heterogeneity in the source data. There is heterogeneity at multiple levels. We see a variety of file formats. Mostly the data are laid out as columnar tables, but we also see some weird and wonderful layouts. Then there is variety in how the tables are designed – some prefer relatively flat tables with one row per patient, others prefer one row per clinical event, observation or visit. And then there is a lot of diversity in which variables (like temperature, parasitaemia, etc.) have been recorded, how the variables have been named, what units have been used, etc. Finally, top that all off with plenty of subtlety in the semantics of the variables and the data (how was the temperature measured?).

The general approach through this morass is to design a standard schema for the data, with a well-defined set of variables. A transformation is then designed for each of the source datasets, mapping the data onto the standard schema.

The problem we have is that designing a transformation for each of the source datasets is a time-consuming task, requiring expertise on the part of the curator in data transformation techniques as well as lots of knowledge about the domain and experience of different ways of representing the data. These skills don’t often come together in one person. We’ve made various attempts at developing software tools that make designing transformations much easier and less technical, but we certainly don’t have it solved.

The other day I realised what now seems blindingly obvious, which is that SQL and relational views provide a declarative language and tool for designing transformations on columnar tables. This is still not the holy grail of a non-programmer’s tool for designing data transformations, but I thought if I could describe some transformation patterns, along with examples in SQL, that would take us a step in the right direction.

Now, rather than start with the easy stuff like converting temperature in Fahrenheit to temperature in Celsius, or multiplying two columns together, I thought I’d start with the harder cases involving transformations on time series data. Below are a couple of patterns with some SQL, this is not exhaustive by any means, but hopefully an interesting start.

Pattern – Melting Out Events

Often we received data in a one-row-per-patient table. E.g., consider the following data:

pid	d0	parad0	parad1	parad2
1	2011-04-12	134	12	1
2	2011-05-14	24	45	78
3	2011-03-01	38	12	34
4	2011-02-01	91	67	32
5	2011-02-28	34	125	87

The first column (pid) is patient unique ID, the second (d0) is the date on which the patient joined the trial (“day 0”), the other columns (parad0, parad1, parad2) are parasitaemia measurements on day 0, 1 and 2 respectively.

We want to transform this table into a table that is one-row-per-event. I.e., we want to “melt out” the time interval that is implicit in the parad0, parad1, and parad2 variables, and come to a more generalised representation.

This can be done in SQL using a view constructed via a union, with a bit of date arithmetic thrown in. With the data above in a file called “source_para.csv”, the MySQL script below does the transformation.

DROP DATABASE IF EXISTS test_curation;

CREATE DATABASE test_curation CHARACTER SET 'utf8';

USE test_curation;

CREATE TABLE source (
	pid INT PRIMARY KEY,
	d0 DATE,
	parad0 INT,
	parad1 INT,
	parad2 INT
) ENGINE = MyISAM;

LOAD DATA LOCAL INFILE '/path/to/source_para.csv' 
	REPLACE 
	INTO TABLE source 
	CHARACTER SET 'utf8'
	FIELDS 
		TERMINATED BY '\t'
		OPTIONALLY ENCLOSED BY '"'
		ESCAPED BY '\\'
	LINES 
		STARTING BY ''
		TERMINATED BY '\n'
	IGNORE 1 LINES
;

CREATE VIEW output AS 
	(
		SELECT 
			pid,
			DATE_ADD(d0, INTERVAL 0 DAY) AS date,
			parad0 AS parasitaemia
		FROM source
	)
	UNION 
	(
		SELECT
			pid,
			DATE_ADD(d0, INTERVAL 1 DAY) AS date,
			parad1 AS parasitaemia
		FROM source
	)
	UNION 
	(
		SELECT
			pid,
			DATE_ADD(d0, INTERVAL 2 DAY) AS date,
			parad2 AS parasitaemia
		FROM source
	)
;

SELECT * FROM output ORDER BY pid;

Running this query gives the following data:

pid	date	parasitaemia
1	2011-04-14	1
1	2011-04-12	134
1	2011-04-13	12
2	2011-05-16	78
2	2011-05-14	24
2	2011-05-15	45
3	2011-03-03	34
3	2011-03-01	38
3	2011-03-02	12
4	2011-02-03	32
4	2011-02-01	91
4	2011-02-02	67
5	2011-03-01	125
5	2011-03-02	87
5	2011-02-28	34

Pattern – Flattening Time Series Data

This is the inverse of the first pattern above. Sometimes we have time-series data, and we want to “flatten” these into a one-row-per-patient table.

E.g., if we started with:

pid	date	parasitaemia
1	2011-04-14	1
1	2011-04-12	134
1	2011-04-13	12
2	2011-05-16	78
2	2011-05-14	24
2	2011-05-15	45
3	2011-03-03	34
3	2011-03-01	38
3	2011-03-02	12
4	2011-02-03	32
4	2011-02-01	91
4	2011-02-02	67
5	2011-03-01	125
5	2011-03-02	87
5	2011-02-28	34

…and we wanted to restore the original source table, with variables pid (patient unique ID), d0 (the date on which the patient joined the trial), and parad0, parad1, and parad2 (parasitaemia measurements on day 0, 1 and 2 respectively).

One way to do this in SQL is with two queries, the first a query with a GROUP BY clause using the MIN() aggregate function to reconstruct the d0 variable, then a second query joining back onto the original table a couple of times to reconstruct the parad0, parad1, and parad2 variables, again with some date arithmetic. With the data above in a file called 'output_para.csv', the following MySQL query does the job.

DROP DATABASE IF EXISTS test_curation;

CREATE DATABASE test_curation CHARACTER SET 'utf8';

USE test_curation;

CREATE TABLE source (
	pid INT,
	date DATE,
	parasitaemia INT
) ENGINE = MyISAM;

LOAD DATA LOCAL INFILE '/home/aliman/Desktop/output_para.csv' 
	REPLACE 
	INTO TABLE source
	CHARACTER SET 'utf8'
	FIELDS 
		TERMINATED BY '\t'
		OPTIONALLY ENCLOSED BY '"'
		ESCAPED BY '\\'
	LINES 
		STARTING BY ''
		TERMINATED BY '\n'
	IGNORE 1 LINES
;

CREATE VIEW trans AS SELECT 
	pid,
	MIN(date) AS d0
	FROM source GROUP BY pid
;

CREATE VIEW output AS 
    SELECT trans.*, t0.parasitaemia AS parad0, t1.parasitaemia AS parad1, t2.parasitaemia AS parad2
    FROM trans JOIN (source AS t0, source AS t1, source AS t2)
    ON (
        trans.pid = t0.pid 
        AND trans.d0 = t0.date
        AND trans.pid = t1.pid
        AND DATE_ADD(trans.d0, INTERVAL 1 DAY) = t1.date
        AND trans.pid = t2.pid
        AND DATE_ADD(trans.d0, INTERVAL 2 DAY) = t2.date
    )
;

SELECT * FROM output ORDER BY pid;

I’m sure this could be reduced to a single query, but I thought I’d leave it as two for ease of reading.

This script and the one in the pattern above should be round-trippable, i.e., you can feed the output from one as the input to the other, and you should end up back where you started.

Of course, this could all break if you sprinkled some missingness through the source data … but that’s a problem for another day 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s