Background: Clinical Data Curation in MalariaGEN and WWARN
In both MalariaGEN‘s Consortial Projects and WWARN we’ve been involved in aggregating clinical data from different studies and research groups, and a big challenge is dealing with heterogeneity in the source data. There is heterogeneity at multiple levels. We see a variety of file formats. Mostly the data are laid out as columnar tables, but we also see some weird and wonderful layouts. Then there is variety in how the tables are designed – some prefer relatively flat tables with one row per patient, others prefer one row per clinical event, observation or visit. And then there is a lot of diversity in which variables (like temperature, parasitaemia, etc.) have been recorded, how the variables have been named, what units have been used, etc. Finally, top that all off with plenty of subtlety in the semantics of the variables and the data (how was the temperature measured?).
The general approach through this morass is to design a standard schema for the data, with a well-defined set of variables. A transformation is then designed for each of the source datasets, mapping the data onto the standard schema.
The problem we have is that designing a transformation for each of the source datasets is a time-consuming task, requiring expertise on the part of the curator in data transformation techniques as well as lots of knowledge about the domain and experience of different ways of representing the data. These skills don’t often come together in one person. We’ve made various attempts at developing software tools that make designing transformations much easier and less technical, but we certainly don’t have it solved.
The other day I realised what now seems blindingly obvious, which is that SQL and relational views provide a declarative language and tool for designing transformations on columnar tables. This is still not the holy grail of a non-programmer’s tool for designing data transformations, but I thought if I could describe some transformation patterns, along with examples in SQL, that would take us a step in the right direction.
Now, rather than start with the easy stuff like converting temperature in Fahrenheit to temperature in Celsius, or multiplying two columns together, I thought I’d start with the harder cases involving transformations on time series data. Below are a couple of patterns with some SQL, this is not exhaustive by any means, but hopefully an interesting start.
Pattern – Melting Out Events
Often we received data in a one-row-per-patient table. E.g., consider the following data:
pid d0 parad0 parad1 parad2 1 2011-04-12 134 12 1 2 2011-05-14 24 45 78 3 2011-03-01 38 12 34 4 2011-02-01 91 67 32 5 2011-02-28 34 125 87
The first column (pid) is patient unique ID, the second (d0) is the date on which the patient joined the trial (“day 0″), the other columns (parad0, parad1, parad2) are parasitaemia measurements on day 0, 1 and 2 respectively.
We want to transform this table into a table that is one-row-per-event. I.e., we want to “melt out” the time interval that is implicit in the parad0, parad1, and parad2 variables, and come to a more generalised representation.
This can be done in SQL using a view constructed via a union, with a bit of date arithmetic thrown in. With the data above in a file called “source_para.csv”, the MySQL script below does the transformation.
DROP DATABASE IF EXISTS test_curation; CREATE DATABASE test_curation CHARACTER SET 'utf8'; USE test_curation; CREATE TABLE source ( pid INT PRIMARY KEY, d0 DATE, parad0 INT, parad1 INT, parad2 INT ) ENGINE = MyISAM; LOAD DATA LOCAL INFILE '/path/to/source_para.csv' REPLACE INTO TABLE source CHARACTER SET 'utf8' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES STARTING BY '' TERMINATED BY '\n' IGNORE 1 LINES ; CREATE VIEW output AS ( SELECT pid, DATE_ADD(d0, INTERVAL 0 DAY) AS date, parad0 AS parasitaemia FROM source ) UNION ( SELECT pid, DATE_ADD(d0, INTERVAL 1 DAY) AS date, parad1 AS parasitaemia FROM source ) UNION ( SELECT pid, DATE_ADD(d0, INTERVAL 2 DAY) AS date, parad2 AS parasitaemia FROM source ) ; SELECT * FROM output ORDER BY pid;
Running this query gives the following data:
pid date parasitaemia 1 2011-04-14 1 1 2011-04-12 134 1 2011-04-13 12 2 2011-05-16 78 2 2011-05-14 24 2 2011-05-15 45 3 2011-03-03 34 3 2011-03-01 38 3 2011-03-02 12 4 2011-02-03 32 4 2011-02-01 91 4 2011-02-02 67 5 2011-03-01 125 5 2011-03-02 87 5 2011-02-28 34
Pattern – Flattening Time Series Data
This is the inverse of the first pattern above. Sometimes we have time-series data, and we want to “flatten” these into a one-row-per-patient table.
E.g., if we started with:
pid date parasitaemia 1 2011-04-14 1 1 2011-04-12 134 1 2011-04-13 12 2 2011-05-16 78 2 2011-05-14 24 2 2011-05-15 45 3 2011-03-03 34 3 2011-03-01 38 3 2011-03-02 12 4 2011-02-03 32 4 2011-02-01 91 4 2011-02-02 67 5 2011-03-01 125 5 2011-03-02 87 5 2011-02-28 34
…and we wanted to restore the original source table, with variables pid (patient unique ID), d0 (the date on which the patient joined the trial), and parad0, parad1, and parad2 (parasitaemia measurements on day 0, 1 and 2 respectively).
One way to do this in SQL is with two queries, the first a query with a GROUP BY clause using the MIN() aggregate function to reconstruct the d0 variable, then a second query joining back onto the original table a couple of times to reconstruct the parad0, parad1, and parad2 variables, again with some date arithmetic. With the data above in a file called 'output_para.csv', the following MySQL query does the job.
DROP DATABASE IF EXISTS test_curation;
CREATE DATABASE test_curation CHARACTER SET 'utf8';
USE test_curation;
CREATE TABLE source (
pid INT,
date DATE,
parasitaemia INT
) ENGINE = MyISAM;
LOAD DATA LOCAL INFILE '/home/aliman/Desktop/output_para.csv'
REPLACE
INTO TABLE source
CHARACTER SET 'utf8'
FIELDS
TERMINATED BY '\t'
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '\\'
LINES
STARTING BY ''
TERMINATED BY '\n'
IGNORE 1 LINES
;
CREATE VIEW trans AS SELECT
pid,
MIN(date) AS d0
FROM source GROUP BY pid
;
CREATE VIEW output AS
SELECT trans.*, t0.parasitaemia AS parad0, t1.parasitaemia AS parad1, t2.parasitaemia AS parad2
FROM trans JOIN (source AS t0, source AS t1, source AS t2)
ON (
trans.pid = t0.pid
AND trans.d0 = t0.date
AND trans.pid = t1.pid
AND DATE_ADD(trans.d0, INTERVAL 1 DAY) = t1.date
AND trans.pid = t2.pid
AND DATE_ADD(trans.d0, INTERVAL 2 DAY) = t2.date
)
;
SELECT * FROM output ORDER BY pid;
I’m sure this could be reduced to a single query, but I thought I’d leave it as two for ease of reading.
This script and the one in the pattern above should be round-trippable, i.e., you can feed the output from one as the input to the other, and you should end up back where you started.
Of course, this could all break if you sprinkled some missingness through the source data … but that’s a problem for another day
