It makes life easier when you have all of your data in one place, but it also helps if you can get all of your data into one simple file, ready for analysis. In the Topheno application, we have a good set of tools helping us to gather and transform clinical data into a standard set of fields. Over time, multiple standardised data files are produced for different studies, as well as for the same study. Often some curation is required to consolidate these clinical data into one resource, so that the amalgamated data can be compared with genotyping results. To help this amalgamation process, I built a web application called dataMerger, which can import CSV data files and produce one flat file containing all of these data combined into one table. One complication is that the data from separate sources can disagree, so a significant part of the dataMerger application is geared up to help us identify and resolve any such conflicts.
As an example of what this application can do, imagine that we have two files containing some overlapping data for a set of individuals.
File 1:
| ID | Name | Location | DOB | Telephone |
| 101 | Billy | London | 1st April 1964 | 020 7123 1234 |
| 102 | Bob | Paris | 2nd June 1978 | 01 23 45 67 89 |
| 103 | Sally | New York | 3rd August 1939 | |
| 104 | Jane | Rome | 4th May 1946 | 06 1234 1234 |
File 2:
| ID | Name | Location | Telephone | |
| 103 | Sally | Oxford | sally@example.org | 01865 123456 |
| 104 | Jane | Oxford | jane@example.org | 01865 123456 |
| 105 | Pete | Bamako | pete@example.org | 223 12345678 |
| 106 | Fred | Bangkok | fred@example.org | 02-1234567 |
The dataMerger application allows us to easily combine these data into one file, resolving any conflicts and incompleteness along the way. Depending on decisions made by the user, the output contains data from both sources. The data provenance is also recorded.
Output:
| ID | Name | Location | DOB | Telephone | |
| 101 | Billy | London | 1st April 1964 | 020 7123 1234 | |
| 102 | Bob | Paris | 2nd June 1978 | 01 23 45 67 89 | |
| 103 | Sally | Oxford | 3rd August 1939 | sally@example.org | 01865 123456 |
| 104 | Jane | Oxford | 4th May 1946 | jane@example.org | 01865 123456 |
| 105 | Pete | Bamako | pete@example.org | 223 12345678 | |
| 106 | Fred | Bangkok | fred@example.org | 02-1234567 |
The application code is open source and freely available via SVN from the following URL http://cggh.googlecode.com/svn/networks/MalariaGEN/projects/dataMerger/trunk/
This tool uses relatively simple technologies, such as Java servlets, JSP, Maven, MySQL, JavaScript (JQuery, JSON, AJAX) and of course CSS and XHTML. Maven was convenient, but not essential. This application also makes use of Andrew Valums’ rather nifty JQuery plug-in for uploading files, which is itself freely available and open source (GPL), http://valums.com/ajax-upload/
Even though dataMerger is narrowly focussed upon merging data, I encountered a wide range of topics during its development, for instance:
- Requirements gathering and issue tracking
- Technology choice, architecture choice, development approaches
- Web application security, user-access schemes, resource sharing
- Off-the-shelf versus tailor-made, open source integrations
- User interface design, REST, user experience, workflow
- Import / export of data file formats, cross-platform compatibility
- Dynamic database structures, data storage efficiency
- Strategies for handling data conflicts, nulls and missingness
- Database query performance and benchmarking procedural algorithms
- Balancing scalability with urgency and purpose-built engineering
- Balancing portability with close-coupling and interoperability
- Software versioning, data provenance, deployment strategies.
All in all, I had a lot of fun working on this project and I learnt a fair bit along the way too.