Posted: 22 October 2014 by Alistair Miles in Uncategorized
The Anopheles gambiae 1000 genomes project is presenting us with some technical challenges, as genetic diversity within the mosquito populations we are studying is extremely high. Although the A. gambiae reference genome (~250Mb) is an order of magnitude smaller than the human genome, we still discover about 100 million SNPs, of which about half pass a reasonably conservative set of filters, which works out to about 1 good SNP every 5 bases or so.
Doing any kind of exploratory analysis of a dataset of ~100 million SNPs genotyped across ~1000 samples is difficult, and working directly from VCF files is impractical, because of the time it takes to parse. Genotype calls can be represented as two-dimensional arrays of numerical data, and there are a number of well-established and emerging software tools and standards for dealing in a generic way with large multi-dimensional arrays, so we’ve been doing some investigation and trying to leverage this work to speed up our analysis workflow.
In particular, the HDF5 format is well supported, and we’ve got a lot of mileage out of it already. I’ve been working on a package called vcfnp which provides support for converting data from a VCF file first into NumPy arrays, and from there to HDF5. You have to make some choices when loading data into an HDF5 file, in particular what type of compression to use, and how to chunk the data. In order to make an informed decision, I did some benchmarking, looking at performance under a number of access scenarios, comparing different compression options and chunk layouts.
The main finding was that using a chunk size of around 128kb, and a fairly narrow chunk width of around 10, provides a very good compromise solution, with good read performance under both column-wise and row-wise access patterns. While other compression options are available and are slightly faster, gzip is very acceptable, and is more widely supported, so we’ll be sticking with that for now. See the notebook linked above for the gory details.
Posted: 22 October 2014 by Alistair Miles in Uncategorized
Back in June we officially launched the Anopheles gambiae 1000 genomes project, which is a consortial project generating and analysing whole genome sequence data on wild-caught mosquitoes of the species Anopheles gambiae and Anopheles coluzzii, the major vectors of Plasmodium falciparum malaria in Africa.
Along with the initial web page, we also made our first data release. The phase 1 preview release contains genotype data on 103 mosquitoes from Uganda, contributed by Martin Donnelly and David Weetman of the Liverpool School of Tropical Medicine. VCF files are available to download from the Ag1000G public FTP site, and there is also an early version of the Panoptes web application which provides an interactive environment for exploring the data.
The consortium is currently working hard on preparing and analysing the full phase 1 dataset, which comprises 765 samples from 8 countries spanning sub-Saharan Africa. We hope to release at least a beta version of these data before the end of the year, I’ll post here when it’s available.
Posted: 31 October 2013 by Alistair Miles in Jobs
Join the MalariaGEN team! We’re currently recruiting bioinformatics positions, see the MalariaGEN jobs web page for further details and how to apply. The closing date for applications is 4 November.
We’re primarily looking for bioinformaticians to join the methods development team, which works on evaluating methodologies for processing next-generation sequence data and analysing genetic variation. We are currently working with deep sequence data for approaching 3,000 Plasmodium samples and over 1,000 Anopheles samples, and a human resequencing project is just getting underway. So we are up to our eyeballs in data, and need people who have a keen eye for sifting the signal from the noise.
If you have any questions about the roles, please feel free to contact me.
A short video on the problem of anti-malarial drug resistance and the role of genome sequencing in parasite surveillance.
Posted: 5 September 2013 by Alistair Miles in Uncategorized
Recently Olivo Miotto and members of the MalariaGEN teams at Oxford and Sanger, in collaboration with teams studying malaria at 10 locations in West Africa and Southeast Asia, published a paper on multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. To present the findings at this week’s Royal Society Summer Science Exhibition the MalariaGEN communications team have put together a short animation, enjoy!
Posted: 3 July 2013 by Alistair Miles in Uncategorized
Posted: 25 February 2013 by Alistair Miles in Uncategorized
I’ve created a liftover chain file to migrate genomic data from the “version 2” 3D7 reference genome to the newer “version 3” reference genome. You can download the chain file at the link below, as well as a binary for the liftOver program compiled for x86_64:
To check it works, download the above and test.bed to a local directory then run:
chmod +x ./liftOver
./liftOver test.bed 2to3.liftOver test.v3.bed test.v3.unmapped
This should create the file
Pf3D7_07_v3 403620 403621 crt
Note that this expects chromosome names in the input to be like “Pf3D7_01”. If you’re using chromosome names like “MAL1” you’ll need to convert those first prior to applying the liftover to version 3.
Read the rest of this entry »
Posted: 22 February 2013 by Alistair Miles in Uncategorized
I’ve recently been doing some analysis of SNPs and indels from the MalariaGEN P. falciparum genetic crosses project, and have found it convenient to load variant call data from VCF files into numpy arrays to compute summary statistics, make plots, etc.
Attempt 1: vcfarray
I initially wrote a small Python library for loading the arrays based on the excellent PyVCF module. This works well but is a little slow, and when I profiled it it was the VCF parsing that was the bottleneck, so I went in search of a C/C++ library I could use from Cython…
Attempt 2: vcfnp
Erik Garrison’s vcflib library provides a nice C++ API for parsing a VCF file, so I had a go at writing a Cython module based on that. Performance is better, I get roughly 2-4X speed-up over the PyVCF-based implementation, although I was hoping for an order of magnitude … I guess it’s just the case that string parsing is relatively slow, even in C/C++, and we should be using BCF2.
To install and try vcfnp for yourself, do:
pip install vcfnp
See the vcfnp README for some examples of usage.