One of our main tasks is to develop and run a data processing pipeline, that takes in raw sequencing data from many different parasite DNA samples, then finds all the positions in the parasite genome where some of these samples are different from others (“SNP discovery”), then makes a “call” for each sample and for each variable position in the genome (“SNP”) as to which nucleotide (A,C, T or G) is found there (“genotyping”).
Once we’ve done that, we can tell you something about how one parasite is different from another, or how one group of parasites is different from another group, which might be interesting if, for example, one or more of those parasites are resistant to a particular drug.
Anyway, this is harder than you might think, and is complicated by all sorts of factors, not least that we have to work with lots and lots of relatively short pieces of DNA, and that the malaria parasite genome has long stretches of really repetitive sequence, and that a DNA sample might actually be from a person who was infected with 2 or more different parasites (a “mixed infection”) and so you’re actually looking at 2 or more genomes in one sample, …, not to mention that you have to work with lots of data, so use of memory and compute power needs to be efficient.
This is something we’ve been working on for a couple of years (well, I say “we”, but I take no credit, this is all down to Magnus, Gareth, Dominic and others). Back in 2009 Magnus published an article on something called SNP-o-matic, which is a piece of software he wrote to perform SNP discovery and genotyping for parasite samples, and to deal with some of the complications I’ve mentioned above. SNP-o-matic is still a key part of our pipelines, and Magnus has just finished some work rewriting it, primarily to add some new features for finding different types of genetic variation. But this is not a solved problem, and we know we can do better, so this is something that’s going to keep us busy for a while yet!