Previously, this blog has talked about genotyping, SNP-o-matic, and how to find genetic variants in our DNA samples. But what data is actually used to make these variant “calls” on? And what does this data look like, in its raw, “un-called” form? Does variant discovery rely exclusively on fancy software, or can the good ol’ Mark I eyeball still uncover hidden genetic features?
A brief history of new sequencing technology
The current generation of DNA sequencing machines, which feed our hungry processing pipeline, do not simply generate “the” genomic sequence of an organism. Instead, they chop the DNA into many short, overlapping fragments (think strings of 200-400 chars). Then, they give you the “substrings” at either end of a fragment, usually 54 or 76 chars (aka “bases”). They do not tell you precisely how long the fragment is, or the bases in between the ends, or where in the genome the fragment originally came from. But no worries: to compensate for this massive lack of information, you get lots of it. As in millions of fragment ends (aka “reads”). It then falls to alignment algorithms such as SNP-o-matic to solve the vast puzzle of putting the reads into their rightful place, using an existing “reference” genome as blueprint. Sometimes, a read is a perfect match to this reference; sometimes, it almost matches, but contains some mismatch (either a sequencing error, or a real variation); and sometimes, all the king’s horses and all the king’s men…
The data tsunami
Since the genomic position of the fragments is random, one has to generate a lot of them to be sure to have the entire genome represented in the form of fragments. Ideally, every position in the genome should be represented in several fragments, to detect erroneous bases, as well as “mixed” variants from multiple cells within the same DNA sample. The mean number of fragments overlapping each position in the genome is called depth. In our samples, the depth can range from less than 10 to several hundred. Of course, each base in each fragment needs to be stored, as well as every quality value (the machine’s guess of how accurate the base is), plus fragment identified, position information, metadata, etc. This adds up pretty quickly; even for our “house parasite”, Plasmodium falciparum, with a meagre 24 megabases (24 million characters, or 24 megabyte) of DNA, the (compressed) sequencing data for a single sample usually takes up 2-4 GB of storage.
The visual puzzle
In 2009, I published about LookSeq, a web-based tool for browsing these vast data sets in visual form. Since then, it went through several iterations, and I only recently finished the latest one. (Check out some public samples yourself, or get an overview about the programme and the code!) LookSeq allows the user to browse multiple samples, aligned with various algorithms, in various display modes, zooming from an entire chromosome down to individual bases.
A quick guide: What you see are the reads aligned to the reference genome (x-axis). The bases matching the reference genome are blue, mismatches are in red. You can zoom in by double-clicking, zoom in and out or to pre-set sizes (2kb, 50kb) using buttons. Once zoomed in, you can drag the display to reveal the adjacent alignments. The default display mode shows the apparent size of the respective fragment as position on the y-axis; this is useful to detect structural variations like insertions and deletions (see figure on the left). There are also a coverage view (depth per base) and a pileup view (reads piled on top of each other; see “depth of coverage” figure above).