Posted: 1 June 2011 by magnusmanske in Software
Tags: , , , , , ,

Previously, this blog has talked about genotyping, SNP-o-matic, and how to find genetic variants in our DNA samples. But what data is actually used to make these variant “calls” on? And what does this data look like, in its raw, “un-called” form? Does variant discovery rely exclusively on fancy software, or can the good ol’ Mark I eyeball still uncover hidden genetic features?

A brief history of new sequencing technology

The current generation of DNA sequencing machines, which feed our hungry processing pipeline, do not simply generate “the” genomic sequence of an organism. Instead, they chop the DNA into many short, overlapping fragments (think strings of 200-400 chars). Then, they give you the “substrings” at either end of a fragment, usually 54 or 76 chars (aka “bases”). They do not tell you precisely how long the fragment is, or the bases in between the ends, or where in the genome the fragment originally came from. But no worries: to compensate for this massive lack of information, you get lots of it. As in millions of fragment ends (aka “reads”). It then falls to alignment algorithms such as SNP-o-matic to solve the vast puzzle of putting the reads into their rightful place, using an existing “reference” genome as blueprint. Sometimes, a read is a perfect match to this reference; sometimes, it almost matches, but contains some mismatch (either a sequencing error, or a real variation); and sometimes, all the king’s horses and all the king’s men…

The data tsunami

Depth of coverage.

Since the genomic position of the fragments is random, one has to generate a lot of them to be sure to have the entire genome represented in the form of fragments. Ideally, every position in the genome should be represented in several fragments, to detect erroneous bases, as well as “mixed” variants from multiple cells within the same DNA sample. The mean number of fragments overlapping each position in the genome is called depth. In our samples, the depth can range from less than 10 to several hundred. Of course, each base in each fragment needs to be stored, as well as every quality value (the machine’s guess of how accurate the base is), plus fragment identified, position information, metadata, etc. This adds up pretty quickly; even for our “house parasite”, Plasmodium falciparum, with a meagre 24 megabases (24 million characters, or 24 megabyte) of DNA, the (compressed) sequencing data for a single sample usually takes up 2-4 GB of storage.

The visual puzzle

LookSeq screenshot of three samples, entire chromosome.

In 2009, I published about LookSeq, a web-based tool for browsing these vast data sets in visual form. Since then, it went through several iterations, and I only recently finished the latest one. (Check out some public samples yourself, or get an overview about the programme and the code!) LookSeq allows the user to browse multiple samples, aligned with various algorithms, in various display modes, zooming from an entire chromosome down to individual bases.

Default "InDel" view of three samples. The tower-like structure on the left is a deletion. The two red "lines" are SNPs.

A quick guide: What you see are the reads aligned to the reference genome (x-axis). The bases matching the reference genome are blue, mismatches are in red. You can zoom in by double-clicking, zoom in and out or to pre-set sizes (2kb, 50kb) using buttons. Once zoomed in, you can drag the display to reveal the adjacent alignments. The default display mode shows the apparent size of the respective fragment as position on the y-axis; this is useful to detect structural variations like insertions and deletions (see figure on the left). There are also a coverage view (depth per base) and a pileup view (reads piled on top of each other; see “depth of coverage” figure above).

The tech

Now for the implementation. As LookSeq is browser-based, no client-side installation is required, only JavaScript needs to run for the “visual experience”, and to communicate with the server. Communication runs through a server-side Perl script (which handles security, logins, metadata etc.) and a C++ backend for the heavy lifting, that is, the image rendering. Data is read directly from the alignment data in BAM format.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s