Visualisation software round-up
A common need that we have is to directly view, or interpretively visualise information (both numeric and categoric) that is attached to a particular point on genomic sequence, often in relation to some attributes of that sequence. The number of file formats and tools that have been written for doing this surprised me when I first looked. This post is the first step in looking at what is out there. For this purpose I’m limiting myself to looking at tools that read the popular VCF format (Not to be confused with v-Card). This will only scratch the surface – a quick look at this list shows there is more bioinformatics software than you can shake a double helix at.
IGV – ‘Integrative Genomics Veiwer’
IGV is a java app, that loads a veritable kitchen sink of formats. It is integrated with the Java Web Start system that allows launching of a Java app with ‘one-click’ from your web browser. Files can be loaded from disk or over http/ftp/DAS or from a curated set that the app has metadata for. The state of the app on startup can be specified by command line args or XML config. Combined with a php script that cooks up custom Web Start files one can actually link to a specific view on a specific dataset (if that data is public). This gives web-app style linking, albeit with a bit of a wait and no access control.
I’m only interested at the moment in using IGV to look at SNP data from VCF files – it does much more than this, for example reads from BAM files. Before loading you need to pre-process the VCF to create an index using igvtools which is accessed from the ‘File’ menu in IGV. Indexing our VCF files originally failed – IGV complained that they did not comply with the the VCF4.0 spec as they had whitespace in the INFO field. I confirmed this with VCF tools - in fact the error message from IGV was more instructive as it had the line number of the problem. This is defiantly the fault of our systems and something I hope we can eradicate through better automated testing and persuading people that sticking to standards is in their best interest. For now I just fixed this by truncating the file before the problem.
Firstly one picks the reference genome onto which the VCF file will be mapped. IGV comes with quite a selection available in its curated set, or one can be loaded. The VCF will need the same chromosome names or it will not map. For example I had to pick an old Plasmodium reference as our VCF had the old ‘MAL1′ style chromosome names. IGV is very flexible in what it will load – one could add extra columns to the VCF and have them displayed along side.
Once loaded one is presented with a layout with base as the X-axis and sample as the Y, you can drag around and use the arrow keys to move left right or use the stylised scroll bar at the top. I couldn’t use the mouse wheel to zoom but you can use Ctrl-+. The app keeps a record of your locations which you can navigate with the forward and back buttons. You can skip to a point or gene label using the search box at the top – this auto-completes and will give you a list to pick from if it gets more than one match. I managed to make the app hang by searching for a single letter though.
Hovering over any point gives information about that point in a window that disappears as soon as you move away – making you feel like you’re playing some kind of steady hand game. This also means that you can’t cut and paste that info out of it. There is a toolbar button that replaces the pop-up with a separate window with text, but I couldn’t copy out of that either. Right-clicking brings up a context menu that lets you sort by the selection or change how it displayed, for example switching between colouring for allele or genotype. As far as I can see the display is always relative to the reference genome. Although you can mark regions of interest you can’t pick a set of SNP positions and then just view those without the intervening bases, or order them by any criteria but genomic position. Above the individual samples is a summary section which for each position shows a small bar which is coloured in proportion the samples’ genotype distribution.
The code is on github (yay!) and appears to be under active development. In summary IGV is a flexible tool for viewing data, but does not offer any tools specifically for exploring variation through SNPs as in our use case.
VARB is a C++/QT app that only views VCF files. It is distributed as a binary but with shared linking to QT so I had to ‘apt-get libqt’ before it would start. The source is distributed as a zip file so I can’t tell if it is under active development or submit changes as anything but a patchfile. VARB loads requires three files, a reference in FASTA format, an annotation file in GFF format and finally the VCF. I used the FASTA from here and the GFF from the VARB example files. In loading our malformed VCF VARB also failed but did not provide any clue beyond saying that the file was malformed.
VARB offers the same kind of navigation as IGV, again no mouse-wheel and strangely zooming is relative to the left edge. SNPs can disappear and re-appear as one zooms as the rasterisation algorithm doesn’t cope with sparse SNPs on zoomed out regions. The controls and drawing appear to run in the same thread which makes navigation hard. There is an annotation search, but with no complete. The selection tool was much more useful however with the details coming up in the sidebar and easily copied as clicking makes the details stick in the window until cleared.
As well as the information from the VCF VARB adds some analytical output at the bottom of the window. This is fixed to the GC density, Relative variant density, Fst and Tajima’s D, these are updated as one changes the quality, depth and SNP type filters on the left. The windows used for calculating these are fixed and zoom independent. Samples can be grouped, and this grouping is used for the Fst calculation – although I’m not sure how it works out Fst for more than one group. As in IGV there is no way to view the SNPs or samples in any way but sequence order and with separation. The colours can be re-assigned – I found that setting the reference allele colour to white let me see the variation much more clearly. With a few tweaks VARB could be a very use-able SNP browser.
BAMSeek isn’t so much a visualisation tool as it is a file inspection tool. It is distributed as a JAR file with source on Google Code. It supports quite a few formats and is primarily designed for loading large files as it indexes, and then pages, the file as needed. Anyone who has used a normal text editor will know the pain of large files (I have found Sublime Text handles them well though after a slightly long loading). BAMSeek successfully loaded our off-spec VCF file – probably as does not fully parse it in order to display its textual content. The VCF file is simply displayed in a table with the header in a separate section. The paging is done by having actual pages that you flip through with a control on the bottom. The line numbers on the left are relative to the page – which is a little frustrating as to get the actual line number you have to do ((page-1)*(rows_per_page)+line_no) in your head. Hovering over a cell gives you the information formatted vertically. There’s not much more to it than that!
Next time we’ll look at some web-based apps that do a similar job.