I’ve created a liftover chain file to migrate genomic data from the “version 2″ 3D7 reference genome to the newer “version 3″ reference genome. You can download the chain file at the link below, as well as a binary for the liftOver program compiled for x86_64:

To check it works, download the above and test.bed to a local directory then run:

chmod +x ./liftOver
./liftOver test.bed 2to3.liftOver test.v3.bed test.v3.unmapped

This should create the file test.v3.bed containing:

Pf3D7_07_v3	403620	403621	crt

Note that this expects chromosome names in the input to be like “Pf3D7_01″. If you’re using chromosome names like “MAL1″ you’ll need to convert those first prior to applying the liftover to version 3.

Read the rest of this entry »

Load data from a VCF file into numpy arrays

Posted: February 22, 2013 by Alistair Miles in Uncategorized

I’ve recently been doing some analysis of SNPs and indels from the MalariaGEN P. falciparum genetic crosses project, and have found it convenient to load variant call data from VCF files into numpy arrays to compute summary statistics, make plots, etc.

Attempt 1: vcfarray

I initially wrote a small Python library for loading the arrays based on the excellent PyVCF module. This works well but is a little slow, and when I profiled it it was the VCF parsing that was the bottleneck, so I went in search of a C/C++ library I could use from Cython…

Attempt 2: vcfnp

Erik Garrison’s vcflib library provides a nice C++ API for parsing a VCF file, so I had a go at writing a Cython module based on that. Performance is better, I get roughly 2-4X speed-up over the PyVCF-based implementation, although I was hoping for an order of magnitude … I guess it’s just the case that string parsing is relatively slow, even in C/C++, and we should be using BCF2.

To install and try vcfnp for yourself, do:

pip install vcfnp

See the vcfnp README for some examples of usage.

Visualisation software round-up

A common need that we have is to directly view, or interpretively visualise information (both numeric and categoric) that is attached to a particular point on genomic sequence, often in relation to some attributes of that sequence. The number of file formats and tools that have been written for doing this surprised me when I first looked. This post is the first step in looking at what is out there. For this purpose I’m limiting myself to looking at tools that read the popular VCF format (Not to be confused with v-Card). This will only scratch the surface – a quick look at this list shows there is more bioinformatics software than you can shake a double helix at.

IGV – ‘Integrative Genomics Veiwer’

IGV is a java app, that loads a veritable kitchen sink of formats. It is integrated with the Java Web Start system that allows launching of a Java app with ‘one-click’ from your web browser. Files can be loaded from disk or over http/ftp/DAS or from a curated set that the app has metadata for. The state of the app on startup can be specified by command line args or XML config. Combined with a php script that cooks up custom Web Start files one can actually link to a specific view on a specific dataset (if that data is public). This gives web-app style linking, albeit with a bit of a wait and no access control.

I’m only interested at the moment in using IGV to look at SNP data from VCF files – it does much more than this, for example reads from BAM files. Before loading you need to pre-process the VCF to create an index using igvtools which is accessed from the ‘File’ menu in IGV. Indexing our VCF files originally failed – IGV complained that they did not comply with the the VCF4.0 spec as they had whitespace in the INFO field. I confirmed this with VCF tools - in fact the error message from IGV was more instructive as it had the line number of the problem. This is defiantly the fault of our systems and something I hope we can eradicate through better automated testing and persuading people that sticking to standards is in their best interest. For now I just fixed this by truncating the file before the problem.

Firstly one picks the reference genome onto which the VCF file will be mapped. IGV comes with quite a selection available in its curated set, or one can be loaded. The VCF will need the same chromosome names or it will not map. For example I had to pick an old Plasmodium reference as our VCF had the old ‘MAL1′ style chromosome names. IGV is very flexible in what it will load – one could add extra columns to the VCF and have them displayed along side.

IGV InterfaceOnce loaded one is presented with a layout with base as the X-axis and sample as the Y, you can drag around and use the arrow keys to move left right or use the stylised scroll bar at the top. I couldn’t use the mouse wheel to zoom :( but you can use Ctrl-+.  The app keeps a record of your locations which you can navigate with the forward and back buttons. You can skip to a point or gene label using the search box at the top – this auto-completes and will give you a list to pick from if it gets more than one match. I managed to make the app hang by searching for a single letter though.

IGV Pop-upHovering over any point gives information about that point in a window that disappears as soon as you move away – making you feel like you’re playing some kind of steady hand game. This also means that you can’t cut and paste that info out of it. There is a toolbar button that replaces the pop-up with a separate window with text, but I couldn’t copy out of that either. Right-clicking brings up a context menu that lets you sort by the selection or change how it displayed, for example switching between colouring for allele or genotype. As far as I can see the display is always relative to the reference genome. Although you can mark regions of interest you can’t pick a set of SNP positions and then just view those without the intervening bases, or order them by any criteria but genomic position. Above the individual samples is a summary section which for each position shows a small bar which is coloured in proportion the samples’ genotype distribution.

The code is on github (yay!) and appears to be under active development. In summary IGV is a flexible tool for viewing data, but does not offer any tools specifically for exploring variation through SNPs as in our use case.

VARiation Browser

VARB is a C++/QT app that only views VCF files. It is distributed as a binary but with shared linking to QT so I had to ‘apt-get libqt’ before it would start. The source is distributed as a zip file so I can’t tell if it is under active development or submit changes as anything but a patchfile. VARB loads requires three files, a reference in FASTA format, an annotation file in GFF format and finally the VCF. I used the FASTA from here and the GFF from the VARB example files. In loading our malformed VCF VARB also failed but did not provide any clue beyond saying that the file was malformed.

VARB offers the same kind of navigation as IGV, again no mouse-wheel and strangely zooming is relative to the left edge. SNPs can disappear and re-appear as one zooms as the rasterisation algorithm doesn’t cope with sparse SNPs on zoomed out regions. The controls and drawing appear to run in the same thread which makes navigation hard. There is an annotation search, but with no complete. The selection tool was much more useful however with the details coming up in the sidebar and easily copied as clicking makes the details stick in the window until cleared.

As well as the information from the VCF VARB adds some analytical output at the bottom of the window. This is fixed to the GC density, Relative variant density, Fst and Tajima’s D, these are updated as one changes the quality, depth and SNP type filters on the left. The windows used for calculating these are fixed and zoom independent. Samples can be grouped, and this grouping is used for the Fst calculation – although I’m not sure how it works out Fst for more than one group. As in IGV there is no way to view the SNPs or samples in any way but sequence order and with separation. The colours can be re-assigned – I found that setting the reference allele colour to white let me see the variation much more clearly. With a few tweaks VARB could be a very use-able SNP browser.

BAMSeek

BAMSeek isn’t so much a visualisation tool as it is a file inspection tool. It is distributed as a JAR file with source on Google Code. It supports quite a few formats and is primarily designed for loading large files as it indexes, and then pages, the file as needed.  Anyone who has used a normal text editor will know the pain of large files (I have found Sublime Text handles them well though after a slightly long loading).  BAMSeek successfully loaded our off-spec VCF file – probably as does not fully parse it in order to display its textual content. The VCF file is simply displayed in a table with the header in a separate section. The paging is done by having actual pages that you flip through with a control on the bottom. The line numbers on the left are relative to the page – which is a little frustrating as to get the actual line number you have to do ((page-1)*(rows_per_page)+line_no) in your head. Hovering over a cell gives you the information formatted vertically. There’s not much more to it than that!

Next time we’ll look at some web-based apps that do a similar job.

These days, virtualisation is all the rage. The various competing virtualisation products have reached a level of maturity where they can be reliably used for server consolidation. VirtualBox is one of the easiest to use, most featureful programs available in this space and with the ability to run on many different OSes on hardware with or without VM extensions, it is also one of the most popular. However, there is one wrinkle when it comes to using it for server consolidation: the proprietary RDP/USB2 extension pack.

The conventional wisdom when running a headless server with VirtualBox is that you need to install this proprietary extension pack from Oracle. This is fine until you want to use the server in production: as the PUEL only covers you for personal use and evaluation, you must purchase licenses. You can either pay £34 per user or £670 per “socket” (which has quite a convoluted definition). This gets you USB2 and RDP support.

However, there is another way, at least when it comes to RDP support. Read the rest of this entry »

Update 2012-04-25: mysql workbench has now appeared in the universe package archive. You should be able to install it with a simple:

sudo apt-get install mysql-workbench

Read on if you still want to compile from source.

Right now (2012-04-04), Ubuntu 12.04 hasn’t been released yet, and so there is no binary package from Oracle of MySQL Workbench for Precise. I managed to get the MySQL Workbench binaries for Oneiric to run, by manually installing libzip1_0.9.3-1_amd64.deb from Oneiric, but this wasn’t stable (crashed as soon as I tried to run a SQL Query).

So I decided to build from source. Here’s how I did it Read the rest of this entry »

We use Sun Grid Engine here at WTCHG for managing our compute resources. Many of the analyses I’m doing are best run as array jobs, which generally works very well, but sometimes one or more tasks will fail for one reason or another, and I’ve been casting around for best practice when it comes to (a) verifying which tasks succeeded and which failed, and (b) re-running failed tasks.

I found a nice post by Shiran Pasternak on resubmitting failed SGE array tasks, however Shiran doesn’t say how he determined which tasks had failed, and the set of tasks to rerun is specified manually. I have thousands of tasks in each array job, and so I really need an automated way of determining the success/failure of each task and rerunning those that failed.

I came up with the following pattern.

Read the rest of this entry »

Jobs – Bioinformatics

Posted: March 23, 2012 by Alistair Miles in Jobs
Tags:

We’re advertising bioinformatics jobs at both Oxford and Sanger (near Cambridge), see the following links for job descriptions and information on how to apply:

Here’s an excerpt from the job description:

Overview of role

All MalariaGEN projects working on parasite and vector biology depend on next-generation sequencing. Over 2,000 samples of parasite DNA have been sequenced, and at least 10,000 samples will have been sequenced by 2015. Genome sequencing has been carried out on approximately 200 Anopheles samples to date, and the aim is to sequence approximately 2,500 individuals over the next 4 years. Most parasite samples have been extracted directly from infected blood samples, and so present additional complexities such as small quantities of DNA and mixed infection.

Raw next-generation sequence data is the beginning of a complex and intellectual demanding analysis process. The primary goal is to discover robust evidence for genetic variation. However, building from raw sequence data to robust variation data is and will continue to be one of the most significant challenges facing the malaria research community over coming years. Working to iteratively improve the quality of our genetic variation data and reach deeper into the Plasmodium and Anopheles genomes is the main focus of the MalariaGEN Bioinformatician roles.

This is an extremely fast-paced area of current research and development, and new methods and tools are emerging from many leading research groups and projects, many of whom we have close contacts with. However, we have to strike a balance between looking to the future, and delivering data to MalariaGEN partners that might not be perfect or complete but which nevertheless provides a highly valuable research tool for a range of studies, such as genotype-phenotype association studies, and studies of parasite and vector population structure and dynamics.

To achieve this balance between methods development on the one hand, and production of data on the other, our bioinformatics programme is organised around two working groups. The methods development group is focused on the development, exploration and thorough evaluation of new methods, including methods for sequence alignment, variation calling and genotyping, working closely with statisticians. The production group is focused on establishing tightly specified data analysis pipelines and using them to produce high quality variation data in a reproducible way. Both working groups work to a quarterly data release cycle, where the methods development looks ahead to the next release and determines the best available methods, which are then adopted and implemented by production.

While this role may focus more on methods development or production at different times, we encourage participation in both working groups, as there are important insights that can only be gained by working across both.