Using Interval Trees to Query Genome Annotations by Position

Posted: 7 July 2011 by Alistair Miles in Uncategorized
Tags: , , , , , , ,

This week I’ve been doing quality assurance work on some data we’re about to send back to partners of the P. falciparum Genome Variation project. These data include some SNP lists – files listing positions in the P. falciparum genome believed to be variable from one parasite to another. To make these files useful, it helps to include genome annotations – information about which gene (if any) can be found at each variable position. Constructing these files means joining a list of variable positions with a set of genome annotations, where each annotation has a start and end position on some chromosome. I.e., for each variable position, find all genome annotations overlapping that position.

Because I need to do this lookup once for each of about a million SNPs, I wanted to know what the most efficient algorithm for doing this type of query would be. It turns out that Interval Trees are the way to go (thanks Lee for discovering this). It also turns out that there is an implementation of interval trees tailored for searching genome annotations in a package called bx-python, which is very handy as I’ve been writing my QA scripts in Python.

On my Ubuntu desktop installing bx-python is as easy as sudo easy_install bx-python. There are also instructions for manually installing bx-python if you don’t have access to easy_install.

Below is a snippet from one of my QA scripts which uses the IntervalTree class from bx-python and builds a set of interval trees from a GFF3 annotations file.

import csv
from bx.intervals.intersection import IntervalTree

def index_gff3(gff3_file_path):

    # dictionary mapping chromosome names to interval trees
    genome = dict()

    # parse the annotations file (GFF3) and build the interval trees
    with open(gff3_file_path, 'r') as annotations_file:
        reader = csv.reader(annotations_file, delimiter='\t')
        for row in reader:
            if len(row) == 9 and not row[0].startswith('##'):
                seqid = row[0]
                start = int(row[3])
                end = int(row[4])
                tree = None
                # one interval tree per chromosome
                if seqid in genome:
                    tree = genome[seqid]
                    # first time we've encountered this chromosome, create an interval tree
                    tree = IntervalTree()
                    genome[seqid] = tree
                # index the feature
                tree.add(start, end, tuple(row)) 

    return genome

I can then use this function to do things like…

genome = index_gff3('/path/to/annotations.gff')
annotations = genome['apidb|MAL1'].find(474889, 474889) # find annotations overlapping a single position
annotations = genome['apidb|MAL1'].find(470000, 480000) # find annotations overlapping an interval
  1. I discovered that, if you want to find annotations overlapping a single position at the start or end of the annotation, then you need to do:

     # find annotations which overlap 474889 on MAL1, including start and end position
    annotations = genome['apidb|MAL1'].find(474888, 474890) 

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s