Tuesday, February 16, 2010

organizing genomic data

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative.

Big ol' Matrix

because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here. They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for more details on how the coexpression is calculated, see the link above, I'm using this dataset specifically so I don't have to think about the raw expression data which is difficult to deal with].

I'm using numpy to create and then read that binary file (though it's just 32bit floats so the format will work for any language). The index of a particular gene in the array can be set as it's alphabetic order. So, to find the level of coexpression between the 13656th gene and the 12355th gene is arr[13657,12356]--or using pointer arithmetic if you're not using numpy. To find the average level of coexpression of the 13656th gene with all others is arr[13657].mean(). Looking up the index of each gene is a simple matter of reading in the file of gene names (already ordered alphabetically) and creating a dictionary lookup of gene => index. The one line of code to do that looks like this:

idx_lookup = dict((accn.rstrip(), i) for i, accn in enumerate(open(path + "/names.order")))
which can take an accn name like "At2g23450" and return its integer index.
Since the binary file is 1.9G of data, it's not ideal to read the entire thing into memory. But, it's simple to memory map it with numpy into a big ol' matrix/array:

L = len(idx_lookup)
coexp = np.memmap(path + '/coexp.bin', dtype=np.float32, shape=(L, L), mode='r')

so then, given a pair like: AT1G47770,AT2G26550
the code to find the coexpression is:

ai = idx_lookup["AT1G47770"]
bi = idx_lookup["AT2G26550"]
ce = coexp[ai, bi]

this is stupid fast. To memory map the array, read in the idx_lookup and do 6294 queries takes 0.17 seconds on my machine--and much of that is the one-time cost of reading in idx_lookup. The code for all of this is readable here.
and svn'able via:
svn checkout http://bpbio.googlecode.com/svn/trunk/scripts/coex/

That includes the code to get the data (get.sh) convert it to the .bin file (to_array.py) and serve it as a web script (coexp.wsgi).


In most eukaryote organisms there are on the order of 30K genes (varying from maybe a couple thousand up to 60K or so). Each gene can have multiple sub-features: CDS coding regions, mRNAs, UTR's, introns, etc. So it's a pretty small amount of data, but large enough and with enough structure that it's nice to have it in an efficient data structure. After five years, I think am finally converging on a good abstraction. It holds 1 feature per row in a subclass of a numpy array. This hides some complexity arising from the fact that each gene feature has multiple exons and each gene can have a different number of exons (gff uses one row per cds * gene * mRNA * exon). Numpy has typed data "fields" which are something like columns. Common types are floats and ints of various precisions and python Objects. So, the exons start, stops are stored as a list of python integers in an object field of a python array and that python array is stored as a 'locs' field in each row of the numpy array. The .flat file format looks like this:
#id  chr accn    start   end    strand  ftype   locs
1 1 AT1G01010 3631 5899 + CDS 3760,3913,3996,4276,4486,4605,4706,5095,5174,5326,5439,5630

where the columns should be mostly understandable by the header. `start` is the left-most position of the feature on the genome and `end` is the rightmost. `ftype` tells the type of the feature(s) in the following column, `locs` which is a list of start,stops. most often ftype is `CDS` (or exon) for coding sequence--or the part of the gene that gets translated. From there, it's possible to take advantage of the speed and syntax of numpy:

>>> from flatfeature import Flat
# note also attaching the genomic fasta sequence.
>>> flat = Flat('data/thaliana_v8.flat', 'data/thaliana_v8.fasta')
# so can get coding sequence.
>>> cds_seq = flat.row_cds_sequence('AT1G01370')
# find all the gene names in a given chromosome, region:
>>> [gene['accn'] for gene in flat.get_features_in_region('1', 5000, 7000)]
['AT1G01010', 'AT1G01020']

# how many features on chromosome 4 are on the - strand:
>>> flat[(flat['seqid'] == '4') & (flat['strand'] == '-')].shape

The row_cds_sequence, while an ugly name, shows the advantage of tyeing the fasta to the Flat object--it allows extracting the genomic sequence from the fasta based on the coordinates in each row. It also highlights a problem with this data structure--what about alternative splicings which are very common? For our lab, we always use the union of all coding sequences so I report the resulting intervals as the 'locs' column. I intend to improve the flexibility of how that is handled in the code, if not the file.

As a side note, the final example in that session shows how you can think of the numpy slicing syntax as a sort of SQL-like selection. so: flat[(flat['seqid'] == '4') & (flat['strand'] == '-')] reads like: "select * from flat where seqid = '4' and strand = '-'", but there's no database, it's all in memory, and the work is all done in C, very quickly. Actually, I think flatfeature and the huge matrix could both classify as NoSQL.

The code for flatfeature is on github it includes an example .flat file for arabidopsis thaliana in the data directory. I'll be tinkering with that for some time. As with all my code, any forks, patches, suggestions or ridicule will be welcome (in that order of preference). Finally, to add that for many cases genometools works great, they do a nice job of wrapping GFF for a number of languages including python. But for some things GFF is too much trouble.