Saturday, May 23, 2009

displaying and serving big(ish) data with numpy and memmap

In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script.

A web request containing a width=XXX in the url will show the user wants an image. so a url like:
http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512
will give an image:

(geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into
an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression.
Without the width=XXX param, the service returns JSON of values for the + and minus strand for each basepair and tissue in the original file--which is less useful than having a summary of mean value for each tissue over the range:
http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&summary=t
gives:

{"+": [0.008325556293129921, 0.0072462377138435841, 0.0074741491116583347, 0.0017609233036637306, 0.010895536281168461], "-": [0.0088664013892412186, 0.017081102356314659, 0.009236418642103672, 0.0043859495781362057, 0.021286465227603912]}

which are the mean values from basepair 491520 to 499712 for the + and - strands and each of the 5 tissues. A single tissue can be requested by number as:
http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&summary=t&tissue=2
which gives:

{"+": 0.0074741492195734898, "-": 0.0092364182547917447}

Likewise, one can request and image with only a single tissue:
http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512&tissue=2



Transcriptome data often has enough rows of data that it's possible, but not really convenient to stick it into an RDBMS. the data I'm using is available here and looks like:

26 C 125.05 106.18 119.80 159.78 101.40
26 W 141.90 139.07 137.50 151.91 171.18
51 C 131.85 108.57 71.00 156.41 133.79
51 W 123.83 129.97 122.50 100.23 139.09
76 C 136.84 104.98 120.30 88.88 151.64
76 W 119.35 112.79 160.00 119.51 130.29
126 C 138.93 100.92 111.00 84.82 103.48
126 W 87.82 118.30 110.30 157.17 126.42
151 C 184.50 71.24 157.80 396.25 86.42
151 W 136.76 119.95 107.00 98.10 108.60
176 C 135.75 86.98 115.80 188.41 93.53
176 W 147.57 158.19 131.00 117.45 164.77
[... millions more ...]

where the first column is basepair position, the second is either 'W' or 'C' for the Watson or Crick strand of the (double-stranded) DNA sequence. Columns 2 and on are measurements of transcription (see the wikipedia article) for various tissues in the plant Arabidopsis Thaliana which we use because it has a small genome, it's well annotated and doesn't have too much repetitive DNA or too many transposons.

The data is somewhat irregular in that not every 'C' row has a 'W' counterpart and sometimes vice-versa, for those cases, I create a row to fill the missing row with the same values as the existing row. The code below runs through and creates numpy .npy files of the data:

It saves the positions and the actual transcriptome data into separate files because for the web service, the transcriptome files will be memory-mapped while the smaller position files will be read into memory. This code is run at startup (and so not done for every request):

from numpy.lib import format
for ichr in range(1, 6): # for 5 at chrs
schr = str(ichr)
# memmap these
seqids[schr] = format.open_memmap(os.path.join(path, \
'at.tome.data.%s.npy' % (schr,)), mode='r')
# read these into memory.
posns[schr] = np.load(os.path.join(path, 'at.tome.posn.%s.npy' % schr))

So that the seqids dict has values that are the memmaped contents of the expression data in the .npy files. For each web request, it then uses numpy's searchsorted to do a binary search so that when a user requests &xmin=1234&xmax=4567 searchsorted() finds the position in the array where 1234 would fall (exactly like python's bisect module). and like wise for 4567. The pair of array indexes from the searchsorted() calls with xmin and xmax give the indexes into the tome array for which to grab the data:

minidx = posns[seqid].searchsorted(xmin)
maxidx = posns[seqid].searchsorted(xmax)

data = seqids[seqid][minidx: maxidx]
data_idx = posns[seqid][minidx: maxidx]

where data contains the values of the transcriptome data and data_idx contains the associated base_pair positions. I could have saved the basepair positions in the seqids/data numpy array, but since searchsorted() will likely traverse many chunks of the array, I think it's best to have that part in memory rather than memory mapped.
The rest of the code is a bunch of if statements (which I should really spread across multiple functions...) deciding whether to show an image with matplotlib or grab a summary of the data and return it via simplejson. The gist of the entire wsgi script is at the end of this post.
An example of what this looks like in an openlayers map with gene annotations is in the image below.


where one can see some correlation between where the (green) genes are and the higher transcriptome levels. In summary, it's nice to be able to take data in this format and write a 120 line script which provides full, fast access both to the raw data and to images which we can use to find patterns to explore.


Tuesday, April 21, 2009

Needleman-Wunsch global sequence alignment

I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as:
 sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/

or via svn from:
svn co http://bpbio.googlecode.com/svn/trunk/nwalign/

it will put an executable 'nwalign' into /usr/bin/ which when run will give this message:

Usage:
nwalign [options] seq1 seq2


Options:
-h, --help show this help message and exit
--gap=GAP gap extend penalty (must be integer <= 0)
--gap_init=GAP_INIT gap start penalty (must be integer <= 0)
--match=MATCH match score (must be integer > 0)
--mismatch=MISMATCH gap penalty (must be integer < 0)
--matrix=MATRIX scoring matrix in ncbi/data/ format,
if not specificied, match/mismatch are used

where the matrix is optional but can be the full path specifying the transition score from row to column as here.
If the matrix is specified, match and mismatch are not used. If the matrix is not specified, match and mismatch are used. a typical run looks like:

$ nwalign alphabet alpet
alphabet
alp---et

$ nwalign --matrix /usr/share/ncbi/data/BLOSUM62 EEAEE EEEEG
EEAEE-
EE-EEG
And usage from a python script can be seen in test.py.

I wrote this for a colleague who was using a perl script. This is over 100 times faster than the perl script for long sequences--and the perl script had the BLOSUM62 matrix hard-coded as a hash. There are a couple places that could still be sped up, but I think the improvement will be relatively small. This kind of script is perfect for the numpy-cython mix as I can just access the contents of the 2d array at c-speed without having to do pointer arithmetic. If there's any obvious optimizations I missed, I'd be glad to know them.

Thursday, April 16, 2009

python object initialization speed

On the Cython mailing list, I saw this mentioned for avoiding init overhead, so i wrote up some code to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully).
I ran a benchmark with 5 cases:

  1. PY_NEW macro (still has python overhead for each call to the creator function)

  2. regular python init

  3. python init using __slots__

  4. cython init (cdef'ed class)

  5. batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead

  6. batch init on cython class


the timings look like this:

PY_NEW on Cython class: 1.160
__init__ on Python class: 30.414
__init__ on Python class with slots: 10.242
__init__ on Cython class 1.185
batch PY_NEW total: 0.855 , interval only: 0.383
batch __init__ on Cython class total 0.998 , interval_only: 0.540


So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular python class. That Cython is faster is no surprise.
Stefan Behnel explains better than I could.

All the code is smashed uncomfortably into this gist.

for kicks, i tried with unladen-swallow. It comes out almost 2x faster on the python times both with and without slots. I didn't use the optimization stuff. Cython even works with unladen-swallow--just have to rebuild the .so--and the timings are the same as Cython-with-CPython.

Sunday, April 12, 2009

apache mpm-worker with php on low memory servers

I'm partly writing this because I think such valuable info is too hard to find, I'd read a lot that if you want php, you have to use mpm-prefork but it's not true!

I've been using slicehost for dev server for almost a year now. Since my 1 year deal is up, I decided to switch to their affiliate mosso. I really like slicehost, but Mosso "cloud servers" seem a good fit for a server that goes through spurts of development and use followed by weeks of non-use. So now, I can keep it as a 256MB instance at about $10/month and update to a larger instance when doing real dev.
I built it today as a 1024MB instance -- installed all my usual stuff, and updated my build script for ubuntu. That's here.
The machine I'm on is extremely fast, normally I set GDAL building and leave, but it finished before I had a chance. After all was built, I resized it to a 256MB server -- that took 12minutes, but my instance was accessible for at least 10 of those.
After that, I log right back in and all works fine. On this dev server, I have to run PHP (at least I dont have to code in it). I've been getting tired of having to use apache's mpm-prefork because ubuntu won't let me have php and mpm-worker. (For more from someone who actually knows what they are talking about, Graham Dumpleton, the author of mod_wsgi, has a good write up here.) So, every apache process takes up a good chunk of the available memory and things are sloooow. Even just running trac, it starts swapping.
I found a good post for using fcgi for php and followed it blindly and all works!
Using worker_mpm, Trac (via mod_wsgi) is so fast I doubled checked to make sure my instance was still at 256MB. After setting ThreadStackSize to 1.5MB trac takes up only 16% of the available memory. The relevant bits of my apache config look like:

<IfModule mpm_worker_module>
StartServers 2
MaxClients 40
MinSpareThreads 5
MaxSpareThreads 20
ThreadsPerChild 20
MaxRequestsPerChild 5000
</IfModule>

MaxKeepAliveRequests 1000
Timeout 30

# i keep this fairly high for serving image tiles
# from mapserver
KeepAliveTimeout 40

# default is 8 per child process, only use 1.5MB
ThreadStackSize 1500000

AddHandler fcgid-script .php
FCGIWrapper /usr/lib/cgi-bin/php5 .php

and then in 000-default, i added +ExecCGI to the DocumentRoot Directory Options.
If any of that config is not sane, let me know.

I should say that I have no idea how PHP will do under fast-cgi, if it uses more memory, or not, or if it will have any problems, on quick look, everything seems ok.

Apparently, you can also use mpm-worker with php if you build php from source. Since using fcgi worked so easily, I didn't try that.

Friday, February 20, 2009

biohash

This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here.

Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval.

The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, which can be as small as the maximum end value. If all the intervals are small, and the rng is very large, the biohash will not help much. The rng used to load must be the same as the one used to dump or the values won't be correct.

I had plans to finish up a set of SQLAlchemy models for this that would save the hash and use it to do range queries behind the scenes, but haven't finished that up yet. The code is in my google SVN. It will even pull in a fast cython version of the encoder.
If anyone wants to use it, improve it or get it going with SQLAlchemy, it available from SVN with:
svn checkout https://bpbio.googlecode.com/svn/trunk/ihash/
it has tests in __init__.py

Friday, January 09, 2009

Stupid things I did as a Bioinformatics Programmer in 2008

In 2008, I was good enough at programming to get my ass kicked by hard problems. I think that's the most positive way to say it.
My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles.

I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly.

I was passing around data in arrays and tuples, so then when I changed the order or added another element in the script that was creating the tuple, a downstream script that was using the tuple would be using the wrong index to access data. If I was lucky, my code would fail, if not, it'd be using the strand when what it should have had was the start location.

I hit problems where I'd run out of memory. At one point, I ran out of disk space (it's a big series of datasets), hit bugs of software I was using.

When I should have run just 1 chromosome to test the pipleline in 1/100th (comparing 10 chromsomes to 10 chromosomes) the time, I ran it over an entire
genome.

When I should have taken the time to really fix small mistakes as I found them, I instead worked around them, making the code unnecessarily complex as a result. If I had fixed instead of fudged in those cases, I would have been more productive.

I did write tests, but not enough, and I didn't set up the project in a way that it was really testable. I'm still learning how to do that. All the other stuff may just be discipline, but the testing is very difficult for me in bioinformatics. I've extracted what I could into tested libraries and added checks for the intermediate data, and every time I found a dumb error, I'd add an assert. (there's a discussion of pretty much exactly the problems I'm describing here: http://ivory.idyll.org/blog/sep-08/the-future-of-bioinformatics-part-1b.html )

I'd assume that it was ok to use a tuple, because I was only going to store start and stop, but then later I'd need to add chromosome, then strand, then score, and pretty soon i'd have code elsewhere like:
if cns[2][4] < cns[2][5]: 
...

where i have no idea what i'm comparing there. That one's simple to fix, just use objects, or dicts at the very least, but a 2-tuple is so tempting, and
a 3-tuple not so bad, and ...

On top of all this, the project was changing as I was working on it, so I was changing what went in to the start of the pipeline, what we were getting out, and the steps I was doing. And because of all the extra little hacks in there, I would be stuck in some function far away from the data that I needed (*).

It sucked. I can make those the sort of mistakes on a project that runs in 5 seconds and has a very simple, and relatively known output. But, not for complex pipelines.

It's been painful watching myself do such stupid stuff, and then reading the code afterward, but I think I've learned a lot. Much of that code still sucks, but I've moved to more rigid data-structures. Every time I make a change now, I do it for real, it's not just hacked in. I have to deal with someone pacing around, waiting for the results and asking me questions like "why is it so hard to just add in the RNA?" or whatever. Also, statistically speaking, I'm
probably running out of mistakes I can make...

And actually, the most difficult things have been inter-personal relationships, but that's for another post--as are the more positive, awesome things I did with projects I set up correctly, and had good test coverage...
And finally, I did make a lot of mistakes, but this has been genuinely a difficult problem.



* see this post, especially the last 4 paragraphs. It's good to hear that someone with 43 years of programming language has the same problems as I do.

Wednesday, December 17, 2008

landsummary.com

One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni (His writeup). We got together one afternoon, and by the time we left, we had a reasonable start of what we call landsummary, we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too.
What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the NLCD dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population, weather.precip, and a couple of services with environmental engineering application. All of them use the same libraries--postGIS for doing the census related stuff, and GDAL, gdal_array for doing the raster (currently just NLCD) queries. Josh handled all the census data, I know very little about that, except that whenever I've tried previously, it's been a pain to work with, and now, Josh has a nice set up for it. I took the lead on the raster summaries. For that, I wrote a little library that wraps gdal_array, so you can take a GDAL datasource:

>>> g = AGoodle("something.tif")
>>> a = g.read_array_bbox([xmin, ymin, xmax, ymax])

And then 'a' is a numpy array with all the niceties that entails. So, if we want to get just the food cells, which have values of 81 and 82 in the NLCD dataset, it's just:

>>> a[(a == 81) | (a == 82)]

For arbitrary polygons, we use a surprisingly fast function from matplotlib to mask anything that's not inside a list of verticies (polygon). Then do any summary stats on the masked array.

Thanks to Josh, we have a fairly nice django project structure, with separate apps for each little analysis we've added. In my previous django projects, I've dumped everything into a single app and hacked away, the structure we have now makes it easier to keep what's needed in my brain. Also, when hacking with someone, I'm less likely to put in total crap code. Josh has already had a good laugh at some code where I found 25 closest weather stations using:

SELECT * from stations ORDER BY ABS(lat - ?) + ABS(lon - ?) LIMIT 25

then sorted those 25 using geopy.distance to make sure it was the real distance. In my defence, I really wanted to use vanilla sqlite and so didn't have postGIS at my disposal--also, it was quite fast for only 6,000 stations. We've since dumped it all into postGIS. There's probably a couple of other gems in there.

So, back to the modest functionality part... Actually, it turns out, this is a fairly difficult thing to do in McClick software--time consuming in user and processor time. So, having a way to click a point and see land-use stats and population data appear in about a second is pretty cool--and it's on the web. We've already found a couple folks with interesting applications, and we're interested in finding more--the original motivation was 'foodmiles'-- from this post. And there's a couple things we'll probably add in from that, people I happened to hear talking in a cafe today were talking about foodmiles and seemed interested in incorporating the carbon foot-print of exporting / importing food vs. growing locally. My friend, Megan also has lots other ideas for things that firms commonly do McClick style with the NLCD data.
There's more info on the about page, but suffice to say we make full use of all the usual open-source GIS, science tools.

Monday, November 03, 2008

python interval tree

EDIT: added a couple points inline.

I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees.
I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry.
Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method.
The wikipedia entry that baffled me was about storing 2 copies of each node's intervals--one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I think that increases the worst-case, but makes no difference in actual search time--with the benefit of halving storage.

EDIT: now the version in my repo keeps the intervals sorted by start, so it can avoid doing the brute for search at each node during a search when search.stop < node.intervals[0].start. This did improve performance.

The tree class takes a list of intervals and calculates a center point. From there it partitions them into left, overlapping, and right in terms of their relation to the center point. Overlapping are assigned to the current node, and left and right are recursively partitioned in that fashion until there are only `minbucket` intervals per node, or the specified `depth` has been reached AND there are fewer intervals than `maxbucket`. So a tree can have a greater `depth` than requested if it would otherwise have more than `maxbucket` intervals in a single node. The Wikipedia version doesn't have maxbucket or minbucket...

EDIT: the maxbucket actually only works on leaf-nodes, and has no effect otherwise.

I'm sure that's painfully obvious for anyone who's ever taken a CS course, but it was foggy at best for me until I implemented. Below is the entire implementation:

class IntervalTree(object):
__slots__ = ('intervals', 'left', 'right', 'center')

def __init__(self, intervals, depth=16, minbucket=96, _extent=None, maxbucket=4096):

depth -= 1
if (depth == 0 or len(intervals) < minbucket) and len(intervals) > maxbucket:
self.intervals = intervals
self.left = self.right = None
return

left, right = _extent or \
(min(i.start for i in intervals), max(i.stop for i in intervals))
center = (left + right) / 2.0


self.intervals = []
lefts, rights = [], []


for interval in intervals:
if interval.stop < center:
lefts.append(interval)
elif interval.start > center:
rights.append(interval)
else: # overlapping.
self.intervals.append(interval)

self.left = lefts and IntervalTree(lefts, depth, minbucket, (left, center)) or None
self.right = rights and IntervalTree(rights, depth, minbucket, (center, right)) or None
self.center = center


def find(self, start, stop):
"""find all elements between (or overlapping) start and stop"""
overlapping = [i for i in self.intervals if i.stop >= start
and i.start <= stop]

if self.left and start <= self.center:
overlapping += self.left.find(start, stop)

if self.right and stop >= self.center:
overlapping += self.right.find(start, stop)

return overlapping

Only 45 lines of code. I had added a couple extra attributes so that searching could do fewer checks, but it only improved performance by ~15% and I liked the simplicity. One way to improve the search speed, and the distribution on skewed data would be to sort the intervals at the top node, so they'd then be sorted for all other nodes. Then instead of using center = (left + right)/2, It'd could use the center point of the center interval at each node. That would also allow short-circuiting the brute-force search at the top of the find method with something like:

if not (start > self.intervals[-1].stop and stop < self.intervals[0].start):
overlapping = [ .. list comprehension ]

But all told, that adds 5 or so lines of code. Oh, and depending on how it's used, it's between 15 and 25 times faster than brute-force search.

EDIT: I added the above check, but it can only do the 2nd comparison "stop < self.intervals.start as the first is invalid given a very long interval. Regarding speed, the smaller the search window, the better the performance improvement. The code is now > 20 times as fast brute force for a very (speaking in terms of looking for genomic features) large swath of 100K. with a search space of 50K, it's 50+ times as fast as linear search.

The full code (including a docstring with homer simpson quote) is in my google code repo. If I've made obvious mistakes or you have improvements, I'd be glad to know them.

Saturday, October 25, 2008

twill with XHTML (not viewing HTML)

Since I couldn't find this anywhere, I'll add it here for those who have the same problem:

I was trying to test a website with twill and got this at the end of my traceback:

raise BrowserStateError("not viewing HTML")
BrowserStateError: not viewing HTML

After spending a bunch of time making sure that, yes, it was spitting out HTML, I figured out that it specifically means that twill (actually mechanize) doesnt like XHTML.

You can likely fix it by adding this at the top of the script:

b = twill.get_browser()
b._browser._factory.is_html = True
twill.browser = b

Presumably, there's a real reason that check is in place, but works-4-me...

Friday, October 24, 2008

appengine memcache memoize decorator

I've been playing with google appengine lately. I'm working on a fun, pointless side project. Here's what I came up with for a cache decorator that pulls from memcache based on the args, kwargs and function name if no explicit key is given. The code for creating a key from those is from the recipe linked in the docstring.
"""
a decorator to use memcache on google appengine.
optional arguments:
`key`: the key to use for the memcache store
`time`: the time to expiry sent to memcache

if no key is given, the function name, args, and kwargs are
used to create a unique key so that the same function can return
different results when called with different arguments (as
expected).

usage:
NOTE: actual usage is simpler as:
@gaecache()
def some_function():
...

but doctest doesnt seem to like that.

>>> import time

>>> def slow_fn():
... time.sleep(1.1)
... return 2 * 2
>>> slow_fn = gaecache()(slow_fn)

this run take over a second.
>>> t = time.time()
>>> slow_fn(), time.time() - t > 1
(4, True)

this grab from cache in under .01 seconds
>>> t = time.time()
>>> slow_fn(), time.time() - t < .01
(4, True)

modified from
http://code.activestate.com/recipes/466320/
and
http://code.activestate.com/recipes/325905/
"""

from google.appengine.api import memcache
import logging
import pickle

class gaecache(object):
"""
memoize decorator to use memcache with a timeout and an optional key.
if no key is given, the func_name, args, kwargs are used to create a key.
"""
def __init__(self, time=3600, key=None):
self.time = time
self.key = key

def __call__(self, f):
def func(*args, **kwargs):
if self.key is None:
t = (f.func_name, args, kwargs.items())
try:
hash(t)
key = t
except TypeError:
try:
key = pickle.dumps(t)
except pickle.PicklingError:
logging.warn("cache FAIL:%s, %s", args, kwargs)
return f(*args, **kwargs)
else:
key = self.key

data = memcache.get(key)
if data is not None:
logging.info("cache HIT: key:%s, args:%s, kwargs:%s", key, args, kwargs)
return data

logging.warn("cache MISS: key:%s, args:%s, kwargs:%s", key, args, kwargs)
data = f(*args, **kwargs)
memcache.set(key, data, self.time)
return data

func.func_name = f.func_name
return func

Friday, October 03, 2008

genedex.fasta with numpy.memmap

EDIT: added job posting to comments.
I've been working a bit on genedex, I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree, but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py. That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on the start and stop. I could use rtree to index the sqlite database, but I'd like to move away from the LGPL. Maybe this KDTree that's already in a scipy branch with a more permissive license. Then it could do both spatial, and attribute queries...
That's all tinkering...

I also cleaned up the genedex.fasta module. The useage is nice, if not entirely the implementation. A fasta file can look like:

>chr1
ATGTCGTCGGCCGC
GGGCCAAGA
CAACGGAGA

>chr3
ATGGAGGAGGCTGGCGAGCGG

>chr2
ATGGCGTGC
ACGGCGGCG
CGCATGTT
CGCCT

where a line starting with > is the name of the sequence ('chr1') and everything up until the next > is the sequence. The problem is the newlines, so every time you want to look at chr1 basepairs 10 to 20, you have to find where the sequence starts, and account for newlines. -- Acutally one should never do that, as Biopython, and pretty much any library will take care of that for you. Pygr for example, creates a new file something.fasta.pureseq which removes all newlines and labels and indexes where the starts and stop are. genedex.fasta.Fasta now does something similar, here's example useage on the file above ('123.fasta').

from genedex import Fasta
f = Fasta('123.fasta')
print f.keys()
print f['chr1'][9:20].tostring()
print f['chr1'][9:20]

after, the fasta file looks like this:

>chr1
ATGTCGTCGGCCGCGGGCCAAGACAACGGAGA
>chr3
ATGGAGGAGGCTGGCGAGCGG
>chr2
ATGGCGTGCACGGCGGCGCGCATGTTCGCCT

with spaces removed--so it's still a valid fasta file (you can also send an argument to the constructor and it will create a new file) and there's a new file 123.fasta.gdx that is a python pickle containing:
{'chr3': (45L, 67L), 'chr2': (73L, 105L), 'chr1': (6L, 39L)}
which indicate the start and stop positions of the sequence in the file.
So the file remains a valid fasta file, but now it can be efficiently sliced. For now, it actually uses a numpy mmmap (numpy.memmap), to take advantage of the broadcasting, other than that, it'd be simpler to just use python mmap. So, when it sees f['chr1'][10:20] it acts just like it's indexing a numpy array, but it's accessing the data directly from the disk (well, not actually, but mmap works it's magic and I dont have to think about that). I like that--I can keep my fasta files as valid, add only a small python pickle file, and get simple, fast, pythonic indexing into them. It does take about 12 seconds to index and flatten the entire sorghum genome (629MB, ~660 million basepairs) on first view, after that first time, it's instantaneous.
Anyway, the source is available and easy_install-able:

svn checkout http://genedex.googlecode.com/svn/trunk/ genedex
As always, I'll gladly take any improvements, bugs, enhancements.

Also, our lab at UCB is looking for another (full-time or close to it, on-site) programmer who knows some biology, perl, and hopefully some python. If you're interested, or know anyone, send me an email. I have no real authority in the matter (or any matter) but I will have some say in this. I'd like to work with someone I can learn from. I'll add a link to the job posting in the comments below once it's posted.

Wednesday, August 06, 2008

choosing django

I prefer sqlalchemy and genshi (or mako) and was therefore looking at using turbogears, but I saw a demo of the django admin, and that sold me. Certainly the templating language did not. Before this, I'd only used web.py in my projects. These are the things I've liked/noted:


The first and most important: community. (Oddly enough, as I write this there are 666 projects tagged as django. 'turbogears', 'tg2', 'tg' give less than 50 projects combined. Think someone might have already written what you need? yep.
Also, a great site: http://www.djangosnippets.org/, where I've learned a lot just by reading, and saved myself a lot of time, by extending ideas there.
And the development is active.

Second, django.contrib.*
  • User authentication is simple, and check google-code for various alterations on the theme.

  • admin. This was what first made me decide on django. And now, new-forms admin is in trunk. This gives you a pretty nice CRUD interface for models in your app. In the app I've been working on, we have row _and_ field level permissions. We also need to let users edit the fields of other users stored in the database--but only certain fields, with which fields depending on the user viewing and the user being edited. This bit is more hacky than it should be, but quite simple.
    My biggest gripe about the admin is that it's too complicated to use custom widgets or validation. Hopefully, this will change.
    Oh, and it's too hard to have read_only privileges. (Yeah, I know the admin mantra).

  • GIS: nuf said.


i18n, t9n. Anywhere there's some text to be displayed in the app, I wrap it in ugettext_lazy (aliased to _), and later, dump it and send it to someone who knows Chinese. When it returns, I make messages, and the app will show in English or Chinese depending on browser preferences. Simple.


docs/help: yeah, the docs may have trouble keeping up, especially with recent rate of change, but it's easy enough to find what you need, and the django official docs are pretty nice. And if you can handle the fact that most responses you'll get on #django will begin with "of course, ...", then it's a great place to get help.


ModelForms: I've just started using these, but, they've already saved me a lot of code.




There's a lot of other nice things, and a lot of django that I don't even know. I'd still consider myself a newb, but it's still possible to get sh*t done.

Friday, June 20, 2008

pylab matplotlib imagemap

UPDATE 7-10-08:
+ add example for scatter plot
+ link to ken-ichi
===
Figuring how to make a client side image map from a matplotlib image has stumped me more than once. Andrew Dalke does have a working example. Below, I have the minimal example.

It's simple once you get the steps right: just use mpl's transform() to convert the data into the image's coordinate system. Then flip the y-axis as required by the imagemap, then do the normal imagemap stuff and save the html. The only real gotcha, is to make sure to put the dpi in the call to savefig().

import pylab
import sys
import random

name = 'imap'

# make some fake data
xs = range(15)
ys = [random.choice(xs) for i in range(len(xs))]

xys = zip(xs, ys)

# can also use : f = pylab.subplot(121)
f, = pylab.plot(xs, ys, 'ro')
dpi = f.figure.get_dpi()
height = f.figure.get_figheight() * dpi

# convert the x,y coords into image coords.
transform = f.get_transform()
icoords = transform.transform(xys)


# the minimal 'template' to generate an image map.
tmpl = """
<html><head></head><body>
<img src="%s.png" usemap="#points" border="0">
<map name="points">%s</map>
</body></html>"""

# change this as needed, e.g. if not plotting points.
fmt = "<area shape='circle' coords='%i,%i,2' href='http://example.com/%i/%i' >"

# need to do height - y for the image-map
fmts = [fmt % (ix, height - iy, x, y) for (ix, iy), (x, y) in zip(icoords, xys) ]

# NOTE, this dpi is needed!
pylab.savefig('imap' + '.png', dpi=dpi)
print >> open(name + ".html", 'w'), tmpl % (name, "\n".join(fmts))

UPDATE:
When trying to figure how to do this for a pylab.scatter plot, I found Ken-ichi had also done this for a scatter plot.
As of a matplotlib trunk revision 5711, the transform does not get set when the scatter plot is drawn. To set it, I had to set the transform explicitly:

s = pylab.scatter(xs, ys)
s.set_transform(s.axes.transData)
transformed_xys = s.get_transform().transform(zip(xs,ys))

Wednesday, June 04, 2008

binary search over intervals

[EDIT: update location of code repo]

This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with.

Binary search is easy. Just look at the python std library implementation (and the C API version). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order.
Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here, that means storing not only the start, but the stop and often the center of every interval--and using them to do the search. That makes things considerably more complicated than binary search. I started with an implementation of interval search here, but couldn't figure out how to customize.

Binary search is kind of a special case of binary search where intervals are of exactly length 0, e.g. start == stop. So, if all intervals are of exactly length 2? Well, then you can sort by start, find the left-most index by looking for start - 2 and find the right most index by searching for the (highest) index of start. That will give you the indicies in the sorted array of all intervals that overlap the query. The highest and lowest correspond to python's bisect.bisect_left, and bisect.bisect_right respectively.
This carries to any length. But, if all the intervals are different lengths. Well, then you can save the longest length, and then for any search, it's:
p_overlaps = intervals[bisect_left(start - max_len):bisect_right(stop)]
but that only gives the putative overlapping. Since we extended the left by max_len, and we may have found an interval whose length was < max_len (meaning its stop is before the start of the query) we have to explicitly test for distance:
real_overlaps = [iv for iv in p_overlaps if distance(query_interval, iv) == 0]
which gives only the intervals that overlap. So, that part is the price to pay for the simplified search. Another way, as suggested in the wikipedia article is to store the length of each interval as part of the interval.

In this setup, the worst case scenario is when a single looooong interval covers the entire range of the list of intervals. Then every search is linear, brute-force search. However, my use for this is genomic data. There, I'll have an entire range of say... 50 million, and the intervals (genes) are rarely longer than 4000 in length (basepairs). So, it's useful to optimize simplicity.

My cython version of this is in my googlecode repo here. It has all the stuff I use, methods for left(), right(), upstream(), downstream(), nearest_neighbors(). Most of the searching work is in the binsearch_* functions -- I couldn't use the python ones. There are a couple of hacks in there:
1) because pyrex/cython don't support closures
2) the left() method is confusing because the intervals are sorted by start, and the left() has to find the nearest intervals by stop(). That's where complexity is increased because of this setup.

On my machine, it creates a tree with 6815 intervals in .016 seconds and does 50000 searches on that tree in .14 seconds. It seems to scale well with the number of features as 50K searches on 68150 features takes .50 seconds. Adding an interval that covers the entire range of all other features (results in worst-case linear search) makes the 50K searches (on 6185 intervals) take 1.54 seconds--which is only so good because the brute-force in method is pretty close to c-speed. This could be optimized by saving the stops in a separate array, or by saving long intervals in a separate array, but it's rare enough, and the code is simple enough as-is, that I'll probably leave it for now.

The "proper" way to do this is with an interval/segment tree, of which there's a very readable version in bx-python. If I'd found that earlier, I probably wouldn't have coded this... The tree is faster for larger number of intervals, but that's rarely going to be an issue, it does take much less memory...

Wednesday, May 21, 2008

wherecamp

I agree with every report I've seen. Wherecamp was awesome. I've been telling people, and I'm still sure it's true, that I met exactly zero people who I'd physically seen before. Ordinarily, I avoid meetings, but this is a good format and seems to attract good people. It's fun to meet and work with people who are really into what they do. The talks are less "talky" and more like chat sessions--which is possible when the groups are small.

There was also plenty of time to hack, which was the original reason I went. During and after, I learned some simple things which I'm trying to incorporate into my usual workflow:
In the shell, background a job with "ctrl + z" then get back to it with %i where i is the number shown in the output from "jobs". That's a trick from jlivni.

From crschmidt, I added:
alias doctest="nosetests --with-doctest --doctest-extension=.txt"
to my .bash_aliases. Which let's me do:
doctest tests/
or
doctest tests/test_somefile.txt
to run my doctests instead of python -c "import doctest;doctest.testfile('...')"

And springmeyer showed me a ton of django and geodjango. The admin stuff is just ... nice -- it's how making a db front end should be. I still don't know how to learn that stuff on my own, it seems a lot of it, you just have to know which modules to import and the django book doesn't cover newforms or the new admin stuff as far as I can tell.

Quotes


On friday night, we met up in SF to do some hacking, the never went down, as I couldnt get wireless and it turned into more of a real bar trip. We, were however, talking about python. At one point, it was sorta quiet and out of the silence, comes:
"Python sucks"
from a true lisp hacker in the next booth--complete with curly grey beard and spectacles. He actually turned out to be a cool guy, I think maybe he even admitted that if he couldn't use lisp, python was a reasonable choice--I think that's about as much as you can expect from a lisper.

From crschmidt:
"I don't really know python that well"
Then who was it that basically rewrote featureserver between the hours of 2AM and 9AM when everyone else was sleeping?

Wednesday, May 14, 2008

slicehost, trac, wherecamp

I have a development "server" here beside me. It's actually a budget laptop that sold for $799 2 years ago. It's a xubuntu machine, hosting a trac instance, a development server for mapping stuff, postgresql/postgis, mapserver, mysql, and couple svn repos, anything I do for contracting, etc. Oh, and it's also hosting a couple of sites for the multi-national company that my gf works for! all of their servers are windows machines (long rant suppressed).
It used to get warm, so propped it up on 4 tuna cans, 1 for each corner, now it stays cooler. Yep, it's a sweet setup.
Anyway, I pay AT&T or SBC --or whatever they are now called-- for static IP's and a supposedly faster internet connection. My 1 year contract for that is nearly up, so I'm switching to slicehost.

I'm not a sys-admin, I sorta do that for 4 gentoo (not my choice) machines at $work, and my strategy is to set up, rsnapshot and never, ever emerge -u world. ever. So far, it's mostly working. When I have the choice, I use (x/k)ubuntu, I don't care if they do magical stuff (or even if I had to redo my ssh keys today), it just works.

Anyway, I want something easy and idiot proof. I'd heard the hype about slicehost when it came out, and figured it was just that, hype. It's not. I've never used shared hosting before, but this was pretty simple. From entering my payment to ssh'ing as root into my slice took ~ 2 minutes. /proc/cpuinfo shows it's a machine with 4 opteron dual-cores. I started with a 256 slice with back-up. You can start new slices and restore from the backups. That's cool, and less $$ than I pay AT&T for static IPs and faster uploads.

I ran a script to apt-get all the packages I use, and had a base, working system in under 1 hour. They have a lot of articles about how to set stuff up, mostly basic (even for me), but I followed their info on setting up iptables. Predictably, I forgot to leave open port 22 and locked myself out of ssh, but they have a web-based console, so, not a problem. It seems to be idiot proof...

Trac

Also, in the theme of things that just work as they should, upgrading to Trac 0.11 (still in rc).

sudo easy_install -UZ Trac==0.11rc1
cd /path/to/trac/project/
sudo trac-admin . upgrade
sudo trac-admin . upgrade wiki
sudo apache2ctl restart

WhereCamp


I'll likely be at wherecamp, it'll be good to learn some stuff, and meet people I only know from IRC. If anyone needs a ride from the east-bay, let me know.

Thursday, May 08, 2008

openlayers, genomes and image-maps

In response to Titus' post on using imagemaps for genomic visualization:
Why are imagemaps so popular in genomics? As an extreme and unfair comparison, just imagine if http://maps.google.com was an image map.
Given a CGI script that can accept a url like
&start=1024&stop=2048&chr=3
and return an appropriate image, you can provide a substantial set of tools using openlayers, which is developed by what must be one of the largest and active developer communities in GIS. (Yes, I am an openlayers fan-boy.)
You can do that with a small addition to openlayers which I updated a couple weeks ago to OL version 2.6. In that update, I removed > 140 lines of code. So, it's now even less of a change to OL. Maybe when 2.7 comes out, I'll figure out how to provide a patch that allows an extra argument to the OpenLayers.Map constructor that limits panning to the horizontal direction -- in which case genome-browser will cease to exist and only the single file containing OpenLayers.Layer.Genomic would be needed.

Maybe I'd need to make a real example of using openlayers for genomics, making more obvious use of layers, the vector stuff, fractional zoom, markers, geocoding, projections, power steering, etc to make it more obvious how badass OL can be. As far as I know, I'm the only one using it for genomics, and 98% of the time I spent developing it was in my spare time--I only have a good excuse to hack on it at $work when something breaks. If more people were using it, I might be more motivated to figure out how to do stacking of images vertically, but still restricting scrolling to the horizontal--to essentially allow the same thing as "tracks" in other genome browsers. Maybe that's the killer feature.

And


An observation on the difference between the bio and geo programming environments as I interpret them:
There's some tools that (IMHO) are better in the geo world than in the bio. Perhaps that's because of GDAL, the keystone for geo-data formats and projections. Since any other software (in any SWIG-able language) that makes use of gdal can access pretty much any format and I/O to any projection, then geo developers can do things like make nice renderers, or web-based map browsers rather than figuring how to convert that mrSID image in epsg:4326 to a tif in epsg:900913. (I'm also a GDAL fan-boy.)

Contrast this to the bio world where there's a bio for nearly every common language, bio-java, bio-perl, bio-python, bio-ruby, each with it's own blast parser, gen-bank parser, sequence objects, alignment objects. There is no keystone, so there's more duplication of effort, and there's no parser that's used across languagues as is the case for GIS. -- I'm not suggesting that shouldn't be the case, simply making an observation.