<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-21289662</id><updated>2011-12-13T22:57:44.395-08:00</updated><category term='flash'/><category term='mmap'/><category term='mod_wsgi'/><category term='pyfasta'/><category term='twill'/><category term='pylab kmeans'/><category term='pylab'/><category term='gdal'/><category term='lua'/><category term='bioinformatics'/><category term='biohash'/><category term='seqfind'/><category term='ngs'/><category term='aligner'/><category term='ctypes'/><category term='biotools'/><category term='genedex'/><category term='fasta'/><category term='cython'/><category term='python'/><category term='wsgi'/><category term='browser'/><category term='tokyocabinet'/><category term='web.py'/><category term='fastq'/><category term='htseq'/><category term='ecology'/><category term='apache'/><category term='oss'/><category term='python gis'/><category term='gis'/><category term='geo'/><category term='django'/><category term='vis'/><category term='gis oss'/><category term='nwalign'/><category term='c'/><category term='s5'/><category term='fcsh'/><category term='numpy'/><category term='pygments'/><category term='appengine'/><category term='bio'/><category term='spatial'/><category term='gsnap'/><category term='testing'/><category term='flash gis'/><category term='methylcoder'/><category term='tree'/><category term='haxe'/><title type='text'>Bio and Geo Informatics</title><subtitle type='html'>Genomics Programming</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>72</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-21289662.post-4731561902882505008</id><published>2011-04-12T18:40:00.000-07:00</published><updated>2011-04-12T18:40:21.428-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='browser'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>Adding bed/wig data to dalliance genome browser</title><content type='html'>I have been playing a bit with the &lt;a href="http://www.biodalliance.org/"&gt;dalliance genome browser&lt;/a&gt;. It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses.&lt;br /&gt; I added a document to &lt;a href="https://github.com/brentp"&gt;my github&lt;/a&gt; repo describing how to display a bed file in the browser. That rst is &lt;a href="https://github.com/brentp/bio-playground/blob/master/ngs-notes/dalliance.rst"&gt;here&lt;/a&gt; and displayed in inline below.&lt;br /&gt;It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below).&lt;br /&gt;This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently.&lt;br /&gt;&lt;div class="document"&gt; &lt;br /&gt;&lt;div class="section" id="dalliance-data-tutorial"&gt;&lt;h1&gt;Dalliance Data Tutorial&lt;/h1&gt;&lt;p&gt;&lt;a class="reference external" href="http://www.biodalliance.org/"&gt;dalliance&lt;/a&gt; is a web-based scrolling genome-browser. It can display data from&lt;br /&gt;remote &lt;a class="reference external" href="http://dasregistry.org/"&gt;DAS&lt;/a&gt; servers or local or remote &lt;a class="reference external" href="http://genome.ucsc.edu/goldenPath/help/bigWig.html"&gt;BigWig&lt;/a&gt; or &lt;a class="reference external" href="http://genome.ucsc.edu/goldenPath/help/bigBed.html"&gt;BigBed&lt;/a&gt; files.&lt;br /&gt;This will cover how to set up an html page that links to remote &lt;a class="reference external" href="http://dasregistry.org/"&gt;DAS&lt;/a&gt; services.&lt;br /&gt;It will also show how to create and serve &lt;a class="reference external" href="http://genome.ucsc.edu/goldenPath/help/bigWig.html"&gt;BigWig&lt;/a&gt; and &lt;a class="reference external" href="http://genome.ucsc.edu/goldenPath/help/bigBed.html"&gt;BigBed&lt;/a&gt; files.&lt;/p&gt;&lt;div class="note"&gt;&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;&lt;p class="last"&gt;This document will be using hg18 for this tutorial, but it is applicable to&lt;br /&gt;any version available from your favorite database or &lt;a class="reference external" href="http://dasregistry.org/"&gt;DAS&lt;/a&gt; .&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="section" id="creating-a-bigbed"&gt;&lt;h1&gt;Creating A BigBed&lt;/h1&gt;&lt;div class="section" id="getting-a-bed-file-from-ucsc"&gt;&lt;h2&gt;Getting a bed file from UCSC&lt;/h2&gt;&lt;blockquote&gt;&lt;ul&gt;&lt;li&gt;&lt;p class="first"&gt;From the &lt;a class="reference external" href="http://genome.ucsc.edu/cgi-bin/hgTables"&gt;UCSC table browser&lt;/a&gt; choose&lt;/p&gt;&lt;ul class="simple"&gt;&lt;li&gt;genome: Human&lt;/li&gt; &lt;li&gt;assembly:  NCBI36/hg18&lt;/li&gt; &lt;li&gt;group: Genes and Gene Prediction Tracks&lt;/li&gt; &lt;li&gt;track: UCSC Genes&lt;/li&gt; &lt;li&gt;table: knownGene&lt;/li&gt; &lt;li&gt;output format &amp;quot;selected fileds from primary and related tables&amp;quot;&lt;/li&gt; &lt;li&gt;in text box, name it &amp;quot;knownGene.hg18.stuff.txt&amp;quot;&lt;/li&gt; &lt;li&gt;&lt;em&gt;click&lt;/em&gt; &amp;quot;get output&amp;quot;&lt;/li&gt; &lt;li&gt;&lt;em&gt;check&lt;/em&gt; kgXref under 'Linked Tables'&lt;/li&gt; &lt;li&gt;&lt;em&gt;click&lt;/em&gt; 'Allow Selection From Checked Tables' at bottom of page.&lt;/li&gt; &lt;li&gt;&lt;em&gt;check&lt;/em&gt; 'geneSymbol' from hg18.kgXref fields section&lt;/li&gt; &lt;li&gt;&lt;em&gt;click&lt;/em&gt; 'get output' and a file named 'knownGene.hg18.stuff.txt' will be saved to your downloads directory. move it to your current directory.&lt;/li&gt; &lt;/ul&gt;&lt;/li&gt; &lt;li&gt;&lt;p class="first"&gt;To get this into bed format copy and paste this onto the command-line:&lt;/p&gt;&lt;pre class="literal-block"&gt;grep -v '#' knownGene.hg18.stuff.txt | awk 'BEGIN { OFS = &amp;quot;\t&amp;quot;; } ;&lt;br /&gt;{   split($9, astarts, /,/);&lt;br /&gt;    split($10, aends, /,/);&lt;br /&gt;    starts=&amp;quot;&amp;quot;&lt;br /&gt;    ends=&amp;quot;&amp;quot;&lt;br /&gt;    for(i in astarts){&lt;br /&gt;        if (! astarts[i]) continue&lt;br /&gt;        ends=ends(aends[i] - astarts[i])&amp;quot;,&amp;quot;&lt;br /&gt;        starts=starts(astarts[i] = astarts[i] - $4)&amp;quot;,&amp;quot;&lt;br /&gt;    }&lt;br /&gt;    print $2,$4,$5,$1&amp;quot;,&amp;quot;toupper($13),1,$3,$6,$5,&amp;quot;.&amp;quot;,$8,ends,starts&lt;br /&gt;}' | sort -k1,1 -k2,2n &amp;gt; knownGene.hg18.bed&lt;br /&gt;&lt;/pre&gt;&lt;/li&gt; &lt;li&gt;&lt;p class="first"&gt;To create a &lt;a class="reference external" href="http://genome.ucsc.edu/goldenPath/help/bigBed.html"&gt;BigBed&lt;/a&gt; from this, do (note if you're not on a 64 bit&lt;br /&gt;machine, you'll have to find the 32bit binaries:&lt;/p&gt;&lt;pre class="literal-block"&gt;wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes&lt;br /&gt;wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed&lt;br /&gt;chmod +x fetchChromSizes bedToBigBed&lt;br /&gt;./fetchChromSizes hg18 &amp;gt; data/hg18.chrom.sizes&lt;br /&gt;./bedToBigBed knownGene.hg18.bed data/hg18.chrom.sizes knownGene.hg18.bb&lt;br /&gt;&lt;/pre&gt;&lt;/li&gt; &lt;/ul&gt;&lt;/blockquote&gt;&lt;p&gt;now knownGene.hg18.bb is a &lt;a class="reference external" href="http://genome.ucsc.edu/goldenPath/help/bigBed.html"&gt;BigBed&lt;/a&gt; file containing both the UCSC and the common&lt;br /&gt;name in the name column.&lt;/p&gt;&lt;/div&gt;&lt;div class="section" id="sql"&gt;&lt;h2&gt;SQL&lt;/h2&gt;&lt;p&gt;UCSC also has a public mysql server so the process of downloading to a bed can be simplified to:&lt;/p&gt;&lt;pre class="literal-block"&gt;mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -P 3306   -e &amp;quot;select chrom,txStart,txEnd,K.name,X.geneSymbol,strand,exonStarts,exonEnds from knownGene as K,kgXref as X where  X.kgId=K.name;&amp;quot; &amp;gt; tmp.notbed&lt;br /&gt;grep -v txStart tmp.notbed | awk 'BEGIN { OFS = &amp;quot;\t&amp;quot;; } ;&lt;br /&gt;    {   split($7, astarts, /,/);&lt;br /&gt;        split($8, aends, /,/);&lt;br /&gt;        starts=&amp;quot;&amp;quot;&lt;br /&gt;        sizes=&amp;quot;&amp;quot;&lt;br /&gt;        exonCount=0&lt;br /&gt;        for(i in astarts){&lt;br /&gt;            if (! astarts[i]) continue&lt;br /&gt;            sizes=sizes&amp;quot;&amp;quot;(aends[i] - astarts[i])&amp;quot;,&amp;quot;&lt;br /&gt;            starts=starts&amp;quot;&amp;quot;(astarts[i] = astarts[i] - $2)&amp;quot;,&amp;quot;&lt;br /&gt;            exonCount=exonCount + 1&lt;br /&gt;        }&lt;br /&gt;        print $1,$2,$3,$4&amp;quot;,&amp;quot;$5,1,$6,$2,$3,&amp;quot;.&amp;quot;,exonCount,sizes,starts&lt;br /&gt;    }' | sort -k1,1 -k2,2n &amp;gt; knownGene.hg18.bed&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;then proceed as the last steps above to create the big bed file.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="section" id="displaying-a-bigbed-in-dalliance"&gt;&lt;h1&gt;Displaying A BigBed in Dalliance&lt;/h1&gt;&lt;p&gt;From there, download dalliance:&lt;/p&gt;&lt;pre class="literal-block"&gt;$ git://github.com/dasmoth/dalliance.git&lt;br /&gt;cd dalliance&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;and edit test.html, adding:&lt;/p&gt;&lt;pre class="literal-block"&gt;{name: 'UCSC Genes',&lt;br /&gt; bwgURI:               '/dalliance/knownGene.hg18.bb',&lt;br /&gt;},&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;before the line that looks like:&lt;/p&gt;&lt;pre class="literal-block"&gt;{name: 'Repeats',&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;at around &lt;em&gt;line 55&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;Then edit your apache.conf (e.g. &lt;cite&gt;/etc/apache2/sites-enabled/000-default&lt;/cite&gt;)&lt;br /&gt;and put the following&lt;br /&gt;(here i assume you cloned dalliance into &lt;cite&gt;/usr/usr/local/src/dalliance-git&lt;/cite&gt;):&lt;/p&gt;&lt;pre class="literal-block"&gt;Alias /dalliance &amp;quot;/usr/local/src/dalliance-git&amp;quot;&lt;br /&gt;&amp;lt;Directory &amp;quot;/usr/locals/src/dalliance-git&amp;quot;&amp;gt;&lt;br /&gt; &lt;br /&gt;    Header set Access-Control-Allow-Origin &amp;quot;*&amp;quot;&lt;br /&gt;    Header set Access-Control-Allow-Headers &amp;quot;Range&amp;quot;&lt;br /&gt; &lt;br /&gt;    Options Indexes MultiViews FollowSymLinks&lt;br /&gt;    AllowOverride None&lt;br /&gt;    Order allow,deny&lt;br /&gt;    Allow from all&lt;br /&gt;&amp;lt;/Directory&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;Then enable mod-headers apache module. On Ubuntu, that looks like:&lt;/p&gt;&lt;pre class="literal-block"&gt;sudo a2enmod headers&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;Then point your browser to:: &lt;a class="reference external" href="http://yourhost/dalliance/test.html"&gt;http://yourhost/dalliance/test.html&lt;/a&gt; &lt;br /&gt;And you should see the your 'UCSC Genes' track in full glory along&lt;br /&gt;with the other niceties of the &lt;a class="reference external" href="http://www.biodalliance.org/"&gt;dalliance&lt;/a&gt; browser.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4731561902882505008?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4731561902882505008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4731561902882505008' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4731561902882505008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4731561902882505008'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2011/04/adding-bedwig-data-to-dalliance-genome.html' title='Adding bed/wig data to dalliance genome browser'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2962726271183857257</id><published>2010-10-22T09:12:00.000-07:00</published><updated>2010-10-22T09:12:39.596-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ngs'/><category scheme='http://www.blogger.com/atom/ns#' term='htseq'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>(bloom) filter-ing repeated reads</title><content type='html'>In this post, I'll talk a &lt;i&gt;bit&lt;/i&gt; about using a bloom filter as a pre-&lt;i&gt;filter&lt;/i&gt; for large amounts of data, specifically some next-gen sequencing reads.&lt;br /&gt;&lt;h3&gt;Bloom Filters&lt;/h3&gt;A &lt;a href="http://en.wikipedia.org/wiki/Bloom_filter"&gt;Bloom Filter&lt;/a&gt; is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, &lt;a href="http://search.cpan.org/~palvaro/Bloom-Faster-1.6/lib/Bloom/Faster.pm"&gt;Bloom::Filter&lt;/a&gt;. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.&lt;br /&gt;The code for that wrapper is in my github, &lt;a href="http://github.com/brentp/pybloomfaster"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Big Data&lt;/h3&gt;It's a common request to filter out repeated reads from next-gen sequencing data. Just see &lt;a href="http://biostar.stackexchange.com/questions/3010/how-to-remove-the-same-sequences-in-the-fasta-files"&gt;this question&lt;/a&gt; on biostar. &lt;a href="http://biostar.stackexchange.com/questions/3010/how-to-remove-the-same-sequences-in-the-fasta-files/3011#3011"&gt;My answer&lt;/a&gt;, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure and use that to find duplicates and print out only the best or first or whatever. However, once you start getting "next-gen", this is less useful. For example, someone kindly &lt;a href="http://github.com/brentp/bio-playground/issues/#issue/2"&gt;reported a bug&lt;/a&gt; on my simple &lt;a href="http://github.com/brentp/bio-playground/tree/master/reads-utils/"&gt;c++ filter&lt;/a&gt; because he had 84Gigs of reads and was running out of memory. And that &lt;b&gt;is&lt;/b&gt; a bug. Anything that's supposed to deal with next-gen sequences has to deal with that stuff. &lt;br /&gt;&lt;br /&gt;As a side note, my &lt;a href="http://tanghaibao.blogspot.com/"&gt;friend/co-worker&lt;/a&gt; came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Just Do It&lt;/h3&gt;So, I wrote a &lt;a href="http://github.com/brentp/pybloomfaster/tree/master/examples"&gt;simple script&lt;/a&gt; that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.&lt;br /&gt;usage looks like:&lt;br /&gt;&lt;i&gt;python fastq-unique.py &amp;gt; in.fastq &amp;lt; out.unique.fastq&lt;/i&gt;&lt;br /&gt;On my machine, a particular run with a decent sized file looks like this:&lt;br /&gt;&lt;script src="http://gist.github.com/640806.js?file=gistfile2.sh"&gt;&lt;/script&gt;&lt;br /&gt;and here's the code:&lt;br /&gt;&lt;script src="http://gist.github.com/640806.js?file=fastq-unique.py"&gt;&lt;/script&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2962726271183857257?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2962726271183857257/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2962726271183857257' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2962726271183857257'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2962726271183857257'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/10/bloom-filter-ing-repeated-reads.html' title='(bloom) filter-ing repeated reads'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-9108554039693898509</id><published>2010-09-20T16:34:00.000-07:00</published><updated>2011-10-21T12:08:09.427-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='fastq'/><category scheme='http://www.blogger.com/atom/ns#' term='ngs'/><title type='text'>filtering paired end reads (high throughput sequencing)</title><content type='html'>NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.&lt;br /&gt;&lt;br /&gt;I wrote &lt;a href="http://hackmap.blogspot.com/2010/09/ngs-high-throughput-sequencing-pipeline.html"&gt;last time&lt;/a&gt; about a pipeline for high-throughput sequence data. In it, I mentioned that the &lt;a href="http://hannonlab.cshl.edu/fastx_toolkit/"&gt;fastx toolkit&lt;/a&gt; works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the &lt;i&gt;nth&lt;/i&gt; record in file 1 will be the pair of the &lt;i&gt;nth&lt;/i&gt; record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.&lt;br /&gt;At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses &lt;a href="http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage"&gt;fastx_clipper&lt;/a&gt; if an adaptor sequence is specified and then pipes the output to &lt;i&gt;fastq_quality_trimmer&lt;/i&gt; for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;ADAPTORS=GAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT,GAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGT,GAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG&lt;br /&gt;pair_fastx_clip_trim.py --sanger -a $ADAPTORS -M 20 -t 28 -l 40 en.wt.1.fastq en.wt.2.fastq&lt;br /&gt;&lt;/pre&gt;Where the -a (adaptor) -M (length of adaptor match) -t (min quality threshold) and -l (min length after quality chops) options are copied directly from (and sent directly to) fastx toolkit. &lt;i&gt;--sanger&lt;/i&gt; indicates that the reads have &lt;a href="http://en.wikipedia.org/wiki/FASTQ_format#Quality"&gt;fastq qualities&lt;/a&gt; in the sanger encoding. If that option is not specified, qualities are assumed to be in illumina 1.3 format where the ascii offset is 64.&lt;br /&gt;&lt;b&gt;Output&lt;/b&gt;&lt;br /&gt;This example will create 2 new files: &lt;i&gt;en.wt.1.fastq.trim&lt;/i&gt; and &lt;i&gt;en.wt.2.fastq.trim&lt;/i&gt; each with the same number of corresponding records that pass the filtering above.&lt;br /&gt;&lt;b&gt;Adaptors&lt;/b&gt;&lt;br /&gt;As described in my &lt;a href="http://hackmap.blogspot.com/2010/09/ngs-high-throughput-sequencing-pipeline.html"&gt;previous post&lt;/a&gt;, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option &lt;i&gt;-a&lt;/i&gt;--in a single run.&lt;br /&gt;&lt;b&gt;Script&lt;/b&gt;&lt;br /&gt;It's not too pretty, but it does the job:&lt;br /&gt;&lt;script src="http://gist.github.com/588841.js?file=fastq_pair_filter.py"&gt;&lt;/script&gt;&lt;br /&gt;As always, let me know of any feedback.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-9108554039693898509?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/9108554039693898509/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=9108554039693898509' title='31 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/9108554039693898509'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/9108554039693898509'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/09/filtering-paired-end-reads-high.html' title='filtering paired end reads (high throughput sequencing)'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>31</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-635529086048036271</id><published>2010-09-12T18:21:00.000-07:00</published><updated>2010-09-12T18:27:51.388-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ngs'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>ngs / high-throughput sequencing pipeline</title><content type='html'>This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know.&lt;br /&gt;I'll put it in the form of a shell script that assumes you've got &lt;a href="http://gist.github.com/407882"&gt;this software&lt;/a&gt; installed.&lt;br /&gt;I'll also assume your data is in the &lt;a href="http://en.wikipedia.org/wiki/FASTQ_format"&gt;FASTQ format&lt;/a&gt;. If it's in illumina's qseq format, you can convert to FastQ with &lt;a href="http://gist.github.com/549824"&gt;this script&lt;/a&gt; by sending a list of qseq files as the command-line arguments.&lt;br /&gt;If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply.&lt;br /&gt;This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Setup&lt;/h4&gt;The programs and files will be declared as follows.&lt;br /&gt;&lt;pre class="prettyprint"&gt;#programs&lt;br /&gt;fastqc=/usr/local/src/fastqc/FastQC/fastqc&lt;br /&gt;bowtie_dir=/usr/local/src/bowtie/bowtie-0.12.7/&lt;br /&gt;samtools=/usr/local/src/samtools/samtools&lt;br /&gt;&lt;br /&gt;#files. (you'll want to change these)&lt;br /&gt;FASTQ=/path/to/a.fastq&lt;br /&gt;REFERENCE=/path/to/reference.fasta&lt;br /&gt;&lt;/pre&gt;Most of the following will run as-is for any set of reads, you'll only need to change the &lt;i&gt;FASTQ&lt;/i&gt; and &lt;i&gt;REFERENCE&lt;/i&gt; variables above.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Seeing the Reads&lt;/h4&gt;If you have a file with 2GB of reads, you likely can't just &lt;i&gt;read&lt;/i&gt; (pun intended) it and get an idea of what's going on -- though I've tried. There are a number of tools to give you a summary of the data including stats such as quality per base, nucleotide frequency, etc. While &lt;a href="http://hannonlab.cshl.edu/fastx_toolkit/"&gt;fastx toolkit&lt;/a&gt; will do this for you, I've found &lt;a href="http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/"&gt;fastqc&lt;/a&gt; to be the best choice.&lt;br /&gt;The command: &lt;pre class="prettyprint"&gt;$fastqc $FASTQ&lt;/pre&gt;will write a folder "a_fastqc/" containing the html report in fastqc_report.html&lt;br /&gt;Here's an example of the nicely formatted and informative FastQC report before quality filtering and trimming (scroll to see the interesting stuff):&lt;br /&gt;&lt;iframe src="http://syntelog.com/t/save/a_fastqc/fastqc_report.html" height=400 width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;from that, we can see that (as with all illumina datasets) quality scores are lower at the 3' end of the read. For this analysis, I'll choose to chop bases with a quality score under 28. In addition to the other information, we can see that there is an Illumina Primer still present on many of the reads, so we'll want to chop that out in the filtering step below.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Filtering the Reads&lt;/h4&gt;The &lt;a href="http://hannonlab.cshl.edu/fastx_toolkit/"&gt;fastx toolkit&lt;/a&gt; is not extremely fast, but it has a simple command-line interface for most common operations I use to filter reads (though it--like similar libraries--is lacking in support for filtering paired-end reads). For this case, we want to trim nucleotides with quality less than 28 from the ends of each read and then remove reads of length 40. In addition, we want to chip Illumina adaptor sequence. This can be done by piping 2 commands together:&lt;pre class="prettyprint"&gt;# the sequence identified by FastQC above:&lt;br /&gt;CLIP=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAA&lt;br /&gt;fastx_clipper -a $CLIP -i $FASTQ -M 30 | fastq_quality_trimmer -t 28 -l 40 &gt; ${FASTQ}.trim&lt;br /&gt;&lt;/pre&gt;both &lt;i&gt;fastx_clipper&lt;/i&gt; and &lt;i&gt;fastq_quality_trimmer&lt;/i&gt; are provided in the fastx toolkit.&lt;br /&gt;The first command clips the adaptor sequence from $FASTQ and pipes the output to the 2nd command, fastq_quality_trimmer, which chomps bases with quality less than 28 (-t) then discards and read with a remaining length less than 40 (-l) and sends the output to a &lt;i&gt;.trim&lt;/i&gt; file.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Seeing the &lt;i&gt;filtered&lt;/i&gt; Reads&lt;/h4&gt;We then want to re-run fastqc on the trimmed, clipped reads&lt;br /&gt;The command: &lt;pre class="prettyprint"&gt;$fastqc ${FASTQ}.trim&lt;/pre&gt;and the output looks like:&lt;br /&gt;&lt;iframe src="http://syntelog.com/t/save/a.fastq.trim_fastqc/fastqc_report.html" height=400 width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;where we can see the quality looks much better and there are no primer sequences remaining.&lt;br /&gt;This took the number of reads from 14,368,483  to 11,579,512 (and many shorter--trimmed--reads in the latter as well).&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Analysis&lt;/h4&gt;Up to this point, the analysis will have been very similar for RNA-Seq, BS-Seq, and genomic reads, but you'll want to customize your filtering steps. The following will go through a simple alignment using bowtie that assumes you have genomic reads.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Aligning&lt;/h4&gt;The first step is to build the reference bowtie index:&lt;pre class="prettyprint"&gt;${bowtie_dir}/bowtie-build --quiet $REFERENCE bowtie-index&lt;/pre&gt;That will create the bowtie index for your reference fasta in the bowtie-index/ directory. It can take a while so you may want to download the pre-built indexes from the &lt;a href="http://bowtie-bio.sourceforge.net/index.shtml"&gt;bowtie website&lt;/a&gt; if your organism is available. &lt;br /&gt;Then, you can run the actual alignment as:&lt;pre class="prettyprint"&gt;${bowtie_dir}/bowtie --tryhard --phred64-quals -p 4 -m 1 -S bowtie-index -q ${FASTQ}.trim ${FASTQ}.trim.sam&lt;/pre&gt;which tells bowtie to try hard (--tryhard), assume quality scores are the latest schema from illumina (--phred64-quals), use 4 processors (-p 4), discard any reads that map to more than one location the the reference (-m 1), use SAM output format (-S) and then where to find the bowtie index and the reads. The output is sent to ${FASTQ}.trim.sam.&lt;br /&gt;We've done a lot of experimenting with different values for -m, and can affect your results, but -m 1 seems a common choice in the literature. And it's clearly less sketchy than, for example, -m 100 which would report up to 100 alignments for any read that maps to less than 100 locations in the genome.&lt;br /&gt;&lt;h4&gt;View the Alignment&lt;/h4&gt;From there, we want to view the alignment. Most tools can handle SAM formatted files, but will perform better with an indexed bam. To get that, we do:&lt;pre class="prettyprint"&gt;# view with samtools filter out unmapped reads and converted to sorted, indexed bam.&lt;br /&gt;${samtools} view -bu -S -F 0x0004 ${FASTQ}.trim.sam -o ${FASTQ}.trim.unsorted.bam&lt;br /&gt;${samtools} sort ${FASTQ}.trim.unsorted.bam ${FASTQ}.trim&lt;br /&gt;${samtools} index ${FASTQ}.trim.bam&lt;br /&gt;&lt;/pre&gt;where now ${FASTQ}.trim.bam is sorted and indexed and contains only mapped reads (the .sam file from bowtie contains unmapped reads).&lt;br /&gt;You can have a quick look at the alignment stats with:&lt;pre class="prettyprint"&gt;${samtools} flagstat ${FASTQ}.trim.bam&lt;/pre&gt;and you can see an awesome ascii &lt;a href="http://linux.die.net/man/3/ncurses"&gt;curses&lt;/a&gt; view of the alignment with:&lt;pre class="prettyprint"&gt;${samtools} tview ${FASTQ}.trim.bam&lt;/pre&gt;to get something that looks like:&lt;br /&gt;&lt;img src="http://lh6.ggpht.com/_uU_kLC5AdTc/S5h7fACdteI/AAAAAAAAAyc/r5l345gDPCo/methylcode.png" /&gt;&lt;br /&gt;But you'll probably want to use a "real" viewer such as &lt;a href="http://www.broadinstitute.org/igv/"&gt;IGV&lt;/a&gt;, &lt;a href="http://johnsonlab.ucsf.edu/sj/mochiview-start/"&gt;MochiView&lt;/a&gt;, or &lt;a href="http://bioinf.scri.ac.uk/tablet/"&gt;Tablet&lt;/a&gt;, all of which I have had some sucess with.&lt;br /&gt;&lt;br /&gt;Note that you may want to choose another aligner, because &lt;i&gt;bowtie&lt;/i&gt; does poorly with indels. Something like &lt;a href="http://www.gene.com/share/gmap/"&gt;GSNAP&lt;/a&gt; will be better suited for reads where you have more complex variants.&lt;br /&gt;I welcome any feedback on these methods.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-635529086048036271?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/635529086048036271/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=635529086048036271' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/635529086048036271'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/635529086048036271'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/09/ngs-high-throughput-sequencing-pipeline.html' title='ngs / high-throughput sequencing pipeline'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh6.ggpht.com/_uU_kLC5AdTc/S5h7fACdteI/AAAAAAAAAyc/r5l345gDPCo/s72-c/methylcode.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4849647232804345635</id><published>2010-07-14T12:50:00.000-07:00</published><updated>2010-07-14T12:50:30.028-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='methylcoder'/><category scheme='http://www.blogger.com/atom/ns#' term='gsnap'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><category scheme='http://www.blogger.com/atom/ns#' term='aligner'/><title type='text'>GSNAP</title><content type='html'>&lt;h4&gt;Aligners&lt;/h4&gt;Since starting the &lt;a href="http://github.com/brentp/methylcode/"&gt;methylcoder&lt;/a&gt; project, I've been using the &lt;a href="http://bowtie-bio.sourceforge.net"&gt;bowtie short read aligner&lt;/a&gt;. It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough &lt;a href="http://bowtie-bio.sourceforge.net/manual.html"&gt;options&lt;/a&gt; to keep you busy (including my favorite: --try-hard).&lt;br /&gt;There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;As with many aligners, bowtie creates an index of the reference genome. It uses a &lt;a href="http://en.wikipedia.org/wiki/Suffix_array"&gt;suffix array&lt;/a&gt; compressed with the &lt;a href="http://en.wikipedia.org/wiki/Burrows-Wheeler_transform"&gt;Burrows-Wheeler Transform&lt;/a&gt;(BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the reference and puts the human genome past that limit. The "solution" is to split the reference and create multiple indexes. But, with that setup, when requesting reads that map &lt;i&gt;uniquely&lt;/i&gt; to a single best location, it's only guaranteed that the mapping is unique to each index independently and post-processing is required. This is also a problem for any large genome that will not fit into a bowtie index.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Bowtie does not handle indels very well. Even the other popular aligners SOAP and BWA can only handle very short indels (up to about 3 bp) while MAQ and SOAP2 do not align reads with indels.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h4&gt;GSNAP&lt;/h4&gt;&lt;a href="http://research-pub.gene.com/gmap/"&gt;GSNAP&lt;/a&gt;: Genomic Short-read Nucleotide Alignment Program (from Thomas Wu at genentech) addresses both of those short-comings and seems reasonably fast. For the reference index, rather than use a suffix-array, GSNAP uses a hash of 12-mers sampled every 3 basepairs (by my reading, sampling complicates the search somewhat but reduces the size of the index). Since it can then use 12-mer seeds to span gaps, it's better at dealing with indels and can also be used to map RNA-seq data to a reference genome, with or without known, annotated splice sites! In addition, it can take known SNPs and add them to the 12-mer hash index so that reads mapping over a SNP will not be wrongly accused of having a mismatch.&lt;br /&gt;Finally, it can map BS-sulfite treated reads to the genome without requiring the C=&gt;T conversion--using the same re-indexed genome.&lt;br /&gt;It's input format is a bit wonky for paired end reads. Normally, paired-end reads are specified in 2 separate files: file_1.fastq, file_2.fastq with the header (and order) indicating the pairing. GSNAP requires a single FASTA file (it will not accept FASTQ) with  format:&lt;pre&gt;&amp;gt;header&lt;br /&gt;ACTCTCAGCGGGACGTTAACGCGACCGATTACGGTGACC&lt;br /&gt;CCACGTGCCGACTTAGGCAGACCGACGTTACGCACCACA&lt;/pre&gt;where the first sequence is from the file_1.fastq and the 2nd is from file_2.fastq.&lt;br /&gt;As of yesterday, &lt;a href="http://github.com/brentp/methylcode/"&gt;MethylCoder&lt;/a&gt; supports GSNAP as an aligner, in addition to Bowtie. (This addition was quite simple since GSNAP does all the work anyway). MethylCoder includes a script to convert paired-end FASTQ files to GSNAP format. &lt;a href="http://github.com/brentp/methylcode/blob/master/methylcoder/gsnap.py#L75"&gt;The script&lt;/a&gt; simply takes as arguments paired end FASTA or FASTQ files and puts them in a single file of the GSNAP paired end format. It's almost too simple to worry about, but I needed it so it's there...&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://bioinformatics.oupjournals.org/cgi/reprint/btq057v1"&gt;GSNAP paper&lt;/a&gt; is an interesting read. Ever heard of "galloping binary search"? I hadn't, but apparently it's used in &lt;a href="http://svn.python.org/projects/python/trunk/Objects/listsort.txt"&gt;python's timsort&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4849647232804345635?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4849647232804345635/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4849647232804345635' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4849647232804345635'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4849647232804345635'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/07/aligners-since-starting-methylcoder.html' title='GSNAP'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1688755842948903655</id><published>2010-06-07T23:45:00.000-07:00</published><updated>2010-06-07T23:45:59.605-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='biotools'/><title type='text'>python tools for bioinformatics II: group-ing</title><content type='html'>This is the second post describing python tools and idioms I've found useful in bioinformatics.&lt;br /&gt;&lt;br /&gt;This post involves grouping items either using a &lt;a href="http://en.wikipedia.org/wiki/Disjoint_sets"&gt;disjoint set&lt;/a&gt; or python's&lt;br /&gt;&lt;a href="http://docs.python.org/library/itertools.html#itertools.groupby"&gt;itertools.groupby&lt;/a&gt;. I was introduced to both of these&lt;br /&gt;by &lt;a href="http://tanghaibao.blogspot.com/"&gt;the post-doc&lt;/a&gt; who sits behind me.&lt;br /&gt;&lt;h4&gt;Grouper&lt;/h4&gt;Grouper is an implementation of a disjoint set, initially from &lt;a href="http://code.activestate.com/recipes/387776-grouping-objects-into-disjoint-sets/"&gt;this recipe&lt;/a&gt; by Michael Droettboom and included in matplotlib.cbook.&lt;br /&gt;The slightly modified implementation we use is &lt;a href="http://github.com/tanghaibao/quota-alignment/blob/master/grouper.py"&gt;here&lt;/a&gt;.&lt;br /&gt;Where the docstring gives a decent idea of useage:&lt;br /&gt;&lt;pre class="prettyprint"&gt;"""&lt;br /&gt;   This class provides a lightweight way to group arbitrary objects&lt;br /&gt;   together into disjoint sets when a full-blown graph data structure&lt;br /&gt;   would be overkill. &lt;br /&gt;&lt;br /&gt;   Objects can be joined using .join(), tested for connectedness&lt;br /&gt;   using .joined(), and all disjoint sets can be retrieved using list(g)&lt;br /&gt;   The objects being joined must be hashable.&lt;br /&gt;&lt;br /&gt;   &gt;&gt;&gt; g = Grouper()&lt;br /&gt;   &gt;&gt;&gt; g.join('a', 'b')&lt;br /&gt;   &gt;&gt;&gt; g.join('b', 'c')&lt;br /&gt;   &gt;&gt;&gt; g.join('d', 'e')&lt;br /&gt;   &gt;&gt;&gt; list(g)&lt;br /&gt;   [['a', 'b', 'c'], ['d', 'e']]&lt;br /&gt;   &gt;&gt;&gt; g.joined('a', 'b')&lt;br /&gt;   True&lt;br /&gt;   &gt;&gt;&gt; g.joined('a', 'c')&lt;br /&gt;   True&lt;br /&gt;   &gt;&gt;&gt; 'f' in g&lt;br /&gt;   False&lt;br /&gt;   &gt;&gt;&gt; g.joined('a', 'd')&lt;br /&gt;   False"""&lt;/pre&gt;Basically the idea is if there's an explicit connection between `A` and `B` and an explicit connection between `A` and `C` the Grouper class will create a connection between `B` and `C`. In the implementation above, the explicit connections are created with `Grouper.join()` For genomics stuff, we can use this in many places, one simple use-case is finding local duplicates. These are also called tandem duplications and are identified as a group of genes with very similar sequence co-occurring in a local chromosomal region. We want to group these into a single "family" with all members, even if there is not an explicit connection between all genes--due to slight sequence divergence or sequence alignment (BLAST) oddities.&lt;br /&gt;A snippet of code from &lt;a href="http://github.com/tanghaibao/quota-alignment/blob/master/scripts/synteny_score.py"&gt;here&lt;/a&gt; (more nice code from Haibao) shows the use of a grouper to find syntenic regions by checking if adjacent members of (an already sorted list) are within a certain distance (window) of each other:&lt;br /&gt;&lt;script src="http://gist.github.com/428236.js?file=gistfile2.pyw"&gt;&lt;/script&gt;&lt;br /&gt;after this, the elements that are grouped in the `g` Grouper object are in the same window (though not strictly syntenic).&lt;br /&gt;This is a very nice abstraction for anything where you have groups of related objects. It reduces the accounting invovled because once you have `join`ed all the elements, querying the Grouper object with any element will return all elements in the group to which it is joined.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;groupby&lt;/h4&gt;Python's itertools has a number of useful features. `groupby` is a means to partition an iterable similar groups of adjacent items, where `similar` is determined by a function you supply to groupby. The example from the &lt;a href="http://docs.python.org/library/itertools.html#itertools.groupby"&gt;docs&lt;/a&gt; is:&lt;br /&gt;&lt;script src="http://gist.github.com/428236.js?file=gistfile3.pyw"&gt;&lt;/script&gt;&lt;br /&gt;but it's also possible to specify an arbitrary grouping function, so grouping capital letters together:&lt;br /&gt;&lt;script src="http://gist.github.com/428236.js?file=gistfile4.pyw"&gt;&lt;/script&gt;&lt;br /&gt;So the `k`ey tells what the grouping function returned and the values or groups are grouped with other &lt;b&gt;adjacent&lt;/b&gt; items from the list that also pass the test. So in some cases, we often want to sort the input before using groupby and then we can get groups and avoid an extra nested for-loop that we would otherwise need for filtering.&lt;br /&gt;An example use is grouping BLAST hits by query and subject. Here, BlastLine is a simple object that takes a line from a tab-delimited blast and converts the e-value and bitscore to float and the starts and stops to ints, etc.&lt;br /&gt;&lt;script src="http://gist.github.com/428236.js?file=gistfile5.pyw"&gt;&lt;/script&gt;&lt;br /&gt;and since the blasts are sorted beforehand, this gives a useful set of blasts in blast_iter--containing only blasts with a matching query and subject. To do this without groupby requires quite a bit more code and reduces readability.&lt;br /&gt;&lt;br /&gt;I recently saw a post about &lt;a href="http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/"&gt;using `groupby` to parse fasta files.&lt;/a&gt; It's a good idea as a fasta file consists of pairs of a single header line, followed by N lines of sequence. A header line always starts with "&gt;" so the test function for groupby is clear, it's simply getting them out in pairwise fashion, here's one way building from the post above:&lt;br /&gt;&lt;script src="http://gist.github.com/428236.js?file=gistfile1.pyw"&gt;&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;groupby-grouper&lt;/h4&gt;Here's a final example, both of these used together to find tandem duplications given a genomic Bed file (contains chromsome, start, stop) and a list of blasts. Again this is code from Haibao from &lt;a href="http://github.com/tanghaibao/quota-alignment/blob/master/scripts/blast_to_raw.py"&gt;here.&lt;/a&gt;&lt;br /&gt;&lt;script src="http://gist.github.com/428236.js?file=gistfile6.pyw"&gt;&lt;/script&gt;&lt;br /&gt;So, this creates a `simple_blast` that's easier to sort and group, it iterates for the groups based on the subject&lt;br /&gt;and then groups hits if they're on the same chromosome an within a given distance (where distance is measured in genes).&lt;br /&gt;I like this example because I'd previously written code to do the same thing without Grouper or groupby and it was longer, slower and less readable.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1688755842948903655?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1688755842948903655/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1688755842948903655' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1688755842948903655'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1688755842948903655'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/06/python-tools-for-bioinformatics-ii.html' title='python tools for bioinformatics II: group-ing'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-7785772758112093705</id><published>2010-05-30T09:19:00.000-07:00</published><updated>2010-05-30T09:22:53.229-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='biotools'/><title type='text'>python tools for bioinformatics I: convolve for moving average</title><content type='html'>This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics.&lt;br /&gt;&lt;h3&gt;Convolve&lt;/h3&gt;&lt;a href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt; makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using &lt;a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.convolve.html"&gt;convolve&lt;/a&gt;. It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')".&lt;br /&gt;&lt;script src="http://gist.github.com/419125.js?file=gistfile1.pyw"&gt;&lt;/script&gt;So the workflow is to&lt;ul&gt;&lt;li&gt;get the sequence into a numpy array of dtype 'c'&lt;/li&gt;&lt;li&gt;convert the boolean array to int so True is 1, False is 0&lt;/li&gt;&lt;li&gt;decide on the window size and make a kernel of that shape that sums to 1&lt;/li&gt;&lt;li&gt;profit&lt;/li&gt;&lt;/ul&gt;with a quick addition for plotting:&lt;br /&gt;&lt;script src="http://gist.github.com/419125.js?file=gistfile3.pyw"&gt;&lt;/script&gt;the first 100 kilobases of gc content for 50 and 250bp windows look like:&lt;br /&gt;&lt;img src="http://lh5.ggpht.com/_uU_kLC5AdTc/TAKNNBJmJZI/AAAAAAAAA1o/vOndvsv51aQ/gc50.png" /&gt;&lt;br /&gt;&lt;img src="http://lh6.ggpht.com/_uU_kLC5AdTc/TAKNNAMTe4I/AAAAAAAAA1s/qzk_ejxULBA/gc250.png" /&gt;&lt;br /&gt;where, as expected, the 250 basepair window is more "smoothed".&lt;br /&gt;From there, it's possible to do some analysis, for example grab the regions with the highest gc content.&lt;br /&gt;&lt;script src="http://gist.github.com/419125.js?file=gistfile2.pyw"&gt;&lt;/script&gt;That's some fairly dense code to pull out the values centered in the highest GC-content windows and then show the mean of those windows. (Arabidopsis thaliana, which I used for that example, has a fairly low genome-wide GC-content, so the 0.61 average &lt;i&gt;is&lt;/i&gt; quite high.)&lt;br /&gt;&lt;h4&gt;Another Example&lt;/h4&gt;There was a &lt;a href="http://biostar.stackexchange.com/questions/1050/enriched-regions-search-program"&gt;recent question&lt;/a&gt; on &lt;a href="http://biostar.stackexchange.com/"&gt;biostar&lt;/a&gt; where someone wanted to find regions with an abundance of a couple amino acids above some cutoff.&lt;br /&gt;The main part of that looks like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;abool = (seq == 'N') | (seq == 'Q') # convert to boolean&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;mw = np.convolve(abool, kern, mode='same')&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;if mw.max() &amp;lt; cutoff: continue&lt;/pre&gt;&lt;br /&gt;Where that operates on 'N' and 'Q' amino acids in a protein sequence instead of 'G' and 'C' as in the example above.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Kernel&lt;/h4&gt;Finally, this approach is &lt;b&gt;much&lt;/b&gt; more flexible than I've shown, the kernel does not have to be uniform, it can be guassian, or even only taking values to the left of each nucleotide, or weighting the nearest 10 nucleotides at 1 and the rest at 0.2. Even given those changes, once the kernel is chosen, the rest of the "workflow" remains unchanged.&lt;br /&gt;&lt;h4&gt;Notes&lt;/h4&gt;In cases where the input array is numeric--not sequence--there is no need to do the conversion to a boolean, simply run the convolution on the original array with the chosen kernel.&lt;br /&gt;&lt;br /&gt;In the examples above, I've shown only np.convolve with mode='same', for most cases dealing with sequence, I think this is a good choice, but it's best to consult the documentation for each specific case.&lt;br /&gt;&lt;br /&gt;Finally, in cases where the kernel and the input array are very large, it may be better to use fft convolution from &lt;a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.fftconvolve.html"&gt;fftconvolve&lt;/a&gt; in scipy. I haven't used this much, I think it may require doing your own padding and chopping at the edges...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-7785772758112093705?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/7785772758112093705/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=7785772758112093705' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7785772758112093705'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7785772758112093705'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/05/python-tools-for-bioinformatics-i.html' title='python tools for bioinformatics I: convolve for moving average'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh5.ggpht.com/_uU_kLC5AdTc/TAKNNBJmJZI/AAAAAAAAA1o/vOndvsv51aQ/s72-c/gc50.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4008638184294744128</id><published>2010-05-19T17:44:00.000-07:00</published><updated>2010-05-19T18:05:00.615-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='geo'/><title type='text'>numpy + GDAL = agoodle</title><content type='html'>It's been a while since I've posted anything geo-related. Just want to let folks know about &lt;a href="http://github.com/brentp/agoodle"&gt;agoodle&lt;/a&gt; a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to &lt;a href="http://landsummary.com/map/"&gt;landsummary&lt;/a&gt;, a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, &lt;a href="http://openlayers.org/"&gt;openlayers&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;You can see the usage on the &lt;a href="http://github.com/brentp/agoodle"&gt;github page&lt;/a&gt; but basically, it'll be something like:&lt;script src="http://gist.github.com/407055.js?file=gistfile1.pyw"&gt;&lt;/script&gt; which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for raster_epsg and poly_epsg, so you can query with a polygon that's in a different projection than the raster.&lt;br /&gt;&lt;br /&gt;The library could use some work. There's sketchiness about "inside", so the sum of the areas in the returned dictionary will not match the area of the polygon. Internally, we use matplotlib's &lt;a href="http://matplotlib.sourceforge.net/api/nxutils_api.html#"&gt;nxutils&lt;/a&gt; which does a very-fast point in poly test for every cell in the raster that's within the bounding box of the polygon. So any cell that passes that test will be included in the results--all or nothing, there's no corrections for the case where a cell is half in the polygon. This is not a problem unless the polygon is small relative to the size of the raster cells (or the perimeter of the query polygon is large). &lt;i&gt;Patches Welcome&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Check out the code, we made &lt;a href="http://gdal.osgeo.org"&gt;gdal&lt;/a&gt;, numpy and nxutils do all the hard work, but it comes together pretty well. If there's other opensource projects out there that do this well, let me know and i'll link to them. I'm posting this because as far as I know, no others exist.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4008638184294744128?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4008638184294744128/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4008638184294744128' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4008638184294744128'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4008638184294744128'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/05/numpy-gdal-agoodle.html' title='numpy + GDAL = agoodle'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2168741247436501916</id><published>2010-04-10T17:27:00.000-07:00</published><updated>2010-04-10T18:20:45.713-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tokyocabinet'/><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>fileindex</title><content type='html'>Disclaimer: This is about a generic indexing method that fits in &lt; 50 lines of code, while it's cool, I'm not suggesting it be used in place of a real project. The code is &lt;a href="http://github.com/brentp/bio-playground/tree/master/fileindex/"&gt;here&lt;/a&gt; and explained below.&lt;br /&gt;&lt;br /&gt;Everyone's indexing big sequence files. Peter Cock has a &lt;a href="http://github.com/peterjc/biopython/tree/index-sqlite"&gt;bio-python branch&lt;/a&gt; to index via sqlite. Brad Chapman &lt;a href="http://bcbio.wordpress.com/2009/07/26/sorting-genomic-alignments-using-python/"&gt;writes up&lt;/a&gt; a nice approach using tools from &lt;a href="http://bitbucket.org/james_taylor/bx-python/wiki/Home"&gt;bx-python&lt;/a&gt;. And there's my own &lt;a href="http://pypi.python.org/pypi/pyfasta/"&gt;pyfasta&lt;/a&gt; for fasta files. This morning I set out to use another one: &lt;a href="http://github.com/acr/screed"&gt;screed&lt;/a&gt; from the pygr guys to index fastq and fasta files via sqlite db. Titus &lt;a href="http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html"&gt;wrote about&lt;/a&gt; screed and posted to the biology-in-python mailng list, which is how i originally heard about it.&lt;br /&gt;&lt;br /&gt;Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast.  Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the &lt;i&gt;thing&lt;/i&gt; (the fasta sequence or the fastq record). &lt;br /&gt;In the case of flat files, it's a mapping from and id or name to the &lt;i&gt;fseek&lt;/i&gt; file-position at the start of a record. Given that, it's possible to make a generic file indexer that creates an index given a file and a function that advances the file pointer (fh) to the next record and returns an id.&lt;br /&gt;&lt;br /&gt;So for the case of the &lt;a href="http://en.wikipedia.org/wiki/FASTQ_format"&gt;FASTQ&lt;/a&gt; format, which contains a new record every 4 lines, the first of which is the 'id', the parser could be:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;def fastq_next(fh):&lt;br /&gt;    id = fh.readline() # save the id.&lt;br /&gt;    # advance the file handle 3 lines.&lt;br /&gt;    fh.readline(); fh.readline(); fh.readline()&lt;br /&gt;    return id&lt;/pre&gt;&lt;br /&gt;So regardless of the file format, this is the interface. The function takes a file handle and 1) advances the file position to the start of the next record and 2) returns the id.&lt;br /&gt;Given that, the indexer call looks like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;FileIndex.create('some.fastq', fastq_next)&lt;/pre&gt;&lt;br /&gt;all that call has to do is repeatedly send a filehandle to fastq_next, accept the the id returned by fastq_next, and save a mapping of id to the (previous) file position. An implementation detail is how that mapping is saved. I use a tokyo-cabinet BTree database.&lt;br /&gt;Once the index is created, usage is:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;fi = FileIndex(somefile, handler)&lt;br /&gt;record = fi[somekey]&lt;/pre&gt;&lt;br /&gt;where handler is a callable (including a class) that takes a pointer and returns a thing. For fastq, it could be:&lt;pre class="prettyprint"&gt;&lt;br /&gt;class FastQEntry(object):&lt;br /&gt;    def __init__(self, fh):&lt;br /&gt;        self.name = fh.readline().rstrip('\r\n')&lt;br /&gt;        self.seq = fh.readline().rstrip('\r\n')&lt;br /&gt;        self.l3 = fh.readline().rstrip('\r\n')&lt;br /&gt;        self.qual = fh.readline().rstrip('\r\n')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;so usage for fastq looks like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;&gt;&gt;&gt; fi = FileIndex('some.fastq', FastQEntry)&lt;br /&gt;&gt;&gt;&gt; some_id = '@HWI-EAS105_0001:1:1:7:1680#0/1'&lt;br /&gt;&gt;&gt;&gt; record = fi[some_id]&lt;br /&gt;&gt;&gt;&gt; assert isinstance(record, FastQEntry)&lt;br /&gt;&gt;&gt;&gt; record.seq&lt;br /&gt;TATTTATTGTTATTAGTTATTTTANTANAAATANTNGANGGGGAGGAAGGNNNNNNTNNNNNNNNGANNNNANGAG&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h4&gt;FastQ&lt;/h4&gt;So, from any name, there's direct access to each record. Given the implementation of &lt;i&gt;FileIndex&lt;/i&gt; setup, it's actually possible to create and use an index for a fastq file with 9 total lines of code, 6 of which are the class itself:&lt;pre class="prettyprint"&gt;&lt;br /&gt;class FastQEntry(object):&lt;br /&gt;    def __init__(self, fh):&lt;br /&gt;        self.name = fh.readline().rstrip('\r\n')&lt;br /&gt;        self.seq = fh.readline().rstrip('\r\n')&lt;br /&gt;        self.l3 = fh.readline().rstrip('\r\n')&lt;br /&gt;        self.qual = fh.readline().rstrip('\r\n')&lt;br /&gt;&lt;br /&gt;# note the 2nd argument takes a file handle and returns a name.&lt;br /&gt;FileIndex.create('some.fastq', lambda fh: FastQEntry(fh).name) &lt;br /&gt;fi = FileIndex('some.fastq', FastQEntry)&lt;br /&gt;print fi['@HWI-EAS105_0001:1:1:7:1680#0/1'].seq&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h4&gt;SAM&lt;/h4&gt;That's decent, but what makes it cooler is that this same interface can be used to implement an index on a &lt;a href="http://samtools.sourceforge.net/"&gt;SAM&lt;/a&gt; (Sequence Alignment/Map) format file:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;class SamLine(object):&lt;br /&gt;    def __init__(self, fh):&lt;br /&gt;        line = fh.readline().split("\t") or [None]&lt;br /&gt;        self.name = line[0]&lt;br /&gt;        self.ref_seqid = line[2]&lt;br /&gt;        self.ref_loc = int(line[3])&lt;br /&gt;        # ... other SAM format stuff omitted.&lt;br /&gt;&lt;br /&gt;f = 'some.sam'&lt;br /&gt;FileIndex.create(f, lambda fh: SamLine(fh).name, allow_multiple=True)&lt;br /&gt;fi = FileIndex(f, SamLine, allow_multiple=True)&lt;br /&gt;print [(s.name, s.ref_seqid, s.ref_loc) for s in fi['23351265']]&lt;br /&gt;&lt;br /&gt;# output: [('23351265', '2', 8524), ('23351265', '3', 14202495)]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That's it. Note one extra thing: In an alignment represented in the SAM file, I used the read id (the first column) as the id for the indexing. That does not have to be unique, so I specify allow_multiple=True and it returns an array of records for every key. (So although using tokyocabinet is an implementation detail, putcat() functions make it much easier to support multiple entries per id.).&lt;br /&gt;&lt;h4&gt;Performance&lt;/h4&gt;For the SAM file I tested, the original file is 2.5G and the index created by FileIndex is 493M. For a large fastq file of 62.5 million lines (about 15.5 million records) the file is 3.3G and the index is 1.2G (compared to 3.8G for screed). It takes about 5 minutes to index the FASTQ file (compared to 11 for screed). &lt;br /&gt;On my machine, with that FASTQ file, I get about 17,000 queries per second (including object creation). For the same fastq, screed gives about 10,000 queries per second (which matches the image in &lt;a href="http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences.html"&gt;Titus' post&lt;/a&gt;).&lt;br /&gt;&lt;h4&gt;Implementation&lt;/h4&gt;The full implementation right now is 44 lines, including (sparse) comments and is shown in the gist at the end of this post. It lacks a lot of nice things (like, um tests) and I'm not recommending anyone use it in-place of a well thought out project like the biopython stuff or screed, still, i think the concept is useful--and I am using it for SAM files.&lt;br /&gt;I've put the full code in &lt;a href="http://github.com/brentp/bio-playground/tree/master/fileindex/"&gt;bio-playground&lt;/a&gt; on github, including the &lt;a href="http://github.com/brentp/bio-playground/tree/master/fileindex/examples/"&gt;examples&lt;/a&gt; from this post. But, this is generic enough that it could be used to index anything: blast-files, fasta, ... whatever. As always, I welcome any feedback.&lt;br /&gt;&lt;h4&gt;About Tokyo Cabinet&lt;/h4&gt;In the course of this, I found major problems in 3 different python wrappers for tokyo-cabinet. I reported bugs to 2 of them. So, kudos to &lt;a href="http://code.google.com/p/py-tcdb/"&gt;py-tcdb&lt;/a&gt;. While I don't like that I have to explicitly tell it _not_ to pickle what I send it, it's well tested, and the code is very nice and ... best of all, I haven't been able to make it segfault. It is also easy_install'able.&lt;br /&gt;Another thing I learned today is that you can't use tokyo-cabinet hash for more than a couple million records. A web-search shows that lots of people have asked about this and the answers always have to do with adjusting the bnum bucket parameter. &lt;b&gt;That does not work&lt;/b&gt;. If you can create a tokyo cabinet hash database with over 5 million records that does not slow on inserting, please show me how. Until I see it, I think one has to use the BTree database in TC.&lt;br /&gt;&lt;script src="http://gist.github.com/362402.js?file=gistfile1.pyw"&gt;&lt;/script&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2168741247436501916?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2168741247436501916/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2168741247436501916' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2168741247436501916'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2168741247436501916'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/04/fileindex.html' title='fileindex'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8543142008770277567</id><published>2010-04-04T12:43:00.000-07:00</published><updated>2010-04-09T19:17:01.272-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c'/><category scheme='http://www.blogger.com/atom/ns#' term='lua'/><title type='text'>writing and building a lua c extension module</title><content type='html'>[Update: 2009-04-09]&lt;br /&gt;This package is now on &lt;a href="http://luarocks.org"&gt;luarocks&lt;/a&gt; under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my &lt;a href="http://github.com/brentp/lua-projects"&gt;lua github repo&lt;/a&gt;.&lt;br /&gt;[/Update]&lt;br /&gt;I've been messing with &lt;a href="http://lua.org"&gt;Lua programming language&lt;/a&gt; lately, mostly because I wanted to try to use &lt;a href="http://love2d.org"&gt;love2d&lt;/a&gt; as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from &lt;a href="http://docs.python.org/extending/extending.html#keyword-parameters-for-extension-functions"&gt;python&lt;/a&gt;.&lt;br /&gt;In lua, all the c functions have a signature like:&lt;pre class="prettyprint"&gt;int c_function(lua_State *L)&lt;/pre&gt;and you use the lua c API to pull values off the lua_state stack thingy -- &lt;i&gt;L&lt;/i&gt;.&lt;br /&gt;And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you.&lt;br /&gt;Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to &lt;a href="http://lua-users.org/wiki/LuaPowerPatches"&gt;provide patches for the lua distribution itself&lt;/a&gt;. There are some docs but they were difficult for me to find and it's not clear which docs are for lua-5.1--the current version. So I'm including a full example with sourcecode, simple tests, Makefile, and &lt;a href="http://luarocks.org"&gt;luarock spec&lt;/a&gt;.&lt;br /&gt;The full gist is &lt;a href="http://gist.github.com/355629"&gt;here&lt;/a&gt;.&lt;br /&gt;The C code (shamelessly stolen from the wiki and book -- not my own code) actually implements 2 very useful functions string.split and string.strip. These are otherwise easily added using lua's regular expression searches, but these are faster as they're not using the regexp machinery:&lt;script src="http://gist.github.com/355629.js?file=stringext.c"&gt;&lt;/script&gt;&lt;br /&gt;Note the functions are included in a struct and "registered" with the &lt;i&gt;luaopen_stringext&lt;/i&gt; function.&lt;br /&gt;The Makefile then builds the shared library: &lt;script src="http://gist.github.com/355629.js?file=Makefile"&gt;&lt;/script&gt; The shared library stringext.so is then immediately usable from the lua interpreter. A test session looks like:&lt;script src="http://gist.github.com/355629.js?file=stringext_test.lua"&gt;&lt;/script&gt;. Where string.split() returns a table (array) of the tokens. Another cool thing about lua is visible in that script. The added functions actually become methods on Lua strings! So after importing stringext, all strings now have &lt;i&gt;strip()&lt;/i&gt; and &lt;i&gt;split()&lt;/i&gt; methods! This is because of the line in &lt;i&gt;stringext.c&lt;/i&gt;:&lt;pre&gt;luaL_openlib(L, "string", stringext, 0);&lt;/pre&gt; which tells it to add the methods to the "string" module.&lt;br /&gt;Finally, luarocks... &lt;a href="http://luarocks.org"&gt;Luarocks&lt;/a&gt; are the equivalent of python eggs or ruby gems (you gotta love all these clever names). They take a rockspec file (equiv of python's setup.{py,cfg}). Mine looks like &lt;a href="http://gist.github.com/raw/355629/94fc5febcaa21cc83b28530f90178148ae2de5b1/gistfile2.lua"&gt;this&lt;/a&gt;.&lt;br /&gt;I've requested that be added to the main luarocks repository, there doesnt seem to be a way to upload your own rock directly. Still once you write the rockspec, you can build the C extension without a Makefile by typing &lt;pre&gt;luarocks make&lt;/pre&gt; and it handles all the appropriate flags and such.&lt;br /&gt;Anyone interested can download a tar-ball of the entire distribution &lt;a href="http://bpbio.googlecode.com/files/stringext-0.2.tar.gz"&gt;here&lt;/a&gt;.&lt;br /&gt;I add that the lua community seems very active and helpful. I asked a question about building the extension and received quick, helpful replies.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8543142008770277567?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8543142008770277567/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8543142008770277567' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8543142008770277567'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8543142008770277567'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/04/writing-and-building-lua-c-extension.html' title='writing and building a lua c extension module'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4445867074324593220</id><published>2010-02-16T07:06:00.000-08:00</published><updated>2010-02-16T07:13:49.250-08:00</updated><title type='text'>organizing genomic data</title><content type='html'>In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like &lt;a href="http://www.sequenceontology.org/gff3.shtml"&gt;gff&lt;/a&gt; (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Big ol' Matrix&lt;/h4&gt;because it was already nicely pre-processed, I downloaded some &lt;i&gt;Arabidopsis&lt;/i&gt; expression data from &lt;a href="http://atted.jp/"&gt;here&lt;/a&gt;. They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for more details on how the coexpression is calculated, see the link above, I'm using this dataset specifically so I don't have to think about the raw expression data which is difficult to deal with].&lt;br /&gt;&lt;br /&gt;I'm using numpy to create and then read that binary file (though it's just 32bit floats so the format will work for any language). The index of a particular gene in the array can be set as it's alphabetic order.  So, to find the level of coexpression between the 13656th gene and the 12355th gene is &lt;i&gt;arr[13657,12356]&lt;/i&gt;--or using pointer arithmetic if you're not using numpy. To find the average level of coexpression of the 13656th gene with all others is &lt;i&gt;arr[13657].mean()&lt;/i&gt;. Looking up the index of each gene is a simple matter of reading in the file of gene names (already ordered alphabetically) and creating a dictionary lookup of gene =&gt; index. The one line of code to do that looks like this:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;idx_lookup = dict((accn.rstrip(), i) for i, accn in enumerate(open(path + "/names.order")))&lt;br /&gt;&lt;/pre&gt;which can take an accn name like "At2g23450" and return its integer index.&lt;br /&gt;Since the binary file is 1.9G of data, it's not ideal to read the entire thing into memory. But, it's simple to &lt;a href="http://en.wikipedia.org/wiki/Memory-mapped_file"&gt;memory map&lt;/a&gt; it with numpy into a &lt;i&gt;big ol' matrix/array&lt;/i&gt;:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;L = len(idx_lookup)&lt;br /&gt;coexp = np.memmap(path + '/coexp.bin', dtype=np.float32, shape=(L, L), mode='r')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;so then, given a pair like: AT1G47770,AT2G26550&lt;br /&gt;the code to find the coexpression is:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;ai = idx_lookup["AT1G47770"]&lt;br /&gt;bi = idx_lookup["AT2G26550"]&lt;br /&gt;ce = coexp[ai, bi]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;this is stupid fast. To memory map the array, read in the idx_lookup and do 6294 queries takes 0.17 seconds on my machine--and much of that is the one-time cost of reading in idx_lookup. The code for all of this is readable &lt;a href="http://code.google.com/p/bpbio/source/browse/#svn/trunk/scripts/coex"&gt;here&lt;/a&gt;.&lt;br /&gt;and svn'able via:&lt;br /&gt;&lt;pre&gt;svn checkout http://bpbio.googlecode.com/svn/trunk/scripts/coex/&lt;/pre&gt;&lt;br /&gt;That includes the code to get the data (get.sh) convert it to the .bin file (to_array.py) and serve it as a web script (coexp.wsgi).&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Flat&lt;/h4&gt;&lt;br /&gt;In most eukaryote organisms there are on the order of 30K genes (varying from maybe a couple thousand up to 60K or so). Each gene can have multiple sub-features: CDS coding regions, &lt;a href="http://en.wikipedia.org/wiki/Messenger_RNA"&gt;mRNAs&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/UTR"&gt;UTR's&lt;/a&gt;, introns, etc. So it's a pretty small amount of data, but large enough and with enough structure that it's nice to have it in an efficient data structure. After five years, I think am finally converging on a good abstraction. It holds 1 feature per row in a &lt;a href="http://docs.scipy.org/doc/numpy/user/basics.subclassing.html#basics-subclassing"&gt;subclass of a numpy array&lt;/a&gt;. This hides some complexity arising from the fact that each gene feature has multiple exons and each gene can have a different number of exons (gff uses one row per cds * gene * mRNA * exon). Numpy has typed data "fields" which are something like columns. Common types are floats and ints of various precisions and python Objects. So, the exons start, stops are stored as a list of python integers in an object field of a python array and that python array is stored as a 'locs' field in each row of the numpy array. The .flat file format looks like this:&lt;br /&gt;&lt;pre&gt;#id  chr accn    start   end    strand  ftype   locs&lt;br /&gt;1   1   AT1G01010   3631    5899    +   CDS 3760,3913,3996,4276,4486,4605,4706,5095,5174,5326,5439,5630&lt;/pre&gt;&lt;br /&gt;where the columns should be mostly understandable by the header. `start` is the left-most position of the feature on the genome and `end` is the rightmost. `ftype` tells the type of the feature(s) in the following column, `locs` which is a list of start,stops. most often ftype is `CDS` (or exon) for coding sequence--or the part of the gene that gets translated. From there, it's possible to take advantage of the speed and syntax of numpy:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt; &gt;&gt;&gt; from flatfeature import Flat&lt;br /&gt; # note also attaching the genomic fasta sequence.&lt;br /&gt; &gt;&gt;&gt; flat = Flat('data/thaliana_v8.flat', 'data/thaliana_v8.fasta')&lt;br /&gt; # so can get coding sequence.&lt;br /&gt; &gt;&gt;&gt; cds_seq = flat.row_cds_sequence('AT1G01370')&lt;br /&gt; # find all the gene names in a given chromosome, region:&lt;br /&gt; &gt;&gt;&gt; [gene['accn'] for gene in flat.get_features_in_region('1', 5000, 7000)]&lt;br /&gt; ['AT1G01010', 'AT1G01020']&lt;br /&gt;&lt;br /&gt; # how many features on chromosome 4 are on the - strand:&lt;br /&gt; &gt;&gt;&gt; flat[(flat['seqid'] == '4') &amp; (flat['strand'] == '-')].shape&lt;br /&gt; (2502,)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The row_cds_sequence, while an ugly name, shows the advantage of tyeing the fasta to the Flat object--it allows extracting the genomic sequence from the fasta based on the coordinates in each row. It also highlights a problem with this data structure--what about &lt;a href="http://en.wikipedia.org/wiki/Alternative_splicing"&gt;alternative splicings&lt;/a&gt; which are very common? For our lab, we always use the union of all coding sequences so I report the resulting intervals as the 'locs' column. I intend to improve the flexibility of how that is handled in the code, if not the file.&lt;br /&gt;&lt;br /&gt;As a side note, the final example in that session shows how you can think of the numpy slicing syntax as a sort of SQL-like selection. so: &lt;i&gt;flat[(flat['seqid'] == '4') &amp; (flat['strand'] == '-')] &lt;/i&gt; reads like: "select * from flat where seqid = '4' and strand = '-'", but there's no database, it's all in memory, and the work is all done in C, very quickly. Actually, I think flatfeature and the huge matrix could both classify as &lt;a href="http://en.wikipedia.org/wiki/NoSQL"&gt;NoSQL&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The code for flatfeature is &lt;a href="http://github.com/brentp/flatfeature/"&gt;on github&lt;/a&gt; it includes an example .flat file for arabidopsis thaliana in the data directory. I'll be tinkering with that for some time. As with all my code, any forks, patches, suggestions or ridicule will be welcome (in that order of preference). Finally, to add that for many cases &lt;a href="http://genometools.org"&gt;genometools&lt;/a&gt; works great, they do a nice job of wrapping GFF for a number of languages including python. But for some things GFF is too much trouble.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4445867074324593220?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4445867074324593220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4445867074324593220' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4445867074324593220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4445867074324593220'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/02/organizing-genomic-data.html' title='organizing genomic data'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1951637107107601515</id><published>2010-01-18T10:02:00.001-08:00</published><updated>2010-01-18T10:07:56.001-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='numpy'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>Loopless programming (for calculating methylation types)</title><content type='html'>DNA &lt;a href="http://en.wikipedia.org/wiki/DNA_Methylation"&gt;Methylation&lt;/a&gt; is used in plants as a means of epi-genetic regulation. &lt;a href="http://en.wikipedia.org/wiki/Bisulfite_sequencing"&gt;Bi-sulfite sequencing&lt;/a&gt; is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are &lt;i&gt;3 types of methylation&lt;/i&gt;, determined by the nucleotides that follow the 'C':&lt;br /&gt;&lt;ul&gt;&lt;li&gt;CG: a 'G' follows the 'C'&lt;/li&gt;&lt;br /&gt;    &lt;li&gt;CHG: anything but a 'G' follows the 'C' and a 'G' follows that&lt;/li&gt;&lt;br /&gt;    &lt;li&gt;CHH: no 'G' in the 2 subsequent nucleotides.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases  and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized &lt;a href="http://numpy.scipy.org"&gt;numpy&lt;/a&gt; to make a fast methylation-type calculator without loops or ifs.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Loopless&lt;/h4&gt;&lt;br /&gt;Given a numpy array `seq`, it's possible to find all the 'C's with:&lt;pre&gt;np.where(seq == 'C')&lt;/pre&gt;. From there one can calculate the methylation type without looping or if'ing with:&lt;br /&gt;&lt;script src="http://gist.github.com/280224.js?file=loopless_methylation.py"&gt;&lt;/script&gt;&lt;br /&gt;which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.&lt;br /&gt;From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:&lt;br /&gt;&lt;script src="http://gist.github.com/280226.js?file=gistfile1.py"&gt;&lt;/script&gt;&lt;br /&gt;which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;0: not a C or G&lt;br /&gt;1: CG at a 'C' (+ strand )&lt;br /&gt;2: CHG at a 'C' (+ strand )&lt;br /&gt;3: CHH at a 'C' (+ strand )&lt;br /&gt;4: CG at a 'G' (- strand )&lt;br /&gt;5: CHG at a 'G' (- strand )&lt;br /&gt;6: CHH at a 'G' (- strand )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The full implemenation is &lt;a href="http://gist.github.com/280231"&gt;here&lt;/a&gt;. (even the reverse-complement is done without loops or ifs)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1951637107107601515?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1951637107107601515/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1951637107107601515' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1951637107107601515'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1951637107107601515'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/01/loopless-programming-for-calculating.html' title='Loopless programming (for calculating methylation types)'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8617627725846602698</id><published>2010-01-10T09:06:00.000-08:00</published><updated>2010-01-10T09:22:27.111-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='nwalign'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>needleman-wunsch global sequence alignment -- updates and optimizations to nwalign</title><content type='html'>I've written &lt;a href="http://hackmap.blogspot.com/2009/04/needleman-wunsch-global-sequence.html"&gt;previously&lt;/a&gt; about &lt;a href="http://pypi.python.org/pypi/nwalign/"&gt;nwalign&lt;/a&gt;, a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance. &lt;br /&gt;&lt;h3&gt;Ack&lt;/h3&gt;It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing.&lt;br /&gt;&lt;h3&gt;Bugs&lt;/h3&gt;&lt;h4&gt;Placement of Gaps&lt;/h4&gt;Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and  "AGEBAM", the alignments:&lt;pre&gt;AGEBAMAM&lt;br /&gt;AGEBAM--&lt;/pre&gt;and&lt;pre&gt;AGEBAMAM&lt;br /&gt;AGEB--AM&lt;/pre&gt;Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through the scoring matrix.&lt;br /&gt;&lt;h4&gt;Gap Extension&lt;/h4&gt;nwalign allows use of a &lt;a href="http://en.wikipedia.org/wiki/Substitution_matrix"&gt;scoring matrix&lt;/a&gt; to lookup the score/penalty for conversion from one letter to another or a simpler version where the user specifies gap_open, gap_extend, mismatch penalties, and match reward. For the simpler non-matrix path, earlier versions of nwalign did not heed the gap_extension--using the gap_open penalty regardless of the previous entries in the dynamic programming matrix. It's common to use only a gap penalty without a separate gap_extend penalty, nwalign no longer assumes that's what is preferred -- if it is, one can simply specify a `gap_extend` penalty that is equal to `gap_open`.&lt;br /&gt;&lt;h3&gt;Scoring&lt;/h3&gt;Previous versions of nwalign only returned the alignment and didn't provide access to the score of the alignment. Recent versions allow the user to get the score via:&lt;pre class="prettyprint"&gt;&lt;br /&gt;&gt;&gt;&gt; nw.score_alignment('CEELECANTH', '-PELICAN--', gap_open=-5,&lt;br /&gt;...                     gap_extend=-2, matrix='PAM250')&lt;br /&gt;11&lt;/pre&gt;&lt;br /&gt;where `11` is the score of the alignment given, the gap_open and extend_penalty, and subsitution scores in the file 'PAM250' (which is in the NCBI substitution matrix format). &lt;br /&gt;&lt;h3&gt;Performance&lt;/h3&gt;The earliest version of nwalign was literally about 100 times faster than the perl version from the BLAST book. But, there were a few places where performance has improved even more since that. The most dramatic was in the lookups for the scoring matrix values. Given, again the 2 short strings: "AGEBAMAM" and  "AGEBAM", the alignment algorithm has to do the inner loop 48 ( 6 * 8 ) times, in each of those loops, it has to look up the substitution score in the matrix. For example in the first iteration, it has to find the score for "changing" from an "A" to an "A"; in the &lt;a href="http://prowl.rockefeller.edu/aainfo/pam250.htm"&gt;PAM250&lt;/a&gt; matrix, that score is +2. In the substitution matrix, the row and column "keys" are the amino acids and the values are the scores for changing from the row key to the column key. Previously, I stored the keys in their own array, then did a linear search to find the index. So, for both the row and column, an amino acid letter is translated to an index with a function like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;aa_keys = ['A', 'R', 'N', 'D', ... ] # or 'ARND...'&lt;br /&gt;def findpos(amino_acid, aa_keys):&lt;br /&gt;    i = 0&lt;br /&gt;    while i &lt; len(aa_keys):&lt;br /&gt;        if aa_keys[i] == amino_acid:&lt;br /&gt;            return i&lt;br /&gt;        i += 1&lt;br /&gt;    return - 1&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;except in C(ython) so it was much faster, the alignment then uses the return values for row and column amino acid to look up the score from the substitution matrix. When the string are long enough-- as they will be with real proteins, this is a huge speed bottleneck. So, it trades memory for speed. Instead of using a 25 * 25 matrix to store the substitution matrix, it now uses an X * X matrix where 'X' is the ord value of the largest amino acid. Usually this will be less than 'Z', so the matrix will be 90 * 90, stored efficiently as a numpy array. From there, the lookup is directly with matrix[ord(x_amino), ord(y_amino)] which is extremely fast in C because the ord is unnecessary as a char can index an array. This gave at least a 3X speed improvement. I could reduce the memory used by the matrix by subtracting ord('A'), but that would be trading speed for memory since it would require 2 extra subtractions per inner loop. &lt;br /&gt;&lt;br /&gt;Also, the latest version uses less memory in other areas; in order to do the dynamic programming, the algo has to save N * M arrays of direction "pointers" (in the left/right/diagonal sense (not the c sense), scores and gaps. However, the gaps are not actually needed for the entire trace-back, they are only needed 1 step back to determine if a current gap is a gap_open, or gap_extend. So, now, instead of an N*M matrix for the gap matrix, it's N*1 and M*1 arrays. For large N * M, this is enough to offset the increased memory incurred by how the substitution matrix is stored.&lt;br /&gt;&lt;br /&gt;Running a benchmark test script with nwalign-0.1 takes 12.13 seconds and the same script on nwalign-0.3.1 takes 3.41 seconds (and actually runs in 2.99 seconds with unladen-swallow... but that's another story). The script does an alignment on 1200 * 1600 basepair sequences 100 times. As with most things, I figured most of this by banging my head against the wall long enough that the stars &lt;i&gt;align&lt;/i&gt;ed, so to speak, anyone who cares to look at the &lt;a href="http://bitbucket.org/brentp/biostuff/src/ca3b83294a71/nwalign/"&gt;code&lt;/a&gt; and offer suggestions would be much appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8617627725846602698?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8617627725846602698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8617627725846602698' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8617627725846602698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8617627725846602698'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/01/needleman-wunsch-global-sequence.html' title='needleman-wunsch global sequence alignment -- updates and optimizations to nwalign'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-666675435925245397</id><published>2010-01-04T18:24:00.001-08:00</published><updated>2010-01-04T18:49:18.714-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pyfasta'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>updates to pyfasta</title><content type='html'>At the end of last year, I put in quite a few updates to &lt;a href="http://pypi.python.org/pypi/pyfasta/"&gt;pyfasta&lt;/a&gt;. One of the nicest is the new &lt;i&gt;flatten&lt;/i&gt; stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends.&lt;br /&gt;With this option, a file like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&gt;a&lt;br /&gt;actg&lt;br /&gt;actg&lt;br /&gt;actg&lt;br /&gt;&gt;b &lt;br /&gt;aaaacccc&lt;br /&gt;aaaacccc&lt;/pre&gt;&lt;br /&gt;will overwritten (flattened in-place) to:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&gt;a&lt;br /&gt;actgactgactg&lt;br /&gt;&gt;b&lt;br /&gt;aaaaccccaaaacccc&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;simple. When dealing with large files (where this is actually useful), the flattened file does not behave well when opened in an editor because the editor will attempt to read a number of lines into the buffer and a single line may be 200 mega-bases. So, this is a pain if you're planning to sit down with a cup of joe and read through the genome, but otherwise, the fasta file should be un-affected.&lt;br /&gt;This method is not currently the default (though it may be so in future versions). But, it's possible to use the commandline: &lt;pre&gt;pyfasta flatten some.fasta&lt;/pre&gt; which will create the flattened fasta (and the index file) and a placeholder some.fasta.flat, containing the text "@flattened@" as a marker to pyfasta that it's ok to use the original (now-flattened) fasta. Once the file is flattened, there is no performance loss compared to having a separate flat file containing no headers.&lt;br /&gt;&lt;br /&gt;pyfasta was a fun project for me in 2009. It's a ridiculously simple little module, but when I started it, I didn't think there was a good alternative. (Though discriminating Fasta-ers should look at the sequence module in pygr, and the Bio.Seq module in BioPython which I think has improved quite a lot recently). It has over 100 tests and very close to 100% test coverage for the modules in pyfasta, and much of the code is run once for each of the 4 backends. &lt;br /&gt;&lt;br /&gt;the source is on &lt;a href="http://bitbucket.org/brentp/biostuff/src/"&gt;bitbucket&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-666675435925245397?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/666675435925245397/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=666675435925245397' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/666675435925245397'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/666675435925245397'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2010/01/updates-to-pyfasta.html' title='updates to pyfasta'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-5997151342279256585</id><published>2009-12-23T06:46:00.000-08:00</published><updated>2009-12-23T07:46:30.047-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>genome scrubber : mask repetitive sequence</title><content type='html'>This is to describe a simple tool I've made available (&lt;a href="http://bpbio.googlecode.com/svn/trunk/scripts/mask_genome"&gt;svn repo&lt;/a&gt;) for masking repetitive sequence.&lt;br /&gt;&lt;br /&gt;rice (Oryza Sativa) version 5 sequence looks something like below when run through &lt;a href="http://pypi.python.org/pypi/pyfasta/"&gt;pyfasta&lt;/a&gt; info.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;rice.fasta&lt;br /&gt;==========&lt;br /&gt;&gt;1 length:43596771 &lt;br /&gt;&gt;3 length:36345490 &lt;br /&gt;&gt;2 length:35925388 &lt;br /&gt;&gt;4 length:35244269 &lt;br /&gt;&gt;6 length:31246789 &lt;br /&gt;&gt;5 length:29874162 &lt;br /&gt;&gt;7 length:29688601 &lt;br /&gt;&gt;11 length:28462103 &lt;br /&gt;&gt;8 length:28309179 &lt;br /&gt;&gt;12 length:27497214 &lt;br /&gt;&gt;9 length:23011239 &lt;br /&gt;&gt;10 length:22876596 &lt;br /&gt;&lt;br /&gt;372.078M basepairs in 12 sequences&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command:&lt;br /&gt;&lt;pre&gt;/usr/bin/blastall -K 100 -a 8 -d rice.fasta -e 0.001 -i rice.fasta -m 8 -o rice_rice.blast -p blastn&lt;/pre&gt;&lt;br /&gt;nearly causes my machine to run out of memory (16G), takes a couple of days to run, and results in a blast output of 5.1G and 84 million rows--that's 84 million blast hits with an e-value below 0.001! By definition, that output is dominated by the repetitive elements. Repetitive elements are interesting, but in the case were we want to look at &lt;a href="http://en.wikipedia.org/wiki/Synteny"&gt;synteny&lt;/a&gt;, we have to wade through that 5.1G of stuff to find the very small chunk of data we need. This adds time to run the sequence comparison, time to parse, time to plot, time to analyze, and data to store, etc...&lt;br /&gt;The solution we use in our lab is to create a "masked" sequence where we 'X' out any DNA sequence occurring more than a given number of times in the original blast. So, from the output of the blast above, any basepair that is covered more than 50 times is hard-masked with 'X'. The single script to run this is &lt;a href="http://code.google.com/p/bpbio/source/browse/trunk/scripts/mask_genome/mask_genome.py"&gt;here&lt;/a&gt;. Available from svn via:&lt;br /&gt;&lt;pre&gt;svn checkout http://bpbio.googlecode.com/svn/trunk/scripts/mask_genome&lt;/pre&gt;&lt;br /&gt;when run as a script with no options, it will print some help text. To mask rice at 50X, it is run as:&lt;br /&gt;&lt;pre&gt;python mask_genome.py -b rice_rice.blast -f rice.fasta -o rice -c 50 -m X&lt;/pre&gt;&lt;br /&gt;This hard-masks with "X". To soft-mask (changing basepairs to upper-case except those occurring more than 50 times which are lower-cased.) one can use the commandline option -m SOFT.&lt;br /&gt;The resulting file: "rice.masked.50.fasta" has the same number of chromosomes, each with the same number of basepairs, but with repetitive regions masked. To show the difference, I re-ran the full-genome blast, except this time on the masked genome:&lt;pre&gt;/usr/bin/blastall -K 100 -a 8 -d rice.masked.50.fasta -e 0.001 -i rice.masked.50.fasta -m 8 -o ricemasked.blast -p blastn&lt;/pre&gt;. The resulting blast file is only 128MB and 2 million rows (contrast to 5.1G and 84 million rows) or a blast file that's 2.4% of the original size. This makes things simpler, and makes doing genomic analyses more enjoyable. This is a different tool than the dust filter that comes with BLAST or many others, because it masks based on the global coverage at each location (there are similar tools).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Implementation&lt;/h3&gt;&lt;br /&gt;&lt;h4&gt;Counting&lt;/h4&gt;&lt;br /&gt;In order to do the full-genome mask, mask_genome.py stores an array for each chromosome, with a length equal to that of the chromosome in a &lt;a href="http://pytables.org"&gt;pytables/hdf5&lt;/a&gt; structure which acts like a numpy array. So to increment the counts for a blast hit, the code looks like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;        cache[qchr][qstart - 1: qstop] += 1&lt;br /&gt;        cache[schr][sstart - 1: sstop] += 1&lt;/pre&gt;&lt;br /&gt;where cache is the hd5 structure (except for some cases it's read into memory to improve speed), 'qchr' and 'schr' are the query and subject chromosome keys with the values being the count arrays, and 's/qstart', 's/qstop' are the bounds of the subject and query for a particular blast line. This sends much of the work to numpy, instead of iterating over the range of qstart, qstop in python. That is repeated for every row in the blast file.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;masking&lt;/h4&gt;&lt;br /&gt;Once all of the counts are created, the script uses that data to mask out the original sequence for any element in count that is greater than the cutoff. That is achieved in a single line iterated per chromosome:&lt;br /&gt;&lt;pre&gt;masked_seq = np.where(numexpr.evaluate("hit_counts &gt; %i" % cutoff)&lt;br /&gt;                                      , mask_value, seq).tostring()&lt;/pre&gt;&lt;br /&gt;where `np` is from "import numpy as np", `seq` is the original fasta sequence in upper case, `mask_value is either a constant, like 'X', or in the case of soft-masking, a lower-case copy of `seq`, and `hit_counts` is the array of counts. The only tricky bit there is the use of &lt;a href="http://code.google.com/p/numexpr/"&gt;numexpr&lt;/a&gt; which makes things a bit faster. The `masked_seq` is written to file after it's corresponding header, and the masking is performed for the next chromosome. The entire masking takes under 10 minutes on my machine.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Plotting&lt;/h4&gt;&lt;br /&gt;Generally, we used the masked fasta file as the final output, but there's useful information in the hd5f file--a sort of measure of repetitiveness at each basepair in the genome. also in the svn repo is a short matplotlib web script (for mod_wsgi/mod_python but can be modified to run as cgi or command-line) to generate an image given a request like: &lt;br /&gt;&lt;pre&gt;?org=rice&amp;xmin=12&amp;xmax=90000&amp;width=800&amp;seqid=2&lt;/pre&gt;&lt;br /&gt;where 'org' is the name of the organism sent to the mask_genome.py script, `xmin`, `xmax` specify the extents in basepairs, and `seqid` the chromosome.&lt;br /&gt;Here's an example from our genome viewer (built on &lt;a href="http://openlayers.org"&gt;openlayers&lt;/a&gt;) where I've added copy count as a layer (with admittedly poor cartography). The nice gene renderings and basepair location ruler are courtesy of Eric Lyons.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_uU_kLC5AdTc/SzItxwTCDOI/AAAAAAAAAxY/zmNs-wFFxoA/s1600-h/copy_count.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 119px;" src="http://2.bp.blogspot.com/_uU_kLC5AdTc/SzItxwTCDOI/AAAAAAAAAxY/zmNs-wFFxoA/s400/copy_count.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5418443634481695970" /&gt;&lt;/a&gt;The height of the blue line indicates the copy count. The flat red line is at a copy count of 50. So, the genes in this region are repetitive, but it's possible to see that they are surrounded by &lt;a href="http://en.wikipedia.org/wiki/Long_terminal_repeat"&gt;long terminal repeats&lt;/a&gt;, and that there is some repetitive sequence in the introns (the gray between the big green exons). Scrolling along, openlayers-style, it's possible to see patterns, and spot transposons.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;web&lt;/h4&gt;&lt;br /&gt;I've made a few genomes &lt;a href="http://copymask.syntelog.com/"&gt;available&lt;/a&gt; and will be adding more. There's also a (currently undocumented) service to plot the genome. So if the organism is brachy and the chromosome of interest is Bd1 then:&lt;br /&gt;&lt;pre&gt;http://copymask.syntelog.com/plot/brachy/Bd1/?xmin=12&amp;xmax=38345&amp;width=600&lt;/pre&gt;&lt;br /&gt;will result in:&lt;br /&gt;&lt;img src="http://copymask.syntelog.com/plot/brachy/Bd1/?xmin=12&amp;xmax=38345&amp;width=600" /&gt;&lt;br /&gt;where the image is described above.&lt;br /&gt;&lt;br /&gt;If you're doing full-genome analyses, give it a try and let me know any suggestions. There's a full example starting from only a genomic fasta, including the BLAST command in the README.rst included in svn. If you get a genome that's too big to run through a full-genome BLAST (maize, ahem). You can split with "pyfasta -split -n 4", run 4 blasts, concat the results, and send through mask_genome.py.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-5997151342279256585?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/5997151342279256585/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=5997151342279256585' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5997151342279256585'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5997151342279256585'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/12/genome-scrubber-mask-repetitive.html' title='genome scrubber : mask repetitive sequence'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_uU_kLC5AdTc/SzItxwTCDOI/AAAAAAAAAxY/zmNs-wFFxoA/s72-c/copy_count.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-7622560658333212225</id><published>2009-10-23T16:28:00.000-07:00</published><updated>2009-10-23T17:10:25.617-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='s5'/><category scheme='http://www.blogger.com/atom/ns#' term='pygments'/><title type='text'>rst2s5 with syntax highlighting</title><content type='html'>&lt;h3&gt;Restructured Text to S5 Presentation&lt;/h3&gt;&lt;br /&gt;(lots of caffeine today so 2 posts in one day)&lt;br /&gt;The stub example presentation I'll be talking about is viewable as a presentation &lt;a href="http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/index.html"&gt;here&lt;/a&gt; (click on that page to advance the demo slides).&lt;br /&gt;&lt;br /&gt;There's a nice browser-based tool for presentations called &lt;a href="http://meyerweb.com/eric/tools/s5/"&gt;S5&lt;/a&gt;. In recent python &lt;a href="http://docutils.sourceforge.net/"&gt;docutils&lt;/a&gt;, there's a tool called &lt;i&gt;rst2s5.py&lt;/i&gt; which converts restructured-text to an s5 presentation. However, it's not obvious how to get syntax highlighting for code blocks to work.&lt;br /&gt;So &lt;a href="http//pygments.org/"&gt;pygments&lt;/a&gt;, a python library that will highlight syntax for many programming languages comes with &lt;a href="http://dev.pocoo.org/projects/pygments/browser/external/rst-directive.py"&gt;this file&lt;/a&gt; which they recommend you use as a starting point. That's what I did, and I've created a stub example project accessible via subversion:&lt;pre&gt;&lt;br /&gt;$ svn export http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/&lt;/pre&gt;&lt;br /&gt;with a build script and a couple of example slides (and a nice theme). It's possible to change the theme by editing rst-directive.py (included in the source) and changing the &lt;i&gt;STYLE&lt;/i&gt; global to a theme that pygments knows about (being comfortable with my machismo, I chose "fruity"). One way to find other themes is to check the drop down box on &lt;a href="http://pygments.org/demo/2403/"&gt;the paste bin at the pygments site&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;To use the template, just edit the index.rst, then run build.sh and view the resulting index.html in a browser. The S5 "theme" it's using is specified in the &lt;i&gt;build.sh&lt;/i&gt; script and contained in the ui/ sub directory, you can find more themes on the s5 site and others that come with the s5 install.&lt;br /&gt;&lt;br /&gt;Check out the pretty example project &lt;a href="http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/index.html"&gt;here&lt;/a&gt; (with python code 3 slides in). As with any S5 slideshow, you can click to advance slides or use the controls that appear in the bottom right when the mouse is in that area.&lt;br /&gt;&lt;br /&gt;As a bonus, if the index.rst file contains python shell sessions (doctests) like the example, you can check them with &lt;a href="http://somethingaboutorange.com/mrl/projects/nose/"&gt;nose&lt;/a&gt; using:&lt;pre&gt;&lt;br /&gt;$ nosetests --with-doctest --doctest-extension=.rst index.rst&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-7622560658333212225?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/index.html' title='rst2s5 with syntax highlighting'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/7622560658333212225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=7622560658333212225' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7622560658333212225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7622560658333212225'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/10/rst2s5-with-syntax-highlighting.html' title='rst2s5 with syntax highlighting'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8768821429207662996</id><published>2009-10-22T17:55:00.000-07:00</published><updated>2009-10-23T08:07:21.152-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ctypes'/><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>some python ctypes stuff in Rtree</title><content type='html'>I've been working with and on the &lt;a href="http://pypi.python.org/pypi/Rtree"&gt;Rtree&lt;/a&gt; python module. It's some cool work done by &lt;a href="http://hobu.biz"&gt;Howard Butler&lt;/a&gt; (hobu) (originally with &lt;a href="http://sgillies.net"&gt;Sean Gillies&lt;/a&gt;) to make python bindings for the &lt;a href="http://www.research.att.com/~marioh/spatialindex/index.html"&gt;Spatial Index&lt;/a&gt; C++ library by Marios Hadjieleftheriou which provides various tree structures for efficient spatial searches in n-dimensions. Hobu has written a C API for that along with a new &lt;a href="http://docs.python.org/library/ctypes.html"&gt;ctypes&lt;/a&gt; wrapper to that API which appears in Rtree 0.5 and greater. There is some cool &lt;a href="http://docs.python.org/library/ctypes.html"&gt;ctypes&lt;/a&gt; stuff in there which I'm starting to understand.&lt;br /&gt;From the website:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;as a simple example of how ctypes works we can pretend there's no atan() in python's math module and access the one from libm in c like this:&lt;br /&gt;&lt;pre class="prettyprint"&gt;import ctypes&lt;br /&gt;libm = ctypes.CDLL('libm.so.6')&lt;br /&gt;&lt;br /&gt;# the following 2 lines correspond to the c signature: double atan(double)&lt;br /&gt;libm.atan.argtypes = [ctypes.c_double]&lt;br /&gt;libm.atan.restype = ctypes.c_double&lt;br /&gt;&lt;br /&gt;print libm.atan(0.22)&lt;br /&gt;print libm.atan(ctypes.c_double(0.22))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where line 2 tells ctypes how to find the library (have a look at &lt;a href="http://trac.gispython.org/lab/browser/Rtree/trunk/rtree/core.py#L12"&gt;Rtree&lt;/a&gt; or &lt;a href="http://trac.gispython.org/lab/browser/shapely.geos/trunk/shapely/geos/__init__.py"&gt;shapely&lt;/a&gt; source code to see the cross-platform way to do that). lines 4, 5 tell it the input types (argtypes) and return type (restype) respectively, and lines 7, 8 call the c function by way of the ctypes wrapper. Here, it's calling the version of atan with a double precision number. With simple types, you can let ctypes wrap a python value in the type or you can do so explicitly as in the last line.&lt;br /&gt;&lt;br /&gt;Things get more interesting with more complicated return types. For c function with a char * return type, e.g. this contrived example:&lt;pre class="prettyprint"&gt;// does not need to be freed.&lt;br /&gt;char* fn_char(){&lt;br /&gt;    char *s = "asdf";&lt;br /&gt;    return s;&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;the ctypes invocation -- with 'ccode' being this contrived library as loaded by ctypes.CDLL -- looks like:&lt;pre class="prettyprint"&gt;ccode.fn_char.restype = ctypes.c_char_p&lt;br /&gt;print ccode.fn_char()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;which returns "asdf" as expected and does not leak. (Note that ctypes.c_char_p is "c char pointer" or "char *".) If you get a copy of a char * and are responsible for freeing it's memory, e.g. from the c function:&lt;pre class="prettyprint"&gt;// needs to be freed.&lt;br /&gt;const char * fn_const_char(){&lt;br /&gt;    return (const char * )strdup("asdf");&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;the ctypes for that looks like:&lt;pre class="prettyprint"&gt;&lt;br /&gt;def get_and_free(achar_p): &lt;br /&gt;    s = ctypes.string_at(achar_p)&lt;br /&gt;    libc.free(achar_p)&lt;br /&gt;    return s&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;ccode.fn_const_char.restype = get_and_free&lt;br /&gt;print ccode.fn_const_char()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where libc is the standard c library defining the function to &lt;i&gt;free&lt;/i&gt;() memory. In this case, it takes advantage of the feature that .restype can be a callable which takes the pointer return from the c code. In get_and_free(), ctypes.string_at() turns that pointer address into a python string. Then the char * pointer is free'd, and the python string is returned, and "asdf" is printed as expected.&lt;br /&gt;It's also possible to do more rigorous error checking with errcheck in which case the ctypes looks like:&lt;pre class="prettyprint"&gt;&lt;br /&gt;def err_check(char_p, fn, args, fn, args):&lt;br /&gt;    s = ctypes.string_at(char_p)&lt;br /&gt;    libc.free(char_p)&lt;br /&gt;    return s&lt;br /&gt;&lt;br /&gt;ccode.fn_const_char.restype = ctypes.POINTER(ctypes.c_char)&lt;br /&gt;ccode.fn_const_char.errcheck = err_check&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where err_check gets 3 arguments, the result of the function, a reference to the function, and the args sent to it (in this case args is empty). Note that in this case, we have to specify the restype ctypes.POINTER(ctypes.c_char) so that we still have the pointer address--which we then free. When the restype is specified as ctypes.c_char_p (char *), then ctypes automatically gives us the python string and we can't (as far as I know) free the memory and a leak occurs. Also, in the case above, I haven't actually added any extra error checking, in Rtree hobu has a few functions to check the return values, see that code &lt;a href="http://trac.gispython.org/lab/browser/Rtree/trunk/rtree/core.py#L12"&gt;here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;This post has been pretty basic, there's also good ctypes code in &lt;a href="http://pypi.python.org/pypi/Shapely"&gt;shapely&lt;/a&gt;, &lt;a href="http://code.djangoproject.com/browser/django/trunk/django/contrib/gis"&gt;geodjango&lt;/a&gt;, and &lt;a href="http://pypi.python.org/pypi/libLAS/"&gt;libLAS&lt;/a&gt;. Next post I'll talk about callback functions -- calling a C function that expects a pointer-to-a-function with a python function as an argument.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8768821429207662996?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8768821429207662996/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8768821429207662996' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8768821429207662996'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8768821429207662996'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/10/some-python-ctypes-stuff-in-rtree.html' title='some python ctypes stuff in Rtree'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4080463775693667097</id><published>2009-10-14T09:14:00.000-07:00</published><updated>2009-10-14T10:21:42.876-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='fasta'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>biostuff</title><content type='html'>I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around.&lt;br /&gt;Previously, I stuck anything of at least marginal quality and re-usability into my google code project &lt;a href="http://code.google.com/p/bpbio"&gt;bpbio&lt;/a&gt;. As of yesterday, I've moved a lot of stuff from there to &lt;a href="http://bitbucket.org/brentp/biostuff/src/"&gt;bitbucket&lt;/a&gt;. "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;pyfasta&lt;/span&gt; provides pythonic access fasta sequence files. Previously, it had been a part of &lt;a href="http://code.google.com/p/genedex/"&gt;genedex&lt;/a&gt; (which I've stopped supporting since &lt;a href="http://twitter.com/howardbutler"&gt;@hobu&lt;/a&gt; has done so much good work on &lt;a href="http://pypi.python.org/pypi/Rtree"&gt;Rtree&lt;/a&gt; that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on &lt;a href="http://pypi.python.org/pypi/pyfasta/"&gt;pypi&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;a href="http://pypi.python.org/pypi/nwalign"&gt;nwalign&lt;/a&gt;&lt;/span&gt; is a command-line or python interface to the &lt;a href="http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm"&gt;Needleman-Wunsch&lt;/a&gt; global sequence alignment which I've blogged about &lt;a href="http://hackmap.blogspot.com/2009/04/needleman-wunsch-global-sequence.html"&gt;previously&lt;/a&gt;. Whenever I need to do stuff with cython and numpy I use &lt;a href="http://bitbucket.org/brentp/biostuff/src/tip/nwalign/nwalign.pyx"&gt;nwalign.pyx&lt;/a&gt; for reference (though there's probably better material out there). &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;a href="http://pypi.python.org/pypi/simpletable"&gt;simpletable&lt;/a&gt;&lt;/span&gt;, as the name suggest is a wrapper around &lt;a href="http://pytables.org"&gt;pytables&lt;/a&gt; to remove some of the boiler-plate in creating a table and dataset and Description. That's it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;&lt;a href="http://pypi.python.org/pypi/skidmarks"&gt;skidmarks&lt;/a&gt;&lt;/span&gt; is a small module to check for runs in data (get it skidmarks, runs?). It implements &lt;a href="http://en.wikipedia.org/wiki/Wald-Wolfowitz_runs_test"&gt;Wald-Wolfowitz&lt;/a&gt;, autocorrelation, &lt;a href="http://books.google.com/books?id=EIbxfCGfzgcC&amp;lpg=PA141&amp;ots=o-8ymmqbs9&amp;pg=PA142#v=onepage&amp;q=&amp;f=false"&gt;serial&lt;/a&gt;, and &lt;a href="http://books.google.com/books?id=EIbxfCGfzgcC&amp;lpg=PA141&amp;ots=o-8ymmqbs9&amp;pg=PA142#v=onepage&amp;q=&amp;f=false"&gt;gap&lt;/a&gt; tests. Each function (implementing one of those tests) returns a p-value which indicates the level of support to reject the null hypotheses that the sequence is random and the chi-square or z-score value as appropriate. I've been using this and monte-carlo simulations to see if runs in genomic data could be explained by random events.&lt;br /&gt;&lt;br /&gt;Any contributions, suggestions, or bug reports are welcomed--the interface at bitbucket should make this easier to do, just fork and fix and pull-request.&lt;br /&gt;Meanwhile, my less documented/ more crappy code will continue to live on google code--at least until it matures. I've got a couple modules in the pipeline that will be added once they're cleaned up and documented.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4080463775693667097?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4080463775693667097/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4080463775693667097' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4080463775693667097'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4080463775693667097'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/10/biostuff.html' title='biostuff'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-6741679369932173474</id><published>2009-07-26T14:33:00.001-07:00</published><updated>2009-08-03T21:51:47.631-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='haxe'/><category scheme='http://www.blogger.com/atom/ns#' term='flash'/><title type='text'>starting haxe. (stuff i want to remember)</title><content type='html'>I've been tinkering with a flash project recently. Actually &lt;a href="http://haxe.org"&gt;haxe&lt;/a&gt;, so it's only linux tools -- VI, and the command line -- not the GUI interface people normally associate with flash. This post is a summary of how to get started with haxe using only the command line, and a project containing flash stuff I want to remember.&lt;br /&gt;To start, here's &lt;a href="http://gist.github.com/140607"&gt;a gist&lt;/a&gt; of shell commands that will set up haxe on an ubuntu machine. The installers from the &lt;a href="http://haxe.org"&gt;haxe&lt;/a&gt; website work fine for windows and mac (and I think 32 bit linux).&lt;br /&gt;&lt;br /&gt;Haxe has a slightly different syntax from actionscript 3, but for most things it is identical. &lt;a href="http://haxe.org/api"&gt;These&lt;/a&gt; docs are very good, and better than the adobe site, I have that page open always when working with haxe. I also grabbed an "actionscript.vim" from the internet somewhere and put it in ~/.vim/syntax/ for syntax highlighting and added this line to my .vimrc:&lt;pre&gt;autocmd BufRead *.hx set filetype=actionscript&lt;/pre&gt;&lt;br /&gt;Then compilation and code is simply a matter of following &lt;a href="http://haxe.org/doc/flash/0_start"&gt;this&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If you only look at haxe/flash occassionally, some stuff can be less than obvious. I've started a github &lt;a href="http://github.com/brentp/learnflash/tree/master"&gt;project&lt;/a&gt; as a testing ground where I can record the stuff I figure out.&lt;br /&gt;Currently it has examples of:&lt;br /&gt;&lt;br /&gt;+ call javascript from flash (ExternalInterface.call())&lt;br /&gt;+ call flash from javascript (ExternalInterface.addCallBack())&lt;br /&gt;+ add an image from a (local) url&lt;br /&gt;+ keyboard events&lt;br /&gt;+ interact with bitmap data of image.&lt;br /&gt;+ style a text field.&lt;br /&gt;&lt;br /&gt;all in a single .hx file. Most of those are fairly simple, they just require the correct incantation. There's a version of this running on my work machine &lt;a href="http://bit.ly/1aylGX"&gt;here&lt;/a&gt; to demo this stuff. The possible interactions are mostly listed as instructions in either the HTML or the flash movie in that page.&lt;br /&gt;To build the flash(9) movie, just type 'haxe build.hxml' in the learnflash/ directory. Commenting out the '--no-traces' line in build.hxml will cause any trace() call to be sent to firebug--this is extremely useful for debugging.&lt;br /&gt;&lt;br /&gt;I'll probably add more as stumble upon it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-6741679369932173474?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://github.com/brentp/learnflash/tree/master' title='starting haxe. (stuff i want to remember)'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/6741679369932173474/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=6741679369932173474' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6741679369932173474'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6741679369932173474'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/07/starting-haxe-stuff-i-want-to-remember.html' title='starting haxe. (stuff i want to remember)'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2467638030048689244</id><published>2009-05-23T11:37:00.001-07:00</published><updated>2009-05-23T12:43:01.497-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mmap'/><category scheme='http://www.blogger.com/atom/ns#' term='numpy'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>displaying and serving big(ish) data with numpy and memmap</title><content type='html'>In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script. &lt;br /&gt;&lt;br /&gt;A web request containing a width=XXX in the url will show the user wants an image. so a url like:&lt;br /&gt;http://128.32.8.28/salktome/?seqid=1&amp;xmin=491520&amp;xmax=499712&amp;&amp;width=512&lt;br /&gt;will give an image:&lt;br /&gt;&lt;img src="http://128.32.8.28/salktome/?seqid=1&amp;xmin=491520&amp;xmax=499712&amp;&amp;width=512" /&gt;&lt;br /&gt;(geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into&lt;br /&gt;an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression.&lt;br /&gt;Without the width=XXX param, the service returns JSON of values for the + and minus strand for each basepair and tissue in the original file--which is less useful than having a summary of mean value for each tissue over the range:&lt;br /&gt;http://128.32.8.28/salktome/?seqid=1&amp;xmin=491520&amp;xmax=499712&amp;summary=t&lt;br /&gt;gives:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;{"+": [0.008325556293129921, 0.0072462377138435841, 0.0074741491116583347, 0.0017609233036637306, 0.010895536281168461], "-": [0.0088664013892412186, 0.017081102356314659, 0.009236418642103672, 0.0043859495781362057, 0.021286465227603912]}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;which are the mean values from basepair 491520 to 499712 for the + and - strands and each of the 5 tissues. A single tissue can be requested by number as:&lt;br /&gt;http://128.32.8.28/salktome/?seqid=1&amp;xmin=491520&amp;xmax=499712&amp;summary=t&amp;tissue=2&lt;br /&gt;which gives:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;{"+": 0.0074741492195734898, "-": 0.0092364182547917447}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Likewise, one can request and image with only a single tissue:&lt;br /&gt;http://128.32.8.28/salktome/?seqid=1&amp;xmin=491520&amp;xmax=499712&amp;&amp;width=512&amp;tissue=2&lt;br /&gt;&lt;img src="http://128.32.8.28/salktome/?seqid=1&amp;xmin=491520&amp;xmax=499712&amp;&amp;width=512&amp;tissue=2" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Transcriptome data often has enough rows of data that it's possible, but not really convenient to stick it into an RDBMS. the data I'm using is available &lt;a href="http://natural.salk.edu/database/transcriptome/salktome/"&gt;here&lt;/a&gt; and looks like:&lt;pre&gt;&lt;br /&gt;26 C 125.05 106.18 119.80 159.78 101.40&lt;br /&gt;26 W 141.90 139.07 137.50 151.91 171.18&lt;br /&gt;51 C 131.85 108.57 71.00 156.41 133.79&lt;br /&gt;51 W 123.83 129.97 122.50 100.23 139.09&lt;br /&gt;76 C 136.84 104.98 120.30 88.88 151.64&lt;br /&gt;76 W 119.35 112.79 160.00 119.51 130.29&lt;br /&gt;126 C 138.93 100.92 111.00 84.82 103.48&lt;br /&gt;126 W 87.82 118.30 110.30 157.17 126.42&lt;br /&gt;151 C 184.50 71.24 157.80 396.25 86.42&lt;br /&gt;151 W 136.76 119.95 107.00 98.10 108.60&lt;br /&gt;176 C 135.75 86.98 115.80 188.41 93.53&lt;br /&gt;176 W 147.57 158.19 131.00 117.45 164.77&lt;br /&gt;[... millions more ...]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where the first column is basepair position, the second is either 'W' or 'C' for the Watson or Crick strand of the (double-stranded) DNA sequence. Columns 2 and on are measurements of transcription (see the &lt;a href="http://en.wikipedia.org/wiki/Transcriptomics"&gt;wikipedia&lt;/a&gt; article) for various tissues in the plant &lt;a href="http://en.wikipedia.org/wiki/Arabidopsis_thaliana"&gt;Arabidopsis Thaliana&lt;/a&gt; which we use because it has a small genome,  it's well annotated and doesn't have too much repetitive DNA or too many transposons.&lt;br /&gt;&lt;br /&gt;The data is somewhat irregular in that not every 'C' row has a 'W' counterpart and sometimes vice-versa, for those cases, I create a row to fill the missing row with the same values as the existing row. The code below runs through and creates numpy .npy files of the data:&lt;br /&gt;&lt;script src="http://gist.github.com/116731.js"&gt;&lt;/script&gt;&lt;br /&gt;It saves the positions and the actual transcriptome data into separate files because for the web service, the transcriptome files will be memory-mapped while the smaller position files will be read into memory. This code is run at startup (and so not done for every request):&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;from numpy.lib import format&lt;br /&gt;for ichr in range(1, 6): # for 5 at chrs&lt;br /&gt;    schr = str(ichr)&lt;br /&gt;    # memmap these&lt;br /&gt;    seqids[schr] = format.open_memmap(os.path.join(path, \&lt;br /&gt;                               'at.tome.data.%s.npy' % (schr,)), mode='r')&lt;br /&gt;    # read these into memory.&lt;br /&gt;    posns[schr] = np.load(os.path.join(path, 'at.tome.posn.%s.npy' % schr))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So that the seqids dict has values that are the memmaped contents of the expression data in the .npy files. For each web request, it then uses numpy's searchsorted to do a binary search so that when a user requests &amp;xmin=1234&amp;xmax=4567 searchsorted() finds the position in the array where 1234 would fall (exactly like python's bisect module). and like wise for 4567. The pair of array indexes from the searchsorted() calls with xmin and xmax give the indexes into the tome array for which to grab the data:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;    minidx = posns[seqid].searchsorted(xmin)&lt;br /&gt;    maxidx = posns[seqid].searchsorted(xmax)&lt;br /&gt;&lt;br /&gt;    data = seqids[seqid][minidx: maxidx]&lt;br /&gt;    data_idx = posns[seqid][minidx: maxidx]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where data contains the values of the transcriptome data and data_idx contains the associated base_pair positions. I could have saved the basepair positions in the seqids/data numpy array, but since searchsorted() will likely traverse many chunks of the array, I think it's best to have that part in memory rather than memory mapped.&lt;br /&gt;The rest of the code is a bunch of if statements (which I should really spread across multiple functions...) deciding whether to show an image with matplotlib or grab a summary of the data and return it via simplejson. The gist of the entire wsgi script is at the end of this post.&lt;br /&gt;An example of what this looks like in an openlayers map with gene annotations is in the image below. &lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_uU_kLC5AdTc/ShhPiMVXQCI/AAAAAAAAAjE/nGgtgO76SIU/s1600-h/gbsalk.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 118px;" src="http://2.bp.blogspot.com/_uU_kLC5AdTc/ShhPiMVXQCI/AAAAAAAAAjE/nGgtgO76SIU/s400/gbsalk.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5339104807092699170" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where one can see some correlation between where the (green) genes are and the higher transcriptome levels. In summary, it's nice to be able to take data in this format and write a 120 line script which provides full, fast access both to the raw data and to images which we can use to find patterns to explore.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;script src="http://gist.github.com/116734.js"&gt;&lt;/script&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2467638030048689244?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2467638030048689244/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2467638030048689244' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2467638030048689244'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2467638030048689244'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/05/displaying-and-serving-bigish-data-with.html' title='displaying and serving big(ish) data with numpy and memmap'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_uU_kLC5AdTc/ShhPiMVXQCI/AAAAAAAAAjE/nGgtgO76SIU/s72-c/gbsalk.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1729797576743998908</id><published>2009-04-21T18:09:00.000-07:00</published><updated>2010-01-01T12:04:46.198-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='cython'/><title type='text'>Needleman-Wunsch global sequence alignment</title><content type='html'>[EDIT 01-01-2010] this is now available at &lt;a href="http://bitbucket.org/brentp/biostuff/src/"&gt;bitbucket&lt;/a&gt;.&lt;br /&gt;I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically &lt;a href="http://docs.cython.org/docs/numpy_tutorial.html"&gt;cython-numpy&lt;/a&gt; goodness. It's easy-installable as:&lt;br /&gt;&lt;pre&gt; sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/&lt;/pre&gt;&lt;br /&gt;or via svn from:&lt;br /&gt;&lt;pre&gt;svn co http://bpbio.googlecode.com/svn/trunk/nwalign/&lt;/pre&gt;&lt;br /&gt;it will put an executable 'nwalign' into /usr/bin/ which when run will give this message:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Usage: &lt;br /&gt;    nwalign [options] seq1 seq2 &lt;br /&gt;    &lt;br /&gt;&lt;br /&gt;Options:&lt;br /&gt;  -h, --help           show this help message and exit&lt;br /&gt;  --gap=GAP            gap extend penalty (must be integer &lt;= 0)&lt;br /&gt;  --gap_init=GAP_INIT  gap start penalty (must be integer &lt;= 0)&lt;br /&gt;  --match=MATCH        match score (must be integer &gt; 0)&lt;br /&gt;  --mismatch=MISMATCH  gap penalty (must be integer &lt; 0)&lt;br /&gt;  --matrix=MATRIX      scoring matrix in ncbi/data/ format,&lt;br /&gt;                       if not specificied, match/mismatch are used&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where the matrix is optional but can be the full path specifying the transition score from row to column as &lt;a href="http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt"&gt;here&lt;/a&gt;.&lt;br /&gt;If the matrix is specified, match and mismatch are not used. If the matrix is not specified, match and mismatch are used. a typical run looks like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ nwalign alphabet alpet&lt;br /&gt;alphabet&lt;br /&gt;alp---et&lt;br /&gt;&lt;br /&gt;$ nwalign --matrix /usr/share/ncbi/data/BLOSUM62 EEAEE EEEEG&lt;br /&gt;EEAEE-&lt;br /&gt;EE-EEG&lt;br /&gt;&lt;/pre&gt;And usage from a python script can be seen in test.py.&lt;br /&gt;&lt;br /&gt;I wrote this for a colleague who was using a perl script. This is over 100 times faster than the perl script for long sequences--and the perl script had the BLOSUM62 matrix hard-coded as a hash. There are a couple places that could still be sped up, but I think the improvement will be relatively small. This kind of script is perfect for the numpy-cython mix as I can just access the contents of the 2d array at c-speed without having to do pointer arithmetic. If there's any obvious optimizations I missed, I'd be glad to know them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1729797576743998908?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1729797576743998908/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1729797576743998908' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1729797576743998908'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1729797576743998908'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/04/needleman-wunsch-global-sequence.html' title='Needleman-Wunsch global sequence alignment'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4975398008969135399</id><published>2009-04-16T21:32:00.000-07:00</published><updated>2009-04-17T07:39:14.329-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='cython'/><title type='text'>python object initialization speed</title><content type='html'>On the Cython mailing list, I saw &lt;a href="http://wiki.cython.org/FAQ#CanCythoncreateobjectsorapplyoperatorstolocallycreatedobjectsaspureCcode.3F"&gt;this&lt;/a&gt; mentioned for avoiding init overhead, so i wrote up &lt;a href="http://gist.github.com/95916"&gt;some code&lt;/a&gt; to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully). &lt;br /&gt;I ran a benchmark with 5 cases: &lt;ol&gt;&lt;br /&gt;&lt;li&gt;PY_NEW macro (still has python overhead for each call to the creator function)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;regular python init&lt;/li&gt;&lt;br /&gt;&lt;li&gt;python init using __slots__&lt;/li&gt;&lt;br /&gt;&lt;li&gt;cython init (cdef'ed class)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead&lt;/li&gt;&lt;br /&gt;&lt;li&gt;batch init on cython class&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;the timings look like this:&lt;pre&gt;&lt;br /&gt;PY_NEW on Cython class: 1.160&lt;br /&gt;__init__ on Python class: 30.414&lt;br /&gt;__init__ on Python class with slots: 10.242&lt;br /&gt;__init__ on Cython class 1.185&lt;br /&gt;batch PY_NEW total: 0.855 , interval only: 0.383&lt;br /&gt;batch __init__ on Cython class total 0.998 , interval_only: 0.540&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular python class. That Cython is faster is no surprise. &lt;br /&gt;Stefan Behnel &lt;a href="http://codespeak.net/pipermail/cython-dev/2009-April/004873.html"&gt;explains&lt;/a&gt; better than I could.&lt;br /&gt;&lt;br /&gt;All the code is smashed uncomfortably into &lt;a href="http://gist.github.com/95916"&gt;this gist.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;for kicks, i tried with &lt;a href="http://code.google.com/p/unladen-swallow/"&gt;unladen-swallow&lt;/a&gt;. It comes out almost 2x faster on the python times both with and without slots. I didn't use the optimization stuff. Cython even works with unladen-swallow--just have to rebuild the .so--and the timings are the same as Cython-with-CPython.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4975398008969135399?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4975398008969135399/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4975398008969135399' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4975398008969135399'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4975398008969135399'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/04/python-object-initialization-speed.html' title='python object initialization speed'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1106620797663871780</id><published>2009-04-12T18:34:00.000-07:00</published><updated>2009-04-12T19:19:33.499-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wsgi'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='apache'/><title type='text'>apache mpm-worker with php on low memory servers</title><content type='html'>I'm partly writing this because I think such valuable info is too hard to find, I'd read a lot that if you want php, you have to use mpm-prefork but it's not true!&lt;br /&gt;&lt;br /&gt;I've been using &lt;a href="http://slicehost.com/"&gt;slicehost&lt;/a&gt; for dev server for almost a year now. Since my 1 year deal is up, I decided to switch to their affiliate &lt;a href="http://mosso.com"&gt;mosso&lt;/a&gt;. I really like slicehost, but Mosso &lt;a href="http://www.mosso.com/cloudservers.jsp"&gt;"cloud servers"&lt;/a&gt; seem a good fit for a server that goes through spurts of development and use followed by weeks of non-use. So now, I can keep it as a 256MB instance at about $10/month and update to a larger instance when doing real dev. &lt;br /&gt;I built it today as a 1024MB instance -- installed all my usual stuff, and updated my build script for ubuntu. That's &lt;a href="http://code.google.com/p/bpgeo/source/browse/trunk/ubuntu_build/install_all.sh"&gt;here&lt;/a&gt;. &lt;br /&gt;The machine I'm on is extremely fast, normally I set GDAL building and leave, but it finished before I had a chance. After all was built, I resized it to a 256MB server -- that took 12minutes, but my instance was accessible for at least 10 of those.&lt;br /&gt;After that, I log right back in and all works fine. On this dev server, I have to run PHP (at least I dont have to code in it). I've been getting tired of having to use apache's mpm-prefork because ubuntu won't let me have php and mpm-worker. (For more from someone who actually knows what they are talking about, Graham Dumpleton, the author of mod_wsgi, has a good write up &lt;a href="http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html"&gt;here&lt;/a&gt;.) So, every apache process takes up a good chunk of the available memory and things are sloooow. Even just running trac, it starts swapping. &lt;br /&gt;I found &lt;a href="http://ivan.gudangbaca.com/installing_apache2_and_php5_using_mod_fcgid"&gt;a good post for using fcgi for php&lt;/a&gt; and followed it blindly and all works!&lt;br /&gt;Using worker_mpm, Trac (via mod_wsgi) is so fast I doubled checked to make sure my instance was still at 256MB. After setting ThreadStackSize to 1.5MB  trac takes up only 16% of the available memory. The relevant bits of my apache config look like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;IfModule mpm_worker_module&amp;gt;&lt;br /&gt;    StartServers          2&lt;br /&gt;    MaxClients          40&lt;br /&gt;    MinSpareThreads      5&lt;br /&gt;    MaxSpareThreads      20&lt;br /&gt;    ThreadsPerChild      20&lt;br /&gt;    MaxRequestsPerChild   5000&lt;br /&gt;&amp;lt;/IfModule&amp;gt;&lt;br /&gt;&lt;br /&gt;MaxKeepAliveRequests 1000&lt;br /&gt;Timeout 30&lt;br /&gt;&lt;br /&gt;# i keep this fairly high for serving image tiles&lt;br /&gt;# from mapserver&lt;br /&gt;KeepAliveTimeout 40&lt;br /&gt;&lt;br /&gt;# default is 8 per child process, only use 1.5MB&lt;br /&gt;ThreadStackSize 1500000&lt;br /&gt;&lt;br /&gt;AddHandler fcgid-script .php&lt;br /&gt;FCGIWrapper /usr/lib/cgi-bin/php5 .php&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;and then in 000-default, i added +ExecCGI to the DocumentRoot Directory Options. &lt;br /&gt;If any of that config is not sane, let me know.&lt;br /&gt;&lt;br /&gt;I should say that I have no idea how PHP will do under fast-cgi, if it uses more memory, or not, or if it will have any problems, on quick look, everything seems ok.&lt;br /&gt;&lt;br /&gt;Apparently, you can also use mpm-worker with php if you build php from source. Since using fcgi worked so easily, I didn't try that.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1106620797663871780?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1106620797663871780/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1106620797663871780' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1106620797663871780'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1106620797663871780'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/04/apache-mpm-worker-with-php-on-low.html' title='apache mpm-worker with php on low memory servers'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8784869318009085008</id><published>2009-02-20T09:51:00.000-08:00</published><updated>2009-02-20T10:19:26.230-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='biohash'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>biohash</title><content type='html'>This is a quick project I did a while back but, I've seen &lt;a href="http://www.logarithmic.net/pfh/blog/01234937824"&gt;people interested&lt;/a&gt; in similar ideas, so I'll post my implementation here.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Geohash"&gt;Geohash&lt;/a&gt; encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using &lt;a href="http://mappinghacks.com/code/geohash.py.txt"&gt;this python implementation&lt;/a&gt; by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval.&lt;br /&gt;&lt;br /&gt;The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, which can be as small as the maximum end value. If all the intervals are small, and the rng is very large, the biohash will not help much. The rng used to load must be the same as the one used to dump or the values won't be correct.&lt;br /&gt;&lt;br /&gt;I had plans to finish up a set of SQLAlchemy models for this that would save the hash and use it to do range queries behind the scenes, but haven't finished that up yet. The code is in &lt;a href="http://code.google.com/p/bpbio/source/browse/trunk/ihash/ihash/__init__.py"&gt;my google SVN&lt;/a&gt;. It will even pull in a fast cython version of the encoder.&lt;br /&gt;If anyone wants to use it, improve it or get it going with SQLAlchemy, it available from SVN with:&lt;br /&gt;svn checkout https://bpbio.googlecode.com/svn/trunk/ihash/&lt;br /&gt;it has tests in __init__.py&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8784869318009085008?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8784869318009085008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8784869318009085008' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8784869318009085008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8784869318009085008'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/02/biohash.html' title='biohash'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8460670704159332051</id><published>2009-01-09T08:00:00.000-08:00</published><updated>2009-01-09T08:03:47.996-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>Stupid things I did as a Bioinformatics Programmer in 2008</title><content type='html'>In 2008, I was good enough at programming to get my ass kicked by hard problems.  I think that's the most positive way to say it.&lt;br /&gt;My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles. &lt;br /&gt;&lt;br /&gt;I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly. &lt;br /&gt;&lt;br /&gt;I was passing around data in arrays and tuples, so then when I changed the order or added another element in the script that was creating the tuple, a downstream script that was using the tuple would be using the wrong index to access data. If I was lucky, my code would fail, if not, it'd be using the strand when what it should have had was the start location.&lt;br /&gt;&lt;br /&gt;I hit problems where I'd run out of memory. At one point, I ran out of disk space (it's a big series of datasets), hit bugs of software I was using.&lt;br /&gt;&lt;br /&gt;When I should have run just 1 chromosome to test the pipleline in 1/100th (comparing 10 chromsomes to 10 chromosomes) the time, I ran it over an entire&lt;br /&gt;genome.&lt;br /&gt;&lt;br /&gt;When I should have taken the time to really fix small mistakes as I found them, I instead worked around them, making the code unnecessarily complex as a result. If I had fixed instead of fudged in those cases, I would have been more productive.&lt;br /&gt;&lt;br /&gt;I did write tests, but not enough, and I didn't set up the project in a way that it was really testable. I'm still learning how to do that. All the other stuff may just be discipline, but the testing is very difficult for me in bioinformatics. I've extracted what I could into tested libraries and added checks for the intermediate data, and every time I found a dumb error, I'd add an assert. (there's a discussion of pretty much exactly the problems I'm describing here: http://ivory.idyll.org/blog/sep-08/the-future-of-bioinformatics-part-1b.html )&lt;br /&gt;&lt;br /&gt;I'd assume that it was ok to use a tuple, because I was only going to store start and stop, but then later I'd need to add chromosome, then strand, then score, and pretty soon i'd have code elsewhere like:&lt;br /&gt;&lt;pre&gt;if cns[2][4] &lt; cns[2][5]: &lt;br /&gt;     ...&lt;/pre&gt;&lt;br /&gt;where i have no idea what i'm comparing there. That one's simple to fix, just use objects, or dicts at the very least, but a 2-tuple is so tempting, and&lt;br /&gt;a 3-tuple not so bad, and ...&lt;br /&gt;&lt;br /&gt;On top of all this, the project was changing as I was working on it, so I was changing what went in to the start of the pipeline, what we were getting out, and the steps I was doing. And because of all the extra little hacks in there, I would be stuck in some function far away from the data that I needed (*).&lt;br /&gt;&lt;br /&gt;It sucked. I can make those the sort of mistakes on a project that runs in 5 seconds and has a very simple, and relatively known output. But, not for complex pipelines.&lt;br /&gt;&lt;br /&gt;It's been painful watching myself do such stupid stuff, and then reading the code afterward, but I think I've learned a lot. Much of that code still sucks, but I've moved to more rigid data-structures. Every time I make a change now, I do it for real, it's not just hacked in. I have to deal with someone pacing around, waiting for the results and asking me questions like "why is it so hard to just add in the RNA?" or whatever. Also, statistically speaking, I'm&lt;br /&gt;probably running out of mistakes I can make...&lt;br /&gt;&lt;br /&gt;And actually, the most difficult things have been inter-personal relationships, but that's for another post--as are the more positive, awesome things I did with projects I set up correctly, and had good test coverage...&lt;br /&gt;And finally, I did make a lot of mistakes, but this has been genuinely a difficult problem. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;* see &lt;a href="http://blog.objectmentor.com/articles/2008/12/15/glory-and-success-are-not-a-destination-they-are-a-point-of-origin"&gt;this post&lt;/a&gt;, especially the last 4 paragraphs. It's good to hear that someone with 43 years of programming language has the same problems as I do.&lt;br /&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8460670704159332051?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8460670704159332051/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8460670704159332051' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8460670704159332051'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8460670704159332051'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2009/01/stupid-things-i-did-as-bioinformatics.html' title='Stupid things I did as a Bioinformatics Programmer in 2008'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1131122869000107946</id><published>2008-12-17T19:31:00.000-08:00</published><updated>2008-12-17T20:26:33.028-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='numpy'/><title type='text'>landsummary.com</title><content type='html'>One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni (&lt;a href="http://porcupinealley.com/entries/2008/dec/16/land-summary-beta/"&gt;His writeup&lt;/a&gt;). We got together one afternoon, and by the time we left, we had a reasonable start of what we call &lt;a href="http://landsummary.com"&gt;landsummary&lt;/a&gt;, we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too. &lt;br /&gt;What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the &lt;a href="http://www.epa.gov/mrlc/nlcd-2006.html"&gt;NLCD&lt;/a&gt; dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population, weather.precip, and a couple of services with environmental engineering application. All of them use the same libraries--postGIS for doing the census related stuff, and GDAL, gdal_array for doing the raster (currently just NLCD) queries. Josh handled all the census data, I know very little about that, except that whenever I've tried previously, it's been a pain to work with, and now, Josh has a nice set up for it. I took the lead on the raster summaries. For that, I wrote a little library that wraps gdal_array, so you can take a GDAL datasource:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;&gt;&gt;&gt; g = AGoodle("something.tif")&lt;br /&gt;&gt;&gt;&gt; a = g.read_array_bbox([xmin, ymin, xmax, ymax])&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;And then 'a' is a numpy array with all the niceties that entails. So, if we want to get just the food cells, which have values of 81 and 82 in the NLCD dataset, it's just:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;&gt;&gt;&gt; a[(a == 81) | (a == 82)]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;For arbitrary polygons, we use a surprisingly fast function from matplotlib to mask anything that's not inside a list of verticies (polygon). Then do any summary stats on the masked array. &lt;br /&gt;&lt;br /&gt;Thanks to Josh, we have a fairly nice django project structure, with separate apps for each little analysis we've added. In my previous django projects, I've dumped everything into a single app and hacked away, the structure we have now makes it easier to keep what's needed in my brain. Also, when hacking with someone, I'm less likely to put in total crap code. Josh has already had a good laugh at some code where I found 25 closest weather stations using:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;SELECT * from stations ORDER BY ABS(lat - ?) + ABS(lon - ?) LIMIT 25&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;then sorted those 25 using geopy.distance to make sure it was the real distance. In my defence, I really wanted to use vanilla sqlite and so didn't have postGIS at my disposal--also, it was quite fast for only 6,000 stations. We've since dumped it all into postGIS. There's probably a couple of other gems in there.&lt;br /&gt;&lt;br /&gt;So, back to the modest functionality part... Actually, it turns out, this is a fairly difficult thing to do in McClick software--time consuming in user and processor time. So, having a way to click a point and see land-use stats and population data appear in about a second is pretty cool--and it's on the web. We've already found a couple folks with interesting applications, and we're interested in finding more--the original motivation was 'foodmiles'-- from this &lt;a href="http://contours-coregis.blogspot.com/2008/07/sustainable-ballard-local-food.html"&gt;post&lt;/a&gt;. And there's a couple things we'll probably add in from that, people I happened to hear talking in a cafe today were talking about foodmiles and seemed interested in incorporating the carbon foot-print of exporting / importing food vs. growing locally. My friend, Megan also has lots other ideas for things that firms commonly do McClick style with the NLCD data. &lt;br /&gt;There's more info on the &lt;a href="http://landsummary.com/about/"&gt;about&lt;/a&gt; page, but suffice to say we make full use of all the usual open-source GIS, science tools.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1131122869000107946?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://landsummary.com/' title='landsummary.com'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1131122869000107946/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1131122869000107946' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1131122869000107946'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1131122869000107946'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/12/landsummarycom.html' title='landsummary.com'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-3236816013215762418</id><published>2008-11-03T18:46:00.000-08:00</published><updated>2008-11-04T13:37:56.781-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='tree'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>python interval tree</title><content type='html'>EDIT: added a couple points inline.&lt;br /&gt;&lt;br /&gt;I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using &lt;a href="http://en.wikipedia.org/wiki/Interval_tree"&gt;trees&lt;/a&gt; to study &lt;a href="http://genome.jgi-psf.org/Poptr1/Poptr1.home.html"&gt;trees&lt;/a&gt;. &lt;br /&gt;I'd tried a couple times to implement an interval tree from scratch following the &lt;a href="http://en.wikipedia.org/wiki/Interval_tree"&gt;Wikipedia entry&lt;/a&gt;.&lt;br /&gt;Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method.&lt;br /&gt;The wikipedia entry that baffled me was about &lt;a href="http://en.wikipedia.org/wiki/Interval_tree#Construction"&gt;storing 2 copies of each node's intervals&lt;/a&gt;--one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I think that increases the worst-case, but makes no difference in actual search time--with the benefit of halving storage.&lt;br /&gt;&lt;br /&gt;EDIT: now the &lt;a href="http://code.google.com/p/bpbio/source/browse/trunk/interval_tree/interval_tree.py"&gt;version&lt;/a&gt; in my repo keeps the intervals sorted by start, so it can avoid doing the brute for search at each node during a search when search.stop &lt; node.intervals[0].start. This did improve performance.&lt;br /&gt;&lt;br /&gt;The tree class takes a list of intervals and calculates a center point. From there it partitions them into left, overlapping, and right in terms of their relation to the center point. Overlapping are assigned to the current node, and left and right are recursively partitioned in that fashion until there are only `minbucket` intervals per node, or the specified `depth` has been reached AND there are fewer intervals than `maxbucket`. So a tree can have a greater `depth` than requested if it would otherwise have more than `maxbucket` intervals in a single node. The Wikipedia version doesn't have maxbucket or minbucket...&lt;br /&gt;&lt;br /&gt;EDIT: the maxbucket actually only works on leaf-nodes, and has no effect otherwise.&lt;br /&gt;&lt;br /&gt;I'm sure that's painfully obvious for anyone who's ever taken a CS course, but it was foggy at best for me until I implemented. Below is the entire implementation:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;class IntervalTree(object):&lt;br /&gt;    __slots__ = ('intervals', 'left', 'right', 'center')&lt;br /&gt;&lt;br /&gt;    def __init__(self, intervals, depth=16, minbucket=96, _extent=None, maxbucket=4096):&lt;br /&gt;   &lt;br /&gt;        depth -= 1&lt;br /&gt;        if (depth == 0 or len(intervals) &lt; minbucket) and len(intervals) &gt; maxbucket:&lt;br /&gt;            self.intervals = intervals&lt;br /&gt;            self.left = self.right = None&lt;br /&gt;            return &lt;br /&gt;&lt;br /&gt;        left, right = _extent or \&lt;br /&gt;               (min(i.start for i in intervals), max(i.stop for i in intervals))&lt;br /&gt;        center = (left + right) / 2.0&lt;br /&gt;&lt;br /&gt;        &lt;br /&gt;        self.intervals = []&lt;br /&gt;        lefts, rights  = [], []&lt;br /&gt;        &lt;br /&gt;&lt;br /&gt;        for interval in intervals:&lt;br /&gt;            if interval.stop &lt; center:&lt;br /&gt;                lefts.append(interval)&lt;br /&gt;            elif interval.start &gt; center:&lt;br /&gt;                rights.append(interval)&lt;br /&gt;            else: # overlapping.&lt;br /&gt;                self.intervals.append(interval)&lt;br /&gt;                &lt;br /&gt;        self.left   = lefts  and IntervalTree(lefts,  depth, minbucket, (left,  center)) or None&lt;br /&gt;        self.right  = rights and IntervalTree(rights, depth, minbucket, (center, right)) or None&lt;br /&gt;        self.center = center&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;    def find(self, start, stop):&lt;br /&gt;        """find all elements between (or overlapping) start and stop"""&lt;br /&gt;        overlapping = [i for i in self.intervals if i.stop &gt;= start &lt;br /&gt;                                              and i.start &lt;= stop]&lt;br /&gt;&lt;br /&gt;        if self.left and start &lt;= self.center:&lt;br /&gt;            overlapping += self.left.find(start, stop)&lt;br /&gt;&lt;br /&gt;        if self.right and stop &gt;= self.center:&lt;br /&gt;            overlapping += self.right.find(start, stop)&lt;br /&gt;&lt;br /&gt;        return overlapping&lt;/pre&gt;&lt;br /&gt;Only 45 lines of code. I had added a couple extra attributes so that searching could do fewer checks, but it only improved performance by ~15% and I liked the simplicity. One way to improve the search speed, and the distribution on skewed data would be to sort the intervals at the top node, so they'd then be sorted for all other nodes. Then instead of using center = (left + right)/2, It'd could use the center point of the center interval at each node. That would also allow short-circuiting the brute-force search at the top of the find method with something like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;if not (start &gt; self.intervals[-1].stop and stop &lt; self.intervals[0].start):&lt;br /&gt;    overlapping = [ .. list comprehension ]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;But all told, that adds 5 or so lines of code. Oh, and depending on how it's used, it's between 15 and 25 times faster than brute-force search. &lt;br /&gt;&lt;br /&gt;EDIT: I added the above check, but it can only do the 2nd comparison "stop &lt; self.intervals.start as the first is invalid given a very long interval. Regarding speed, the smaller the search window, the better the performance improvement. The code is now &gt; 20 times as fast brute force for a very  (speaking in terms of looking for genomic features) large swath of 100K. with a search space of 50K, it's 50+ times as fast as linear search. &lt;br /&gt;&lt;br /&gt;The full code (including a docstring with homer simpson quote) is in my &lt;a href="http://code.google.com/p/bpbio/source/browse/trunk/interval_tree/interval_tree.py"&gt;google code repo&lt;/a&gt;. If I've made obvious mistakes or you have improvements, I'd be glad to know them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-3236816013215762418?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/3236816013215762418/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=3236816013215762418' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3236816013215762418'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3236816013215762418'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/11/python-interval-tree.html' title='python interval tree'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4350390545788400188</id><published>2008-10-25T17:09:00.000-07:00</published><updated>2008-10-25T17:19:49.857-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='twill'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='testing'/><title type='text'>twill with XHTML (not viewing HTML)</title><content type='html'>Since I couldn't find this anywhere, I'll add it here for those who have the same problem:&lt;br /&gt;&lt;br /&gt;I was trying to test a website with &lt;a href="http://darcs.idyll.org/~t/projects/twill/doc/"&gt;twill&lt;/a&gt; and got this at the end of my traceback:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    raise BrowserStateError("not viewing HTML")&lt;br /&gt;BrowserStateError: not viewing HTML&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;After spending a bunch of time making sure that, yes, it was spitting out HTML, I figured out that it specifically means that twill (actually mechanize) doesnt like &lt;b&gt;X&lt;/b&gt;HTML.&lt;br /&gt;&lt;br /&gt;You can likely fix it by adding this at the top of the script:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;b = twill.get_browser()&lt;br /&gt;b._browser._factory.is_html = True&lt;br /&gt;twill.browser = b&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Presumably, there's a real reason that check is in place, but works-4-me...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4350390545788400188?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4350390545788400188/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4350390545788400188' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4350390545788400188'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4350390545788400188'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/10/twill-with-xhtml-not-viewing-html.html' title='twill with XHTML (not viewing HTML)'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-735018163515191434</id><published>2008-10-24T16:47:00.000-07:00</published><updated>2009-12-20T12:38:59.054-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='appengine'/><title type='text'>appengine memcache memoize decorator</title><content type='html'>[NOTE: see the 2nd comment below about using a tuple as a key. better to just use pickle.dumps]&lt;br /&gt;I've been playing with google appengine lately. I'm working on a fun, pointless side project. Here's what I came up with for a cache decorator that pulls from memcache based on the args, kwargs and function name if no explicit key is given. The code for creating a key from those is from the recipe linked in the docstring. &lt;br /&gt;&lt;pre class="prettyprint lang-py"&gt;"""&lt;br /&gt;a decorator to use memcache on google appengine.&lt;br /&gt;optional arguments:&lt;br /&gt;  `key`: the key to use for the memcache store&lt;br /&gt;  `time`: the time to expiry sent to memcache&lt;br /&gt;&lt;br /&gt;if no key is given, the function name, args, and kwargs are&lt;br /&gt;used to create a unique key so that the same function can return&lt;br /&gt;different results when called with different arguments (as&lt;br /&gt;expected).&lt;br /&gt;&lt;br /&gt;usage:&lt;br /&gt;NOTE: actual usage is simpler as:&lt;br /&gt;@gaecache()&lt;br /&gt;def some_function():&lt;br /&gt;...&lt;br /&gt;&lt;br /&gt;but doctest doesnt seem to like that.&lt;br /&gt;&lt;br /&gt;    &gt;&gt;&gt; import time&lt;br /&gt;&lt;br /&gt;    &gt;&gt;&gt; def slow_fn():&lt;br /&gt;    ...    time.sleep(1.1)&lt;br /&gt;    ...    return 2 * 2&lt;br /&gt;    &gt;&gt;&gt; slow_fn = gaecache()(slow_fn)&lt;br /&gt;&lt;br /&gt;this run take over a second.&lt;br /&gt;    &gt;&gt;&gt; t = time.time()&lt;br /&gt;    &gt;&gt;&gt; slow_fn(), time.time() - t &gt; 1&lt;br /&gt;    (4, True)&lt;br /&gt;&lt;br /&gt;this grab from cache in under .01 seconds&lt;br /&gt;    &gt;&gt;&gt; t = time.time()&lt;br /&gt;    &gt;&gt;&gt; slow_fn(), time.time() - t &lt; .01&lt;br /&gt;    (4, True)&lt;br /&gt;&lt;br /&gt;modified from&lt;br /&gt;http://code.activestate.com/recipes/466320/&lt;br /&gt;and&lt;br /&gt;http://code.activestate.com/recipes/325905/&lt;br /&gt;"""&lt;br /&gt;&lt;br /&gt;from google.appengine.api import memcache&lt;br /&gt;import logging&lt;br /&gt;import pickle&lt;br /&gt;&lt;br /&gt;class gaecache(object):&lt;br /&gt;    """&lt;br /&gt;    memoize decorator to use memcache with a timeout and an optional key.&lt;br /&gt;    if no key is given, the func_name, args, kwargs are used to create a key. &lt;br /&gt;    """&lt;br /&gt;    def __init__(self, time=3600, key=None):&lt;br /&gt;        self.time = time&lt;br /&gt;        self.key  = key&lt;br /&gt;&lt;br /&gt;    def __call__(self, f):&lt;br /&gt;        def func(*args, **kwargs):&lt;br /&gt;            if self.key is None:&lt;br /&gt;                t = (f.func_name, args, kwargs.items())&lt;br /&gt;                try:&lt;br /&gt;                    hash(t)&lt;br /&gt;                    key = t&lt;br /&gt;                except TypeError:&lt;br /&gt;                    try:&lt;br /&gt;                        key = pickle.dumps(t)&lt;br /&gt;                    except pickle.PicklingError:&lt;br /&gt;                        logging.warn("cache FAIL:%s, %s", args, kwargs)&lt;br /&gt;                        return f(*args, **kwargs)&lt;br /&gt;            else:&lt;br /&gt;                key = self.key&lt;br /&gt;&lt;br /&gt;            data = memcache.get(key) &lt;br /&gt;            if data is not None: &lt;br /&gt;                logging.info("cache HIT: key:%s, args:%s, kwargs:%s", key, args, kwargs)&lt;br /&gt;                return data&lt;br /&gt;&lt;br /&gt;            logging.warn("cache MISS: key:%s, args:%s, kwargs:%s", key, args, kwargs)&lt;br /&gt;            data = f(*args, **kwargs)&lt;br /&gt;            memcache.set(key, data, self.time) &lt;br /&gt;            return data&lt;br /&gt;&lt;br /&gt;        func.func_name = f.func_name&lt;br /&gt;        return func&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-735018163515191434?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/735018163515191434/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=735018163515191434' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/735018163515191434'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/735018163515191434'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/10/appengine-memcache-memoize-decorator.html' title='appengine memcache memoize decorator'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2138435457805437993</id><published>2008-10-03T10:10:00.000-07:00</published><updated>2008-10-27T11:20:40.749-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='genedex'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>genedex.fasta with numpy.memmap</title><content type='html'>EDIT: added job posting to comments.&lt;br /&gt;I've been working a bit on &lt;a href="http://code.google.com/p/genedex/"&gt;genedex&lt;/a&gt;, I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to &lt;a href="http://pypi.python.org/pypi/Rtree"&gt;rtree&lt;/a&gt;, but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: &lt;a href="http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py"&gt;http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py&lt;/a&gt;. That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports &lt;a href="http://www.sqlite.org/rtree.html"&gt;sqlite rtree&lt;/a&gt; without any extra work--for now, it just uses BTree indexes on the start and stop. I could use rtree to index the sqlite database, but I'd like to move away from the LGPL. Maybe &lt;a href="http://projects.scipy.org/pipermail/numpy-discussion/2008-September/037776.html"&gt;this&lt;/a&gt; KDTree that's already in a scipy branch with a more permissive license. Then it could do both spatial, and attribute queries... &lt;br /&gt;That's all tinkering...&lt;br /&gt;&lt;br /&gt;I also cleaned up the genedex.fasta module. The useage is nice, if not entirely the implementation. A fasta file can look like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;gt;chr1&lt;br /&gt;ATGTCGTCGGCCGC&lt;br /&gt;GGGCCAAGA&lt;br /&gt;CAACGGAGA&lt;br /&gt;&lt;br /&gt;&amp;gt;chr3&lt;br /&gt;ATGGAGGAGGCTGGCGAGCGG&lt;br /&gt;&lt;br /&gt;&amp;gt;chr2&lt;br /&gt;ATGGCGTGC&lt;br /&gt;ACGGCGGCG&lt;br /&gt;CGCATGTT&lt;br /&gt;CGCCT&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where a line starting with &amp;gt; is the name of the sequence ('chr1') and everything up until the next &amp;gt; is the sequence. The problem is the newlines, so every time you want to look at chr1 basepairs 10 to 20, you have to find where the sequence starts, and account for newlines. -- Acutally one should never do that, as Biopython, and pretty much any library will take care of that for you. Pygr for example, creates a new file something.fasta.pureseq which removes all newlines and labels and indexes where the starts and stop are. genedex.fasta.Fasta now does something similar, here's example useage on the file above ('123.fasta').&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;from genedex import Fasta&lt;br /&gt;f = Fasta('123.fasta')&lt;br /&gt;print f.keys()&lt;br /&gt;print f['chr1'][9:20].tostring()&lt;br /&gt;print f['chr1'][9:20]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;after, the fasta file looks like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;gt;chr1&lt;br /&gt;ATGTCGTCGGCCGCGGGCCAAGACAACGGAGA&lt;br /&gt;&amp;gt;chr3&lt;br /&gt;ATGGAGGAGGCTGGCGAGCGG&lt;br /&gt;&amp;gt;chr2&lt;br /&gt;ATGGCGTGCACGGCGGCGCGCATGTTCGCCT&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;with spaces removed--so it's still a valid fasta file (you can also send an argument to the constructor and it will create a new file) and there's a new file 123.fasta.gdx that is a python pickle containing:&lt;br /&gt;{'chr3': (45L, 67L), 'chr2': (73L, 105L), 'chr1': (6L, 39L)}&lt;br /&gt;which indicate the start and stop positions of the sequence in the file.&lt;br /&gt;So the file remains a valid fasta file, but now it can be efficiently sliced. For now, it actually uses a numpy mmmap (numpy.memmap), to take advantage of the broadcasting, other than that, it'd be simpler to just use python mmap. So, when it sees f['chr1'][10:20] it acts just like it's indexing a numpy array, but it's accessing the data directly from the disk (well, not actually, but mmap works it's magic and I dont have to think about that). I like that--I can keep my fasta files as valid, add only a small python pickle file, and get simple, fast, pythonic indexing into them. It does take about 12 seconds to index and flatten the entire sorghum genome (629MB, ~660 million basepairs)  on first view, after that first time, it's instantaneous. &lt;br /&gt;Anyway, the source is available and easy_install-able:&lt;pre&gt;&lt;br /&gt;svn checkout http://genedex.googlecode.com/svn/trunk/ genedex&lt;br /&gt;&lt;/pre&gt; As always, I'll gladly take any improvements, bugs, enhancements.&lt;br /&gt;&lt;br /&gt;Also, our lab at UCB is looking for another (full-time or close to it, on-site) programmer who knows some biology, perl, and hopefully some python. If you're interested, or know anyone, send me an email. I have no real authority in the matter (or any matter) but I will have some say in this. I'd like to work with someone I can learn from. I'll add a link to the job posting in the comments below once it's posted.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2138435457805437993?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2138435457805437993/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2138435457805437993' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2138435457805437993'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2138435457805437993'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/10/genedexfasta-with-numpymemmap.html' title='genedex.fasta with numpy.memmap'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-943774744587032221</id><published>2008-08-06T12:16:00.000-07:00</published><updated>2008-08-06T12:16:14.922-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='django'/><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>choosing django</title><content type='html'>I prefer sqlalchemy and genshi (or mako) and was therefore looking at using turbogears, but I saw a demo of the django admin, and that sold me. Certainly the templating language did not. Before this, I'd only used &lt;a href="http://webpy.org/"&gt;web.py&lt;/a&gt; in my projects. These are the things I've liked/noted:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The first and most important: &lt;a href="http://code.google.com/hosting/search?q=label%3Adjango"&gt;community&lt;/a&gt;. (Oddly enough, as I write this there are 666 projects tagged as django. 'turbogears', 'tg2', 'tg' give less than 50 projects combined. Think someone might have already written what you need? yep.&lt;br /&gt;Also, a great site: &lt;a href="http://www.djangosnippets.org/"&gt;http://www.djangosnippets.org/&lt;/a&gt;, where I've learned a lot just by reading, and saved myself a lot of time, by extending ideas there. &lt;br /&gt;And the development is &lt;a href="http://code.djangoproject.com/log/"&gt;active&lt;/a&gt;.  &lt;br /&gt;&lt;br /&gt;Second, django.contrib.* &lt;ul&gt;&lt;li&gt;User authentication is simple, and check google-code for various alterations on the theme.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;admin. This was what first made me decide on django. And now, &lt;a href="http://devpicayune.com/pycon2008/django_admin.html"&gt;new-forms admin&lt;/a&gt; is in trunk. This gives you a pretty nice CRUD interface for models in your app. In the app I've been working on, we have row _and_ field level permissions. We also need to let users edit the fields of other users stored in the database--but only certain fields, with which fields depending on the user viewing and the user being edited. This bit is &lt;a href="http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser"&gt;more hacky&lt;/a&gt; than it should be, but quite simple. &lt;br /&gt;My biggest gripe about the admin is that it's too &lt;a href="http://juripakaste.fi/cgi/pyblosxom.cgi/custom_django_newforms-admin_widgets.html"&gt;complicated&lt;/a&gt; to use custom widgets or validation. Hopefully, this &lt;a href="http://code.djangoproject.com/ticket/6845"&gt;will change&lt;/a&gt;.&lt;br /&gt;Oh, and it's too hard to have read_only privileges. (Yeah, I know the admin &lt;a href="http://www.google.com/search?q=trusted+users+editing+structred+content&amp;btnG=Search"&gt;mantra&lt;/a&gt;).&lt;br /&gt;&lt;/li&gt; &lt;br /&gt;&lt;li&gt;&lt;a href="http://geodjango.org/"&gt;GIS&lt;/a&gt;: nuf said.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;p&gt;&lt;a href="http://www.djangoproject.com/documentation/i18n/"&gt;i18n, t9n&lt;/a&gt;. Anywhere there's some text to be displayed in the app, I wrap it in ugettext_lazy (aliased to _), and later, dump it and send it to someone who knows Chinese. When it returns, I make messages, and the app will show in English or Chinese depending on browser preferences. Simple.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;docs/help: yeah, the docs may have trouble keeping up, especially with recent rate of change, but it's easy enough to find what you need, and the django official docs are pretty nice. And if you can handle the fact that most responses you'll get on #django will begin with "of course, ...", then it's a great place to get help. &lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;a href="http://www.djangoproject.com/documentation/modelforms/"&gt;ModelForms&lt;/a&gt;: I've just started using these, but, they've already saved me a lot of code. &lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;There's a lot of other nice things, and a lot of django that I don't even know. I'd still consider myself a newb, but it's still possible to &lt;i&gt;get sh*t done&lt;/i&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-943774744587032221?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/943774744587032221/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=943774744587032221' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/943774744587032221'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/943774744587032221'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/08/choosing-django.html' title='choosing django'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2654162750834493402</id><published>2008-06-20T16:51:00.000-07:00</published><updated>2008-07-10T08:29:01.402-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pylab'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>pylab matplotlib imagemap</title><content type='html'>UPDATE 7-10-08: &lt;br /&gt;+ add example for scatter plot&lt;br /&gt;+ link to ken-ichi&lt;br /&gt;===&lt;br /&gt;Figuring how to make a client side image map from a &lt;a href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt; image has stumped me more than once. Andrew Dalke does have a &lt;a href="http://www.dalkescientific.com/writings/diary/archive/2005/04/24/interactive_html.html"&gt;working&lt;/a&gt; example. Below, I have the minimal example.&lt;br /&gt;&lt;br /&gt;It's simple once you get the steps right: just use mpl's transform() to convert the data into the image's coordinate system. Then flip the y-axis as required by the imagemap, then do the normal imagemap stuff and save the html. The only real gotcha, is to make sure to put the dpi in the call to savefig().&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;import pylab&lt;br /&gt;import sys&lt;br /&gt;import random&lt;br /&gt;&lt;br /&gt;name = 'imap'&lt;br /&gt;&lt;br /&gt;# make some fake data&lt;br /&gt;xs = range(15)&lt;br /&gt;ys = [random.choice(xs) for i in range(len(xs))]&lt;br /&gt;&lt;br /&gt;xys = zip(xs, ys)&lt;br /&gt;&lt;br /&gt;# can also use : f = pylab.subplot(121)&lt;br /&gt;f, = pylab.plot(xs, ys, 'ro')&lt;br /&gt;dpi = f.figure.get_dpi()&lt;br /&gt;height = f.figure.get_figheight() * dpi&lt;br /&gt;&lt;br /&gt;# convert the x,y coords into image coords.&lt;br /&gt;transform = f.get_transform()&lt;br /&gt;icoords = transform.transform(xys)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# the minimal 'template' to generate an image map.&lt;br /&gt;tmpl = """&lt;br /&gt;&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&lt;br /&gt;&amp;lt;img src="%s.png" usemap="#points" border="0"&amp;gt;&lt;br /&gt;&amp;lt;map name="points"&amp;gt;%s&amp;lt;/map&amp;gt;&lt;br /&gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;"""&lt;br /&gt;&lt;br /&gt;# change this as needed, e.g. if not plotting points.&lt;br /&gt;fmt = "&amp;lt;area shape='circle' coords='%i,%i,2' href='http://example.com/%i/%i' &amp;gt;"&lt;br /&gt;&lt;br /&gt;# need to do height - y for the image-map&lt;br /&gt;fmts = [fmt % (ix, height - iy, x, y) for (ix, iy), (x, y) in zip(icoords, xys) ]&lt;br /&gt;&lt;br /&gt;# NOTE, this dpi is needed!&lt;br /&gt;pylab.savefig('imap' + '.png', dpi=dpi)&lt;br /&gt;print &gt;&gt; open(name + ".html", 'w'), tmpl % (name, "\n".join(fmts))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;UPDATE:&lt;br /&gt;When trying to figure how to do this for a pylab.scatter plot, I found &lt;a href="http://www.pageofguh.org/random/668"&gt;Ken-ichi had also done this&lt;/a&gt; for a scatter plot.&lt;br /&gt;As of a matplotlib trunk revision 5711, the transform does not get set when the scatter plot is drawn. To set it, I had to set the transform explicitly:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;   s = pylab.scatter(xs, ys)&lt;br /&gt;   s.set_transform(s.axes.transData)&lt;br /&gt;   transformed_xys = s.get_transform().transform(zip(xs,ys))&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2654162750834493402?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2654162750834493402/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2654162750834493402' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2654162750834493402'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2654162750834493402'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/06/pylab-matplotlib-imagemap.html' title='pylab matplotlib imagemap'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-5439167652659210202</id><published>2008-06-04T22:42:00.000-07:00</published><updated>2009-06-05T10:05:28.117-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='cython'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>binary search over intervals</title><content type='html'>[EDIT: update location of code repo]&lt;br /&gt;&lt;br /&gt;This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with. &lt;br /&gt; &lt;br /&gt;Binary search is easy. Just look at the python &lt;a href="http://svn.python.org/view/python/trunk/Lib/bisect.py?rev=54666&amp;view=markup"&gt;std library implementation&lt;/a&gt; (and the &lt;a href="http://svn.python.org/view/python/trunk/Modules/_bisectmodule.c?rev=42609&amp;view=markup"&gt;C API version&lt;/a&gt;). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order. &lt;br /&gt;Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the &lt;a href="http://en.wikipedia.org/wiki/Interval_tree"&gt;explanation here&lt;/a&gt;, that means storing not only the start, but the stop and often the center of every interval--and using them to do the search. That makes things considerably more complicated than binary search. I started with an implementation of interval search &lt;a href="http://bx-python.trac.bx.psu.edu/"&gt;here&lt;/a&gt;, but couldn't figure out how to customize. &lt;br /&gt;&lt;br /&gt;Binary search is kind of a special case of binary search where intervals are of exactly length 0, e.g. start == stop. So, if all intervals are of exactly length 2? Well, then you can sort by start, find the left-most index by looking for start - 2 and find the right most index by searching for the (highest) index of start. That will give you the indicies in the sorted array of all intervals that overlap the query. The highest and lowest correspond to python's bisect.bisect_left, and bisect.bisect_right respectively. &lt;br /&gt;This carries to any length. But, if all the intervals are different lengths. Well, then you can save the longest length, and then for any search, it's:&lt;br /&gt;p_overlaps = intervals[bisect_left(start - max_len):bisect_right(stop)]&lt;br /&gt;but that only gives the putative overlapping. Since we extended the left by max_len, and we may have found an interval whose length was &lt; max_len (meaning its stop is before the start of the query) we have to explicitly test for distance:&lt;br /&gt;real_overlaps = [iv for iv in p_overlaps if distance(query_interval, iv) == 0]&lt;br /&gt;which gives only the intervals that overlap. So, that part is the price to pay for the simplified search. Another way, as suggested in the wikipedia article is to store the length of each interval as part of the interval.&lt;br /&gt;&lt;br /&gt;In this setup, the worst case scenario is when a single looooong interval covers the entire range of the list of intervals. Then every search is linear, brute-force search. However, my use for this is genomic data. There, I'll have an entire range of say... 50 million, and the intervals (genes) are rarely longer than 4000 in length (basepairs). So, it's useful to optimize simplicity.  &lt;br /&gt;&lt;br /&gt;My &lt;a href="http://cython.org/"&gt;cython&lt;/a&gt; version of this is in my googlecode repo &lt;a href="http://code.google.com/p/bpbio/source/browse/trunk/intersect/intersection.pyx"&gt;here&lt;/a&gt;. It has all the stuff I use, methods for left(), right(), upstream(), downstream(), nearest_neighbors(). Most of the searching work is in the binsearch_* functions -- I couldn't use the python ones. There are a couple of hacks in there: &lt;br /&gt;1) because pyrex/cython don't support closures &lt;br /&gt;2) the left() method is confusing because the intervals are sorted by start, and the left() has to find the nearest intervals by stop(). That's where complexity is increased because of this setup. &lt;br /&gt;&lt;br /&gt;On my machine, it creates a tree with 6815 intervals in .016 seconds and does 50000 searches on that tree in .14 seconds. It seems to scale well with the number of features as 50K searches on 68150 features takes .50 seconds. Adding an interval that covers the entire range of all other features (results in worst-case linear search) makes the 50K searches (on 6185 intervals) take 1.54 seconds--which is only so good because the brute-force in method is pretty close to c-speed. This could be optimized by saving the stops in a separate array, or by saving long intervals in a separate array, but it's rare enough, and the code is simple enough as-is, that I'll probably leave it for now.&lt;br /&gt;&lt;br /&gt;The "proper" way to do this is with an interval/segment tree, of which there's a very readable version in &lt;a href="http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/operations/quicksect.py"&gt;bx-python&lt;/a&gt;. If I'd found that earlier, I probably wouldn't have coded this... The tree is faster for larger number of intervals, but that's rarely going to be an issue, it does take much less memory...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-5439167652659210202?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/5439167652659210202/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=5439167652659210202' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5439167652659210202'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5439167652659210202'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/06/binary-search-over-intervals.html' title='binary search over intervals'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1718971444921677180</id><published>2008-05-21T18:44:00.000-07:00</published><updated>2008-05-21T18:44:24.612-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='oss'/><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>wherecamp</title><content type='html'>I agree with every report I've seen. &lt;a href="http://wherecamp.pbwiki.org/WhereCamp2008"&gt;Wherecamp&lt;/a&gt; was awesome. I've been telling people, and I'm still sure it's true, that I met exactly zero people who I'd physically seen before. Ordinarily, I avoid meetings, but this is a good format and seems to attract good people. It's fun to meet and work with people who are really into what they do. The talks are less "talky" and more like chat sessions--which is possible when the groups are small. &lt;br /&gt;&lt;br /&gt;There was also plenty of time to hack, which was the original reason I went. During and after, I learned some simple things which I'm trying to incorporate into my usual workflow:&lt;br /&gt;In the shell, background a job with "ctrl + z" then get back to it with %i where i is the number shown in the output from "jobs". That's a trick from jlivni.&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://crschmidt.net/"&gt;crschmidt&lt;/a&gt;, I added:&lt;br /&gt;alias doctest="nosetests --with-doctest --doctest-extension=.txt"&lt;br /&gt;to my .bash_aliases. Which let's me do:&lt;br /&gt;doctest tests/&lt;br /&gt;or &lt;br /&gt;doctest tests/test_somefile.txt&lt;br /&gt;to run my doctests instead of python -c "import doctest;doctest.testfile('...')"&lt;br /&gt;&lt;br /&gt;And springmeyer showed me a ton of django and &lt;a href="http://geodjango.org/"&gt;geodjango&lt;/a&gt;. The admin stuff is just ... nice -- it's how making a db front end should be. I still don't know how to learn that stuff on my own, it seems a lot of it, you just have to know which modules to import and the django book doesn't cover newforms or the new admin stuff as far as I can tell.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Quotes&lt;/h4&gt;&lt;br /&gt;On friday night, we met up in SF to do some hacking, the never went down, as I couldnt get wireless and it turned into more of a real bar trip. We, were however, talking about python. At one point, it was sorta quiet and out of the silence, comes:&lt;br /&gt;"Python sucks"&lt;br /&gt;from a true lisp hacker in the next booth--complete with curly grey beard and spectacles. He actually turned out to be a cool guy, I think maybe he even admitted that if he couldn't use lisp, python was a reasonable choice--I think that's about as much as you can expect from a lisper. &lt;br /&gt;&lt;br /&gt;From crschmidt:&lt;br /&gt;"I don't really know python that well"&lt;br /&gt;Then who was it that basically rewrote featureserver between the hours of 2AM and 9AM when everyone else was sleeping?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1718971444921677180?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1718971444921677180/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1718971444921677180' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1718971444921677180'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1718971444921677180'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/05/wherecamp.html' title='wherecamp'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-7678669865807190036</id><published>2008-05-14T18:46:00.000-07:00</published><updated>2008-05-14T18:52:34.465-07:00</updated><title type='text'>slicehost, trac, wherecamp</title><content type='html'>I have a development "server" here beside me. It's actually a budget laptop that sold for $799 2 years ago. It's a xubuntu machine, hosting a trac instance, a development server for mapping stuff, postgresql/postgis, mapserver, mysql, and couple svn repos, anything I do for contracting, etc. Oh, and it's also hosting a couple of sites for the &lt;a href="http://www.environcorp.com/locations/"&gt;multi-national&lt;/a&gt; company that my gf works for! all of their servers are windows machines (long rant suppressed). &lt;br /&gt;It used to get warm, so propped it up on 4 tuna cans, 1 for each corner, now it stays cooler. Yep, it's a sweet setup.&lt;br /&gt;Anyway, I pay AT&amp;T or SBC --or whatever they are now called-- for static IP's and a supposedly faster internet connection. My 1 year contract for that is nearly up, so I'm switching to &lt;a href="http://slicehost.com"&gt;slicehost&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;I'm not a sys-admin, I sorta do that for 4 gentoo (not my choice) machines at $work, and my strategy is to set up, rsnapshot and never, ever emerge -u world. ever. So far, it's mostly working.  When I have the choice, I use (x/k)ubuntu, I don't care if they do magical stuff (or even if I had to redo my ssh keys today), it just works.&lt;br /&gt;&lt;br /&gt;Anyway, I want something easy and idiot proof. I'd heard the hype about slicehost when it came out, and figured it was just that, hype. It's not. I've never used shared hosting before, but this was pretty simple. From entering my payment to ssh'ing as root into my slice took ~ 2 minutes. /proc/cpuinfo shows it's a machine with 4 opteron dual-cores.  I started with a 256 slice with back-up. You can start new slices and restore from the backups. That's cool, and less $$ than I pay AT&amp;T for static IPs and faster uploads.&lt;br /&gt; &lt;br /&gt;I ran a script to apt-get all the packages I use, and had a base, working system in under 1 hour. They have a lot of articles about how to set stuff up, mostly basic (even for me), but I followed their info on setting up &lt;a href="http://articles.slicehost.com/2008/4/25/ubuntu-hardy-setup-page-1"&gt;iptables&lt;/a&gt;. Predictably, I forgot to leave open port 22 and locked myself out of ssh, but they have a &lt;a href="http://www.slicehost.com/articles/2006/9/18/ajax-console-for-your-slice"&gt;web-based console&lt;/a&gt;, so, not a problem. It seems to be idiot proof...&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Trac&lt;/h4&gt;Also, in the theme of things that just work as they should, &lt;a href="http://trac.edgewall.org/wiki/TracUpgrade"&gt;upgrading&lt;/a&gt; to Trac &lt;a href="http://trac.edgewall.org/wiki/TracInstall"&gt;0.11&lt;/a&gt; (still in rc). &lt;br /&gt;&lt;pre&gt;&lt;br /&gt;sudo easy_install -UZ Trac==0.11rc1&lt;br /&gt;cd /path/to/trac/project/&lt;br /&gt;sudo trac-admin . upgrade&lt;br /&gt;sudo trac-admin . upgrade wiki &lt;br /&gt;sudo apache2ctl restart&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h4&gt;WhereCamp&lt;/h4&gt;&lt;br /&gt;I'll likely be at wherecamp, it'll be good to learn some stuff, and meet people I only know from IRC. If anyone needs a ride from the east-bay, let me know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-7678669865807190036?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/7678669865807190036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=7678669865807190036' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7678669865807190036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7678669865807190036'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/05/slicehost-trac-wherecamp.html' title='slicehost, trac, wherecamp'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-3153422017378901513</id><published>2008-05-08T21:46:00.000-07:00</published><updated>2008-05-08T23:56:10.529-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>openlayers, genomes and image-maps</title><content type='html'>In response to &lt;a href="http://ivory.idyll.org/blog/may-08/lazyweb-javascript-image-stuff"&gt;Titus' post&lt;/a&gt; on using imagemaps for genomic visualization:&lt;br /&gt;Why are imagemaps so popular in genomics? As an extreme and unfair comparison, just imagine if &lt;a href="http://maps.google.com"&gt;http://maps.google.com&lt;/a&gt; was an image map. &lt;br /&gt;Given a CGI script that can accept a url like &lt;pre&gt;&amp;start=1024&amp;stop=2048&amp;chr=3&lt;/pre&gt; and return an appropriate image, you can provide a substantial set of tools using &lt;a href="http://openlayers.org"&gt;openlayers&lt;/a&gt;, which is developed by what must be one of the largest and active developer communities in GIS. (Yes, I am an openlayers fan-boy.)&lt;br /&gt;You can do that with &lt;a href="http://128.32.8.100/genome-browser/"&gt;a small addition&lt;/a&gt; to openlayers which I updated a couple weeks ago to OL version 2.6. In that update, &lt;a href="http://code.google.com/p/genome-browser/source/detail?r=18"&gt;I removed &gt; 140 lines of code&lt;/a&gt;. So, it's now even less of a change to OL. Maybe when 2.7 comes out, I'll figure out how to provide a patch that allows an extra argument to the OpenLayers.Map constructor that limits panning to the horizontal direction -- in which case genome-browser will cease to exist and only the single file containing OpenLayers.Layer.Genomic would be needed.&lt;br /&gt;&lt;br /&gt;Maybe I'd need to make a real example of using openlayers for genomics, making more obvious use of layers, the vector stuff, fractional zoom,  markers, geocoding, projections, power steering, etc to make it more obvious how badass OL can be. As far as I know, I'm the only one using it for genomics, and 98% of the time I spent developing it was in my spare time--I only have a good excuse to hack on it at $work when something breaks. If more people were using it, I might be more motivated to figure out how to do stacking of images vertically, but still restricting scrolling to the horizontal--to essentially allow the same thing as "tracks" in other genome browsers. Maybe that's the killer feature.&lt;br /&gt;&lt;h4&gt;And&lt;/h4&gt;&lt;br /&gt;An observation on the difference between the bio and geo programming environments as I interpret them: &lt;br /&gt;There's some tools that (IMHO) are better in the geo world than in the bio. Perhaps that's because of &lt;a href="http://gdal.org"&gt;GDAL&lt;/a&gt;, the keystone for geo-data formats and projections. Since any other software (in any SWIG-able language) that makes use of gdal can access pretty much any format and I/O to any projection, then geo developers can do things like make nice &lt;a href="http://mapserver.gis.umn.edu/"&gt;renderers&lt;/a&gt;, or web-based map &lt;a href="http://openlayers.org/"&gt;browsers&lt;/a&gt; rather than figuring how to convert that mrSID image in epsg:4326 to a tif in epsg:900913. (I'm also a GDAL fan-boy.)&lt;br /&gt;&lt;br /&gt;Contrast this to the bio world where there's a bio for nearly every common language, bio-java, bio-perl, bio-python, bio-ruby, each with it's own blast parser, gen-bank parser, sequence objects, alignment objects. There is no keystone, so there's more duplication of effort, and there's no parser that's used across languagues as is the case for GIS. -- I'm not suggesting that shouldn't be the case, simply making an observation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-3153422017378901513?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/3153422017378901513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=3153422017378901513' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3153422017378901513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3153422017378901513'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/05/openlayers-genomes-and-image-maps.html' title='openlayers, genomes and image-maps'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-3872468823428420977</id><published>2008-05-05T17:15:00.000-07:00</published><updated>2008-05-05T17:23:10.463-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='seqfind'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>seqfind: levenshtein + bktree</title><content type='html'>I've copied &lt;a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/572156"&gt;this&lt;/a&gt; recipe that I modified before, and added the BK Tree structure in cython. It's in my repo &lt;a href="http://code.google.com/p/bpbio/source/browse/trunk/seqfind/seqfind.pyx"&gt;here&lt;/a&gt;. &lt;br /&gt;Check it out with:&lt;pre&gt;svn checkout http://bpbio.googlecode.com/svn/trunk/seqfind&lt;/pre&gt;&lt;br /&gt;or easy_install with&lt;pre&gt;sudo easy_install http://bpbio.googlecode.com/svn/trunk/seqfind&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;It's now using the &lt;a href="http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance"&gt;Damerau-Levenshtein distance&lt;/a&gt; which is more sensible for bioinformatics where transpositions are frequent. &lt;br /&gt;&lt;br /&gt;Bearophile's original implementation used a tuple, which made sense, but in Cython, it's more efficient to use an object where the properties can be typed--as a class is converted to a c-struct--so there is no conversion when appending to a python array -- if i understand the generated c code correctly. &lt;br /&gt; &lt;br /&gt;Using an object also allows arbitrary info to be passed along with the word when creating the tree, again, this is important for bio-informatics when the string is something like "actgcc ... acgtc" and it's useful to attach some annotation to it like:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;words = [Word("actgc ... acgtc", {'name': 'At2g26540'}), ...]&lt;br /&gt;#and then create a tree in the same fashion as with raw strings:&lt;br /&gt;tree = BKTree(words)&lt;br /&gt;# and search returns a word object:&lt;br /&gt;[(w.word, w.info['name']) for w in tree.find("atc", 2)]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;It'll suffice to say (the obvious) it's a lot faster searching with the tree, than comparing every word, on every search.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-3872468823428420977?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://code.google.com/p/bpbio/source/browse/trunk/seqfind/seqfind.pyx' title='seqfind: levenshtein + bktree'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/3872468823428420977/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=3872468823428420977' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3872468823428420977'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3872468823428420977'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/05/seqfind-levenshtein-bktree.html' title='seqfind: levenshtein + bktree'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2537082860383525360</id><published>2008-04-30T23:26:00.000-07:00</published><updated>2008-04-30T22:47:16.655-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='haxe'/><category scheme='http://www.blogger.com/atom/ns#' term='flash'/><title type='text'>flash (n)back</title><content type='html'>&lt;a href="http://www.newscientist.com/channel/being-human/dn13786-simple-brain-exercise-can-boost-iq.html"&gt;The PNAS article linked here&lt;/a&gt; found subjects could improve IQ with training. I've written a simple flash version of their protocol over a couple long evenings in &lt;a href="http://haxe.org/"&gt;haxe&lt;/a&gt;/flash. &lt;br /&gt;&lt;br /&gt;The article methods list 2 stimuli, a moving box, and spoken letters. The test subject is to respond (click in my case) when the box position or the spoken letter is the same as it was 2 time steps ago. Where 2 is increased as the subject gets better. I didn't do sound, I just show a big letter. Clearly, the logical thing to do is use it for a couple weeks and then implement the sound when I'm smarter. (I've never really used &lt;a href="http://swfmill.org"&gt;swfmill&lt;/a&gt;, but I think that'd be useful here...) &lt;br /&gt;&lt;br /&gt;The article is ambiguous about when the letter is to sound, I've made both the letter and the box appear at the same time. The default, as in the article is to have 3 seconds between events, and to show the box for 0.5 seconds. I also add some indication of whether the answer was correct (green +) or not (red -). That actually makes things more difficult as its distracting to see something new flash on the screen. It also keeps a running total of misses (didn't click when should have), correct, and incorrect (clicked when shouldn't have). The grid size is set at 3 * 3 as that's more than difficult enough for me, and it appears to be what they used in the article. It actually only has 8 positions as I use the center to display the letter. Another ambiguity from the article is the number of letters. I use 3. That's easily changeable in the code. &lt;br /&gt;&lt;br /&gt;The length of the time step (time_step, 3000ms default), the amount of time to show the box and text (show_time, 500ms default) and the number of steps back (nback, 2 default) are settable via the url so the default equates to:&lt;br /&gt;?time_step=3000&amp;nback=2&amp;show_time=500&lt;br /&gt;&lt;h4&gt;Observations&lt;/h4&gt;1. It's freakin hard. I'm suck at it. &lt;br /&gt;2. Haxe is nice, but jeez, I write ugly actionscript. &lt;br /&gt;3. I like quick, pointless evening projects like this where I have a clear idea of the outcome. It's good for learning. &lt;br /&gt;4. Whatever points there are against flash, it's easy to uh, "deploy".&lt;br /&gt; &lt;br /&gt;&lt;br /&gt;&lt;a href="http://bpgeo.googlecode.com/svn/trunk/nback/index.html"&gt;Here's a live version of the app&lt;/a&gt; (untested in IE, but &lt;a href="http://code.google.com/p/swfobject/"&gt;swfobject&lt;/a&gt; should do it's thing). Just click in the flash movie when there's something that's the same as 2 time-steps ago.&lt;br /&gt;You can make it arbitrarily hard by sending in parameters on the url, see the links in the page for examples. &lt;br /&gt;&lt;br /&gt;And &lt;a href="http://code.google.com/p/bpgeo/source/browse/trunk/nback/NBack.hx"&gt;here's the code&lt;/a&gt;. Get it from svn via:&lt;br /&gt;svn co http://bpgeo.googlecode.com/svn/trunk/nback&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2537082860383525360?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://bpgeo.googlecode.com/svn/trunk/nback/index.html' title='flash (n)back'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2537082860383525360/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2537082860383525360' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2537082860383525360'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2537082860383525360'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/04/flash-nback.html' title='flash (n)back'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8395920164732414555</id><published>2008-04-27T22:43:00.000-07:00</published><updated>2008-04-30T18:06:34.484-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='cython'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>levenshtein in cython</title><content type='html'>EDIT(2): fix markup for &amp;lt;char *&amp;gt; casts ... fix malloc (see comments. thanks Bao).&lt;br /&gt;NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein.&lt;br /&gt;&lt;br /&gt;Well, it seems to be popular to code up the &lt;a href="http://en.wikipedia.org/wiki/Levenshtein_distance"&gt;levenshtein&lt;/a&gt;. I actually have a use for this and wanted to practice some &lt;a href="http://cython.org/"&gt;Cython&lt;/a&gt;, so I've written a version. I used &lt;a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/572156"&gt;bearophile's recipe&lt;/a&gt;, &lt;a href="http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python"&gt;wikibooks&lt;/a&gt;, &lt;a href="http://codepad.org/RxgfdYfy"&gt;this&lt;/a&gt; (from k4st on reddit) and &lt;a href="http://markos.gaivo.net/examples/distance/distance.py"&gt;this&lt;/a&gt; for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn).&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;cdef extern from "stdlib.h":&lt;br /&gt;    ctypedef unsigned int size_t&lt;br /&gt;    size_t strlen(char *s)&lt;br /&gt;    void *malloc(size_t size)&lt;br /&gt;    void free(void *ptr)&lt;br /&gt;    int strcmp(char *a, char *b)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;cdef inline size_t imin(int a, int b, int c):&lt;br /&gt;    if a &lt; b:&lt;br /&gt;        if c &lt; a:&lt;br /&gt;            return c&lt;br /&gt;        return a&lt;br /&gt;    if c &lt; b:&lt;br /&gt;        return c&lt;br /&gt;    return b&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;cpdef int levenshtein(char *a, char *b, int limit=100):&lt;br /&gt;    cdef int m = strlen(a), n = strlen(b)&lt;br /&gt;    cdef char *ctmp&lt;br /&gt;    cdef int i = 0, j = 0, retval&lt;br /&gt;    cdef int achr, bchr&lt;br /&gt;     &lt;br /&gt;    if strcmp(a, b) == 0:&lt;br /&gt;        return 0&lt;br /&gt;&lt;br /&gt;    if m &gt; n:&lt;br /&gt;        ctmp = a;&lt;br /&gt;        a = b;&lt;br /&gt;        b = ctmp;&lt;br /&gt;        #a, b = b, a&lt;br /&gt;        m, n = n, m&lt;br /&gt;        &lt;br /&gt;    # short circuit.&lt;br /&gt;    if n - m &gt;= limit:&lt;br /&gt;        return n - m&lt;br /&gt;&lt;br /&gt;    cdef char *m1 = &amp;lt;char *&amp;gt;malloc((n + 2) * sizeof(char))&lt;br /&gt;    cdef char *m2 = &amp;lt;char *&amp;gt;malloc((n + 2) * sizeof(char))&lt;br /&gt;    &lt;br /&gt;    for i from 0 &lt;= i &lt;= n:&lt;br /&gt;        m1[i] = i&lt;br /&gt;        m2[i] = 0&lt;br /&gt;&lt;br /&gt;    for i from 0 &lt;= i &lt;= m:&lt;br /&gt;        m2[0] = i + 1&lt;br /&gt;        achr = a[i]&lt;br /&gt;        for j from 0 &lt;= j &lt;= n:&lt;br /&gt;            bchr = b[j]&lt;br /&gt;            if achr == bchr:&lt;br /&gt;                m2[j + 1] = m1[j]&lt;br /&gt;            else:&lt;br /&gt;                m2[j + 1] = 1 + imin(m2[j], m1[j], m1[j + 1])&lt;br /&gt;&lt;br /&gt;        m1, m2 = m2, m1&lt;br /&gt;&lt;br /&gt;    retval = m1[n + 1]&lt;br /&gt;    free(m2)&lt;br /&gt;    free(m1)&lt;br /&gt;    return retval&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I believe that's correct (if not, let me know!), it matches output from other versions (which i used fort testing) and it doesn't leak memory, so I must have done the malloc/free correctly despite my lack of C-fu. &lt;br /&gt;Again, I'm not sure, but I think that's a pretty good example of how to mix python and C with cython as it's not much longer than the other python versions and pretty readable. And, it's very fast, it does 500K iterations of:&lt;br /&gt;levenshtein('i ehm a gude spehlar', 'i am a good speller')&lt;br /&gt;in 2.5 seconds.&lt;br /&gt;bearophile's pysco-ed &lt;a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/572156"&gt;editDistanceFast&lt;/a&gt; took 3 minutes, 31 seconds to do the same (which is why I started this project...).&lt;br /&gt;&lt;br /&gt;Literally, just before I was to post this, I found &lt;a href="http://code.google.com/p/pylevenshtein/"&gt;pylevenshtein&lt;/a&gt; which is even faster, does the 500K in 1.5 seconds. ho hum. It's doing more, handling unicode, and &lt;a href="http://code.google.com/p/pylevenshtein/source/browse/trunk/Levenshtein.c#2161"&gt;apparently doing some optimizations&lt;/a&gt;.... So, use that instead! -- and contribute a test-suite.&lt;br /&gt;Next, I think I'll try a &lt;a href="http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance"&gt;variant that allows transpositions&lt;/a&gt; (Damerau) and/or implement the BK-Tree, again cribbing off bearophile's recipe.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;for reference, here are the contents of the setup.py to build the module.&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;from distutils.core import setup&lt;br /&gt;from distutils.extension import Extension&lt;br /&gt;from Cython.Distutils import build_ext&lt;br /&gt;&lt;br /&gt;setup( name = 'levenshtein',&lt;br /&gt;  ext_modules=[ Extension("levenshtein", sources=["levenshtein.pyx"] ), ],&lt;br /&gt;  cmdclass = {'build_ext': build_ext})&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8395920164732414555?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8395920164732414555/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8395920164732414555' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8395920164732414555'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8395920164732414555'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/04/levenshtein-in-cython.html' title='levenshtein in cython'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-6170868514119854729</id><published>2008-04-23T18:57:00.000-07:00</published><updated>2008-11-15T18:19:31.794-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='numpy'/><category scheme='http://www.blogger.com/atom/ns#' term='gdal'/><title type='text'>numpy to tiff via gdal</title><content type='html'>EDIT: 7 months later I came back to this and found an error. update in the code below with old line commented out.&lt;br /&gt;&lt;br /&gt;Rather than venting about a project I've recently decoupled myself from, I'll try to do something constructive... I also &lt;a href="http://lists.gispython.org/pipermail/community/2008-April/001648.html"&gt;posted&lt;/a&gt; this to the gispython mailing list, but I've had to figure it out a couple times, so I'll put it here for the record. Given an N * N numpy array, and a bounding box, it's actually fairly simple to make a georeferenced tiff:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;from osgeo import gdal, gdal_array&lt;br /&gt;import numpy&lt;br /&gt;from osgeo.gdalconst import GDT_Float64&lt;br /&gt;&lt;br /&gt;xsize, ysize = 10, 10&lt;br /&gt;a = numpy.random.random((xsize, ysize)).astype(numpy.float64)&lt;br /&gt;&lt;br /&gt;xmin, xmax = -121., -119.&lt;br /&gt;ymin, ymax = 41., 43.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;driver = gdal.GetDriverByName('GTiff')&lt;br /&gt;# bad: out = driver.Create('a.tiff', a.shape[0], a.shape[1], 1, GDT_Float64)&lt;br /&gt;# the args to Create are 'name', xsize, ysize. and .shape[0] is rows, which is y.&lt;br /&gt;driver.Create('a.tiff', a.shape[1], a.shape[0], 1, GDT_Float64)&lt;br /&gt;&lt;br /&gt;out.SetGeoTransform([xmin&lt;br /&gt;                  , (xmax - xmin)/a.shape[0]&lt;br /&gt;                  , 0&lt;br /&gt;                  , ymin&lt;br /&gt;                  , 0&lt;br /&gt;                  , (ymax - ymin)/a.shape[1]])&lt;br /&gt;&lt;br /&gt;gdal_array.BandWriteArray(out.GetRasterBand(1), a)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where the SetGeoTransform bit is (I believe) the same stuff you'd stick in a world file.&lt;br /&gt; &lt;br /&gt;and plottable in pylab:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;import pylab&lt;br /&gt;tif = gdal.Open('a.tiff')&lt;br /&gt;a = tif.ReadAsArray()&lt;br /&gt;pylab.imshow(a)&lt;br /&gt;pylab.show()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Anyone checked out matplotlib's &lt;a href="http://matplotlib.sourceforge.net/matplotlib.toolkits.basemap.basemap.html"&gt;basemap&lt;/a&gt; recently? I just wish they didn't rely on geos &lt; 3...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-6170868514119854729?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/6170868514119854729/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=6170868514119854729' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6170868514119854729'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6170868514119854729'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/04/numpy-to-tiff-via-gdal.html' title='numpy to tiff via gdal'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-3863776097725369337</id><published>2008-04-11T19:00:00.000-07:00</published><updated>2008-04-11T18:58:32.799-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>python script as wsgi, cgi, or standalone</title><content type='html'>EDIT:&lt;br /&gt;See below for original, I realized this could be done cleanly with a &lt;a href="http://www.python.org/dev/peps/pep-0318/"&gt;decorator&lt;/a&gt;.&lt;br /&gt;The decorator &lt;i&gt;wrapplication&lt;/i&gt; takes the number of the port to use when called as a standalone server. The EMiddle class is unnecessary, it's just used as &lt;a href="http://groovie.org/articles/2005/10/06/wsgi-and-wsgi-middleware-is-easy"&gt;middleware&lt;/a&gt; to update the environ to show it came via wsgi. If there's a cleaner way, let me know.&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;#!/usr/bin/python&lt;br /&gt;import os&lt;br /&gt;&lt;br /&gt;class EMiddle(object):&lt;br /&gt;    def __init__(self, app):&lt;br /&gt;        self.app = app&lt;br /&gt;    def __call__(self, env, start_response):&lt;br /&gt;        env['hello'] = 'wsgi'&lt;br /&gt;        return self.app(env, start_response)&lt;br /&gt;&lt;br /&gt;def wrapplication(port):&lt;br /&gt;    def wrapper(wsgi_app):&lt;br /&gt;        if 'TERM' in os.environ:&lt;br /&gt;            print "serving on port: %i" % port&lt;br /&gt;            os.environ['hello'] = 'standalone'&lt;br /&gt;            from wsgiref.simple_server import make_server&lt;br /&gt;            make_server('', port, wsgi_app).serve_forever()&lt;br /&gt;&lt;br /&gt;        elif 'CGI' in os.environ.get('GATEWAY_INTERFACE',''):&lt;br /&gt;            os.environ['hello'] = 'cgi'&lt;br /&gt;            import wsgiref.handlers&lt;br /&gt;            wsgiref.handlers.CGIHandler().run(wsgi_app)&lt;br /&gt;        else:&lt;br /&gt;            return EMiddle(wsgi_app)&lt;br /&gt;    return wrapper&lt;br /&gt;&lt;br /&gt;@wrapplication(3000)&lt;br /&gt;def application(environ, start_response):&lt;br /&gt;    start_response("200 OK", [('Content-Type', 'text/plain')])&lt;br /&gt;    yield "How do you like the teaches of peaches?\n"&lt;br /&gt;    yield "from " + environ['hello']&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;ORIGINAL VERSION:&lt;br /&gt;&lt;br /&gt;If you write a script with the &lt;i&gt;application&lt;/i&gt; entry point that fits the &lt;a href="http://wsgi.org/wsgi"&gt;wsgi&lt;/a&gt; spec, it's simple to make it run via mod_wsgi, cgi, or via standalone server depending on the context. I believe this is common knowledge, but for my own reference here's an example with the extra setup (which will work for any script) to do this:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;#!/usr/bin/python&lt;br /&gt;&lt;br /&gt;def application(environ, start_response):&lt;br /&gt;    start_response("200 OK", [('Content-Type', 'text/plain')])&lt;br /&gt;    yield environ['QUERY_STRING']&lt;br /&gt;&lt;br /&gt;if __name__ == "__main__":&lt;br /&gt;    try:&lt;br /&gt;        from wsgiref.simple_server import make_server&lt;br /&gt;        import sys&lt;br /&gt;        port = int(sys.argv[1])&lt;br /&gt;        print "server on port: %i" % port&lt;br /&gt;        make_server('', port, application).serve_forever()&lt;br /&gt;    except Exception, e:&lt;br /&gt;&lt;br /&gt;        import wsgiref.handlers&lt;br /&gt;        wsgiref.handlers.CGIHandler().run(application)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To change between wsgi and cgi toggle between &lt;br /&gt;&lt;blockquote&gt;AddHandler cgi-script .py&lt;br /&gt;AddHandler wsgi-script .py&lt;/blockquote&gt;&lt;br /&gt;For the stand alone server, just run it with an argument indicating the port:&lt;br /&gt;&lt;blockquote&gt;python app.py 3000&lt;/blockquote&gt;&lt;br /&gt;to use the &lt;a href="http://www.cherrypy.org/"&gt;cherrypy&lt;/a&gt; server instead of the wsgiref, replace the make_server() line with:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;from cherrypy import wsgiserver&lt;br /&gt;server = wsgiserver.CherryPyWSGIServer(('0.0.0.0', port), [('/', application)], server_name='')&lt;br /&gt;try:&lt;br /&gt;    server.start()&lt;br /&gt;except KeyboardInterrupt:&lt;br /&gt;    server.stop()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That also handles the server shutdown more politely.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-3863776097725369337?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/3863776097725369337/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=3863776097725369337' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3863776097725369337'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3863776097725369337'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/04/python-script-as-wsgi-cgi-or-standalone.html' title='python script as wsgi, cgi, or standalone'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-5335203423476100607</id><published>2008-04-07T19:14:00.000-07:00</published><updated>2008-04-08T08:27:08.489-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='genedex'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>Genedex: query genomic features and sequence</title><content type='html'>Normally, I don't write libraries, I figure smarter people than I should do such things, and I should just use them. But, I got tired enough of writing one-off scripts for genomic feature manipulation-- find the upstream, downstream neighbors and get the sequence -- and I saw enough of the pieces coming together that I decided to build it. I'd &lt;a href="http://hackmap.blogspot.com/2008/03/rtree-know-your-nearest-neigbhors.html"&gt;complained before&lt;/a&gt; about how &lt;a href="http://pypi.python.org/pypi/Rtree"&gt;rtree&lt;/a&gt; didn't support 1D indicies. Not only is this not a problem, it's beneficial. Genomic features should have strand information, so that's the 2nd dimension. Then rtree does containment queries, so it's simple to find only the features on a given strand. I realized this about the same time that the docstring for numpy's memmap went from 0 lines to about 100, &lt;i&gt;and&lt;/i&gt; it was enhanced to &lt;a href="http://projects.scipy.org/scipy/numpy/changeset/4856"&gt;take a filehandle&lt;/a&gt;, not just a filename. This means you can send in a start position and a shape to the numpy.memmap constuctor and it can create a numpy array of only that chunk. This means that &lt;b&gt;it's possible to slice an unaltered fasta file using the numpy array syntax&lt;/b&gt;. That's very good. &lt;br /&gt;&lt;br /&gt;So, if you put those 2 simple things together, you have the start of something powerful. That's what I did. Then I gave it a crappy name: Genedex (Gendex was taken) and slapped it into googlecode. Check it out: &lt;a href="http://code.google.com/p/genedex/"&gt;http://code.google.com/p/genedex/&lt;/a&gt;. My only design goal was to keep it as simple as possible. If the amount of features is under-whelming, that's good.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;TDD&lt;/h4&gt;&lt;br /&gt;Also, I generally do &lt;a href="http://en.wikipedia.org/wiki/Test-driven_development"&gt;TDD&lt;/a&gt; very half-ass, with asserts and maybe a couple &lt;a href="http://docs.python.org/lib/module-doctest.html"&gt;doctests&lt;/a&gt;. However, I recently made fairly substantial changes to the SQLite datasource in &lt;a href="http://featureserver.org/"&gt;featureserver&lt;/a&gt;, and wrote &lt;a href="http://svn.featureserver.org/trunk/featureserver/doc/API.txt"&gt;this set of doctests&lt;/a&gt; while doing so. It works! and I've been using it. So, I did what featuresever (presumably crschmidt) devs did and &lt;a href="http://svn.featureserver.org/trunk/featureserver/tests/tests.py"&gt;copied&lt;/a&gt; the &lt;a href="http://trac.gispython.org/projects/PCL/browser/Shapely/trunk/tests/test_doctests.py"&gt;setup for the shapely doctests&lt;/a&gt;. It's pretty useful for design, i'd just write out the code for how I wanted the API to look and then implement. The only thing is, for doctests, the way they're used (at least by me) is to copy the output from executing the code into the doctest. So, if your code is wrong to start with, you just copy the wrong answer into the doctest and it's broken but the tests pass. But, at least it's good for regressions, and I just had to remember not to blindly trust the output. That's true for all testing, but especially so for doctests.&lt;br /&gt;&lt;br /&gt;So, there's now more tests than code. But, since it's mostly just tie-ing together pieces that do the real work, it's not much code. Doc-tests are also nice because (as the name suggests) they double as documentation. So, here's the genedex documentation:&lt;br /&gt;&lt;a href="http://genedex.googlecode.com/svn/trunk/doc/readme.html"&gt;http://genedex.googlecode.com/svn/trunk/doc/readme.html&lt;/a&gt;&lt;br /&gt;It's pretty! It gets colored by &lt;a href="http://pygments.org/"&gt;pygments&lt;/a&gt;, using &lt;a href="http://code.google.com/p/genedex/source/browse/trunk/doc/rst-directive.py"&gt;this script&lt;/a&gt;. The only major thing I'd like to add to the library is a  plotting class using &lt;a href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt;. Then other smaller tasks like a method that takes 2 features and returns the sequence between them. &lt;br /&gt;Any fixes, enhancements, ridicule, etc. will be greeted with commit access.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-5335203423476100607?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/5335203423476100607/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=5335203423476100607' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5335203423476100607'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5335203423476100607'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/04/genedex-query-genomic-features-and.html' title='Genedex: query genomic features and sequence'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-5963372728799445582</id><published>2008-04-06T21:37:00.000-07:00</published><updated>2008-12-10T12:37:13.256-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>comparative genomics with openlayers</title><content type='html'>Traditional genome browsers, look like &lt;a href="http://flybase.org/cgi-bin/gbrowse/dmel/"&gt;this&lt;/a&gt;. In fact, I think that's the most popular genome-browser used--gbrowse. They display information in tracks, so any layer of annotation you just add on to the bottom of the image (after making the image  taller). This doesnt work for genome-browser, the hack of openlayers to support only horizontal scrolling, because you if you have 2 adjacent tiles, if one has more features than the next, there's not guarantee that they'll be the same height, and no guarantee that a feature that's on both images will align correctly.&lt;br /&gt;&lt;br /&gt;I was just hacking around, trying to test some work I'd done and realized that you can have annotation layers with OpenLayers, just add another map, and tie them together!&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_uU_kLC5AdTc/R_mjw_qJOrI/AAAAAAAAAXA/LbFy519QJc4/s1600-h/stack.png"&gt;&lt;img style="cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_uU_kLC5AdTc/R_mjw_qJOrI/AAAAAAAAAXA/LbFy519QJc4/s400/stack.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5186356508011084466" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So that's 2 OpenLayers.Map() instances. What makes this easy is the new Map.panTo() methods in OpenLayers 2.6 (which is in release candidate 1). So, the top map registers for 'move' and 'zoomend' events with callbacks that update the bottom map with the position/zoom of the top map.&lt;br /&gt;That's it! And layers of annotation are available, along with the slippy map. &lt;a href="http://openlayers.org/"&gt;OpenLayers&lt;/a&gt; continues to amaze. &lt;br /&gt;That site with the linked maps is &lt;a href="http://128.32.8.100/genome-browser/examples/"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-5963372728799445582?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/5963372728799445582/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=5963372728799445582' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5963372728799445582'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5963372728799445582'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/04/comparative-genomics-with-openlayers.html' title='comparative genomics with openlayers'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_uU_kLC5AdTc/R_mjw_qJOrI/AAAAAAAAAXA/LbFy519QJc4/s72-c/stack.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-553084170753566239</id><published>2008-03-30T21:39:00.000-07:00</published><updated>2008-03-30T09:41:56.078-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>featureserver authentication</title><content type='html'>As the name implies, &lt;a href="http://featureserver.org/"&gt;featureserver&lt;/a&gt; serves vector features to &lt;a href="http://svn.featureserver.org/trunk/featureserver/FeatureServer/Service/"&gt;various formats&lt;/a&gt; from a number of datasources, including &lt;a href="http://www.gdal.org/"&gt;OGR&lt;/a&gt; -- which means pretty much any vector format. That's extremely powerful. Really. That means, for instance, that when you're working on a really cool project and all anyone wants to know is if they can see it in KML/Google Earth, it's no extra work. Just point them to the REST-ful url like "http://example.com/featureserver/all.kml", and continue working on the cool project. Likewise for all.gml, .atom, etc. And, if you have a project with spatial data, if you put it in a format that featureserver understands, it's displayable, and editable in openlayers.&lt;br /&gt;&lt;br /&gt;The next thing people want in a web application is some sort of user restrictions. In featureserver, by default, anyone can do any of the &lt;a href="http://en.wikipedia.org/wiki/Create,_read,_update_and_delete"&gt;CRUD&lt;/a&gt; operations on any feature. I've been playing with a soon-to-be-open-sourced &lt;a href="http://en.wikipedia.org/wiki/Public_Participation_GIS"&gt;PPGIS&lt;/a&gt; (apparently the trendy acronym for that is now &lt;a href="http://www.spatiallyadjusted.com/2008/03/13/vgi-meh/"&gt;VGI&lt;/a&gt;) project where people can go to report &lt;a href="http://nature.berkeley.edu/comtf/"&gt;sudden oak death&lt;/a&gt;. I want anyone to be able to report and view cases and add notes about existing cases, but only admins to be able to edit and delete existing reported sudden oak death cases. The simplest way is to use basic authentication in apache, but then anyone who goes to the site has to enter a user/password, and I think that &lt;b&gt;really&lt;/b&gt; limits the public participation bit. If there's a way to do only authenticate sometimes with apache authentication, let me know.&lt;br /&gt;&lt;h3&gt;unrested development&lt;/h3&gt;Since it's possible to &lt;a href="http://svn.featureserver.org/trunk/featureserver/doc/API.txt"&gt;use featureserver as an API&lt;/a&gt;, you can make your own server and add authentication in python. You can do this with any framework that supports sessions, I've done it with the development version (0.3) of &lt;a href="http://webpy.org/"&gt;web.py&lt;/a&gt;, using the nice &lt;a href="http://webpy.org/cookbook/sessions"&gt;sessions&lt;/a&gt; support. That it supports intuitive syntax for GET, PUT, DELETE, POST makes it a good fit as well. In the code below, the authentication related urls are /login and /logout, but those are never needed unless updating or deleting an existing point. Anyone can create new features. All featureserver related requests are made as with the original. Here's the wsgi script:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;#!/usr/bin/python&lt;br /&gt;import web&lt;br /&gt;from FeatureServer.Server import Server&lt;br /&gt;from FeatureServer.DataSource.SQLite import SQLite&lt;br /&gt;&lt;br /&gt;urls = ( '/logout',      'logout'&lt;br /&gt;        ,'/login',       'login'&lt;br /&gt;        ,'/(.*)',        'features')&lt;br /&gt;&lt;br /&gt;app = web.application(urls, globals())&lt;br /&gt;session = web.session.Session(app&lt;br /&gt;                , web.session.DiskStore('/tmp/sessions')&lt;br /&gt;                , initializer={'authorized': False})&lt;br /&gt;&lt;br /&gt;datasource    = SQLite('fsauth', file="/tmp/fsauth.sqlite")&lt;br /&gt;featureserver = Server({'fsauth': datasource })&lt;br /&gt;&lt;br /&gt;application = app.wsgifunc()&lt;br /&gt;&lt;br /&gt;class login(object):&lt;br /&gt;    """for a real app, save usernames, hashed pws in the db"""&lt;br /&gt;    def POST(self):&lt;br /&gt;        pw = web.input(password=None).password&lt;br /&gt;        user = web.input(user=None).user&lt;br /&gt;        if (user == 'abc' and pw == '123'):&lt;br /&gt;            session.authorized = True&lt;br /&gt;            return '[authorized]'&lt;br /&gt;        return '[NOT-authorized]'&lt;br /&gt;&lt;br /&gt;class logout(object):&lt;br /&gt;    def GET(self): session.kill()&lt;br /&gt;&lt;br /&gt;class features(object):&lt;br /&gt;    """all the featureserver routing"""&lt;br /&gt;    path = "/" + datasource.name + "/" # fsauth&lt;br /&gt;    format = "geojson"&lt;br /&gt;    def GET(self, feature_id=''):&lt;br /&gt;        if "." in feature_id:&lt;br /&gt;            feature_id, self.format = feature_id.split(".")&lt;br /&gt;&lt;br /&gt;        # get web.py parsed url&lt;br /&gt;        path = self.path + feature_id&lt;br /&gt;        data = dict(web.input().items())&lt;br /&gt;        data['format'] = self.format&lt;br /&gt;&lt;br /&gt;        format, rsp = featureserver.dispatchRequest(data, path, "", request_method="GET")&lt;br /&gt;        web.header('Content-type', format)&lt;br /&gt;        return rsp&lt;br /&gt;&lt;br /&gt;    def PUT(self, feature_id=None):&lt;br /&gt;        return self.POST(feature_id, "PUT")&lt;br /&gt;&lt;br /&gt;    def DELETE(self, feature_id=None):&lt;br /&gt;        if "." in feature_id:&lt;br /&gt;            feature_id, self.format = feature_id.split(".")&lt;br /&gt;        # cant delete unless authorized.&lt;br /&gt;        if not session.authorized: &lt;br /&gt;            web.header('Content-type', "text/plain")&lt;br /&gt;            return "not logged in"&lt;br /&gt;        path = self.path + feature_id&lt;br /&gt;        data = dict(web.input().items())&lt;br /&gt;        data['format'] = self.format&lt;br /&gt;        format, rsp = featureserver.dispatchRequest(data, path, "", request_method="DELETE")&lt;br /&gt;        web.header('Content-type', format)&lt;br /&gt;        return rsp&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;    def POST(self, feature_id=None, method="POST"):&lt;br /&gt;        if feature_id is None: return []&lt;br /&gt;        if "." in feature_id:&lt;br /&gt;            feature_id, self.format = feature_id.split(".")&lt;br /&gt;        # must be an admin to do something with an existing feature.&lt;br /&gt;        if not session.authorized:&lt;br /&gt;            if not feature_id in ('new', 'create'):&lt;br /&gt;                return 'not logged in'&lt;br /&gt;        e = web.ctx.environ&lt;br /&gt;        post_data = e['wsgi.input'].read(int(e['CONTENT_LENGTH']))&lt;br /&gt;        path = self.path + feature_id&lt;br /&gt;        format, rsp = featureserver.dispatchRequest({'format':self.format}, path, "", post_data=post_data, request_method=method)&lt;br /&gt;        web.header('Content-type', format)&lt;br /&gt;        return rsp&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That'll be called from OpenLayers, but to demo, since &lt;a href="http://zcologia.com/news/430/feature-demo/"&gt;everyone&lt;/a&gt; &lt;a href="http://featureserver.org/"&gt;else&lt;/a&gt; is using curl which supports cookies:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;FS="http://localhost/fsauth/"&lt;br /&gt;&lt;br /&gt;echo "\n\nfirst borrow a feature. "&lt;br /&gt;curl --url "http://featureserver.org/featureserver.cgi/scribble/35.geojson" &gt; 35.geojson&lt;br /&gt;&lt;br /&gt;echo "\nsee the features ... "&lt;br /&gt;curl --url $FS&lt;br /&gt;&lt;br /&gt;echo "\n\nadd a feature (no auth required.)"&lt;br /&gt;curl -d @35.geojson --url "$FS/create.geojson"&lt;br /&gt;&lt;br /&gt;echo "\n\nsee the features ... "&lt;br /&gt;curl --url $FS&lt;br /&gt;&lt;br /&gt;echo "\n\ntry to delete ... but cant"&lt;br /&gt;curl -s -X DELETE $FS/1&lt;br /&gt;&lt;br /&gt;echo "\n\nlogin ... \n"&lt;br /&gt;curl -s --cookie-jar "cookies.txt" -d "password=123&amp;user=abc" --url $FS/login &gt; /dev/null&lt;br /&gt;&lt;br /&gt;echo "\n\nthen delete ... "&lt;br /&gt;curl -s -X DELETE -b "cookies.txt" $FS/1 &gt; /dev/null&lt;br /&gt;&lt;br /&gt;echo "\n\nsee the empty features ... \n"&lt;br /&gt;curl --url $FS&lt;br /&gt;&lt;br /&gt;echo "\n\nreturn the borrowed feature. thanks.  :-) "&lt;br /&gt;curl -s -X PUT -d @35.geojson --url "http://featureserver.org/featureserver.cgi/scribble/35.geojson"&lt;/pre&gt;&lt;br /&gt;the important point there being that the DELETE fails until the user is logged in. I'm pretty sure adding authentication makes it un-REST-ful. but &lt;a href="http://www.google.com/search?q=restful+authentication"&gt;???&lt;/a&gt;. Anyway, I won't lose any sleep over it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-553084170753566239?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/553084170753566239/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=553084170753566239' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/553084170753566239'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/553084170753566239'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/featureserver-authentication.html' title='featureserver authentication'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-9147049073360160647</id><published>2008-03-25T22:25:00.000-07:00</published><updated>2008-03-25T22:28:43.976-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>reading code</title><content type='html'>A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric.&lt;br /&gt;So, I &lt;a href="http://www.gutenberg.org/dirs/etext03/ulyss12.txt"&gt;downloaded&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Ulysses_%28poem%29"&gt;Ulysses&lt;/a&gt;.&lt;br /&gt;wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough.&lt;br /&gt;So on the rice fasta file i ran:&lt;pre&gt;grep -v "&gt;" rice.fasta | wc -c&lt;/pre&gt;to get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words&lt;br /&gt;So, at 250 words per minute, it'd take:&lt;br /&gt;62012960 / 250 / 60 = &lt;b&gt;4,134 hours to read the rice genome&lt;/b&gt;. That's 172 days. Also, from what I know, the plot is hard to follow.&lt;br /&gt;Genome size varies widely among plants. I have a couple ideas for pointless visualizations of this...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-9147049073360160647?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/9147049073360160647/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=9147049073360160647' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/9147049073360160647'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/9147049073360160647'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/reading-code.html' title='reading code'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-7136725143809181735</id><published>2008-03-24T19:12:00.000-07:00</published><updated>2008-12-10T12:37:13.472-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>point partitioning</title><content type='html'>After you spend all day banging your head against your own problems, sometimes it's just nice to bang it on something else for a bit.&lt;br /&gt;&lt;a href="http://postgis.refractions.net/pipermail/postgis-users/2008-March/018937.html"&gt;This&lt;/a&gt;  question came through on the postgis mailing list and it seemed like a good diversion. I think it's a very clear description of the problem. To quote:&lt;br /&gt;&lt;blockquote&gt;I have around 300,000 points, each with a lat, lon and altitude (also converted to geometry). I need to get a subset of those points, where none of them are within 5m (or some other arbitrary distance) of each other. It doesnt matter which points get picked over another, as long as whatever the data set it creates, that none of the points are within that 5m radius and that relatively most of the points are used&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;So, I hacked up a quick solution. It's probably inefficient -- deleting keys from a dict, and removing entries from an &lt;a href="http://trac.gispython.org/projects/PCL/wiki/Rtree"&gt;rtree&lt;/a&gt; index. But, it's easy to understand, and (without the plotting) it runs in about 2 minutes for the requested 300000 points. &lt;br /&gt;When plotting, the image looks like this:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_uU_kLC5AdTc/R-hdV_qJOqI/AAAAAAAAAWg/Ek8OgCOegEU/s1600-h/points_d.png"&gt;&lt;img style="cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_uU_kLC5AdTc/R-hdV_qJOqI/AAAAAAAAAWg/Ek8OgCOegEU/s400/points_d.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5181494003736591010" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and the code...&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;import rtree&lt;br /&gt;import random&lt;br /&gt;import pylab&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;dist = 500&lt;br /&gt;index = rtree.Rtree()&lt;br /&gt;points = {}  # xy coords. use dict to delete without changing keys.&lt;br /&gt;groups = {}  # store the grouped points. all keys are &gt; dist apart&lt;br /&gt;&lt;br /&gt;# create some random points and put them in an index.&lt;br /&gt;for i in range(3000):&lt;br /&gt;    x = random.random() * 10000&lt;br /&gt;    y = random.random() * 10000&lt;br /&gt;    pt = (x, y)&lt;br /&gt;    points[i] =  pt&lt;br /&gt;    index.add(i, pt)&lt;br /&gt;&lt;br /&gt;print "index created..."&lt;br /&gt;&lt;br /&gt;while len(points.values()):&lt;br /&gt;    pt = random.choice(points.values())&lt;br /&gt;    print pt&lt;br /&gt;    bbox = (pt[0] - dist, pt[1] - dist, pt[0] + dist, pt[1] + dist)&lt;br /&gt;&lt;br /&gt;    idxs = index.intersection(bbox)&lt;br /&gt;    # add actual distance here, to get those within dist.&lt;br /&gt;&lt;br /&gt;    groups[pt] = []&lt;br /&gt;    for idx in sorted(idxs, reverse=True):&lt;br /&gt;        delpt = points[idx]&lt;br /&gt;        groups[pt].append(delpt)&lt;br /&gt;        index.delete(idx, delpt)&lt;br /&gt;        del points[idx]&lt;br /&gt;&lt;br /&gt;# groups contains keys where no key is within dist of any other pt&lt;br /&gt;# the values for a given key are all points with dist of that point.&lt;br /&gt;&lt;br /&gt;for pt, subpts in groups.iteritems():&lt;br /&gt;    subpts = pylab.array(subpts)&lt;br /&gt;    pylab.plot(subpts[:,0], subpts[:,1], 'k.')&lt;br /&gt;    pylab.plot([pt[0]], [pt[1]], 'ro')&lt;br /&gt;&lt;br /&gt;pylab.show()&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-7136725143809181735?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/7136725143809181735/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=7136725143809181735' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7136725143809181735'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7136725143809181735'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/point-partitioning.html' title='point partitioning'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_uU_kLC5AdTc/R-hdV_qJOqI/AAAAAAAAAWg/Ek8OgCOegEU/s72-c/points_d.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4051527622277639172</id><published>2008-03-18T18:22:00.000-07:00</published><updated>2008-03-18T18:34:46.308-07:00</updated><title type='text'>OGR python projection</title><content type='html'>&lt;h1&gt;&lt;a href="http://www.gdal.org/ogr/"&gt;OGR&lt;/a&gt; Projection&lt;/h1&gt;&lt;br /&gt;&lt;br /&gt;If you're using &lt;a href="http://trac.gispython.org/projects/PCL/wiki/Shapely"&gt;shapely&lt;/a&gt; and you need to do projections, you'll either have a lot of boilerplate or a function like this one. Actually, even in OGR, there's a lot of bioler plate involved in transforming....&lt;br /&gt; &lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;from osgeo import ogr&lt;br /&gt;from shapely.wkb import loads&lt;br /&gt;&lt;br /&gt;def project(geom, to_epsg=900913, from_epsg=4326):&lt;br /&gt;    """utility function to do quick projection with ogr,&lt;br /&gt;    to and from shapely objects&lt;br /&gt;    &gt;&gt;&gt; from shapely.geometry import LineString&lt;br /&gt;    &gt;&gt;&gt; l = LineString([[-121, 43], [-122, 42]])&lt;br /&gt;    &gt;&gt;&gt; lp = project(l, from_epsg=4326, to_epsg=26910)&lt;br /&gt;    &gt;&gt;&gt; lp.wkt&lt;br /&gt;    'LINESTRING (663019.0700828594854102 4762755.6415722491219640, 582818.0692490270594135 4650259.8474613213911653)'&lt;br /&gt;    """&lt;br /&gt;&lt;br /&gt;    to_srs = ogr.osr.SpatialReference()&lt;br /&gt;    to_srs.ImportFromEPSG(to_epsg)&lt;br /&gt;&lt;br /&gt;    from_srs = ogr.osr.SpatialReference()&lt;br /&gt;    from_srs.ImportFromEPSG(from_epsg)&lt;br /&gt;&lt;br /&gt;    ogr_geom = ogr.CreateGeometryFromWkb(geom.wkb)&lt;br /&gt;    ogr_geom.AssignSpatialReference(from_srs)&lt;br /&gt;&lt;br /&gt;    ogr_geom.TransformTo(to_srs)&lt;br /&gt;    return loads(ogr_geom.ExportToWkb())&lt;br /&gt;&lt;br /&gt;import doctest&lt;br /&gt;doctest.testmod()&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4051527622277639172?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4051527622277639172/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4051527622277639172' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4051527622277639172'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4051527622277639172'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/ogr-python-projection.html' title='OGR python projection'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-6988725553885130756</id><published>2008-03-16T11:07:00.000-07:00</published><updated>2008-12-10T12:37:14.366-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ecology'/><category scheme='http://www.blogger.com/atom/ns#' term='vis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='spatial'/><title type='text'>spatially explicit metapopulation models in scipy.</title><content type='html'>&lt;h2&gt;Making Pretty Pictures&lt;/h2&gt;&lt;br /&gt;I started to learn to program about 5 years ago running population ecology models in &lt;a href="http://www.wolfram.com/"&gt;mathematica&lt;/a&gt;. Yesterday, I found an old mma notebook with a model modified to include &lt;a href="http://scholar.google.com/scholar?hl=en&amp;q=Host-parasitoid+metapopulations%3A+the+consequences+of+parasitoid+aggregation+on+spatial+dynamics+and+searching+efficiency&amp;btnG=Search"&gt;differential parasitoid dispersal&lt;/a&gt; to adjacent host cells depending on the host density in those cells. It's bascially a discrete-time &lt;a href="http://www.ento.vt.edu/~sharov/PopEcol/lec10/paras.html"&gt;Nicholson-Bailey model&lt;/a&gt;. But in a grid of cells, where each cell contains a population of hosts (H) and parasitoids (P) that give birth, die, eat, and get eaten according to the NB model. Each generation, following birth/reproduction/predation, the hosts and parasitoids disperse. The hosts disperse equally to the 8 surrounding cells in their neigbhorhood. The parasitoids can move irrespective of host densities when the aggregation parameter (eta) is 0. When aggregation is 1 (eta == 1), the parasitoids move to adjacent cells in exact proportion to host densities in each of the surrounding cells. muH and muP (i'm too lazy to figure out how to write the symbols for mu) determine the proportion of individuals in each population that disperse. I no longer have mathematica, but I downloaded all 96 megabytes of the reader to open the model:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_uU_kLC5AdTc/R91Y7V9L6_I/AAAAAAAAAWA/EdiIB0oCNwA/s1600-h/model.png"&gt;&lt;img style="cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_uU_kLC5AdTc/R91Y7V9L6_I/AAAAAAAAAWA/EdiIB0oCNwA/s400/model.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5178392923075242994" /&gt;&lt;/a&gt;&lt;br /&gt;I'm no longer a big fan of mathematica, but, it does make it easy to use funky characters and show equations nicely.&lt;br /&gt;There, the ListCorrelate does the dispersal. It's a convolution to "smear" out populations according to the kernels, which are defined by the muH, muP and for the parasitoid, eta.&lt;br /&gt;It took literally 10 minutes to translate that to &lt;a href="http://scipy.org/"&gt;numpy/scipy&lt;/a&gt;. &lt;br /&gt;I'll paste that code at the end.&lt;br /&gt;The cool thing is that you can see &lt;a href="http://www.nature.com/nature/journal/v353/n6341/abs/353255a0.html"&gt;complex spatial patterns&lt;/a&gt; arising within the gridded metapopulation, even when the sum of the densities across cells is constant. So, running the model, and plotting the time series (top), and the &lt;a href="http://en.wikipedia.org/wiki/Phase_portrait"&gt;phase portrait&lt;/a&gt; (bottom) are shown.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_uU_kLC5AdTc/R91bgV9L7AI/AAAAAAAAAWI/XRASI3U7_6w/s1600-h/series.png"&gt;&lt;img style="cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_uU_kLC5AdTc/R91bgV9L7AI/AAAAAAAAAWI/XRASI3U7_6w/s400/series.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5178395757753658370" /&gt;&lt;/a&gt;&lt;br /&gt;The time-series is the average density of Hosts, Parasitoids across the metaopulation. It takes a while to stabilize, but they oscillate down to an equilibrium. The phase portrait is just plotting the mean density of hosts vs parasitoids for each generation. The generations are discrete, so the lines are merely to view the trajectory. It's hard to see from the scale of the time series, but the parasitoid cycles lag those of the host by about a generation. &lt;br /&gt;But the model is running in a spatially explicit manner, and averaging loses all of that spatial data. So, starting at 200 generations, the program then saves a snapshot of the grid, with higher host densities in white, and lower values in black. This is done every third generation. After the model run, there's a directory with a bunch of images. It'd be nice to play them as a movie... &lt;a href="http://www.imagemagick.org/"&gt;Imagemagick's&lt;/a&gt; convert is perfect for this:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;convert -delay 40 -loop 0 images/*.png metapop.gif&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Where that command creates a nice little (4MB) animated gif:&lt;br /&gt;&lt;img style="cursor:pointer; cursor:hand;" src="http://128.32.8.100/SOD/metapop.gif" border="0" alt="metapop.gif" id="metapop.gif" /&gt;&lt;br /&gt;That shows the spatiotemporal dynamics of the host population.&lt;br /&gt;That's a pretty simple way to incorporate time into an visualization when you're otherwise limited to 2 dimensions... Maybe I'm too easily amused, but I could watch that all day. &lt;br /&gt;Anyway, here's the hastily translated code:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;import numpy&lt;br /&gt;import pylab&lt;br /&gt;from scipy.signal import convolve2d&lt;br /&gt;&lt;br /&gt;def doplot(hm, pm):&lt;br /&gt;    pylab.plot(range(gens), hm, 'b')&lt;br /&gt;    pylab.plot(range(gens), pm, 'g')&lt;br /&gt;    pylab.ylim(0, pm.mean() + hm.mean())&lt;br /&gt;    pylab.legend(('hosts', 'parasitoids'))&lt;br /&gt;    pylab.subplot(212)&lt;br /&gt;    pylab.plot(hm, pm, 'r')&lt;br /&gt;    pylab.plot(hm, pm, 'k.')&lt;br /&gt;    pylab.xlim(0, 1.3 * numpy.mean(hm) + 5)&lt;br /&gt;    pylab.ylim(0, 1.3 * numpy.mean(pm) + 5)&lt;br /&gt;    pylab.xlabel('hosts')&lt;br /&gt;    pylab.ylabel('parasitoids')&lt;br /&gt;    pylab.show()&lt;br /&gt;&lt;br /&gt;def metapop(H0, P0, a=0.05, l=3., K=1000, muH=0.2, muP=0.8&lt;br /&gt;              , Hrange=1, Prange=1, eta=3, size=32, gens=1000&lt;br /&gt;              , mode="same", boundary="wrap"):&lt;br /&gt;    E = numpy.e&lt;br /&gt;    Hmeta = []&lt;br /&gt;    Pmeta = []&lt;br /&gt;    Pkern = numpy.ones((2 * Prange + 1, 2 * Prange + 1)&lt;br /&gt;            , dtype=numpy.double)&lt;br /&gt;    Hkern = numpy.ones((2 * Hrange + 1, 2 * Hrange + 1)&lt;br /&gt;            , dtype=numpy.double)&lt;br /&gt;    Hkern *= muH/((2 * Hrange + 1)**2 - 1.)&lt;br /&gt;    Hkern[Hrange + 1, Hrange + 1] = 1.0 - muH&lt;br /&gt;&lt;br /&gt;    for gen in range(1, gens + 1):&lt;br /&gt;        Hmeta.append(H0.mean())&lt;br /&gt;        Pmeta.append(P0.mean())&lt;br /&gt;        &lt;br /&gt;        # poisson search.&lt;br /&gt;        f = E**(-a * P0)&lt;br /&gt;&lt;br /&gt;        # predation, births&lt;br /&gt;        H1 = l * H0 * f * numpy.exp(-numpy.log(l) * H0*f/K)&lt;br /&gt;        P1 = H0 * 1 - f&lt;br /&gt;&lt;br /&gt;        # simple movement between adjacent cells.&lt;br /&gt;        H0 = 0.001 + convolve2d(H1, Hkern, mode=mode, boundary=boundary)&lt;br /&gt;&lt;br /&gt;        # biased movement by parasitoid according to density of hosts in&lt;br /&gt;        # adjacent cells.&lt;br /&gt;        heta = H0**eta&lt;br /&gt;        B = heta / convolve2d(heta, Pkern, mode=mode&lt;br /&gt;               ,  boundary=boundary)&lt;br /&gt;&lt;br /&gt;        P0 = (1. - muP) * P1 + muP * B * convolve2d(P1, Pkern, mode=mode&lt;br /&gt;                , boundary=boundary)&lt;br /&gt;        P0 *= P1.sum()/P0.sum()&lt;br /&gt;        if gen &gt; 200 and not gen % 3:&lt;br /&gt;            pylab.figure(figsize=(4,4))&lt;br /&gt;            pylab.axes([0, 0, 1, 1])&lt;br /&gt;            pylab.imshow(H0, cmap=pylab.cm.gray)&lt;br /&gt;            pylab.xticks([])&lt;br /&gt;            pylab.yticks([])&lt;br /&gt;            pylab.savefig('images/%03i.png' % gen)&lt;br /&gt;            pylab.close()&lt;br /&gt;&lt;br /&gt;    return numpy.array(Hmeta), numpy.array(Pmeta)&lt;br /&gt;&lt;br /&gt;if __name__ == "__main__":&lt;br /&gt;    gens = 300&lt;br /&gt;    size = 64&lt;br /&gt;    H0 = 1000 * numpy.abs(numpy.random.randn(size * size).reshape(size, size))&lt;br /&gt;    P0 = numpy.zeros_like(H0)&lt;br /&gt;    P0[0, 0] = 1.&lt;br /&gt;    P0[32, 32] = 1.&lt;br /&gt;&lt;br /&gt;    pylab.close()&lt;br /&gt;    pylab.subplot(211)&lt;br /&gt;    hm, pm = metapop(H0, P0, gens=gens, a = 0.02)&lt;br /&gt;    doplot(hm, pm)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The variable names are ugly, but it's a start for something to play with. It's interesting to see the effects of parasitoid aggregation (eta). And the parasitoid initialization--here, it introduces a single parasitoid at cells 0,0 and 32, 32. Should anyone actually use this, I have a derivation of the equations, and better explanation of the parameters available.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-6988725553885130756?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/6988725553885130756/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=6988725553885130756' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6988725553885130756'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6988725553885130756'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/spatially-explicit-metapopulation.html' title='spatially explicit metapopulation models in scipy.'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_uU_kLC5AdTc/R91Y7V9L6_I/AAAAAAAAAWA/EdiIB0oCNwA/s72-c/model.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-14263664841036458</id><published>2008-03-11T08:12:00.000-07:00</published><updated>2008-03-11T08:10:21.728-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='spatial'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>rtree: know your nearest neigbhors</title><content type='html'>My computer spends a lot of time looking for neighbors of a given location-- even more so for bio, than for geo. This is what I've learned about the options for doing smarter search so far.&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;SELECT * from (&lt;br /&gt; (SELECT * FROM ((SELECT * FROM feature WHERE start&lt;= ? ORDER BY start DESC  LIMIT 1) UNION (SELECT * FROM feature where start&gt;= ?&lt;br /&gt;ORDER BY start LIMIT 1)) as u)&lt;br /&gt; UNION&lt;br /&gt; (SELECT * FROM ((SELECT * FROM feature where stop&lt;= ? ORDER BY stop   DESC  LIMIT 1) UNION (SELECT * FROM feature where stop&gt;= ?&lt;br /&gt;ORDER BY stop LIMIT 1)) as v)&lt;br /&gt;) as w&lt;br /&gt;ORDER BY ABS((start + stop)/2 - ?) LIMIT 1&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;if you fill in ? with an integer location, that query will return the closest feature most of the time. It's verbose if not ugly, and that's only for 1 dimension. It can return the wrong feature in certain cases.... You have to write it like that in MySQL, because it doesnt support &lt;a href="http://www.postgresql.org/docs/7.3/static/indexes-functional.html"&gt;functional indexes&lt;/a&gt;, so as soon as you do something like:&lt;br /&gt;ORDER BY ABS((start + stop)/2 - ?)&lt;br /&gt;it's no longer an indexed search&lt;br /&gt;It's a hard problem, even if you're using &lt;a href="http://postgis.refractions.net/"&gt;postgis&lt;/a&gt;. And &lt;a href="http://www.bostongis.com/blog/index.php?/categories/7-nearest-neighbor"&gt;even if you're a postGIS badass&lt;/a&gt;. &lt;br /&gt;Other than postGIS, there postgres's &lt;a href="http://www.postgresql.org/docs/8.0/interactive/functions-geometry.html"&gt;builtin geometric types&lt;/a&gt;. Even so, for most things, using SQL makes me feel far from the data.&lt;br /&gt;Generally, I pull my data into a numpy array and use slicing to get to a region of interest:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;feats = all_feats[(all_feats['bpmax'] &gt; bpmin) &amp;amp; (all_feats['bpmin'] &lt; bpmax)] &lt;/pre&gt;&lt;br /&gt;that's usually pretty fast. It can be even faster with &lt;a href="http://www.scipy.org/SciPyPackages/NumExpr"&gt;numexpr&lt;/a&gt;. I suppose it could be done using indexed searching in &lt;a href="http://www.pytables.org/"&gt;pytables&lt;/a&gt; pro. Still doing a windowed search like that doesn't guarantee you get the nearest neighbor and you still have to pare it down and do some arithmetic to figure out the nearest gene. And, it can be the bottleneck.&lt;br /&gt;For bioinformatics libraries, there's &lt;a href="ftp://ftp.informatics.jax.org/pub/fjoin/README"&gt;fjoin&lt;/a&gt; which does indexing, but doesnt seem much of a library (plus the indentation is messed up in the file). There's also &lt;a href="http://bioinfo.mbi.ucla.edu/pygr"&gt;pygr&lt;/a&gt;, which, presumably would be perfect for this, if I could understand it. I keep looking at pygr occassionally, but it just doesn't sink in yet.&lt;br /&gt;&lt;br /&gt;This &lt;a href="http://lin-ear-th-inking.blogspot.com/2008/03/branch-and-bound-algorithms-for-nearest.html"&gt; mention&lt;/a&gt; of rtree and nearest neighbor reminded me of &lt;a href="http://zcologia.com/news/595/plone-r-tree-spatial-index/"&gt;this&lt;/a&gt; (see Sean's comment). Originally, I thought between &lt;a href="http://trac.gispython.org/projects/PCL/wiki/Shapely"&gt;shapely&lt;/a&gt;, &lt;a href="http://trac.gispython.org/projects/PCL/wiki/Rtree"&gt;rtree&lt;/a&gt;, and &lt;a href="http://matplotlib.sourceforge.net/"&gt;pylab&lt;/a&gt;, you could have a pretty useful python-specific plot-table, spatial data structure. It's a wrapper over &lt;a href="http://trac.gispython.org/projects/PrimaGIS/browser/SpatialIndex/trunk/README.txt"&gt;spatialindex&lt;/a&gt; which is pretty comprehensive, supporting Rtree, MVR tree, TP Rtree (&lt;a href="http://anandmu.googlepages.com/comparison3finalreport.pdf"&gt;pdf&lt;/a&gt;). It's pretty low-level, you have to add each item to the index, but that lets you do powerful things, like add an index to uhhh, anything. As with most geo libraries, you have to hack them a bit to hand1e 1D shapes like genomics data, which only have a start and a stop, or maybe just a midpoint. But, I now have the simplest possible proof of concept to do stuff like this:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;gstore = Gendex('genes')&lt;br /&gt;for start, stop in features:&lt;br /&gt;   gene = (start, stop)&lt;br /&gt;   gstore.append(gene)&lt;br /&gt;gstore.save()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;and then to do nearest neighbor query:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;gstore = Gendex('genes')&lt;br /&gt;for loc in locations:&lt;br /&gt;   neighbors = gstore.nearby(loc, 2)&lt;br /&gt;   do_something(neighbors)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;ideally, it'd allow for genomics upstream, downstream searches, with a syntax like:&lt;br /&gt;gstore.nearby(loc, 2, direction=-1)&lt;br /&gt;with -1 for upstream, and 1 for downstream locations.&lt;br /&gt;Gendex is a simple class that automatically creates a store of locations and indexes them. Rtree can already create a persistent index, this just creates a persistent pickle of the start, stop locations. It has a bit of syntax sugar for 1D, automatically setting the y-values to 0.1 or -0.1 and returns the coordinates via .nearby, instead of their indicies from rtree's .nearest() search. Eventually, instead of just using a tuple location, it'll use take a class with name, strand, chromosome, etc. attributes but this is good enough for tinkering:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;class Gendex(rtree.Rtree):&lt;br /&gt;    _store = []&lt;br /&gt;&lt;br /&gt;    def __init__(self, *args, **kwargs):&lt;br /&gt;        self.filename = args[0]&lt;br /&gt;        super(Gendex, self).__init__(*args, **kwargs)&lt;br /&gt;&lt;br /&gt;        if not 'overwrite' in kwargs and os.path.exists(args[0] + '.pkl'):&lt;br /&gt;            self.load()&lt;br /&gt;&lt;br /&gt;    def append(self, bds):&lt;br /&gt;        self.__setitem__(len(self._store), bds)&lt;br /&gt;&lt;br /&gt;    def __setitem__(self, i, bds):&lt;br /&gt;        # if &lt; 2 y values, search performs poorly.&lt;br /&gt;        y = (i % 2) and 0.01 or -0.01&lt;br /&gt;        if isinstance(bds, (list, tuple)):&lt;br /&gt;            assert len(bds) == 2&lt;br /&gt;            self.add(i, (bds[0], y, bds[1], y))&lt;br /&gt;        else:&lt;br /&gt;            self.add(i, (bds, y))&lt;br /&gt;        self._store.append(bds)&lt;br /&gt;&lt;br /&gt;    def __getitem__(self, i):&lt;br /&gt;        if isinstance(i, int):&lt;br /&gt;            return self._store[i]&lt;br /&gt;        assert(isinstance(i, (list, tuple)))&lt;br /&gt;        return [self._store[ii] for ii in i]&lt;br /&gt;&lt;br /&gt;    def save(self):&lt;br /&gt;        f = open(self.filename + '.pkl', 'wb')&lt;br /&gt;        pickle.dump(self._store, f, -1)&lt;br /&gt;        f.close()&lt;br /&gt;&lt;br /&gt;    def load(self):&lt;br /&gt;        self._store = pickle.load(open(self.filename + '.pkl', 'rb'))&lt;br /&gt;&lt;br /&gt;    def nearby(self, bds, n):&lt;br /&gt;        if isinstance(bds, (int, float)):&lt;br /&gt;            bds = (bds, 0.1)&lt;br /&gt;        #return self.nearest(bds, n)&lt;br /&gt;        return self.__getitem__(self.nearest(bds, n))&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;According to &lt;a href="http://hobu.biz/"&gt;hobu&lt;/a&gt;, the spatial index library supports 1-d, it just has to be added to the rtree python wrapper. I've started to look at this, but given my limited c/c++ skillz, that may require a lot of hand-holding.&lt;br /&gt;&lt;br /&gt;A nice lightweight bio-informatics package for feature management could be the combination of:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://pytables.org/"&gt;Pytables&lt;/a&gt; or a memmaped &lt;a href="http://www.scipy.org/Numpy_Example_List#head-bf12166d60e12d84eebf19e80b5233e346e9d7f4"&gt;numpy recarray&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;rtree&lt;/li&gt;&lt;br /&gt;&lt;li&gt;pylab&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;that's it.&lt;br /&gt;Pytables could work especially nicely if &lt;a href="http://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg00805.html"&gt;this&lt;/a&gt; gets implemented.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-14263664841036458?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/14263664841036458/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=14263664841036458' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/14263664841036458'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/14263664841036458'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/rtree-know-your-nearest-neigbhors.html' title='rtree: know your nearest neigbhors'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-5637003585818572269</id><published>2008-03-04T07:00:00.000-08:00</published><updated>2008-12-10T12:37:15.350-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>What a Shapely genome you have!</title><content type='html'>This might be a case of if you have a really cool &lt;a href="http://pypi.python.org/pypi/Shapely/"&gt;hammer&lt;/a&gt;, everything looks like a nail, but it was fun mixing tools from different disciplines.&lt;br /&gt;&lt;br /&gt;After &lt;a href="http://hackmap.blogspot.com/2008/02/synteny-mapping.html"&gt;finding synteny&lt;/a&gt;, there's a bunch of paired genes whose neighbors are also pairs.  Paired (&lt;a href="http://en.wikipedia.org/wiki/Homology_%28biology%29"&gt;homologous&lt;/a&gt;) genes have similar sequence because they have some function and can't change without loss of function. Non-gene sequence between the paired genes is mostly randomized via mutation, deletion, etc. But, there is non-gene sequence that is conserved &lt;u&gt;between&lt;/u&gt; the genes. These CNS's-- conserved non-coding sequences--are usually sites that bind stuff that regulates the expression of a gene.&lt;br /&gt;That looks like this.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_uU_kLC5AdTc/R8td3CRJdNI/AAAAAAAAAUg/ELcZQwFvUus/s1600-h/syn1.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_uU_kLC5AdTc/R8td3CRJdNI/AAAAAAAAAUg/ELcZQwFvUus/s400/syn1.png" alt="" id="BLOGGER_PHOTO_ID_5173331797048128722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;With one gene on the top, and its pair below, both yellow.  Pink lines in the foreground connect putative CNSs (similar sequences) between these genes. That the lines cross is bad. CNSs occur right at the level of noise. So even though a similar sequence occurs near both genes, it could be by chance. It is possible to reduce this noise using local synteny. In the figure, that means both ends of the line should be about equi-distant from the end of either  yellow gene. And lines that are going diagonally across the image should be removed. The goal is to find the parallel lines and remove those that cross.&lt;br /&gt;&lt;br /&gt;Luckily, there's a &lt;a href="http://pypi.python.org/pypi/Shapely/"&gt;"bioinformatics" library&lt;/a&gt; that lets me write code like this:&lt;br /&gt;&lt;pre class="prettyprint"&gt;for aline in pink_lines:&lt;br /&gt; for bline in pink_lines:&lt;br /&gt;     if aline == bline: continue&lt;br /&gt;     if aline.crosses(bline):&lt;br /&gt;         aline.has_crossed.update(b)&lt;br /&gt;         bline.has_crossed.update(a)&lt;br /&gt;&lt;/pre&gt;The .has_crossed is a set() to keep track which and how many lines any given line crosses.  Then it's simple to find lines that crossed a lot of others and remove them.&lt;br /&gt;&lt;pre class="prettyprint"&gt;# sort with most crosses first&lt;br /&gt;pink_lines = sorted(pink_lines, cmp=lambda a, b: cmp(len(b.has_crossed), len(a.has_crossed)))&lt;br /&gt;&lt;br /&gt;for cline in pink_lines:&lt;br /&gt; if len(cline.has_crossed) &gt; THRESHOLD:&lt;br /&gt; cline.remove = True&lt;br /&gt; for dline in cline.has_crossed:&lt;br /&gt;     # remove cline from all lines it crossed&lt;br /&gt;     dline.has_crossed.difference_update(cline)&lt;br /&gt;&lt;br /&gt;pinklines = [pl for pl in pinklines if not pl.remove]&lt;br /&gt;&lt;/pre&gt;Then there will still be some crosses remaining, so the code then loops through pink_lines, find intersections and removes based on the score of the hit, and how many crosses it has.&lt;br /&gt;&lt;br /&gt;That is the simplest case. It also has to remove lines (CNSs) that are in the intron of one gene, and not in the intron of another:&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;if acns.within(agene) and bcns.within(bgene):&lt;br /&gt; do_something_with_intronic_cns(acns, bcns)&lt;br /&gt;elif acns.within(agene) != bcns.within(bgene):&lt;br /&gt; # bad! in only 1 intron&lt;br /&gt; remove(acns, bcns)&lt;br /&gt;&lt;/pre&gt;There's also the cases where the lines of the CNSs don't cross, but the actual CNS's touch. So on one homeolog there might be a CNS from basepairs 920 to 936 and another from 922 to 937. We also want to get rid of those. So that'd be:&lt;br /&gt;&lt;pre class="prettyprint"&gt;if cnsa.overlaps(cnsb):&lt;br /&gt; remove_either(cnsa, cnsb)&lt;br /&gt;&lt;/pre&gt;Finally, the CNSs themselves have to be syntenic relative to the gene--that is, if a CNS cns_a is 9000 basepairs upstream from gene a, and its homeolog, cns_b is 12 basepairs upstream from gene b, that's likely not a syntenic pair. As with synteny dot plots, sometimes it's good to flip the subject homeolog to the y-axis (in the images above the subject is the bottom gene) and keep the query (the top gene in the images above) along the x-axis. Then any hit between the x and the y gene can be drawn in the x-y space.  For the genes above, that looks like this:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_uU_kLC5AdTc/R8thdiRJdPI/AAAAAAAAAUw/ZG4RgIxnYck/s1600-h/bowtie.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_uU_kLC5AdTc/R8thdiRJdPI/AAAAAAAAAUw/ZG4RgIxnYck/s400/bowtie.png" alt="" id="BLOGGER_PHOTO_ID_5173335757007975666" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Starting with the top image. The thick blue line in the center of the image is the gene itself. Any large green dots without a black dot over them are what the program called as CNSs. Red CNSs are on the wrong strand and are not considered. CNSs that overlap the grey--either along the x-axis or the y-axis occur in the introns of other genes and are not considered. Green dots covering the blue gene are intronic CNSs. The thin blue line just extends the y=x line.  The purple bowtie is what I use to enforce synteny (bowtie.contains(cns) ). Anything outside of the bowtie is either close to the query homeolog and far from the subject, or vice-versa. Real CNSs should fall along the diagonal. And we chose an arbitrary maximum distance from the gene of 12,000 basepairs. It's easy to see the syntenic genes line up along the y=x blue line.&lt;br /&gt;&lt;br /&gt;The bottom image filled with lines of the bilious hue is to visualize the crosses (as in the code above). The line with the black stripes has been removed because it overlapped &gt; THRESHOLD lines. After that removal, no other lines crossed and after removing non-syntenic genes outside the bowtie, only 20 CNSs remained.&lt;br /&gt;&lt;br /&gt;After all that noise is removed, the original image looks like this in our real viewer:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_uU_kLC5AdTc/R8tfxCRJdOI/AAAAAAAAAUo/dtdJdMzhxro/s1600-h/syn2.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_uU_kLC5AdTc/R8tfxCRJdOI/AAAAAAAAAUo/dtdJdMzhxro/s400/syn2.png" alt="" id="BLOGGER_PHOTO_ID_5173333892992169186" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and those are real CNSs with lovely parallel lines.  This runs for thousands of homologous gene pairs and creates a list of CNSs. There are other complexities, i.e. when pairs that are on opposite strands or when there are other syntenic genes in the up/downstream regions.&lt;br /&gt;&lt;br /&gt;Other than the extra dimension, MultiLineString() works perfectly for a single gene with many CNSs. The fact that it integrates so well with &lt;a href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt;, and therefore &lt;a href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt; is great for visualization. Compare that to making a map file for just a quick look some data. Anyway, a fun project. Cheers &lt;a href="http://zcologia.com/news/"&gt;Sean&lt;/a&gt; and the makers of &lt;a href="http://geos.refractions.net/"&gt;geos&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-5637003585818572269?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/5637003585818572269/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=5637003585818572269' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5637003585818572269'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5637003585818572269'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/03/what-shapely-genome-you-have.html' title='What a Shapely genome you have!'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_uU_kLC5AdTc/R8td3CRJdNI/AAAAAAAAAUg/ELcZQwFvUus/s72-c/syn1.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4356159929690276116</id><published>2008-02-28T07:17:00.000-08:00</published><updated>2008-02-28T07:17:55.330-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='fcsh'/><category scheme='http://www.blogger.com/atom/ns#' term='flash'/><title type='text'>flash, vi, fcsh</title><content type='html'>All my flash tinkering has been in VIM-- no IDE, no XML, just actionscript.  It's a little tough to deal with the &lt;a href="http://www.adobe.com/products/flex/sdk/"&gt;adobe compiler&lt;/a&gt; as it takes about 11 seconds to compile a large project like modestmaps on my machine. That's not good for a guess-and-check programmer. The typing does catch some errors.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://worldkit.org/"&gt;worldkit&lt;/a&gt; project compiles instantaneously with &lt;a href="http://mtasc.org"&gt;mtasc&lt;/a&gt; (the predecessor to &lt;a href="http://haxe.org"&gt;haxe&lt;/a&gt;)--likewise for the as2 branch of modestmaps.  The &lt;a href="http://labs.adobe.com/wiki/index.php/Flex_Compiler_Shell"&gt;flash compiler shell&lt;/a&gt; drops the compile time for as3 modestmaps to under 3 seconds, so I've added this to my .bash_aliases:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;alias fcsh="/usr/bin/rlwrap /opt/src/flex2/bin/fcsh"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;the rlwrap is to use readline in the flash shell--meaning I can just press up-arrow to get the previous compile command. By default, one has to paste or type the entire command again.&lt;br /&gt;With that, it's close to a reasonable workflow.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4356159929690276116?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4356159929690276116/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4356159929690276116' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4356159929690276116'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4356159929690276116'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/flash-vi-fcsh.html' title='flash, vi, fcsh'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1848391513278578301</id><published>2008-02-27T09:12:00.000-08:00</published><updated>2008-03-11T22:36:16.820-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>synteny mapping</title><content type='html'>&lt;span style="font-size:130%;"&gt;&lt;span style="font-style: italic;"&gt;&lt;u&gt;Living in Synteny&lt;/u&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;I've been working on automating &lt;a href="http://en.wikipedia.org/wiki/Synteny"&gt;synteny&lt;/a&gt; mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY.   Often there are small losses and inversions, but between closely related organisms like &lt;a href="http://scholar.google.com/scholar?hl=en&amp;amp;lr=&amp;amp;q=man+mouse+synteny&amp;amp;btnG=Search"&gt;man and mouse&lt;/a&gt;, there's still a lot of synteny.&lt;br /&gt;Plants can undergo &lt;a href="http://en.wikipedia.org/wiki/Polyploidy"&gt;polyploidy&lt;/a&gt;, following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost,  and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms.&lt;br /&gt;I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it's ... sorta non-deterministic.&lt;br /&gt;The output looks like this for &lt;a href="http://www.arabidopsis.org/"&gt;Arabidopsis&lt;/a&gt; against itself:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href=""&gt;&lt;img style="cursor: pointer; width: 400px;" src="http://img84.imageshack.us/img84/4435/diagsvc0.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;That's compressed from a 3000px*3000px image, but it's 25 boxes for the 5 vs. 5 chromsomes so above and below the diagonal should be mirror images. and with:&lt;br /&gt;+ the diagonal dark line in the center being where each gene matches itself&lt;br /&gt;+ the dark patches in the center of each box being all the crap in the centromere of each chromosome.&lt;br /&gt;+ the speckled dots all over being because genes have many relatives. and there's lots of frequently appearing transposons.&lt;br /&gt;+ the red lines being what my algorithm finds as diagonals.&lt;br /&gt;The problem is it either extends too far into the sea of dots, or it doesn't seed on what should be a diagonal, depending on the parameters. The human eye is very good at this, but it's difficult to make a computer do it. Anyway, I'd previously tried published synteny finding programs and found them no better than mine, but just found &lt;a href="http://dagchainer.sourceforge.net/"&gt;dagchainer&lt;/a&gt;, where DAG is &lt;a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;directed-acyclic graph&lt;/a&gt;.  It's a but fussy in that given too many points, it will find spurious diagonals, but it's predictable, and fast, and it's a simple matter to cull the points before sending them in. DAGChainer is &lt;a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/18/3643"&gt;published&lt;/a&gt; and public domain, and the code is readable (made me think maybe C++ ain't so bad). Also, it doesnt rely on protein sequence as many synteny programs do. This is important when dealing with poorly annotated genomes. I had some trouble with it originally, and emailed the author, Brian Haas. He asked me to send my data set, so I sent the worst parts. He did some preprocessing, found some good parameters , ran it himself and sent me the results in under a day, including a couple of explanatory emails back and forth. So, now, I'm building my processing scripts around dagchainer, and it's working out great. Brian is extremely enthusiastic and helpful.  Must be another one of those crazy folks who enjoy what they do.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1848391513278578301?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1848391513278578301/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1848391513278578301' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1848391513278578301'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1848391513278578301'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/synteny-mapping.html' title='synteny mapping'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-6581792542155485976</id><published>2008-02-26T17:45:00.000-08:00</published><updated>2008-02-26T18:12:49.506-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='oss'/><category scheme='http://www.blogger.com/atom/ns#' term='gis'/><category scheme='http://www.blogger.com/atom/ns#' term='flash'/><title type='text'>open source gis and flash maps part two</title><content type='html'>&lt;span style="font-size:130%;"&gt;&lt;span style="font-style: italic;"&gt;&lt;u&gt;Mash Flap&lt;/u&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;I &lt;a href="http://hackmap.blogspot.com/2008/02/flash-y-map-plication-vs-500-marker.html"&gt;started&lt;/a&gt; looking into flash mapping stuff lately. For the patch I submitted to worldkit, &lt;a href="http://brainoff.com/weblog/"&gt;Mikel&lt;/a&gt; gave me commit access!  So, now the &lt;a href="http://worldkit.org/trac/index.fcgi/browser/worldkit"&gt;svn version of worldkit&lt;/a&gt; can be compiled with &lt;a href="http://mtasc.org/"&gt;mtasc&lt;/a&gt; by typing "make". I feel unreasonably proud of that, given that much of what I did was some global search and replace stuff in VI, and then read and fixed mtasc compiler errors until they went away. It was good fun.&lt;br /&gt;&lt;a href="http://mike.teczno.com/"&gt;    Michal Migurski&lt;/a&gt; saw in that same post  that I mentioned modestmaps and gave me some good ideas on getting WMS going. I just figured out how to get that working and &lt;a href="http://getsatisfaction.com/modestmaps/topics/simple_working_wms_overlays_over_commerical_mapproviders"&gt;posted a message&lt;/a&gt; to their overly  web2.0 forums. Hopefully someone with some real actionscript skillz will clean it up. A mapping library without a good WMS interface is much less useful for most of the stuff I do.&lt;br /&gt;&lt;br /&gt;I haven't decided whether to use modestmaps or worldkit, or both. The&lt;a href="http://www.slideshare.net/mikel_maron/its-about-time-for-time/"&gt; time stuff&lt;/a&gt; the Mikel has done in worldkit is very cool and I haven't really looked at that yet. But I have a time based project starting soon. But, most of my GIS project are reliant on good, hi-resolution imagery and quality roads data, and it's nice if that's just "magically" included via one of the commercial map providers. That's a plus for modestmaps. Though I _still_ do not understand if the modestmaps usage of google, yahoo, microsoft images falls within the terms of use... Especially since the google starts sending "X" tiles instead of imagery after browsing with modesmaps for a while. Meanwhile, I keep using &lt;a href="http://openlayers.org/"&gt;OpenLayers&lt;/a&gt;, because that's what sane people do.&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-6581792542155485976?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/6581792542155485976/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=6581792542155485976' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6581792542155485976'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6581792542155485976'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/open-source-gis-and-flash-maps-part-two.html' title='open source gis and flash maps part two'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-3927855024068457735</id><published>2008-02-24T09:36:00.000-08:00</published><updated>2008-02-24T10:43:39.310-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='bio'/><title type='text'>fast python with shedskin</title><content type='html'>There's a &lt;a href="http://shed-skin.blogspot.com/2008/02/shed-skin-0027.html"&gt;new release&lt;/a&gt; of the shedskin compiler. It is able to generate fast shared libraries that can be run from CPython. It can also create binaries so I thought I'd see how it did on some code from &lt;a href="http://www.biomedcentral.com/1471-2105/9/82"&gt;this&lt;/a&gt; BMC bioinformatcs article compared to &lt;a href="http://psyco.sourceforge.net/"&gt;psyco&lt;/a&gt; and CPython.&lt;br /&gt;I took this iterative, brute-force (&lt;a href="http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm"&gt;Needleman-Wunsch&lt;/a&gt;?) alignment &lt;a href="ftp://ftp.bioinformatics.org/pub/benchmark/python/alignment.py"&gt;code&lt;/a&gt; and modified it slightly. That's pasted &lt;a href="http://pastebin.com/f470ac285"&gt;here&lt;/a&gt;. (Notice the first line! that's how it appears in the original code). The modifications allow shedskin to infer the function and variable types. Plus, there's a couple changes I made that improve the run-time for all cases. The max() function is also in the original, but unnecessary because of python's builtin max(), however, pysco does run much faster using their hand-coded max(). For the shedskin run, I removed that extra code and used shedskin's builtin 'cause it made me feel better.&lt;br /&gt;&lt;br /&gt;The python code was run as&lt;br /&gt;$ time python -c "import alignment; alignment.imain()"&lt;br /&gt;&lt;br /&gt;To run with psyco. The only change from the pasted script is to add&lt;br /&gt;'import psyco; psyco.full()' at the top.&lt;br /&gt;&lt;br /&gt;To run with shedskin built executable (after removing the max()):&lt;br /&gt;$ shedskin alignment.py  ### generates Makefile&lt;br /&gt;$ make&lt;br /&gt;$ time ./alignment&lt;br /&gt;&lt;br /&gt;To run with shedskin as a shared (.so) library&lt;br /&gt;$ shedskin -e -n alignment.py # generates Makefile&lt;br /&gt;$ make&lt;br /&gt;$ time python -c "import alignment; print alignment.imain()"&lt;br /&gt;# python finds and imports the shared library (.so) before the .py file.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Timing&lt;/span&gt;:&lt;br /&gt;time reported is real.&lt;br /&gt;Python 2.5.1: &lt;span style="font-style: italic;"&gt;19.030s&lt;/span&gt;&lt;br /&gt;Psyco: &lt;span style="font-style: italic;"&gt;1.1336s&lt;/span&gt;&lt;br /&gt;Shedskin shared: &lt;span style="font-style: italic;"&gt;0.921&lt;/span&gt;&lt;br /&gt;Shedskin binary: &lt;span style="font-style: italic;"&gt;0.818&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;CPython is much slower at 19seconds compared to ~1 for pysco and shedskin. The output format for the shedskin binary is slightly different because it calls the stuff in __main__, but they all generate the same alignments. It's interesting to look at the alignment.ss.py that shedskin generates, as you can see all the inferred types. The alignment.cpp contains the generated cpp code, which is also quite readable. It's also nice to be able to get a binary executable without jumping through any extra hoops.&lt;br /&gt;&lt;br /&gt;Shedskin was very easy to use and faster than psyco for this case, I just pulled from &lt;a href="http://code.google.com/p/shedskin/source/checkout"&gt;SVN&lt;/a&gt; and it worked out of the box. It now has support for sets and regular expressions, and seems &lt;a href="http://code.google.com/p/shedskin/source/list"&gt;quite active&lt;/a&gt;. I can see using shedskin for purely brute force stuff where &lt;a href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt; is no help and I might otherwise have to resort to &lt;a href="http://cython.org/"&gt;cython&lt;/a&gt;.  I kinda like cython, but it's nice just to be able to get fast code with little to no modifications.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;EDIT:&lt;br /&gt;jython 2.3a0 runs this unaltered in 42 seconds.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-3927855024068457735?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/3927855024068457735/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=3927855024068457735' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3927855024068457735'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3927855024068457735'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/fast-python-with-shedskin.html' title='fast python with shedskin'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1206996298087260480</id><published>2008-02-23T10:21:00.001-08:00</published><updated>2008-02-23T21:44:36.847-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python gis'/><title type='text'>Python Mapscript Tricks</title><content type='html'>I &lt;a href="http://hackmap.blogspot.com/2008/02/flash-y-map-plication-vs-500-marker.html"&gt;mentioned previously&lt;/a&gt; a site which uses google maps with a WMS. The problem with using points (or labels) in a tiled application such as google maps, or &lt;a href="http://openlayers.org/"&gt;OpenLayers&lt;/a&gt; is that each tile can only draw its own contents. So if you draw a point with a radius of 10 pixels whose center is 2 px away  from the edge of the tile, then 8px of the entire 20px will be chopped. By default, those lost 8 pixels will not be drawn in the adjacent tile because the center of the point does not fall in that tile. so that gives something that looks like this for 2 adjacent tiles:&lt;br /&gt;&lt;nobr&gt;&lt;br /&gt;&lt;a style="border: 0pt" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://static.scribd.com/profiles/images/bdobl12q3cnti-full.png"&gt;&lt;img style="cursor: pointer; width: 256px;" src="http://static.scribd.com/profiles/images/bdobl12q3cnti-full.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;/nobr&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://static.scribd.com/profiles/images/2m0h5l56kuxxf-large.png"&gt;&lt;img style="margin: 0pt; cursor: pointer; width: 256px;" src="http://static.scribd.com/profiles/images/2m0h5l56kuxxf-large.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;nobr&gt;&lt;br /&gt;&lt;/nobr&gt;&lt;br /&gt;It's hard to even tell that those tiles belong together! Using some &lt;a href="http://mapserver.gis.umn.edu/docs/howto/mapscript_python"&gt;python mapscript&lt;/a&gt; and &lt;a href="http://www.pythonware.com/products/pil/"&gt;PIL&lt;/a&gt;, it's pretty simple to make those look like this( except I dont know how to tell blogger not to add the spacin...) :&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://static.scribd.com/profiles/images/k1389rhnzo514-full.png"&gt;&lt;img style="width: 256px;" src="http://static.scribd.com/profiles/images/k1389rhnzo514-full.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://static.scribd.com/profiles/images/1weiu66xp7794-full.png"&gt;&lt;img style="margin: 0pt; cursor: pointer; width: 256px;" src="http://static.scribd.com/profiles/images/1weiu66xp7794-full.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;That's actually the same WMS request(s), just changing the call from a simple WMS CGI script that calls mapserver to a &lt;a href="http://wsgi.org/wsgi"&gt;WSGI&lt;/a&gt; script that uses python mapscript and PIL.&lt;br /&gt;The script:&lt;br /&gt;1. takes the current bounding box, width and height, and expands them all proportionally.&lt;br /&gt;2. has mapscript draw the expanded image which is&lt;br /&gt;3. saved into a &lt;a href="http://docs.python.org/lib/module-cStringIO.html"&gt;stringIO&lt;/a&gt; object which is&lt;br /&gt;4. sent that to PIL to be cropped down to size and extent.&lt;br /&gt;5. returned to the browser.&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;br /&gt;#!/usr/bin/python&lt;br /&gt;&lt;br /&gt;import mapscript&lt;br /&gt;from cgi import parse_qsl&lt;br /&gt;import os&lt;br /&gt;import Image&lt;br /&gt;from cStringIO import StringIO&lt;br /&gt;&lt;br /&gt;BUFFER_PCT = 0.1&lt;br /&gt;MAPFILE = 'soil.map'&lt;br /&gt;&lt;br /&gt;########################################################&lt;br /&gt;MAPFILE = os.path.join(os.path.dirname(__file__), MAPFILE)&lt;br /&gt;&lt;br /&gt;def application(environ, start_response):&lt;br /&gt;  winput = dict(parse_qsl(environ['QUERY_STRING']))&lt;br /&gt;  format = winput.get('FORMAT', 'image/png')&lt;br /&gt;  start_response('200 OK', [('Content-Type', format)])&lt;br /&gt;  extension = format[format.find('/') + 1:]&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  wms = mapscript.mapObj(MAPFILE)&lt;br /&gt;  req = mapscript.OWSRequest()&lt;br /&gt;&lt;br /&gt;  for k,v in winput.items():&lt;br /&gt;      req.setParameter(k, v)&lt;br /&gt;&lt;br /&gt;  buffer = BUFFER_PCT/2&lt;br /&gt;&lt;br /&gt;  bbox = map(float, winput['BBOX'].split(","))&lt;br /&gt;  rangex = bbox[2] - bbox[0]&lt;br /&gt;  rangey = bbox[3] - bbox[1]&lt;br /&gt;  w = int(winput['WIDTH'])&lt;br /&gt;  h = int(winput['HEIGHT'])&lt;br /&gt;  xdelta = int(round(w * buffer)) # e.g add 13px on each side&lt;br /&gt;  ydelta = int(round(h * buffer)) # for 256x256 image&lt;br /&gt;&lt;br /&gt;  bbox[0] -= rangex * buffer # extend the&lt;br /&gt;  bbox[1] -= rangey * buffer # bbox in&lt;br /&gt;  bbox[2] += rangex * buffer # all&lt;br /&gt;  bbox[3] += rangey * buffer # directions&lt;br /&gt;&lt;br /&gt;  # http://trac.osgeo.org/mapserver/ticket/2299&lt;br /&gt;  req.setParameter('WIDTH',  str(w + 2 * xdelta))   # and adjust the&lt;br /&gt;  req.setParameter('HEIGHT', str(h + 2 * ydelta))   # h, w by the same&lt;br /&gt;  req.setParameter('BBOX', ",".join(map(str,bbox))) # amount&lt;br /&gt;  req.setParameter("STYLES", "")&lt;br /&gt;  req.setParameter("REQUEST", "GetMap")&lt;br /&gt;&lt;br /&gt;  # PIL doesnt like interlace.&lt;br /&gt;  wms.outputformat.setOption('FORMATOPTIONS', 'INTERLACE=OFF')&lt;br /&gt;  wms.loadOWSParameters(req)&lt;br /&gt;&lt;br /&gt;  im = Image.open(StringIO(wms.draw().getBytes()))&lt;br /&gt;  if im is None: return ['']&lt;br /&gt;&lt;br /&gt;  # crop the image back to the requested w, h&lt;br /&gt;  im = im.crop((xdelta, ydelta, w + xdelta, h + ydelta))&lt;br /&gt;&lt;br /&gt;  buffer = StringIO()&lt;br /&gt;  im.save(buffer, extension)&lt;br /&gt;  buffer.seek(0)&lt;br /&gt;  data = buffer.read()&lt;br /&gt;  return [ data ]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;That's pretty simple, but pretty cool. And it actually runs slightly faster than the CGI version. It runs even more quickly if "wms = mapscript.mapObj(MAPFILE)" is moved into the global scope, so the mapfile doesnt have to be re-parsed each time, but I didnt test enough to make sure that's ok.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1206996298087260480?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1206996298087260480/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1206996298087260480' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1206996298087260480'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1206996298087260480'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/i-mentioned-previously-site-which-uses.html' title='Python Mapscript Tricks'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-7753721811229556269</id><published>2008-02-21T16:59:00.000-08:00</published><updated>2008-02-21T17:30:27.709-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gis oss'/><title type='text'>Open Source GIS</title><content type='html'>Just saw &lt;a href="http://blog.entchev.com/2008/02/16/all-bikes-weigh-the-same.aspx"&gt;this&lt;/a&gt; linked from Sean's &lt;a href="http://zcologia.com/news/685/still-not-getting-it/"&gt;post&lt;/a&gt;. The quotes in there are absurd. But reminded me there's an important point in favor of OSS that I haven't seen.&lt;br /&gt;&lt;br /&gt;You can still get help directly via the email list or IRC from &lt;a href="http://en.wikipedia.org/wiki/Frank_Warmerdam"&gt;F Warmerdam&lt;/a&gt;, it's primary author. Likewise for the developers of &lt;a href="http://mapserver.gis.umn.edu/"&gt;Mapserver&lt;/a&gt; and &lt;a href="http://openlayers.org/"&gt;OpenLayers&lt;/a&gt; and &lt;a href="http://postgis.refractions.net/"&gt;PostGIS&lt;/a&gt;. I wonder if the lead developers of ESRI products spend their off-time perusing the forum's or mailing lists and answering questions? ( I don't know, they may. But I suspect not. )&lt;br /&gt;There's something to be said for enjoying what you do. And I think that's very true in the case of those in the open-source community. Happy coders make better software. Any programmer that denies that will leave me flabbergasted.&lt;br /&gt;&lt;br /&gt;I just don't understand how I could be effective only clicking the menus that were provided to me if I chose a black-box solution. But, call me crazy, I like linux.&lt;br /&gt;&lt;br /&gt;Also, in these parts, if you lock your single-speed bike &lt;a href="http://blog.entchev.com/2008/02/16/all-bikes-weigh-the-same.aspx"&gt;to a wooden chair&lt;/a&gt;, &lt;a href="http://www.ebbc.org/?q=theft_prevention"&gt;it'll get stolen&lt;/a&gt; regardless of the cost of the lock or the weight of the bike.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-7753721811229556269?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/7753721811229556269/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=7753721811229556269' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7753721811229556269'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/7753721811229556269'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/open-source-gis.html' title='Open Source GIS'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-5629429583861096899</id><published>2008-02-12T17:21:00.000-08:00</published><updated>2008-12-10T12:37:16.122-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='flash gis'/><title type='text'>Flash-y Map-plication vs. 500 marker limit</title><content type='html'>I've been pushing capabilities of javascript mapping frameworks. There's a real limit on the number of markers a browser can display without getting bogged down. 500 is a good limit, and that's probably too high for internet explorer. You can put as many markers as you like via WMS tiles. You can even take map clicks, send them back as AJAX queries and open an info window. I recently helped my friend do this, and we have a pretty snappy map displaying 5,000+ clickable "markers" no problem. It's actually not markers, just tiles, but it works quite well:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_uU_kLC5AdTc/R7c82atKvzI/AAAAAAAAAT4/kyMH1ZTm7aA/s1600-h/mich.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_uU_kLC5AdTc/R7c82atKvzI/AAAAAAAAAT4/kyMH1ZTm7aA/s320/mich.png" alt="" id="BLOGGER_PHOTO_ID_5167666003010305842" border="0" /&gt;&lt;/a&gt;That's a snapshot of the google maps application, with an info window that appears when a marker is clicked. It sends a GetFeatureInfo request back to a mapserver WMS.&lt;br /&gt;&lt;br /&gt;Vector drawing in the browser is limited. The amazing efforts of the &lt;a href="http://openlayers.org/"&gt;OpenLayers&lt;/a&gt;, &lt;a href="http://featureserver.org/"&gt;featureserver&lt;/a&gt; projects make this difficult to assert, as they abstract away all of the browser incompatibilities and give a nice platform to do &lt;a href="http://featureserver.org/demo.html"&gt;real vector editing in a map&lt;/a&gt;. Still, there's a limit, and even if browsers become twice as fast in the next 2-3 years (and that's a big if, because the slowest commonly used browser will be the weakest link), the limit will be at 1000 markers.&lt;br /&gt;Flash is built to do vector drawing. It's cross-browser. I've got my toes wet with a flash-based project and now that I am passed (part of) the learning curve, I figured I'd see what's available for GIS applications in flash...&lt;br /&gt;&lt;br /&gt;Yahoo fairly recently released their &lt;a href="http://developer.yahoo.com/flash/astra-webapis/"&gt;Actionscript 3.0 version&lt;/a&gt; of their maps API. The examples &lt;a href="http://developer.yahoo.com/flash/maps/examples/YahooMap_Events/YahooMap_Events.mxml"&gt;look very straight-forward&lt;/a&gt;, if a bit heavy on the under-scores--even for a pythonista. But it's not open source, and there is no indication on how one would request tiles from any source--say WMS.&lt;br /&gt;There is a &lt;a href="http://freeearth.poly9.com/"&gt;3D earth viewer&lt;/a&gt; with a nice javascript API, but it's not open source and I dont need the 3D stuff.&lt;br /&gt;&lt;br /&gt;I'm still looking at &lt;a href="http://worldkit.org/"&gt;worldkit&lt;/a&gt;, and even &lt;a href="http://lists.brainoff.com/pipermail/worldkit-dev-brainoff.com/2008-February/000407.html"&gt;provided a patch&lt;/a&gt; to allow it to compile with mtasc. The worldkit approach as I understand it is to just provide the functionality in the flash movie, and only offer customization through an inutitive config.xml file, so no actionscript programming is necessary. Of course, this isn't strictly the case, but I think it's the way that it's most used. It's licensed GPL...&lt;br /&gt;&lt;br /&gt;Then there's &lt;a href="http://www.modestmaps.com/"&gt;modest maps&lt;/a&gt; who are truly open source, with a trac bug tracker and BSD licensed code, multiple developers, clean design, and libraries for both actionscript 2.0 and 3.0. And they can display imagery from any of the major providers, I don't quite understand how they get by without violating the licensing restrictions, but they are making javascript calls--presumably to get the copyright info associated with the region of interest. They have a &lt;a href="http://modestmaps.mapstraction.com/trac/wiki/TileCoordinateComparisons"&gt;coordinate conversion system&lt;/a&gt; that seems very clean, though I also dont understand how to relate it to proj4 definitions--or the EPSG code in a WMS request. If I can understand how to use this, and even if it's possible to have layers that overlay base imagery I'd go with modest maps.&lt;br /&gt;&lt;br /&gt;Anyone know of any others? I guess my criteria are:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Active, currently developed&lt;/li&gt;&lt;li&gt;Open source&lt;/li&gt;&lt;li&gt;WMS friendly&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;EDIT:&lt;br /&gt;I was able to get a modestmaps actionscript 2.0 movie to show 2500 markers, with the map still very responsive. The as3 version should perform even better because a marker can be a subclass of the more lightweight Sprite, rather than the full movieclip.&lt;br /&gt;this is a shot of the randomly placed markers.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_uU_kLC5AdTc/R7fCRatKv0I/AAAAAAAAAUA/3p11dHw279o/s1600-h/m2500.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_uU_kLC5AdTc/R7fCRatKv0I/AAAAAAAAAUA/3p11dHw279o/s320/m2500.png" alt="" id="BLOGGER_PHOTO_ID_5167812701913268034" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-5629429583861096899?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/5629429583861096899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=5629429583861096899' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5629429583861096899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/5629429583861096899'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/flash-y-map-plication-vs-500-marker.html' title='Flash-y Map-plication vs. 500 marker limit'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_uU_kLC5AdTc/R7c82atKvzI/AAAAAAAAAT4/kyMH1ZTm7aA/s72-c/mich.png' height='72' width='72'/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-6511254872118571964</id><published>2008-02-10T09:58:00.000-08:00</published><updated>2008-02-10T13:57:19.218-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>python bioinformatics</title><content type='html'>There's a new article out in BMC Bioinformatics with a comparison of the speed and length of programs from various languages. This article was sent to the biology in python (&lt;a href="http://lists.idyll.org/pipermail/biology-in-python/2008-February/thread.html"&gt;BIP&lt;/a&gt;) mailing list. Looking at the code, it's not &lt;span style="font-style: italic;"&gt;that&lt;/span&gt; bad, but it is clear that the authors are not pythonistas, and that the reviewers have done a great job on the actual paper, but likely there was no thorough review of the code.&lt;br /&gt;The authors define their own max() function that needlessly overrides python's built-in, and they use the code:&lt;br /&gt;&lt;pre&gt;line.rstrip('/n')&lt;/pre&gt; that indicates there was not a thorough understanding of python, or a complete code review. Even a non-python programmer should have seen the intent was to strip a newline '\n', but the operation is not inplace, so the desired behavior could be achieved by:&lt;pre&gt;line = line.rstrip('\n')&lt;/pre&gt;Syntactical mistakes aside, python was given a poor review on speed. Andrew Dalke, of &lt;a href="http://www.dalkescientific.com/writings/diary/archive/2007/10/07/wide_finder.html"&gt;wide-finder&lt;/a&gt; (and general python-bio) fame ran the &lt;a href="ftp://ftp.bioinformatics.org/pub/benchmark/python/alignment.py"&gt;alignment&lt;/a&gt; program (which seems to be Needleman-Wunsch) and found the  &lt;a href="http://psyco.sourceforge.net/"&gt;psyco&lt;/a&gt; JIT'ed version to run in 1.7 seconds instead of the original 18+.&lt;br /&gt;&lt;br /&gt;Given the lack of polish on the programs, even from other languages, the question then is what's the recourse? On the BIP list, some even suggested requesting that the authors retract the article. I think that's going too far as (to my knowledge) the code runs, and does more/less what it should. An alternate approach may be to propose more thorough code review for all articles, not just those that are benchmarks.  As an example of informal code review, had the authors sent their code to the python list, they would undoubtedly have received numerous suggestions for making the code faster and more idiomatic. Likewise for the other languages. Clearly, that's not a solution for all cases, but given  reasonably active communities for Bio-Python, Perl, Java, Ruby, a journal or author should be able to find a competent reviewer--especially in the most common scenarios where a single language was used.   Journals demand a very specific format of the text of an article, should have similar standards for any code that is used in the article? Should they require automated tests? This presents a problem for those that use proprietary software. But one of the points of a scientific paper is to document the methods sufficiently to reproduce the results, is the actual code required to do so?&lt;br /&gt;It's an interesting question, I don't know the answer.  I notice 2 things:&lt;br /&gt;1) A reviewer is not expected to know how to program, but she is expected to understand the science.&lt;br /&gt;2) We do not have the same standards for code review as for reviewing the text.&lt;br /&gt;&lt;br /&gt;Do the ends justify the means?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-6511254872118571964?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.biomedcentral.com/1471-2105/9/82' title='python bioinformatics'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/6511254872118571964/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=6511254872118571964' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6511254872118571964'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/6511254872118571964'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/python-bioinformatics.html' title='python bioinformatics'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-906877501559890849</id><published>2008-02-02T16:15:00.000-08:00</published><updated>2008-12-10T12:37:16.380-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>parallel blasts using python's pp module</title><content type='html'>&lt;a href="http://en.wikipedia.org/wiki/BLAST"&gt;BLAST&lt;/a&gt; handles utilizes multiple cores for some scenarios using the  -a  flag. however, i often do full genome blasts -- blasting all chromosomes of one organism against all others. For the case of rice (O.s.) against itself, this is 124 jobs. there are simple tools in python to run a queued blast in which on my 8 core machine, each core will run 1 of those blasts, as a job finishes, the pp or &lt;a href="http://parallelpython.org/"&gt;parallel python&lt;/a&gt; module starts the next job, based on the number of cpus it has detected for your machine.&lt;br /&gt;the syntax for the script is:&lt;br /&gt;&lt;pre&gt;python pblast.py rice_rice_10kmers&lt;/pre&gt;&lt;br /&gt;where rice_rice_10kmers is the section in a config/.ini file to get the parameters.&lt;br /&gt;My fasta directory looks like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ ls /tmp/rice/fasta/&lt;br /&gt;ricetenkmers_chr01.fasta  ricetenkmers_chr04.fasta  ricetenkmers_chr07.fasta  ricetenkmers_chr10.fasta  ricetenkmers.order&lt;br /&gt;ricetenkmers_chr02.fasta  ricetenkmers_chr05.fasta  ricetenkmers_chr08.fasta  ricetenkmers_chr11.fasta&lt;br /&gt;ricetenkmers_chr03.fasta  ricetenkmers_chr06.fasta  ricetenkmers_chr09.fasta  ricetenkmers_chr12.fasta&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;the script will create (144 - repeats ) output files with names that look like:&lt;br /&gt;&lt;pre&gt;ricetenkmers_chr06_vs_ricetenkmers_chr09.blast&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The code below is what does the work:&lt;br /&gt;&lt;pre class="prettyprint" id="python"&gt;&lt;br /&gt;"""&lt;br /&gt;this will parallelize the blast on a directory of fasta files. keeps&lt;br /&gt;the number of CPUs on the given machine full until all jobs are done.&lt;br /&gt;"""&lt;br /&gt;&lt;br /&gt;import commands&lt;br /&gt;import os&lt;br /&gt;import sys&lt;br /&gt;import pp&lt;br /&gt;import glob&lt;br /&gt;&lt;br /&gt;import ConfigParser&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;filen = lambda fullpath: fullpath[fullpath.rfind("/") + 1:fullpath.rfind(".")]&lt;br /&gt;"""&lt;br /&gt;&gt;&gt;&gt; filen = lambda path: fullpath[path.rfind("/") + 1:path.rfind(".")]&lt;br /&gt;&gt;&gt;&gt; filen("/tmp/rice/fasta/ricetenkmers_chr06.fasta")&lt;br /&gt;'/ricetenkmers_chr06'&lt;br /&gt;"""&lt;br /&gt;&lt;br /&gt;def gen_command(q_fastas, s_fastas, format_db, blast, out_dir):&lt;br /&gt;   """&lt;br /&gt;   generator of blast commands given arguments:&lt;br /&gt;   'q_fastas' : a list/iterable of strings indicating the full path&lt;br /&gt;                to the set of query fastas.&lt;br /&gt;   's_fastas' : a list/iterable of strings ... subject ...&lt;br /&gt;   'format_db': a formatdb command (see pblast.ini for example)&lt;br /&gt;   'blast'    :  a blast command string (see pblast.ini) &lt;br /&gt;   'out_dir'  :  directory to write the blast output files&lt;br /&gt;   yields the full blast commands.&lt;br /&gt;   """&lt;br /&gt;&lt;br /&gt;   for q_fasta in q_fastas:&lt;br /&gt;       q_name = filen(q_fasta)&lt;br /&gt;       for s_fasta in s_fastas:&lt;br /&gt;           s_name = filen(s_fasta)&lt;br /&gt;&lt;br /&gt;           out_name = os.path.join(out_dir, "%s_vs_%s.blast" % (q_name, s_name))&lt;br /&gt;          &lt;br /&gt;           if not os.path.exists("%s.nin" % s_fasta):&lt;br /&gt;               print &gt;&gt;sys.stderr, "formatting %s" % s_fasta&lt;br /&gt;               # formatting db is not parallelized here ...&lt;br /&gt;               commands.getoutput(format_db % s_fasta)&lt;br /&gt;&lt;br /&gt;           # tell it where to send the output.&lt;br /&gt;           command = (blast % (q_fasta, s_fasta)) + " -o " + out_name&lt;br /&gt;           yield command&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;def consume(command):&lt;br /&gt;   return command, commands.getoutput(command)&lt;br /&gt;&lt;br /&gt;def save_blast_info(out_dir, fomatdb, blast):&lt;br /&gt;   """ keep a record of the blast and format_db params used"""&lt;br /&gt;   bp = open(os.path.join(out_dir, "00blast.params"), "w")&lt;br /&gt;   print &gt;&gt; bp, format_db&lt;br /&gt;   print &gt;&gt; bp, blast&lt;br /&gt;   bp.close()&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;def main(q_fastas, s_fastas, format_db, blast, out_dir):&lt;br /&gt;   # create the server.&lt;br /&gt;   s = pp.Server()&lt;br /&gt;   if not os.path.exists(out_dir): os.makedirs(out_dir)&lt;br /&gt;   save_blast_info(out_dir, format_db, blast)&lt;br /&gt;&lt;br /&gt;   jobs = [] # this will hold all the pp jobs.&lt;br /&gt;&lt;br /&gt;   for c in gen_command(q_fastas, s_fastas, format_db, blast, out_dir):&lt;br /&gt;       # tell pp that we'll call the consume function&lt;br /&gt;       # (c,) is the argument list, where c is a string for hte blast&lt;br /&gt;       # command&lt;br /&gt;       # ("commands",) is a tuple of modules that will be needed for the job&lt;br /&gt;       jobs.append(s.submit(consume, (c,), (), ("commands",)))&lt;br /&gt;&lt;br /&gt;   # loop through and collect the jobs and any output as they return&lt;br /&gt;   for j in jobs:&lt;br /&gt;       command, output = j()&lt;br /&gt;       print command&lt;br /&gt;       if output: print output&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;if __name__ == "__main__":&lt;br /&gt;   config = ConfigParser.ConfigParser()&lt;br /&gt;   config.read("pblast.ini")&lt;br /&gt;   params = dict(config.items(sys.argv[1]))&lt;br /&gt;&lt;br /&gt;   format_db = params.get("format_db", "/usr/bin/format_db -p F -i %s")&lt;br /&gt;   blast     = params.get("blast", "/usr/bin/blastall -p blastn -K 80 -i %s -d %s -e 0.001 -m 8 ")&lt;br /&gt;  &lt;br /&gt;   query     = params.get("query_files")&lt;br /&gt;   subject   = params.get("subject_files")&lt;br /&gt;   out_dir   = os.path.join(params.get("out_dir", "/tmp") , sys.argv[1].strip())&lt;br /&gt;&lt;br /&gt;   fasta_list = {"q": [], "s": []}&lt;br /&gt;   patterns = ("fa", "fasta", "faa", "fas")&lt;br /&gt;&lt;br /&gt;   for qs, fileset in (("q", query), ("s", subject)):&lt;br /&gt;       # it's either a directory, in which case get all fasta files.&lt;br /&gt;       if os.path.isdir(fileset):&lt;br /&gt;           for pat in patterns:&lt;br /&gt;               fasta_list[qs].extend(glob.glob(os.path.join(fileset, "*." + pat)))&lt;br /&gt;       # or it's a glob pattern&lt;br /&gt;       else:&lt;br /&gt;           fasta_list[qs] = glob.glob(fileset)&lt;br /&gt;       assert len(fasta_list[qs]) &gt; 0, "didn't find any files for %s" % fileset&lt;br /&gt;          &lt;br /&gt;&lt;br /&gt;   main(fasta_list["q"], fasta_list["s"], format_db, blast, out_dir)&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;which reads this config file:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;[rice_rice_10kmers]&lt;br /&gt;&lt;br /&gt;# is it protein? leave the %s to be filled programmatically&lt;br /&gt;format_db=/usr/bin/formatdb -p F -i %s&lt;br /&gt;&lt;br /&gt;# just use correct path and use all parameters here:&lt;br /&gt;# the %s's get filled with query and subject files programmatically.&lt;br /&gt;# the output file is chosen and appended programmatically.&lt;br /&gt;blast=/usr/bin/blastall -p blastn -K 80 -i %s -d %s -e 0.001 -m 8&lt;br /&gt;&lt;br /&gt;# where to send the blast output becomes:&lt;br /&gt;# /tmp/rice_rice_10kmers/&lt;br /&gt;out_dir=/tmp/&lt;br /&gt;&lt;br /&gt;# a glob pattern or a directory&lt;br /&gt;query_files=/tmp/rice/fasta/&lt;br /&gt;# it's a self-self blast. so query and subject are same.&lt;br /&gt;subject_files=/tmp/rice/fasta/&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This will run a blast of all rice chromosomes (split into 10,000mers) against all other 10,000 mers. This is what top looks like on my 8-core machine.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_uU_kLC5AdTc/R6X-Dt-egqI/AAAAAAAAATs/nT-xwKMUiuM/s1600-h/top_blast1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_uU_kLC5AdTc/R6X-Dt-egqI/AAAAAAAAATs/nT-xwKMUiuM/s400/top_blast1.png" alt="" id="BLOGGER_PHOTO_ID_5162811887684846242" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-906877501559890849?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/906877501559890849/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=906877501559890849' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/906877501559890849'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/906877501559890849'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2008/02/parallel-blasts-using-pythons-pp-module.html' title='parallel blasts using python&apos;s pp module'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_uU_kLC5AdTc/R6X-Dt-egqI/AAAAAAAAATs/nT-xwKMUiuM/s72-c/top_blast1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-4491618757452450470</id><published>2007-11-26T15:21:00.000-08:00</published><updated>2008-02-03T09:54:47.345-08:00</updated><title type='text'>tinycc</title><content type='html'>i've been _trying_ to learn C. tinycc, beside being tiny, it compiles very quickly, allowing you to do cool things like script in C&lt;br /&gt;&lt;pre class="prettyprint" id="C"&gt;#!/usr/bin/tcc -run&lt;br /&gt;&lt;br /&gt;#include &lt;stdio.h&gt;&lt;br /&gt;&lt;br /&gt;int main(int argc, char *argv[]) {&lt;br /&gt; printf("Hello World %s, %s", argv[0], argv[1]);&lt;br /&gt; return 0;&lt;br /&gt;}&lt;br /&gt;&lt;/stdio.h&gt;&lt;/pre&gt;and then run as&lt;br /&gt;./file.c arg_1&lt;br /&gt;&lt;br /&gt;which makes it easier for those c-fu to guess and check.&lt;br /&gt;it also allows such nice things as &lt;a href="http://www.cs.tut.fi/%7Eask/cinpy/"&gt;c in python&lt;/a&gt;&lt;br /&gt;which is like pyinline, but uses ctypes and doesn't need write access.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-4491618757452450470?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/4491618757452450470/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=4491618757452450470' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4491618757452450470'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/4491618757452450470'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/11/tinycc.html' title='tinycc'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1985493247497536070</id><published>2007-10-11T20:53:00.001-07:00</published><updated>2007-10-11T20:58:45.468-07:00</updated><title type='text'>Sorting by proximity to a date in PostgreSQL</title><content type='html'>postgreSQL has great support for dates, &lt;br /&gt;&lt;br /&gt;=&gt; SELECT '2007-08-23'::date - '2006-09-14'::date as days;&lt;br /&gt; days &lt;br /&gt;------&lt;br /&gt;  343&lt;br /&gt;&lt;br /&gt;given a date column and a date, to find the nearest date, you can "extract the epoch", here, i used ABS as i just want the nearest date, before or after.&lt;br /&gt;&lt;br /&gt;SELECT *, ABS(EXTRACT(EPOCH FROM(date - '2006-08-23'))::BIGINT) as date_order  FROM record WHERE well_id = 1234 ORDER BY date_order limit 1&lt;br /&gt;&lt;br /&gt;i suppose this could make a nice PL/PGSQL function...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1985493247497536070?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1985493247497536070/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1985493247497536070' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1985493247497536070'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1985493247497536070'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/10/sorting-by-proximity-to-date-in.html' title='Sorting by proximity to a date in PostgreSQL'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8644205983159766144</id><published>2007-09-29T10:22:00.000-07:00</published><updated>2008-02-10T13:45:35.523-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pylab kmeans'/><category scheme='http://www.blogger.com/atom/ns#' term='pylab'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>k-means clustering in scipy</title><content type='html'>it's fairly simple to do clustering of points with similar z-values in scipy:&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint" id="python"&gt;&lt;br /&gt;import numpy&lt;br /&gt;import matplotlib&lt;br /&gt;matplotlib.use('Agg')&lt;br /&gt;from scipy.cluster.vq import *&lt;br /&gt;import pylab&lt;br /&gt;pylab.close()&lt;br /&gt;&lt;br /&gt;# generate some random xy points and&lt;br /&gt;# give them some striation so there will be "real" groups.&lt;br /&gt;xy = numpy.random.rand(30,2)&lt;br /&gt;xy[3:8,1] -= .9&lt;br /&gt;xy[22:28,1] += .9&lt;br /&gt;&lt;br /&gt;# make some z vlues&lt;br /&gt;z = numpy.sin(xy[:,1]-0.2*xy[:,1])&lt;br /&gt;&lt;br /&gt;# whiten them&lt;br /&gt;z = whiten(z)&lt;br /&gt;&lt;br /&gt;# let scipy do its magic (k==3 groups)&lt;br /&gt;res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)&lt;br /&gt;&lt;br /&gt;# convert groups to rbg 3-tuples.&lt;br /&gt;colors = ([([0,0,0],[1,0,0],[0,0,1])[i] for i in idx])&lt;br /&gt;&lt;br /&gt;# show sizes and colors. each color belongs in diff cluster.&lt;br /&gt;pylab.scatter(xy[:,0],xy[:,1],s=20*z+9, c=colors)&lt;br /&gt;pylab.savefig('/var/www/tmp/clust.png')&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8644205983159766144?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8644205983159766144/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8644205983159766144' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8644205983159766144'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8644205983159766144'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/09/k-means-clustering-in-scipy.html' title='k-means clustering in scipy'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-3624382985797247219</id><published>2007-07-17T15:50:00.000-07:00</published><updated>2008-02-03T09:56:16.618-08:00</updated><title type='text'>using python mapscript to create a shapefile and dbf</title><content type='html'>i always have trouble remembering how to use mapscript. it's pretty simple, but the docs are hard to find and the test cases (though excellent!) have a lot of abstraction. &lt;br /&gt;&lt;br /&gt;heres some code that creates a shapefile and dbf (using another module). and does a quick projection at the start.&lt;br /&gt;&lt;br /&gt;&lt;pre class='prettyprint' id='python'&gt;&lt;br /&gt;import mapscript as M&lt;br /&gt;import random&lt;br /&gt;from dbfpy import dbf&lt;br /&gt;&lt;br /&gt;#########################################&lt;br /&gt;# do some projection&lt;br /&gt;#########################################&lt;br /&gt;&lt;br /&gt;p = 'POINT(466666 466000)'&lt;br /&gt;shape = M.shapeObj.fromWKT(p)&lt;br /&gt;projInObj  = M.projectionObj("init=epsg:32619")&lt;br /&gt;projOutObj = M.projectionObj("init=epsg:4326")&lt;br /&gt;shape.project(projInObj, projOutObj)&lt;br /&gt;print shape.toWKT()&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;#########################################&lt;br /&gt;# create a shapefile from scractch&lt;br /&gt;#########################################&lt;br /&gt;ms_dbf = dbf.Dbf("/tmp/t.dbf", new=True)&lt;br /&gt;ms_dbf.addField(('some_field', "C", 10))&lt;br /&gt;&lt;br /&gt;ms_shapefile = M.shapefileObj('/tmp/t.shp', M.MS_SHAPEFILE_POLYGON)&lt;br /&gt;&lt;br /&gt;for i in xrange(10):&lt;br /&gt;    ms_shape = M.shapeObj(M.MS_SHAPE_POLYGON)&lt;br /&gt;    ms_line  = M.lineObj()&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;    for j in xrange(10):&lt;br /&gt;        ms_line.add(M.pointObj(random.randint(0,99), -random.randint(0,99)))&lt;br /&gt;&lt;br /&gt;    ms_shape.add(ms_line)&lt;br /&gt;    ms_shapefile.add(ms_shape)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;    rec = ms_dbf.newRecord()&lt;br /&gt;    rec['some_field'] = 'hi' + str(i)&lt;br /&gt;    rec.store()&lt;br /&gt;&lt;br /&gt;ms_dbf.close()&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-3624382985797247219?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/3624382985797247219/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=3624382985797247219' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3624382985797247219'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/3624382985797247219'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/07/using-python-mapscript-to-create.html' title='using python mapscript to create a shapefile and dbf'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-673021281880331431</id><published>2007-06-27T15:46:00.000-07:00</published><updated>2007-06-27T15:47:22.804-07:00</updated><title type='text'>Note to self: using python logging module</title><content type='html'>&lt;pre&gt;&lt;br /&gt;import logging&lt;br /&gt;&lt;br /&gt;logging.basicConfig(level=logging.DEBUG&lt;br /&gt;        ,format='%(asctime)s [[%(levelname)s]] %(message)s'&lt;br /&gt;        ,datefmt='%d %b %y %H:%M'&lt;br /&gt;        ,filename='/tmp/app.log'&lt;br /&gt;        ,filemode='a')&lt;br /&gt;&lt;br /&gt;logging.debug('A debug message')&lt;br /&gt;logging.info('Some information')&lt;br /&gt;logging.warning('A shot across the bows')&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-673021281880331431?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/673021281880331431/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=673021281880331431' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/673021281880331431'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/673021281880331431'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/06/note-to-self-using-python-logging.html' title='Note to self: using python logging module'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-8155597529861203900</id><published>2007-04-26T08:54:00.000-07:00</published><updated>2007-04-26T08:57:20.645-07:00</updated><title type='text'>Fix indentation in VIM</title><content type='html'>Often times, i get which has the indentation completely messed up, not just mixing tab/spaces, but really "whack"&lt;br /&gt;&lt;br /&gt;these commands seem to magically fix for at least 2 test cases:&lt;br /&gt;&lt;pre style='border:1px solid black;'&gt;&lt;br /&gt;:set filetype=xml&lt;br /&gt;:filetype indent on&lt;br /&gt;:e&lt;br /&gt;gg=G&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-8155597529861203900?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.chovy.com/web-development/fix-indentation-and-tabs-in-vim/' title='Fix indentation in VIM'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/8155597529861203900/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=8155597529861203900' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8155597529861203900'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/8155597529861203900'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/04/fix-indentation-in-vim.html' title='Fix indentation in VIM'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-2492852401480294251</id><published>2007-04-25T15:59:00.000-07:00</published><updated>2007-04-25T16:04:28.779-07:00</updated><title type='text'>Using Python MiddleWare</title><content type='html'>just trying to figure this stuff out. it's pretty simple, but there's one level of abstraction through web.py. you can use middleware to add keys to the environ for example.&lt;br /&gt;http://groovie.org/files/WSGI_Presentation.pdf&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 2px inset black; background-color: rgb(104, 104, 104);"&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;#!/usr/bin/python&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;import web&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;import random&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;class hi(object):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   def GET(self,who='world'):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       web.header('Content-type','text/html')&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       print "hello %s" % who&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;class bye(object):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   def GET(self,who='world'):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       web.header('Content-type','text/plain')&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       print "bye %s" % who&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       for c in web.ctx.env:&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;           print c, web.ctx.env[c]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;class other(object):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   def GET(self):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       web.header('Content-type','text/plain')&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       for c in web.ctx:&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;           print c, web.ctx[c]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;urls = ( '/bye/(.*)', 'bye'&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       ,'/hi/(.*)' , 'hi'&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       , '/.*'     , 'other')&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;class RandomWare(object):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   def __init__(self, app):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       self.your_app = app;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   def __call__(self,environ,start):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       environ['hello'] = random.random()&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       return self.your_app(environ,start)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;def random_mw(app):&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   return RandomWare(app)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;if __name__ == "__main__":&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;   web.run(urls,globals(),random_mw)&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-2492852401480294251?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/2492852401480294251/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=2492852401480294251' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2492852401480294251'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/2492852401480294251'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/04/using-python-middleware.html' title='Using Python MiddleWare'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-817209007284603603</id><published>2007-04-24T16:43:00.000-07:00</published><updated>2008-02-10T13:45:08.686-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='mod_wsgi'/><category scheme='http://www.blogger.com/atom/ns#' term='web.py'/><category scheme='http://www.blogger.com/atom/ns#' term='apache'/><title type='text'>Install run, and benchmark mod_wsgi in &lt; 10 minutes</title><content type='html'>svn checkout http://modwsgi.googlecode.com/svn/trunk/ modwsgi&lt;br /&gt;cd mod_wsgi&lt;br /&gt;./configure&lt;br /&gt;make&lt;br /&gt;sudo make install&lt;br /&gt;# note where mod_wsgi.so went on your system&lt;br /&gt;&lt;nobr&gt;echo "LoadModule wsgi_module /path/to/mod_wsgi.so" &gt;&gt; /path/to/apache2.conf&lt;br /&gt;&lt;/nobr&gt;&lt;br /&gt;mkdir /var/www/wsgitest/&lt;br /&gt;cd /var/www/wsgitest/&lt;br /&gt;&lt;br /&gt;vi .htaccess&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;# [in .htaccess]&lt;/span&gt;&lt;in&gt;&lt;br /&gt;Options +ExecCGI&lt;br /&gt;&lt;files&gt;&lt;/files&gt;&lt;files&gt;&amp;lt; Files hi.py &amp;gt;&lt;br /&gt;SetHandler wsgi-script&lt;br /&gt;&lt;/files&gt;&lt;/in&gt;&amp;lt;/Files&amp;gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;# [ end .htaccess]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;vi hi.py&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;# [in hi.py]&lt;in&gt;&lt;/in&gt;&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;#!/usr/bin/python&lt;br /&gt;&lt;br /&gt;import web&lt;br /&gt;&lt;br /&gt;class hi(object):&lt;br /&gt;def GET(self,who='world'):&lt;br /&gt;  web.header('Content-type','text/html')&lt;br /&gt;  print "hello %s" % who&lt;br /&gt;&lt;br /&gt;class bye(object):&lt;br /&gt;def GET(self,who='world'):&lt;br /&gt;  web.header('Content-type','text/html')&lt;br /&gt;  print "bye %s" % who&lt;br /&gt;&lt;br /&gt;urls = ( '/bye/?(.*)', 'bye'&lt;br /&gt;  ,'/hi/?(.*)' , 'hi' )&lt;br /&gt;&lt;br /&gt;application = web.wsgifunc(web.webpyfunc(urls, globals()))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;#[end hi.py ]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;you can then browse to&lt;br /&gt;http://localhost/wsgitest/hi.py/hi/there&lt;br /&gt;# see "hello there"&lt;br /&gt;http://localhost/wsgitest/hi.py/bye/bye%20bye&lt;br /&gt;# see "bye bye bye"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(meaningless) Benchmarking:&lt;br /&gt;change last line in hi.py to:&lt;br /&gt;if __name__ == "__main__": web.run(urls,globals())&lt;br /&gt;and save as cgi.py&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;cgi&lt;/span&gt;&lt;br /&gt;&lt;nobr&gt;$ ab -n 1000 -c 30 http://localhost/wsgitest/cgi.py/hi/there | grep 'Requests per second'&lt;br /&gt;&lt;/nobr&gt;&lt;br /&gt;Requests per second:    4.08 [#/sec] (mean)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;wsgi&lt;/span&gt;&lt;br /&gt;&lt;nobr&gt;$ ab -n 1000 -c 30 http://localhost/wsgitest/hi.py/hi/there | grep 'Requests per second'&lt;br /&gt;&lt;/nobr&gt;&lt;br /&gt;Requests per second:    351.05 [#/sec] (mean)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-817209007284603603?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/817209007284603603/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=817209007284603603' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/817209007284603603'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/817209007284603603'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/04/install-and-run-modwsgi-in-10-minutes.html' title='Install run, and benchmark mod_wsgi in &lt; 10 minutes'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-1855788695753821978</id><published>2007-03-21T09:45:00.000-07:00</published><updated>2007-03-21T09:49:07.821-07:00</updated><title type='text'>vim tricks</title><content type='html'>i've been trying to learn new stuff in vim, instead of doing same old. recently, i've been using :tabe to edit in tabs. lately, i've been trying the :sp to edit in splits.&lt;br /&gt;this set of tricks makes it even nicer:&lt;br /&gt;http://www.vim.org/tips/tip.php?tip_id=173&lt;br /&gt;now i can type ctrl+j to move down or ctrl+k to move up a split and&lt;br /&gt;have that split maximized.&lt;br /&gt;both tabs and split make it simple to yank and paste between files. something for which i had been using the mouse.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-1855788695753821978?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/1855788695753821978/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=1855788695753821978' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1855788695753821978'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/1855788695753821978'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/03/vim-tricks.html' title='vim tricks'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-117027679441382998</id><published>2007-01-31T12:28:00.000-08:00</published><updated>2007-01-31T13:10:26.853-08:00</updated><title type='text'>postgresql and mysql: benchmark? how?</title><content type='html'>so somehow, my previous post on postgres / mysql made &lt;a href="http://programming.reddit.com/info/11wim/comments"&gt;reddit&lt;/a&gt; , which i happened to be reading yesterday afternoon. i didnt even realize it was my post until following the link.&lt;br /&gt;there were a couple harsh comments stating that i found what i wanted to find. ... which were merited given the sensationalist way i presented the results (50%) and the careless use of the term "benchmark". and yes, the config for mySQL was the default. still, i just presented what i found.&lt;br /&gt;i was surprised noone commented on the hackish way that i checked to see if it was a protein sequence in perl, rather than mysql--or the coolness of pre-fetching in DBIx (which is available as eager loading or setting lazy=False in the mapper in python's sqlalchemy).&lt;br /&gt;&lt;br /&gt;re the comments on things to change in the postgresql.conf... i'll try at some point.  are there any suggestions for mysql?&lt;br /&gt;the machine has 12G ram, 4CPUs. likely, the raid configuration (i dont know how it's set up) is not optimal, but that is out of my hands.&lt;br /&gt;&lt;br /&gt;for a "real benchmark" it'd be nice to do this in sqlalchemy with the schema written in python and then just change the database engine between mysql/postgresql/sqlite. any pointers on how one would go about creating a "real benchmark"?&lt;br /&gt;&lt;br /&gt;actually, that would be a good ask reddit topic: "How to design a 'real database benchmark'?"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-117027679441382998?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://programming.reddit.com/info/11wim/comments' title='postgresql and mysql: benchmark? how?'/><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/117027679441382998/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=117027679441382998' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/117027679441382998'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/117027679441382998'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/01/postgresql-and-mysql-benchmark-how.html' title='postgresql and mysql: benchmark? how?'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-116884236988079510</id><published>2007-01-14T21:48:00.000-08:00</published><updated>2007-02-04T15:12:21.776-08:00</updated><title type='text'>real-world postgresql vs mysql benchmark</title><content type='html'>At my work, we have a large MySQL database (15 MyISAM tables, 21 million rows, 10Gigs size). After seeing the &lt;a href="http://tweakers.net/reviews/657/5"&gt;benchmarks&lt;/a&gt; showing that Postgres out-performs MySQL on multi-core machines (our new db server has 4 CPU's), I ported the database to PostgreSQL.&lt;br /&gt;We have begun using the DBIx perl module since Class::DBI is too sloooow. The DBIx module allows closer access to the generated SQL, and it allows "prefetch"ing which eliminates extra back-and-forth (and object creation) between the server and client.  In addition, the connection string is in the script, not in the generated API. This makes it easy to benchmark as all that is required to change between db engines is to change the connection string.  Using this script:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;&lt;span style="font-size:85%;"&gt;use CogeX; use strict;&lt;br /&gt;# mysql&lt;br /&gt;#my $connstr = 'dbi:mysql:genomes:host:3306';&lt;br /&gt;# postgresql&lt;br /&gt;my $connstr = 'dbi:Pg:dbname=genomes;host=host;port=5432';&lt;br /&gt;my $s = CoGeX-&gt;connect($connstr, 'user', 'pass' );&lt;br /&gt;&lt;br /&gt;my $rs = $s-&gt;resultset('Feature')-&gt;search({&lt;br /&gt;          'feature_type.name' =&gt;   'CDS' ,&lt;br /&gt;          'feature_names.name' =&gt; {like =&gt; 'At1g%' }&lt;br /&gt;      },&lt;br /&gt;      {&lt;br /&gt;          join =&gt; ['feature_names','feature_type'],&lt;br /&gt;          prefetch =&gt; ['feature_names','feature_type']&lt;br /&gt;      }&lt;br /&gt;);&lt;br /&gt;&lt;br /&gt;while (my $feat =$rs-&gt;next()){&lt;br /&gt;  my $fn = $feat-&gt;feature_names;&lt;br /&gt;  my $type = $feat-&gt;feature_type-&gt;name;&lt;br /&gt;&lt;br /&gt;  map { print  $_-&gt;name . ":". $type . "\t" } $fn-&gt;next();&lt;br /&gt;  print  "\n";&lt;br /&gt;&lt;br /&gt;  # this prefetch avoids n calls where n is number of sequences #&lt;br /&gt;  foreach my $seq ($feat-&gt;sequences({},{prefetch=&gt;"sequence_type"})){&lt;br /&gt;      print  $seq-&gt;sequence_data if $seq-&gt;sequence_type-&gt;name eq 'protein';&lt;br /&gt;  }&lt;br /&gt;  print  "\n\n";&lt;br /&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;fetches the protein sequence of any coding sequence (CDS) that has a feature name that starts with 'At1g' which would be any CDS on chromosome 1 of arabidopsis in our database.&lt;br /&gt;The script consistently runs in 45 seconds on MySQL and in 29-31 seconds in Postgres. Other scripts seem to have about that difference--PostgreSQL finishes in about 60-70% of the time that the MySQL scripts do. Or, more dramatically: &lt;span style="font-weight: bold;"&gt;MySQL is 50% slower&lt;/span&gt;. That's pretty good for no change in the API, and all tables have the same indexing and structure.&lt;br /&gt;Postgres was set up as default except for these values in postgresql.conf&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(0, 102, 0);font-family:Century Schoolbook L;" &gt;shared_buffers 40000&lt;br /&gt;max_connections 200&lt;br /&gt;work_mem 4096&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(0, 102, 0);font-family:Century Schoolbook L;" &gt;effective_cache_size 10000&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;/span&gt; &lt;a href="http://www.powerpostgresql.com/PerfList/"&gt;This&lt;/a&gt; was the most concise indication of values to set, though likely closer tuning could improve performance (ahem, suggestions welcome). &lt;br /&gt;An added benefit, is now, we can push more work into the database server using &lt;a href="http://www.postgresql.org/docs/8.2/interactive/plperl.html"&gt;PL/Perl&lt;/a&gt; or another PL language, which can further reduce network back and forth, and reduce the creation of perl objects when not necessary. &lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-116884236988079510?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/116884236988079510/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=116884236988079510' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/116884236988079510'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/116884236988079510'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2007/01/real-world-postgresql-vs-mysql.html' title='real-world postgresql vs mysql benchmark'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-116459043944547297</id><published>2006-11-26T17:14:00.000-08:00</published><updated>2006-11-26T17:20:39.456-08:00</updated><title type='text'>python jumble solver</title><content type='html'>over thanksgiving, i figured it'd be a good hack to solve the &lt;a href="http://www.zdaily.com/jumble.shtml"&gt;jumble&lt;/a&gt; . Originally, i tried to do all permutation of letter orders in the word but got confused and just did letter frequency:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;import sys&lt;br /&gt;&lt;br /&gt;word = len(sys.argv) &gt; 1 and (sys.argv[1]).strip("\n").lower() or "egaugnal"&lt;br /&gt;words = [w.strip("\n").lower() for w in open('/usr/share/dict/web2') if len(w) == len(word)+1]&lt;br /&gt;&lt;br /&gt;def lfreq(w):&lt;br /&gt;    wfreq = {}&lt;br /&gt;    for letter in w: wfreq[letter] = letter in wfreq and wfreq[letter]+1 or 1&lt;br /&gt;    return wfreq&lt;br /&gt;&lt;br /&gt;wfreq = lfreq(word)&lt;br /&gt;match = [w for w in words if lfreq(w) == wfreq]&lt;br /&gt;&lt;br /&gt;print match&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;that will print all matches for the word send in on the command line, assuming your dictionary file is in &lt;span style="color: rgb(0, 0, 153);"&gt;/usr/share/dict/web2 &lt;span style="color: rgb(0, 0, 0);"&gt;:-( .  I tried for a while to do a recursive permute function which would take a word or list of letters and return all possible permutations:  permute('abc') -&gt; ['abc','acb','bca',bac',cba',cab'], which would clearly be a faster solution. but, as mentioned, i got confused and lazy. and the script above runs in &lt;&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-116459043944547297?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/116459043944547297/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=116459043944547297' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/116459043944547297'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/116459043944547297'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2006/11/python-jumble-solver.html' title='python jumble solver'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-21289662.post-116387862516554656</id><published>2006-11-18T11:36:00.000-08:00</published><updated>2006-11-18T11:37:05.166-08:00</updated><title type='text'>simple AJAX</title><content type='html'>this is the AJAX implementation i use:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;function jfetch(url,t,o) {&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;  var req = jfetch.xhr();&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;  req.open("GET",url,true);&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;  req.onreadystatechange = function() {&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;    if(req.readyState == 4){&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;      var rsp = req.responseText;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;      if(t.constructor == Function) return t.apply(o,[rsp]);&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;      t = document.getElementById(t);&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;      t[t.value ==undefined ? 'innerHTML': 'value'] = rsp;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;      req = null;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;    }&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;  };&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;  req.send(null);&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;jfetch.xhr =&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;    (window.ActiveXObject)&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;   ? function(){ return new ActiveXObject("Microsoft.XMLHTTP"); }&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153);"&gt;   : function(){ return new XMLHttpRequest()};&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;it's short, it only checks for the transport (ActiveX or XHR) once, and it takes either an element id or a call back function. and, i can understand it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/21289662-116387862516554656?l=hackmap.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hackmap.blogspot.com/feeds/116387862516554656/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=21289662&amp;postID=116387862516554656' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/116387862516554656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/21289662/posts/default/116387862516554656'/><link rel='alternate' type='text/html' href='http://hackmap.blogspot.com/2006/11/simple-ajax.html' title='simple AJAX'/><author><name>brentp</name><uri>http://www.blogger.com/profile/12236821145627337774</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
