I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as:
sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/
or via svn from:
svn co http://bpbio.googlecode.com/svn/trunk/nwalign/
it will put an executable 'nwalign' into /usr/bin/ which when run will give this message:
Usage:
nwalign [options] seq1 seq2
Options:
-h, --help show this help message and exit
--gap=GAP gap extend penalty (must be integer <= 0)
--gap_init=GAP_INIT gap start penalty (must be integer <= 0)
--match=MATCH match score (must be integer > 0)
--mismatch=MISMATCH gap penalty (must be integer < 0)
--matrix=MATRIX scoring matrix in ncbi/data/ format,
if not specificied, match/mismatch are used
where the matrix is optional but can be the full path specifying the transition score from row to column as here.
If the matrix is specified, match and mismatch are not used. If the matrix is not specified, match and mismatch are used. a typical run looks like:
And usage from a python script can be seen in test.py.
$ nwalign alphabet alpet
alphabet
alp---et
$ nwalign --matrix /usr/share/ncbi/data/BLOSUM62 EEAEE EEEEG
EEAEE-
EE-EEG
I wrote this for a colleague who was using a perl script. This is over 100 times faster than the perl script for long sequences--and the perl script had the BLOSUM62 matrix hard-coded as a hash. There are a couple places that could still be sped up, but I think the improvement will be relatively small. This kind of script is perfect for the numpy-cython mix as I can just access the contents of the 2d array at c-speed without having to do pointer arithmetic. If there's any obvious optimizations I missed, I'd be glad to know them.
4 comments:
What is wrong with EMBOSS needle?
@diffusing thoughts: the only thing wrong with it is that i didnt know it existed, and searches in google for various combinations of needleman-wunsch, global sequence alignment ..., etc. didn't find it. thanks for the pointer.
here's the link for anyone else interested.
EMBOSS works well, but you have to a) write the sequences to files and b) do a process call/fork and c) read back the alignment/scores. I haven't tried this app yet, but it's a general problem (and Biopython only provides wrappers to EMBOSS so the problem remains)
@Michael, that's good to know. let me know if you have any recommendations to make it more useful. i've since uploaded it to pypi
Post a Comment