Sunday, January 10, 2010

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

I've written previously about nwalign, a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance.

Ack

It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing.

Bugs

Placement of Gaps

Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments:
AGEBAMAM
AGEBAM--
and
AGEBAMAM
AGEB--AM
Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through the scoring matrix.

Gap Extension

nwalign allows use of a scoring matrix to lookup the score/penalty for conversion from one letter to another or a simpler version where the user specifies gap_open, gap_extend, mismatch penalties, and match reward. For the simpler non-matrix path, earlier versions of nwalign did not heed the gap_extension--using the gap_open penalty regardless of the previous entries in the dynamic programming matrix. It's common to use only a gap penalty without a separate gap_extend penalty, nwalign no longer assumes that's what is preferred -- if it is, one can simply specify a `gap_extend` penalty that is equal to `gap_open`.

Scoring

Previous versions of nwalign only returned the alignment and didn't provide access to the score of the alignment. Recent versions allow the user to get the score via:

>>> nw.score_alignment('CEELECANTH', '-PELICAN--', gap_open=-5,
... gap_extend=-2, matrix='PAM250')
11

where `11` is the score of the alignment given, the gap_open and extend_penalty, and subsitution scores in the file 'PAM250' (which is in the NCBI substitution matrix format).

Performance

The earliest version of nwalign was literally about 100 times faster than the perl version from the BLAST book. But, there were a few places where performance has improved even more since that. The most dramatic was in the lookups for the scoring matrix values. Given, again the 2 short strings: "AGEBAMAM" and "AGEBAM", the alignment algorithm has to do the inner loop 48 ( 6 * 8 ) times, in each of those loops, it has to look up the substitution score in the matrix. For example in the first iteration, it has to find the score for "changing" from an "A" to an "A"; in the PAM250 matrix, that score is +2. In the substitution matrix, the row and column "keys" are the amino acids and the values are the scores for changing from the row key to the column key. Previously, I stored the keys in their own array, then did a linear search to find the index. So, for both the row and column, an amino acid letter is translated to an index with a function like:

aa_keys = ['A', 'R', 'N', 'D', ... ] # or 'ARND...'
def findpos(amino_acid, aa_keys):
i = 0
while i < len(aa_keys):
if aa_keys[i] == amino_acid:
return i
i += 1
return - 1

except in C(ython) so it was much faster, the alignment then uses the return values for row and column amino acid to look up the score from the substitution matrix. When the string are long enough-- as they will be with real proteins, this is a huge speed bottleneck. So, it trades memory for speed. Instead of using a 25 * 25 matrix to store the substitution matrix, it now uses an X * X matrix where 'X' is the ord value of the largest amino acid. Usually this will be less than 'Z', so the matrix will be 90 * 90, stored efficiently as a numpy array. From there, the lookup is directly with matrix[ord(x_amino), ord(y_amino)] which is extremely fast in C because the ord is unnecessary as a char can index an array. This gave at least a 3X speed improvement. I could reduce the memory used by the matrix by subtracting ord('A'), but that would be trading speed for memory since it would require 2 extra subtractions per inner loop.

Also, the latest version uses less memory in other areas; in order to do the dynamic programming, the algo has to save N * M arrays of direction "pointers" (in the left/right/diagonal sense (not the c sense), scores and gaps. However, the gaps are not actually needed for the entire trace-back, they are only needed 1 step back to determine if a current gap is a gap_open, or gap_extend. So, now, instead of an N*M matrix for the gap matrix, it's N*1 and M*1 arrays. For large N * M, this is enough to offset the increased memory incurred by how the substitution matrix is stored.

Running a benchmark test script with nwalign-0.1 takes 12.13 seconds and the same script on nwalign-0.3.1 takes 3.41 seconds (and actually runs in 2.99 seconds with unladen-swallow... but that's another story). The script does an alignment on 1200 * 1600 basepair sequences 100 times. As with most things, I figured most of this by banging my head against the wall long enough that the stars aligned, so to speak, anyone who cares to look at the code and offer suggestions would be much appreciated.

7 comments:

Carles said...

Hi.

Thanks a lot for your program, it saved me lot of trouble.

I found a situation where I got a non-satisfactory result, but as I'm inexperienced in alignments I'm not sure it's incorrect.

nwalign.global_align("FFLYVLEMVSKPMI","FFLYVG")

Result:

('FFLYVLEMVSKPMI', 'FFLY----V----G')

When I was expecting 'FFLYV--------G' for the second sequence.

It doesn't change a thing if I specify a gap_open or gap_extend penalty, or match score. On the other hand, if a use the BLOSUM62 matrix everything is fine. BTW, I've tried 0.3 and 0.3.1

I'm I doing something wrong?

Thanks for your work.

Carles.

brentp said...

hi carles, you are correct, that is a bad alignment, someone else has reported this same thing. i will have this fixed in the next release. hopefully that will be next week.
meanwhile, yes, i suggest you specify a matrix.

meric said...

Hi
thank you for the code.
however I need a similarity score (a percentage value) between two sequences. I have another code that will do that for me after your alignment, however I was wondering if your code will give me something like that?
thank you for sharing your work.

Meric said...

one more thing,
sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/
this is giving the following error

error: Can't download http://bpbio.googlecode.com/svn/trunk/nwalign/: 404 Not Found

Meric said...

one more thing

I get his error error: Can't download http://bpbio.googlecode.com/svn/trunk/nwalign/: 404 Not Found

when I want to use easy_install

brentp said...

@Meric, you'll want to install from pypi:
http://pypi.python.org/pypi/nwalign/

or use mercurial and get it from bitbucket:
http://bitbucket.org/brentp/biostuff/src

brentp said...

@Meric also check the docs on this page: http://pypi.python.org/pypi/nwalign/

there's a score_alignment() function that will tell you the score.