Sunday, April 27, 2008

levenshtein in cython

EDIT(2): fix markup for <char *> casts ... fix malloc (see comments. thanks Bao).
NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein.

Well, it seems to be popular to code up the levenshtein. I actually have a use for this and wanted to practice some Cython, so I've written a version. I used bearophile's recipe, wikibooks, this (from k4st on reddit) and this for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn).


cdef extern from "stdlib.h":
ctypedef unsigned int size_t
size_t strlen(char *s)
void *malloc(size_t size)
void free(void *ptr)
int strcmp(char *a, char *b)


cdef inline size_t imin(int a, int b, int c):
if a < b:
if c < a:
return c
return a
if c < b:
return c
return b


cpdef int levenshtein(char *a, char *b, int limit=100):
cdef int m = strlen(a), n = strlen(b)
cdef char *ctmp
cdef int i = 0, j = 0, retval
cdef int achr, bchr

if strcmp(a, b) == 0:
return 0

if m > n:
ctmp = a;
a = b;
b = ctmp;
#a, b = b, a
m, n = n, m

# short circuit.
if n - m >= limit:
return n - m

cdef char *m1 = <char *>malloc((n + 2) * sizeof(char))
cdef char *m2 = <char *>malloc((n + 2) * sizeof(char))

for i from 0 <= i <= n:
m1[i] = i
m2[i] = 0

for i from 0 <= i <= m:
m2[0] = i + 1
achr = a[i]
for j from 0 <= j <= n:
bchr = b[j]
if achr == bchr:
m2[j + 1] = m1[j]
else:
m2[j + 1] = 1 + imin(m2[j], m1[j], m1[j + 1])

m1, m2 = m2, m1

retval = m1[n + 1]
free(m2)
free(m1)
return retval

I believe that's correct (if not, let me know!), it matches output from other versions (which i used fort testing) and it doesn't leak memory, so I must have done the malloc/free correctly despite my lack of C-fu.
Again, I'm not sure, but I think that's a pretty good example of how to mix python and C with cython as it's not much longer than the other python versions and pretty readable. And, it's very fast, it does 500K iterations of:
levenshtein('i ehm a gude spehlar', 'i am a good speller')
in 2.5 seconds.
bearophile's pysco-ed editDistanceFast took 3 minutes, 31 seconds to do the same (which is why I started this project...).

Literally, just before I was to post this, I found pylevenshtein which is even faster, does the 500K in 1.5 seconds. ho hum. It's doing more, handling unicode, and apparently doing some optimizations.... So, use that instead! -- and contribute a test-suite.
Next, I think I'll try a variant that allows transpositions (Damerau) and/or implement the BK-Tree, again cribbing off bearophile's recipe.


for reference, here are the contents of the setup.py to build the module.

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup( name = 'levenshtein',
ext_modules=[ Extension("levenshtein", sources=["levenshtein.pyx"] ), ],
cmdclass = {'build_ext': build_ext})

10 comments:

emsi said...

Thanks for sharing that, it's clean and nice.

Bao said...

Looks quite nice indeed, one minor thing,

char *m1 = malloc((n + 1) * sizeof(char))

retval = m1[n + 1]

malloc n + 2 would be better; perhaps cython handles this well, but c may not.

brentp said...

thanks bao. i updated the code.
it's alarming that it didnt give any errors!

flow said...

i'm using this code on py3.1, but when i feed it b'1234', b'14215424223' for comparison, it returns -29!!! can you reproduce this? i copied the code from http://code.google.com/p/bpbio/source/browse/trunk/seqfind/seqfind.pyx

brentp said...

@flow i dont expect it will work without some changes on python 3.1, did you re-run cython and run setup.py again?

flow said...

i have opened two related questions on stackoverflow (here and here) and readers pointed out that using char as a fast integer data type is bad practice. also, it was suggested to replace m1, m2 = m2, m1; strcpy( m3, m2 ) by m3, m2, m1 = m2, m1, m3 pure and simple for a number of reasons. so i downloaded the code again and did these changes:

cdef int alen = strlen(a), blen = strlen(b), retval
cdef char *ctmp
cdef size_t i, j
cdef size_t achr, bchr

if strcmp(a, b) == 0:
return 0

if alen > blen:
ctmp = a;
a = b;
b = ctmp;
alen, blen = blen, alen

cdef unsigned int arraysize = ( blen + 2 ) * sizeof( unsigned int )
cdef unsigned int *m1 = malloc( arraysize )
cdef unsigned int *m2 = malloc( arraysize )
cdef unsigned int *m3 = malloc( arraysize )

for i from 0 <= i <= blen:
m1[ i ] = 0
m2[ i ] = i
m3[ i ] = 0

for i from 1 <= i <= alen:
m1[0] = i + 1
achr = a[i - 1]
for j from 1 <= j <= blen:
bchr = b[j- 1]
if achr == bchr:
m1[j] = m2[j - 1]
else:
m1[j] = 1 + imin(m1[j - 1], m2[j - 1], m2[j])

if i != 1 and j != 1 and achr == b[j - 2] and bchr == a[i - 2]: # and m1[j] > m3[j - 1] + cost:
m1[j] = m3[j - 1]

m3, m2, m1 = m2, m1, m3

retval = m2[blen]
free(m3)
free(m1)
free(m2)
return retval

(sorry how can i post a code snippet? i put it on pastebin: http://pastebin.com/FN5K8yZT)


now, curiously, when i compare b'helo' to b'hleo', your code gives me a difference of 1, while the active state code gives me 2; both seem to be correct, as there is one transposition involved. but when i compare b'1234' to b'154421152421351112544221', results are 21 and 20. i can transform the longer into the shorter string by deleting 20 bytes, and damerau distance should always be equal or less levenshtein distance, so 21 doesn't seem to be correct. when i compare b'1234' to b'1243', results are 1 and 2, but comparing b'1234' to b'1243xxxxxxxxxxxxxxxxxxxx', i get 21 from both algorithms (i think this is correct for damerau, but levenshtein should be 22, no?). i am still a bit puzzled.

flow said...

i've found this page: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/ and the code there reports a distance of 20 for b'1234' against b'154421152421351112544221' (yours gives me 21), and 13 for b'1234' against b'12321221234413343' (yours gives me 14). as for b'1234' against b'1243xxxxxxxxxxxxxxxxxxxx', all three versions say it's off by 21, so let's just believe it's correct.

brentp said...

@flow, it's hard to follow code updates in the comments here. could you send a patch that fixes the issues you're explaining?

flow said...

a patch... have to learn that part... i uploaded code to pastebin: http://pastebin.com/NzfmuLxE (slightly altered from the one above). it should be easy to follow as the code is not very long. i took out limit but it should be simple to put it back in again.

brentp said...

@flow if you send me an email (bpederse [at] gmail ), i'll make sure to credit you when i incorporate.