Friday, October 22, 2010

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre-filter for large amounts of data, specifically some next-gen sequencing reads.

Bloom Filters

A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.
The code for that wrapper is in my github, here.

Big Data

It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure and use that to find duplicates and print out only the best or first or whatever. However, once you start getting "next-gen", this is less useful. For example, someone kindly reported a bug on my simple c++ filter because he had 84Gigs of reads and was running out of memory. And that is a bug. Anything that's supposed to deal with next-gen sequences has to deal with that stuff.

As a side note, my friend/co-worker came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.

Just Do It

So, I wrote a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.
usage looks like:
python fastq-unique.py > in.fastq < out.unique.fastq
On my machine, a particular run with a decent sized file looks like this:

and here's the code:

No comments: