Monday, September 20, 2010

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.

I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:
ADAPTORS=GAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT,GAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGT,GAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
pair_fastx_clip_trim.py --sanger -a $ADAPTORS -M 20 -t 28 -l 40 en.wt.1.fastq en.wt.2.fastq
Where the -a (adaptor) -M (length of adaptor match) -t (min quality threshold) and -l (min length after quality chops) options are copied directly from (and sent directly to) fastx toolkit. --sanger indicates that the reads have fastq qualities in the sanger encoding. If that option is not specified, qualities are assumed to be in illumina 1.3 format where the ascii offset is 64.
Output
This example will create 2 new files: en.wt.1.fastq.trim and en.wt.2.fastq.trim each with the same number of corresponding records that pass the filtering above.
Adaptors
As described in my previous post, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option -a--in a single run.
Script
It's not too pretty, but it does the job:

As always, let me know of any feedback.

31 comments:

jeffhsu3 said...

Hey great post was potentially looking into a script that would trim paired-reads. Are you getting a lot of adapter sequence because you're using 75-bp and some of your insert sizes are small enough that you'll get read-through into the adapter. I noticed that FastQC only recognized one of the Illumina adapters. Do you see the other one as well?

brentp said...

@jeffhsu3
i believe fastQC finds over-represented sequences wether they are primers or not. for example it finds poly-A stuff left in the reads.
it often finds multiple primers (and potentially different primers for the _1 and _2 reads).

Anonymous said...

Getting error running the script

$ python fastq_pair_filter.py

File "fastq_pair_filter.py", line 29
% (FASTX_CLIPPER, a, M, fastq, "-Q 33" if sanger else ""))
^
SyntaxError: invalid syntax

brentp said...

@Anonymous, you probably need a version of python >= 2.5

Anonymous said...

$ python2.7 fastq_pair_filter.py
Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 41, in main
fha, fhb = procs[0].stdout, procs[1].stdout
IndexError: list index out of range

$ python2.4 fastq_pair_filter.py
File "fastq_pair_filter.py", line 29
% (FASTX_CLIPPER, a, M, fastq, "-Q 33" if sanger else ""))
^
SyntaxError: invalid syntax

brentp said...

@Anonymous, can you post the command you're using to call the program?

Anonymous said...

@brentp

I posted the command itself on the above posts.

"$ python2.7 fastq_pair_filter.py"

Even with input files (paired end files) same error.

If possible Can you also putup usage options of the programm

Thanks

brentp said...

@Anonymous . you are right! thanks for your patience. I fixed the gist so it should work now. just added parenthesis at line 92 so it looks like this:

if (not fastqs and len(fastqs)) == 2:

let me know if you have any more problems.

Anonymous said...

thanks, I followed your suggestion and edited the line 92

"if (not fastqs and len(fastqs)) == 2:"

still have same error

$ python2.7 fastq_pair_filter.py
Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 41, in main
fha, fhb = procs[0].stdout, procs[1].stdout
IndexError: list index out of range

brentp said...

please download from here:
http://gist.github.com/588841

i edited the gist manually and put the new bracket in the wrong place in the gist and in my comment. the not should be out side the ()

Anonymous said...

Thanks
No errors without inputs now
But with input has errors still:

$ python2.7 fastq_pair_filter.py -a ACCC,CGTA,CAGT,TTAG -t 24 SRR018062_1.fastq RR018062_2.fastq

Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 22, in main
trim_cmd = "%s -t %i -l %i" % (FASTQ_QUALITY_TRIMMER, t, l)
TypeError: %d format: a number is required, not NoneType

brentp said...

try specifying with -l 0 on the command-line, it should be optional, but it must be defaulting to None, i will fix this shortly.

Anonymous said...

@brentp
Thanks for your time. Its seems working...

BTW how to do de-multiplex (separate barcodes) in paired-samples and place reads in seperate bins?.

brentp said...

thanks for finding and reporting the errors. i have updated the gist.

for barcode splitting (which i havent done), i think you can use fastx directly:

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage

Anonymous said...

fastx Barcode splitter is useful for SE reads directly. For Paired End reads without barcodes how I need to eliminate from from other mate?

It will be great if you can implement it. It will be complete package for PE reads for filtering, trimming and BC split.

brentp said...

i suspect you could run the barcode splitter on both ends of the reads independently, then run each pair of resulting files through this program. it should remove any reads that are unpaired -- either because of the barcode or because of quality filtering.

Martijn said...

So, you made 'fhr_headers' a generator which I think has the advantage that the headers don't have to be loaded in memory all at once.

However, in the end you still store them all in the 'seen' bitmap, right?

I don't think you need to do this, something like this pseudo code should do. Or am I missing something here?

Furthermore, I don't have a Fastq file handy at the moment, but if the headers occur in the files according to some order, you don't need an original file at all. But probably this is not the case.

Anonymous said...

If the length of read1 or read2 was trimmed to zero, could you add a function to report another read that was not trimmed to length of zero to the third output file?

Anonymous said...

Hi! Thanks so much for this script. I ran it and the files seemed to come out ok but I got the following message:

"Traceback (most recent call last):
File "./pairedtrim", line 81, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "./pairedtrim", line 46, in main
for ra, rb in gen_pairs(procs[0].stdout, procs[1].stdout):
File "./pairedtrim", line 36, in gen_pairs
assert not all(b), ("files not same length")
AssertionError: files not same length
fastq_quality_trimmer: writing nucleotides failed: Broken pipe
"

the command line I ran was:
./pairedtrim -t 18 -l 10 ../s11utegl.txt ../s12utegl.txt &

I did not use any of the adapter specifications because I am new to this and was a bit unclear if we had adaptor sequences. We did illumina paired end. Any help would be great!

brentp said...

@Anon,
use this script. I haven't updated the one in this post, will do so shortly.
https://github.com/brentp/methylcode/blob/master/bench/scripts/fastq_pair_filter.py

Anonymous said...

I am pretty sre that was the one that I used. I am not sure if there is actually a problem. I used the output files to do an assembly and it worked fine.

Also, the original untrimmed files were 7 g and 6 g and the files after using the script were 6.6 g and 5.8 g. That seems like what they should be but I am not sure.

brentp said...

no. that's a different one.

Anonymous said...

It worked, thanks!

nike00 said...

Dear brentp, I'm interested in your script, but now I really don't know which is the latest version, the best working one :)

Can you put the link again?

Thanks so much,
nike

brentp said...

@nike00
the latest is here:
https://github.com/brentp/bio-playground/blob/master/reads-utils/fastq_pair_filter.py

but there are problems with the way fastx trims adaptors (often it removes the entire read) that I havent' dealt with. The quality trimming and pairing stuff should work find though.

Also check out the scythe and sickle projects from the group and UC Davis.

nike00 said...

@brentp

thanks a lot,
I don't want to use the adaptors part, I'm interested in the quality and pairing control.
I have just started to use it,
I will tell you if it works or not, if you like :)

nike00

Yue said...

I try the one downloaded from https://gist.github.com/588841. I found the output pairs DO NOT MATCH. Other version does not work: https://github.com/brentp/bio-playground/blob/master/reads-utils/fastq_pair_filter.py

Habib said...

Hi, i just used your script on about 1 giga paired reads. It has been a couple of hours that the scripts outputed:
writing read1.fastq.trim and read2.fastq.trim

Is it unusual? What is the expected running time of this script on such data, and looking for 5 different adaptors?

My machine has 24 cores.

Anonymous said...

Hi, Can you confirm if any of both scripts mentioned works properly? I tried pair_fastx_clip_trim.py and fastq_pair_filter.py. Both scripts crashes with the adaptors option. Without -a options, pair_fastx_clip_trim.py produces non sync files (one shorter than the other one and fastq_pair_filter.py produces files with a very few reads. With my 9GB start files the first script produces 7GB files and with the second one 150MB files. Thanks for your help

brentp said...

Hi, to any commenters having trouble, I have updated the first sentence of the post to indicate that this should not be used.
I have not been maintaining this and don't intend to do so in the near future.
There are multiple versions about (my fault) with various different bugs.

மித்ரன் said...

Hi BrentP,
Thanks for this tool. I came here after checking Fastx_clipper tool box manually. To my understanding it did NOT work as expected/or praised.

For sure, it pulled out all those reads with adapter sequences. But, in addition, it also pulled out reads that are originally came from my genome of intrest. It was very obvious, when i align those reads (containing adatper as told by fastx tool kit) aginst NCBI_NT db or my genome of intrest.....

Following is the command i used to see what the fastx_clipper thinks as adapter containing reads....


fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -l 1 -d 60 -k -i ES003_carlos_1220_sequence_1.10K.fasta > mp1_index1containg_reads_onlyadapt
er.fa

i use -d 60 so i can pull entire read again to check if it a good job of finding adapters

Anything i am doing worng here? Or is it something been overlooked.....

I will appreciate your help.