Table of Contents
Filter an SMS file according to read length, control frames, camera ROI, quality filter values, sequence dinucleotide composition distribution and alignment to "junk" sequences. Can also trim heading/tailing homopolymers. Outputs filtered file and number of reads lost at every filtering step.
$ filterSMS --input_file reads.sms --output_file filtered_reads.sms [optional parameters]
Generic options:
--help Produce help message
Required options:
--input_file arg Input file name (sms)
--output_file arg Output file name (sms)
Optional options:
--minlen arg (=0) Min read length, for multiple passes use
comma-separated values (e.g. 20,10)
--maxlen arg (=1000) Max read length, for multiple passes use
comma-separated values
--quality arg (=10) Max quality score
--dinuc arg Dinuc content filter file name
--roi arg ROI file name
--trim_hp arg Trim heading/tailing homopolymers (eg 'T/H/2/0.75'
)
--trim_end arg (=0) Trim last bases of each read
--trim_beg arg (=0) Trim initial bases of each read
--trim_rate arg (=0) Trim prefix with high incorporation rate
--filter_rate arg (=0) Remove reads with high incorporation rate
--remove_lock arg (=0) Remove lock base after trimming tailing HP (min
trim)
--align arg Junk sequence file name, for alignment filter
(fasta)
--minscore arg (=4) minimum nscore, for alignment filter
--percent_error arg (=100) Percent error in bitHPDP
--config_file arg hpdp options config file (default is GL no HP)
--no_ctrl Do not filter by control (x) frames
--by_template Filter by first frame
--sample arg Process only a sample of given size
--garbage_file arg Discarded reads file name
--garbage_type arg (=txt) Discarded reads file type (txt/sms)
--prefix arg Prefix for stats file
Reads that passed all filtering steps are stored in a new SMS file. The total number of input and output reads, as well as number of reads lost at each filter (including dinuc filter subtypes) are output into filter_stats.txt. Optionally, reads that were discarded can be stored in a separate text/sms file for debugging. The textual version of this file will include the source (fc/chan/pos/cam) and sequence of the filtered reads, as well as the filter at which they were lost. It is recommended to use this option with the sample option to avoid creating huge text files. The sms version of this file should be used to capture all the filtered reads for downstream processing.
The format of the output statistics file (filter_stats.txt) is as follows, where In and Out denote the total number of input and output reads per channel:
#PROGRAM=filterSMS #VERSION=1.2.0 #DATETIME=2008-05-02-T09:30:23 #COMMAND=filterSMS infile.sms filtered.sms --minlen 20 --maxlen 70 --align P102.fa --dinuc dinuc.txt #PARAMETER:input_file=infile.sms #PARAMETER:output_file=filtered.sms #PARAMETER:minlen=20 #PARAMETER:maxlen=70 #PARAMETER:orphan=2 #PARAMETER:dinuc=dinuc.txt #PARAMETER:align=P102.fa #PARAMETER:minscore=4 #PARAMETER:percent_error=100 #PARAMETER:config_file=/gpfs2/bioinf/config/HPDP/hpdp_GL_noHP_config Flowcell Channel Position Camera In Length Ctrl Qual Dinuc (BAO) Dinuc (AT) Align (P102) Out 1 1 all all 4764253 2280473 85 196605 43892 23511 385639 1834048 1 2 all all 5176371 2468959 84 222945 47220 21266 402100 2013797
Filtering order is:
ROI filter uses an input tab-delimited text file containing the ROI for each camera. e.g.:
Camera x0 y0 x1 y1 0 0 0 1000 1000 1 0 0 1000 1000 2 0 0 1000 1000 3 0 0 1000 1000
Filter AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT Thresh BAO 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0.7 // Base-addition order AT 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0.9 // AT-rich
Each row in the file describes a separate filter. The first column is a name that is used to describe the filter, the next 16 columns describe the weight given to each of the 16 dinuc combinations, and the last column describes the total weight threshold of the filter. For example, the BAO (base addition order) filter gives a weight of 1 to each of the dinucs CT,TA,AG,GC and 0 to all others. A read will be filtered if its total score (total weight of all dinucs/num of dinucs) is >0.8.