Chapter 6. SMS File Tools

Table of Contents

6.1. filterSMS
6.2. extractSMS
6.3. sms2txt
6.4. smsls

6.1. filterSMS

6.1.1. Purpose

Filter an SMS file according to read length, control frames, camera ROI, quality filter values, sequence dinucleotide composition distribution and alignment to "junk" sequences. Can also trim heading/tailing homopolymers. Outputs filtered file and number of reads lost at every filtering step.

6.1.2. Usage

$ filterSMS --input_file reads.sms --output_file filtered_reads.sms [optional parameters]

6.1.3. Parameters

Generic options:
  --help                Produce help message

Required options:
  --input_file arg      Input file name (sms)
  --output_file arg     Output file name (sms)

Optional options:
  --minlen arg (=0)          Min read length, for multiple passes use 
                             comma-separated values (e.g. 20,10)
  --maxlen arg (=1000)       Max read length, for multiple passes use 
                             comma-separated values
  --quality arg (=10)        Max quality score
  --dinuc arg                Dinuc content filter file name
  --roi arg                  ROI file name
  --trim_hp arg              Trim heading/tailing homopolymers (eg 'T/H/2/0.75'
                             )
  --trim_end arg (=0)        Trim last bases of each read
  --trim_beg arg (=0)        Trim initial bases of each read
  --trim_rate arg (=0)       Trim prefix with high incorporation rate
  --filter_rate arg (=0)     Remove reads with high incorporation rate
  --remove_lock arg (=0)     Remove lock base after trimming tailing HP (min 
                             trim)
  --align arg                Junk sequence file name, for alignment filter 
                             (fasta)
  --minscore arg (=4)        minimum nscore, for alignment filter
  --percent_error arg (=100) Percent error in bitHPDP
  --config_file arg          hpdp options config file (default is GL no HP)
  --no_ctrl                  Do not filter by control (x) frames
  --by_template              Filter by first frame
  --sample arg               Process only a sample of given size
  --garbage_file arg         Discarded reads file name
  --garbage_type arg (=txt)  Discarded reads file type (txt/sms)
  --prefix arg               Prefix for stats file

6.1.4. Output

Reads that passed all filtering steps are stored in a new SMS file. The total number of input and output reads, as well as number of reads lost at each filter (including dinuc filter subtypes) are output into filter_stats.txt. Optionally, reads that were discarded can be stored in a separate text/sms file for debugging. The textual version of this file will include the source (fc/chan/pos/cam) and sequence of the filtered reads, as well as the filter at which they were lost. It is recommended to use this option with the sample option to avoid creating huge text files. The sms version of this file should be used to capture all the filtered reads for downstream processing.

The format of the output statistics file (filter_stats.txt) is as follows, where In and Out denote the total number of input and output reads per channel:

#PROGRAM=filterSMS
#VERSION=1.2.0
#DATETIME=2008-05-02-T09:30:23
#COMMAND=filterSMS infile.sms filtered.sms --minlen 20 --maxlen 70 --align P102.fa --dinuc dinuc.txt
#PARAMETER:input_file=infile.sms
#PARAMETER:output_file=filtered.sms
#PARAMETER:minlen=20
#PARAMETER:maxlen=70
#PARAMETER:orphan=2
#PARAMETER:dinuc=dinuc.txt
#PARAMETER:align=P102.fa
#PARAMETER:minscore=4
#PARAMETER:percent_error=100
#PARAMETER:config_file=/gpfs2/bioinf/config/HPDP/hpdp_GL_noHP_config
Flowcell        Channel Position        Camera  In      Length  Ctrl    Qual    Dinuc (BAO)     Dinuc (AT)      Align (P102)    Out
1       1       all     all     4764253 2280473 85      196605  43892   23511   385639  1834048
1       2       all     all     5176371 2468959 84      222945  47220   21266   402100  2013797

6.1.5. Comments

  • Filtering order is:

    1. by ROI
    2. by length
    3. by control frames
    4. by template
    5. by quality value
    6. by length after HP trimming
    7. by length after end trimming
    8. by sequence dinucleotide composition
    9. by alignment
  • ROI filter uses an input tab-delimited text file containing the ROI for each camera. e.g.:

    Camera x0 y0 x1 y1
    0 0 0 1000 1000
    1 0 0 1000 1000
    2 0 0 1000 1000
    3 0 0 1000 1000
  • Trimming heading/tailing homopolymer sequences is done by specifying a parameter consisting of <base>/[H/T]/<min #>/<thresh>. For example the parameter T/H/1/0.75 will remove the longest 1+ prefix (head) of a read that has >75% T’s, while the parameter C/T/2/0.8 will remove the longest 2+ suffix (tail) of a read that has >80% C’s. The removed prefix (suffix) must end (begin) with the relevant base. More than one trim type is possible by specifying multiple --trim switches. After trimming, the length filter is reapplied.
  • The dinuc filter uses an input file of the following format (this example is the default file that is included in the release as install/pypeline/config/dinuc.txt):
Filter AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT Thresh
BAO 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0.7 // Base-addition order
AT 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0.9 // AT-rich

Each row in the file describes a separate filter. The first column is a name that is used to describe the filter, the next 16 columns describe the weight given to each of the 16 dinuc combinations, and the last column describes the total weight threshold of the filter. For example, the BAO (base addition order) filter gives a weight of 1 to each of the dinucs CT,TA,AG,GC and 0 to all others. A read will be filtered if its total score (total weight of all dinucs/num of dinucs) is >0.8.

  • Alignment filter uses an input fasta file containing a set of "junk" sequences to compare to. Any read that aligns to one of the sequences with an nscore greater than the given threhsold will be filtered out.