Chapter 8. Analysis Tools

Table of Contents

8.1. snpSniffer
8.2. coverageTool
8.3. transcriptCount
8.4. binReads

8.1. snpSniffer

8.1.1. Purpose

Detect SNPs using a set of aligned reads, stored in a sorted binary alignment file (BAF), files generated from indexDPgenomic output.

8.1.2. Usage

$ snpSniffer --input_file reads.baf --out_prefix snp .....

8.1.3. Parameters

Generic options:
  --help                Produce help message
Required options:
  --input_file arg      Input sorted BAF file
  --out_prefix arg      Prefix for output
Optional options:
  --reference_file arg             Name of file containing references that wher
                                   e used in alignment
  --reference_name arg             Perform snp analysis on this reference (defa
                                   ult all references)
  --pvalue_threshold arg           Maximum P-value considered significant (defa
                                   ult 1e-6)
  --allele_frequency_threshold arg Flag alleles that have a detected allele 
                                   frequency >= than this threshold (default 
                                   .2)
  --remove_binary_db arg (=1)      Remove binary files upon exit
  --strand arg (=both)             Perform analysis on a given strand (forward,
                                   reverse or both.
  --min_depth arg (=5)             Minimum depth for analysis
  --by_nuc_rate                    Use by nucleotide error rates in SNP analysi
                                   s (default false)

8.1.4. Output

  • SNPSniffer generates a table of SNPs named: <out_prefix>_SNP.txt. In addition it generates two files that should be used as inputs to align2viz. The later generates an HTML file that contains a vizualization of the alignments in the neighborhood of each mutation.
  • An example of a SNP table is presented below. It contains 28 columns, and the table is broken in two 3. The first 10 columns are in the table below.
NumRefNameStartEndRefTypeModel ScoreShiftAllele1Allele2

1

supercontig_1.5

881

881

C

SUB

*

0

C

G

2

supercontig_1.5

1584

1584

T

SUB

*

0

C

T

3

supercontig_1.5

1588

1588

T

SUB

*

0

C

*

4

supercontig_1.5

13303

13303

T

SUB

*

0

C

T

5

supercontig_1.5

14640

14640

*

INS

*

-1

A

*

6

supercontig_1.5

14962

14962

G

SUB

*

0

A

G

7

supercontig_1.5

38096

38096

C

SUB

*

0

T

C

8

supercontig_1.5

58137

58137

T

SUB

*

0

A

T

9

supercontig_1.5

58142

58142

C

DEL

*

0

-

*

10

supercontig_1.5

58158

58158

A

SUB

*

0

G

A

11

supercontig_1.5

58823

58824

GG

LEN

0.986

0

GG

G

12

supercontig_1.5

61151

61151

T

SUB

*

0

C

T

13

supercontig_1.5

63110

63110

A

SUB

*

0

G

A

14

supercontig_1.5

72003

72003

G

SUB

*

0

A

*

15

supercontig_1.5

92479

92481

GGG

LEN

0.998

0

GGG

GG

The next 12 columns follow.

DepthCount1Count2Freq1Freq2Pvalue1Pvalue2Freq AFreq CFreq GFreq TFreq -

10

5

5

0.5

0.5

5.59e-14

5.59e-14

0

0.5

0.5

0

0

9

3

6

0.33

0.6

3.40e-08

1.38e-17

0

0.33

0

0.66

0

9

7

*

0.77

*

4.39e-21

*

0

0.77

0

0.22

0

18

7

11

0.38

0.61

3.86e-18

1.16e-30

0

0.38

0

0.61

0

20

4

*

0.2

*

2.00e-07

*

0.2

0

0

0

0.8

17

6

10

0.35

0.58

2.02e-15

9.61e-28

0.35

0

0.58

0

0.05

12

3

9

0.25

0.75

8.89e-08

1.47e-26

0

0.75

0

0.25

0

28

9

18

0.32

0.64

4.57e-22

5.86e-50

0.32

0

0

0.64

0.03

34

10

*

0.29

*

9.00e-08

*

0

0.70

0

0

0.29

21

10

11

0.47

0.52

1.73e-26

1.28e-29

0.52

0

0.47

0

0

36

18

15

0.5

0.41

*

4.28e-11

*

*

*

*

*

27

10

15

0.37

0.55

4.14e-25

1.90e-40

0

0.37

0

0.55

0.07

19

4

13

0.21

0.68

1.15-09

5.45e-37

0.68

0

0.21

0

0.10

7

6

*

0.85

*

1.15-18

*

0.85

0

0.14

0

0

23

11

12

0.47

0.52

*

2.16e-08

*

*

*

*

*

the last 6 columns are in the last table.

Pvalue APvalue CPvalue GPvalue TPvalue -Flanking

*

5.59e-14

5.59907e-14

*

*

ACCAAATTCATCATCAATATCAATCCACACATCAATATCAAGCAGCTTACC

*

3.40e-08

*

1.38e-17

*

ACTAGTATACTCTTGATGTTGACAGTAAATTGATCGTCTGATGGTGAGTCT

*

4.39e-21

*

1.96e-05

*

GTATACTCTTGATGTTGACAGTAAATTGATCGTCTGATGGTGAGTCTACCC

*

3.86e-18

*

1.16e-30

*

AGATTGTTAAATGATTGTGTAATTATTACTGATATTTATCCAAAAACAACC

2.0e-07

*

*

*

*

TCCCCAAACACATGGCACCACGGTCTTGGAACATCACTTCGACTGCCATTG

2.0e-15

*

9.61e-28

*

0.43

TATATAGACTTTCTTTGAACAGCGAGCCACAGCTTGGAGACGAAGCTTCAG

*

1.47e-26

*

8.89e-08

*

TGAAAACAGGAATCACAATCCCAATCAGAATTGAAATTGGACACATTAAAG

4.57e-22

*

*

5.86e-50

0.60

GTTTTTCTTTGGAGATAATGAGATGTCAAACTGTCAGGAAGTACTTAATCG

*

9.66e-68

*

*

9.e-08

TCTTTGGAGATAATGAGATGTCAAACTGTCAGGAAGTACTTAATCGCAAAA

1.28884e-29

*

1.73e-26

*

*

GATGTCAAACTGTCAGGAAGTACTTAATCGCAAAATAAGGCGAAGGGAAAC

*

*

*

*

*

GCTGATGCGAGCACAAAGGAGATGTGGAGGGCGTCTGTGTAAAAGAGCCTGT

*

4.14e-25

*

1.90e-40

0.22

AGGAGTACATAGAGCTAGAGCATGGTCGTGATGAACAAGAAAAACTATATG

5.45299e-37

*

1.15e-09

*

0.12

CAACATGCATATAGTCAGGTTAATCAAACAACATATTGCATACAATTGCCA

1.15475e-18

*

0.005

*

*

CTGGAATTTGTATGCCAGCAGACTAGAAACACTGGTTCGAGATTGTTGGCA

*

*

*

*

*

CTGGGTATGTTGTTGCATCAAAAGTGGGTGCTGTTGCTGCAGCTGATAATTCA

The column headings are explained below

  • Num: mutation number
  • RefName: reference name
  • Start: start position of mutation
  • End: end position of mutation
  • Ref: reference nucleotide/s
  • Type: type of mutation SUB (substitution), INS (insertion), DEL (deletion) or LEN (homopolymer deletion)
  • ModelScore: defined only for LEN mutation see explanation below
  • Shift: 0 for substitutions and deletions negative for insertions. A shift of -1 indicates that the insertion is immidiately to the left of the position indicated. A shift of -2 indicates that the insertion is to the left of the first insertion at -1.
  • Allele1/Allele2: indicate the nucleotide composition of each allele.
  • Depth: number of reads that span the mutation
  • Count1/Count2: number of reads that have each one of the alleles.
  • Freq1/Freq2: Frequency of each allele that is Count i/Freq i where i=1 or 2.
  • Pvalue1/Pvalue2: Probability that this mutation occured by chance due to sequencing errors for each of the alleles
  • Freq A/C/G/T/- : counts/depth for each nucleotide type
  • Pvalue A/C/G/T/- : Pvalues for each of the nucleotides
  • Flanking: flaning region for mutation with an additional 25 nucleotides on each side

8.1.5. Comments

  • It is recommended to consider positions with a minimum coverage of 20
  • It is recommended to consider only mutations for which the allele with maximum p-Value is 1e-10 or less. Such data can be generated automatically by the tool by running it with the flag --pvalue_threshold 1e-10
  • Currently SNP sniffer uses alignments from both forward and reverse strands to perform the SNP analysis
  • The BAF (binary alignment file) should be sorted using sortAlign
  • The shift position is either 0 or negative. For SUB and DEL it is always 0, while for insertions it is negative. In the later case it indicates how far to the left of a given position is the insertion. So -1 is immediately preceding the position, and -2 would be preceding the first insertion at -1 etc…
  • Deleteions in homopolymers are indicated by the LEN mutation. This type of mutation is determined by matching the observed distribution of HP (homopolymer) length in the reads to a set of possible models. Two alleles will be present in the output table when the best matching model is the mixture of the distributions corresponding to the two lengths. The score with which the best matching modle matches the data is the model score. The closer it is to one the better the match. A score above .98 is a good score. A score of .99 and above is an excellent score. A score below .95 is not particularly good. In order to be confident of a length mutation it is necessary to both have a good model score and a low p-Value.
  • Currently if one of the alleles in an HP length mutation is the reference allele no p-Value is reported for that allele. This issue will be resolved in a future release.
  • Mutations of length 2 can be explained by snpSniffer in a way that may be non-intuitive. The example below illustrates this situation by first showing a reference and a sample that differ by 2 substitutions. The alignment process favors a representation which contains one insertion and one deletion.
alignment ambiguity