7.11. indexDP

7.11.1. Purpose

indexDP supports fast, mismatch tolerant indexed alignment of short reads.

7.11.2. Usage

$ indexDP --reads_file read_file ...

7.11.3. Parameters

indexDP version INDEX_DP_v_1.3.082208
Generic options:
  --help                Produce help message

Required options:
  --reads_file arg          Read file name (fasta)
  --reference_file arg      Reference file name (fasta)
  --seed_size arg           Seed size
  --num_errors arg          Number of errors in seed
  --weight arg              Weight of template
  --out_prefix arg          Prefix for output file name
  --config_file arg         Prefix for output file name
  --template_repository arg Repository for template files

Optional options:
  --read_file_type arg          Type of read file (fasta or sms, default is 
                                fasta)
  --flow_cell arg               Flow cell number (required with sms read file)
  --channel arg                 Channel number (reqired with sms read file)
  --pass arg                    Pass to be aligned 1, 2 (required with sms read
                                file)
  --num_blocks arg              Number of blocks of reads a channel is partitio
                                ned into (required with sms reads file)
  --block_index arg             Index of block of reads to be aligned (required
                                with sms read file)
  --terse_only                  Produce terse only output
  --binary_output               Produce binary output only (default is the 
                                standard text output
  --best_only                   Print best only match
  --multithread                 Run with multiple threads
  --max_hit_duplication arg     Maximum number of times a seed can align before
                                it is filtered (default 25)
  --percent_error arg           Percent error in read threshold (default 30%)
  --read_step arg               Step between kmers is read (default 1)
  --min_norm_score arg          Min normalized score of alignments to be output
                                (default 0)
  --aligned_files_threshold arg Normalized score threshold for specifying read 
                                as aligned (default 4.0)
  --strands arg                 Strand option for reference: forward/both 
                                (forward)

7.11.4. Output

7.11.4.1. Read Sets

OUTPUT_PREFIX_indexDP_duplicates.sms
File containing reads that had too high a hit rate (based on the --max_hit_duplication parameter).
OUTPUT_PREFIX_indexDP_aligned.sms
File containing reads that had at least one alignment whose normalized score was equal to or greater than the value specified with the --aligned_files_threshold parameter, excluding those reads that had too high a hit rate (based on the --max_hit_duplication parameter).
OUTPUT_PREFIX_indexDP_nonAligned.sms
File containing reads that had no alignments whose normalized score was equal to or greater than the value specified with the --aligned_files_threshold parameter, and which did not exceed the hit rate based on the --max_hit_duplication parameter.

7.11.4.2. Alignments

OUTPUT_PREFIX_indexDP_verbose.txt
Verbose human-readable alignment file. Generated by default. Excluded with --terse_only or --binary_output flags.
OUTPUT_PREFIX_indexDP_terse.txt
Terse human-readable alignment file. Generated by default or with the --terse_only flag. Excluded with the --binary_output flag.
OUTPUT_PREFIX_indexDP_binary.bin
Binary alignment file. Included only if --binary_output flag is supplied.
OUTPUT_PREFIX_matches.bin
Binary matches file.

7.11.5. Resource utilization

Unlike search and alignment tools with persistent indexes (e.g. BLAST), indexDP uses RAM to store the read set, reference set and the corresponding indexes which depend on the read length and reference length distribution. As a result, indexDP can be memory intensive for real world tasks.

In tests within Helicos, alignment of 10 million reads (~400Mb file) against RefSeq transcripts using the 20:16:2 template family consumes approximately 6Gb of RAM. Systems using 8Gb per core are recommended. If read sets and reference sets are much larger, jobs can broken up into smaller read sets and run in parallel either manually or using DRM software like Sun Grid Engine.

Due to caching and other issues, the memory scaling is not necessarily linear. For even small data sets, approximately 4Gb of RAM should be available.