2.10. Installing References

Helicos makes available a set of commonly used references at ftp://ftp.helicosbio.com/pub/distribution. These include Human.Genome.tar.gz, Human.Txome.tar.gz and others. The first two are the genomic and transcriptomic reference for Human, respectively. Transcriptome reference distributions will include the assignment files needed by the DGE pipeline, which can save a lot of work.

We recommend installing reference files in a centralized reference repository; see Section 2.6, “Install, Incoming and Reference Directories”. References should be installed by a user with write privileges on this directory. To install a reference, download the desired file to your reference directory and untar it using:

             $ tar xfz <filename>.tar.gz

The contents of each reference distribution will include a FASTA file, plus supporting files such as transcriptome assignment files as appropriate. The FASTA file will typically include source information in the name, e.g. Homo_sapiens.UCSChg18+rDNA.fasta, which indicates that this is UCSC human genome version 18. It may also include a link that provides a more generic version of the name. In this case Human.Genome.fasta will be symbolically linked to the versioned file. This convention allows users to refer to the Human.Genome reference, but still have version information available.

Once you have installed a fasta file, you should index it with preprocessDB. The typical command line is:

             $ preprocessDB --reference_file Human.Genome.fasta --out_prefix Human.Genome.seed18

which will create index files using the standard default indexing seed size of 18 bases. For large references this can take significant time, e.g. 8hrs for Human.

We also recommend creating a samtools index file for each new reference with the command:

             $ samtools faidx Human.Genome.fasta

This is used by align2sam.