Chapter 3. Analysis Workflows

Table of Contents

3.1. Mid Run Analysis
3.2. Full Run Oligo Analysis
3.3. Analysis of Biological Sample Channels

Prior to any data analysis, the HeliSphere software should be installed and configured, and install, incoming and reference directories created as described in Section 2.6, “Install, Incoming and Reference Directories”. Analyses should be executed by a user with write permissions on the incoming data repository.

A typical data analysis workflow follows the following steps:

  1. Download one or more SRF files from the Heliscope into a directory created under the incoming directory
  2. Convert the SRF file(s) to SMS files
  3. Run an analysis pipeline on the SMS files.

Some common workflows are described in more detail below.

3.1. Mid Run Analysis

Helicos recommends performing a midrun analysis as a QC step, typically after 60 incorporation cycles. This process analyzes the data for a predetermined sample of positions in each channel. Although this is only a partial analysis, it provides a good indicator of the quality of the final data and may give valuable information about the instrument status. The midrun analysis can also be used to analyze a complete 15 quad run.

  1. Download Midrun SRF file:

    1. Go to the HCC/Instrument Status and verify that the number of cycles completed for both flow cells has reached at least 61.
    2. Find the information you will need for the download_srf command line. The instrument domain name is the instrument’s IP address (numeric or symbolic format). The Run Integer ID, Run ID, and flow cell barcodes can typically be copied and pasted from the HCC/Run Metrics window into your command line later. Run integer ID is found by going to HCC/Run Metrics/Strand length and by clicking Get SRF URL. The integer is the 1000X number.
    3. Login to your bioinformatics computer as user helicos.
    4. Open a terminal window.
    5. Go to your system’s incoming directory, e.g.:

                $ cd /home/helicos/data/incoming
    6. Create a directory for the run under incoming and go to it. This is typically named for the date of the run. Substitute the appropriate date.

                $ mkdir 2009-07-22
                $ cd 2009-07-22
    7. Create a midrun directory and go to it.

                $ mkdir midrun
                $ cd midrun
    8. Invoke the download_srf command, as follows, substituting the appropriate values for the example values in the {}'. (There should not be any {}'s in the final command.) The \ at the end of each line tells LINUX that the command is being continued across multiple lines; you can combine lines and eliminate the \'s if you prefer.

                download_srf --instrument_name={e.g. heliscope4} \
                     --instrument_domain_name={e.g. 10.0.1.13 or heliscope4.xxx.edu} \
                     --run_integer_id={e.g. 10048} \
                     --run_id={e.g. 2009-07-22} \
                     --flow_cell_1 _barcode={e.g. 0226491251007} \
                     --flow_cell_2_barcode={e.g. 0226491251008}
                     --intermediate=1 > download.log 2>&1 &
    9. You can use the top command to see a list of jobs running. There should be a download_srf job near the top until the job completes. You can exit top at any time by typing Q. You can also monitor the download with the ls command:

                $ ls -l
                total 2.0G
                -rw-rw-r-- 1 helicos bioinf 1.8G Jul 22 20:22 Heliscope4.2009-07-22.intermediate.srf

      If you repeat this command periodically you should see the size increase. A typical midrun SRF file is about 1.5GB. Download times can vary from 10 minutes to 2 hrs depending on network speed.

  2. When the download is complete, run preprocess_srf to convert the SRF file to SMS format as follows:

              $ preprocess_srf --srf_file=Heliscope4.2009-07-22.intermediate.srf --instrument_name=Heliscope4

    Typical preprocessing time is about 10 minutes for an intermediate SRF file containing only QC positions. (Analyzing all positions for a 15Q experiment would take considerably longer.) A message will be printed when the script completes. An SMS file will be created with the same name as the SRF file, but with a .sms extension.

  3. It is a good idea to trim the sms file to ensure that it represents data only up until cycle 61. Perform the following command to do this:

            $ extractSMS --input_file <sms created by preprocess_srf> --output_file <filename for trimmed sms> --max_cycle 61

    Use this trimmed sms file as input to the analysis as described below.

  4. Analyze the data

    1. Copy the sample midrun config file to the midrun directory. (Since you it should be your current directory at this point you can refer to it with ".", meaning the current directory.)

                $ cp $HELICOS_ANALYSIS_HOME/sample/run.midrun.conf .
    2. Edit the file run.midrun.conf in the working directory using a text editor as follows. Where it says

                run = 2009-09-04

      change the date to the name you used for the run folder. If you need to change the channels list to fewer than all of the channels, you can do that as well. E.g., you could do 1:3-5,7,9-15,2:1-4,6,18. This would analyze flow cell 1 channels 3,4,5,7,9,10,11,12,13,14,15, as well as flow cell 2 channels 1,2,3,4,6,18. The sample conf file specifies that the output should be written to the current directory, so output directories created by the pipeline will be placed directly in the current directory. You can change that to another path if you want them to be placed in a subdirectory, or somewhere else entirely.

    3. Be sure to save the conf file after editing to make your changes permanent.
    4. Enter this command to run the midrun pipeline. You may be able to cut & paste it.

                pypeline -p midrun -c run.midrun.conf -j 4

      The -j value should be chosen as described in Section 4.4, “Parallel Processing”.

    5. Typical time to complete is 25-60 minutes.
    6. Once the analysis is finished, ls will show a number of new directories. The reports directory will contain summary reports on channel yields, strand lengths, error rates, etc. The file reports/tsv.txt contains the high-level summary:
FlowcellChannelInitial Strand DensityPost Filter Strand DensityInitial StrandsPost Length FilterPost Ctrl FilterPost Qual FilterPost Dinuc (BAO)FilterAlignedGrowthError RateMean LengthZero Error RateTerm Loss RateRegistration Loss

1

1

0

0

9590

2527

2527

2527

683

13

0.222

NA

34.8

8.3

0.0

48.1

1

2

0

0

9964

5324

5324

5324

4358

3359

0.289

3.25

33.1

38.9

0.7

23.1

See Section 4.5, “oligo pipeline” for an explanation of the columns. This completes the midrun analysis.