Table of Contents
Prior to any data analysis, the HeliSphere software should be installed and configured, and install, incoming and reference directories created as described in Section 2.6, “Install, Incoming and Reference Directories”. Analyses should be executed by a user with write permissions on the incoming data repository.
A typical data analysis workflow follows the following steps:
Some common workflows are described in more detail below.
Helicos recommends performing a midrun analysis as a QC step, typically after 60 incorporation cycles. This process analyzes the data for a predetermined sample of positions in each channel. Although this is only a partial analysis, it provides a good indicator of the quality of the final data and may give valuable information about the instrument status. The midrun analysis can also be used to analyze a complete 15 quad run.
Download Midrun SRF file:
download_srf command line. The instrument domain name is the instrument’s IP address (numeric or symbolic format). The Run Integer ID, Run ID, and flow cell barcodes can typically be copied and pasted from the HCC/Run Metrics window into your command line later. Run integer ID is found by going to HCC/Run Metrics/Strand length and by clicking Get SRF URL. The integer is the 1000X number.
helicos.
Go to your system’s incoming directory, e.g.:
$ cd /home/helicos/data/incoming
Create a directory for the run under incoming and go to it. This is typically named for the date of the run. Substitute the appropriate date.
$ mkdir 2009-07-22
$ cd 2009-07-22Create a midrun directory and go to it.
$ mkdir midrun
$ cd midrunInvoke the download_srf command, as follows, substituting the appropriate values for the example values in the {}'. (There should not be any {}'s in the final command.) The \ at the end of each line tells LINUX that the command is being continued across multiple lines; you can combine lines and eliminate the \'s if you prefer.
download_srf --instrument_name={e.g. heliscope4} \
--instrument_domain_name={e.g. 10.0.1.13 or heliscope4.xxx.edu} \
--run_integer_id={e.g. 10048} \
--run_id={e.g. 2009-07-22} \
--flow_cell_1 _barcode={e.g. 0226491251007} \
--flow_cell_2_barcode={e.g. 0226491251008}
--intermediate=1 > download.log 2>&1 &
You can use the top command to see a list of jobs running. There should be a download_srf job near the top until the job completes. You can exit top at any time by typing Q. You can also monitor the download with the ls command:
$ ls -l
total 2.0G
-rw-rw-r-- 1 helicos bioinf 1.8G Jul 22 20:22 Heliscope4.2009-07-22.intermediate.srfWhen the download is complete, run preprocess_srf to convert the SRF file to SMS format as follows:
$ preprocess_srf --srf_file=Heliscope4.2009-07-22.intermediate.srf --instrument_name=Heliscope4
Typical preprocessing time is about 10 minutes for an intermediate SRF file containing only QC positions. (Analyzing all positions for a 15Q experiment would take considerably longer.) A message will be printed when the script completes. An SMS file will be created with the same name as the SRF file, but with a .sms extension.
It is a good idea to trim the sms file to ensure that it represents data only up until cycle 61. Perform the following command to do this:
$ extractSMS --input_file <sms created by preprocess_srf> --output_file <filename for trimmed sms> --max_cycle 61
Use this trimmed sms file as input to the analysis as described below.
Analyze the data
Copy the sample midrun config file to the midrun directory. (Since you it should be your current directory at this point you can refer to it with ".", meaning the current directory.)
$ cp $HELICOS_ANALYSIS_HOME/sample/run.midrun.conf .
Edit the file run.midrun.conf in the working directory using a text editor as follows. Where it says
run = 2009-09-04
change the date to the name you used for the run folder. If you need to change the channels list to fewer than all of the channels, you can do that as well. E.g., you could do 1:3-5,7,9-15,2:1-4,6,18. This would analyze flow cell 1 channels 3,4,5,7,9,10,11,12,13,14,15, as well as flow cell 2 channels 1,2,3,4,6,18.
The sample conf file specifies that the output should be written to the current directory, so output directories created by the pipeline will be placed directly in the current directory. You can change that to another path if you want them to be placed in a subdirectory, or somewhere else entirely.
Enter this command to run the midrun pipeline. You may be able to cut & paste it.
pypeline -p midrun -c run.midrun.conf -j 4
The -j value should be chosen as described in Section 4.4, “Parallel Processing”.
ls will show a number of new directories. The reports directory will contain summary reports on channel yields, strand lengths, error rates, etc. The file reports/tsv.txt contains the high-level summary:
| Flowcell | Channel | Initial Strand Density | Post Filter Strand Density | Initial Strands | Post Length Filter | Post Ctrl Filter | Post Qual Filter | Post Dinuc (BAO)Filter | Aligned | Growth | Error Rate | Mean Length | Zero Error Rate | Term Loss Rate | Registration Loss |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 9590 | 2527 | 2527 | 2527 | 683 | 13 | 0.222 | NA | 34.8 | 8.3 | 0.0 | 48.1 |
1 | 2 | 0 | 0 | 9964 | 5324 | 5324 | 5324 | 4358 | 3359 | 0.289 | 3.25 | 33.1 | 38.9 | 0.7 | 23.1 |
See Section 4.5, “oligo pipeline” for an explanation of the columns. This completes the midrun analysis.