Simple Fools Guide
 

guide map

Quality Control Processing of Raw Data

Quality control processing of RNA-seq data (FASTQ files)

Overview

Once the sequencing is finished, the data becomes available for download as "fastq" text files, in which each short read takes up four lines. The first line (starting with an @) is a read identifier, the second is the DNA sequence, the third another identifier (same as line 1, but starting with a +(or sometimes only consisting of a +)) and the fourth is a Phred quality score symbol for each base in the read. The quality score is based on the ASCII character code used by computer keyboards (http://www.ascii-code.com/). Illumina's current sequencing pipeline (as of January 2012) uses an offset of 64, so that an @ (ASCII code 64) is 0, and h (ASCII code 104) is 40 (other versions of the pipeline might use different offsets, however. If you have data with a different offset value, you will need to modify your commands accordingly to inform programs that this is the case). The quality score for each base ranges from -5 to 40 and is defined as Qphred =-10 log10(p), where p is the estimated probability of a base call being wrong. So a Qphred of 20 corresponds to a 99 % probability of a correctly identified base. The Illumina sequencing machine produces reads of a predefined length (currently 50 or 101 bases). As the mRNA was fragmented into small pieces before the adapters were ligated, it is possible that partial adapter sequences have been sequenced if any sequenced fragment was shorter than the read length. Also, it is possible that adapter-only sequences have been sequenced.
Before we can use our data to answer any biological questions, we must remove poorly identified bases as well as any adapter sequences from our reads. To evaluate the data set, it is also useful to know what the distribution of quality scores and nucleotides looks like. As the FASTQ files are too large to overview manually, we have to summarize the data and graph it, either by using command-line based software or web server applications. It is also useful to know the fraction of duplicate reads (identical reads present more than once in the dataset) and singletons (reads only present once in the dataset), for which we use command-line tools. There is still debate over whether duplicate reads represent very common transcripts or if they are due to primer or PCR bias, but a large fraction of duplicate reads may be indicative of a poor cDNA library. We have typically seen fractions of duplicate reads of 30-50 %.

In this section we will use bash scripts to process multiple files at once; before executing them we will need to open them and edit input and output filenames, as well as adapter sequences for step 2. Unfortunately, the nucleotide sequences of Illumina's TruSeq adapters are proprietary information, but they can be acquired by emailing Illumina customer support at info@illumina.com.

Objectives

The objectives of this section are to 1) remove all bases with a Phred quality score of less than 20, 2) remove any adapter sequences present in the data, 3) graph the distributions of quality scores and nucleotides, and 4) calculate the fractions of duplicate and singleton reads in the data.

Resources

fastx toolkit: http://hannonlab.cshl.edu/fastx_toolkit/ : A collection of programs for manipulating and examining FASTQ and FASTA files.

Galaxy: http://main.g2.bx.psu.edu/ : A web service performing the same tasks that the fastx toolkit does, and much more.

We will use four programs from the fastx toolkit (fastq_quality_trimmer, fastx_clipper, fastx_quality_stats and fastx_collapser) locally on our computers for most tasks, then upload summary statistics to the web site for plotting. Uploading the raw FASTQ files to Galaxy is not recommended, as they are very large.

flowchart

Process

  1. Download the data, uncompress and rename files.

a.   Find the files corresponding to your individuals on the sequencing center's website, and download them into a new folder on your computer.
b.   Use the sequencing center's notes to rename the files to reflect your sample names.
c.   Uncompress the .tar.gz files by double-clicking on them in the finder window.
d.   Change the file extension of the uncompressed files to .fastq if that is not already the case.
e.   Study the .fastq format by opening a Terminal window, moving into the folder where the files are located with

cd FOLDERNAME

then using the head command to display the top 10 lines of a randomly chosen file:

    head YOURFILE.fastq

  1. Quality trimming and adapter clipping, using the bash script TrimClip.sh

P: fastq_quality_trimmer, fastx_clipper(fastx toolkit)
S: TrimClip.sh
I: YOURFILE.fastq files for each individual
O: YOURFILE_trimmed_clipped.fastq file for each individual

  1. Open the bash script in a text editor with

    edit ~/scripts/TrimClip.sh

The bash script first invokes the quality trimmer, which scans through all reads, and when it encounters a base with a quality score of less than 20, trims off the rest of the read and then subsequently removes reads shorter than 20 bases. A file called YOURFILE_trimmed.fastq is created, which is then used as an input file for the adapter clipper. The clipper removes any read ends that match the defined adapter sequences, and then removes reads that after clipping are shorter than 20 bases.

  1. For each step in the bash script, copy the lines to reflect the number of files (samples) that you are analyzing, change all the input and output file names, as well as the adapter sequences, in the bash script to match your data, then re-save the bash script. In addition, if your quality score offset is not 64, you will need to add a flag to each command that looks like this: -Q QUALITYSCOREOFFSET
  2. Execute the bash script by typing:

    TrimClip.sh

while in the folder containing your data. Make note of how many reads are being trimmed and clipped through the screen output.

3) Calculate the fraction of duplicate and singleton reads, using the bash script CollapseDuplicateCount.sh.

P: fastx_collapser(fastx toolkit), fastqduplicatecounter.py
S: CollapseDuplicateCount.sh
I: YOURFILE_trimmed_clipped.fastq files for each individual
O: YOURFILE_collapsed.txt files for each individual
YOURFILE_duplicatecount.txt files for each individual

  1. Open the bash script in a text editor with

    edit ~/scripts/CollapseDuplicateCount.sh

The bash script first uses fastx_collapser to combine and count all identical reads.  A FASTA-formatted file called YOURFILE_collapsed.txt is created, which is then used as an input file for a python script (fastqduplicatecounter.py) that calculates the fractions of duplicate reads and singletons.

  1. For each step in the bash script, copy the lines to reflect the number of files that you are analyzing, change all the input and output file names in the script to match your data, then re-save the bash script. (As above, you will need to specify the quality score offset to the fastx_collapser commands with the –Q flag if it is not 64.)
  2. Execute the bash script by typing:

    CollapseDuplicateCount.sh

while in the folder containing your data.

  1. Open the files named YOURFILE_duplicatecount.txt and note what percentage of your reads is duplicates, how many different sequences you have (unique reads) and how many singletons there are. Depending on the quality of your initial tissue and sample preparation procedure, the fraction of duplicate reads can be as low as 5-10% or higher than 50 %.

4) Summarize quality score and nucleotide distribution data, and plot using the Galaxy web server.

P: fastx_quality_stats(fastx toolkit)
S: QualityStats.sh
I: YOURFILE_trimmed_clipped.fastq files for each individual
O: YOURFILE_qualstats.txt files for each individual

  1. Open the bash script in a text editor with

    edit ~/scripts/QualityStats.sh

The bash script uses fastq_quality_stats to summarize the data in the FASTQ file by read position (1-50 or 101) into a file named YOURFILE_qualstats.txt

  1. Copy the lines to reflect the number of files that you are analyzing, change all the input and output file names in the script to match your data, then re-save the bash script. (As above, you will need to specify the quality score offset for each of the fastq_quality_stats commands with the –Q flag if it is not 64.)
  2. Execute the bash script by typing:

    QualityStats.sh

while in the folder containing your data.

  1. Visit the Galaxy web server at http://main.g2.bx.psu.edu/. Upload the YOURFILE_qualstats.txt file under "Get Data" in the left panel.
  2. Plot it under NGS: QC and manipulation: Fastx Toolkit for FASTQ data. Choose files to plot under "Draw Quality Score Boxplot" and "Draw Nucleotide Distribution Chart" (you don't have to wait for one plotting job to be done before starting the next one), then save the plots on your computer. They should look something like Figs. 2a-b. If the mean quality scores are low throughout or if the nucleotides are non-randomly distributed, something could have gone wrong during sample preparation or sequencing.

5)  IF you have sequenced some or all of your samples with Paired-End sequencing, you will for these need to sort your two FASTQ files so that reads are in the same order in both files, and so that any reads present in one file but not the other ("orphans") get separated out into a separate file.

P: fastxcombinepairedend.py
S: PECombiner.sh
I: YOURFILE_trimmed_clipped#1_1.fastq (forward)
YOURFILE_trimmed_clipped#1_2.fastq (reverse)
O:   YOURFILE_1_trimmed_clipped_stillpaired.fastq
YOURFILE_2_trimmed_clipped_stillpaired.fastq
YOURFILE_trimmed_clipped_singles.fastq

  1. Open the bash script in a text editor with

    edit ~/scripts/PECombiner.sh

The bash script uses fastxcombinepairedend.py to sort your two FASTQ files so that the reads who are in both files will be in the same order. It also removes all reads only present in one file and saves them in another file. Notice that this script needs a SEQHEADER and a DELIMITER argument in addition to filenames.

  1. In the Terminal, type
  2. head –n 1 YOURFILE_trimmed_clipped.fastq

    (Any of your files)

    to examine the format of the identifier line of your FASTQ files. SEQHEADER will be the first 4 characters of this line (including the @ symbol). DELIMITER will be the character separating the last part of the identifier (that tells the software if the read is a forward or reverse read) from the rest of the identifier. It is usually either a "/" or a " " (space) character.

  3. Enter your SEQHEADER and DELIMITER into the PECombiner.sh bash script, then copy the line to reflect the number of files that you are analyzing, change all the input and output file names in the script to match your data and finally re-save the bash script.

  4. Execute the bash script by typing:

PECombiner.sh

while in the folder containing your data. This will create two FASTQ files with names ending in _stillpaired and one ending in _singles for each Paired-End sample. These will be the files that you will use for downstream analysis (De novo assembly and mapping to reference) for your Paired-End samples.

Summary
We have now cleaned up the raw data, so that they can be used for creating a de novo assembly or for mapping against a reference. Low quality bases and adapter sequences have been removed. We have also verified that the reads are not all identical, which would suggest an error somewhere in the sample preparation pipeline. We have also examined the trimmed dataset to make sure that quality scores are high and that nucleotides are evenly distributed.


Description: H6_QualBox


Figure 2a. Quality score boxplot of 50-bp Illumina reads (after quality trimming, Q>20), summarized by read position. Lower scores in the beginning of the reads is an artifact of the software used to calculate base quality scores.


Description: AB_51NucDist.png

 

Figure 2b. Nucleotide distribution chart of 50-bp Illumina reads, summarized by read position. A non-random distribution in the first 12 bases is common, and is thought to be an artifact of the random hexamer priming during sample preparation.

Tissue   Sequencing   Computer   QC   Assembly   Annotation   Mapping   Expression   SNP

De Wit P, Pespeni MH, Ladner JTBarshis DJ, Seneca F, Jaris H, Overgaard Therkildsen N, Morikawa M and Palumbi SR (2012) The simple fool's guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis.  Molecular Ecology Resources 12, 1058-1067.