De novo assembly
RNA-Seq reads represent short pieces of all the mRNA present in the tissue at the time of sampling. In order to be useful, the reads need to be combined –assembled- into larger fragments, each representing an mRNA transcript. These combined sequences are called "contigs", which is short for "contiguous sequences". If you happen to be working with an organism for which there is a genome available, you can use the gene annotations to pull out sequences coding for mRNA and use those as the reference for further processing. If you already have a reference available, download it in the FASTA format and skip to section 4 (Mapping to reference). If not, however, you need to create your own catalog of contigs by performing a de novo assembly. A de novo assembly joins reads that overlap into contigs, while allowing a certain, user-defined, number of mismatches (variation at nucleotide positions that can be due to sequencing error or biological variation).
Short reads ->
Figure 3. Example of a contig assembled by the joining of many short reads.
When comparing the lengths and numbers of contigs acquired from de novo assemblies to the predicted number of transcripts from genome projects, the de novo contigs typically are shorter and more numerous. This is because the assembler cannot join contigs together unless there is enough overlap and coverage in the reads, so that several different contigs will match one mRNA transcript. Biologically, alternative splicing of transcripts also inflates the number of contigs when compared to predictive data from genome projects. This is important to keep in mind, especially when analyzing gene expression data based on mapping to a de novo assembly. To minimize this issue, we want to use as many reads as possible in the assembly to maximize the coverage level. The assembler therefore pools the reads from all specified samples, which means that no information about the individual samples can be extracted from the assembly. In order to get that information, we need to map our reads from each sample individually to the assembly once it has been created (section 4).
Building a de novo assembly is a very memory-intensive process. There are many programs for this, some of which are listed in the Resources section of this chapter. In our experience, the one that can be used most effectively on any fairly new Mac computer is CLC genomics workbench, as most others require more RAM memory than typically is available on personal computers (in the 100's of GB, depending on the number of reads). CLC is the only software in this protocol that is not open source (an academic license is currently $4,995), although there is a free two-week trial version available. Unlike the other software in this protocol, CLC has a point-and-click graphical user interface and is very easy to use. CLC uses De Bruijn graphs to join reads together. More information about how the assembly algorithm works can be found here: http://www.clcbio.com/files/whitepapers/white_paper_on_de_novo_assembly_on_the_CLC_Assembly_Cell.pdf
The parameters we use in this protocol have proved to work quite well for our data. Nevertheless, it is useful to try to perform several assemblies with your dataset, with varying parameter values (especially the mismatch costs), to see how the results differ.
The objectives of this section are to 1) import our reads into CLC, 2) build a de novo assembly, 3) examine the properties of the newly-created assembly, and 4) export our assembly from CLC.
CLC genomics workbench: http://www.clcbio.com/index.php?id=1240
Examples of other available software for short read assembly:
Here's also an excellent review describing and contrasting the different software packages in use:
Zhang W, Chen J, Yang Y, Tang Y, Shang J, et al. 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS ONE 6: e17915. doi:10.1371/journal.pone.0017915
- Import the quality-trimmed, adapter-clipped FASTQ files into CLC.
a. Open CLC
b. Import your _trimmed_clipped.fastq files to CLC:
File -> Import High-Throughput sequencing data -> Illumina
- de novo assembly.
- Toolbox -> High-Throughput sequencing -> de novo Assembly
Select all samples.
b. Specify mapping parameters:
Mismatch cost 1. Limit 5. Uncheck "fast ungapped alignment". Insertion and Deletion costs: 2 (no global alignment)
Mismatch costs determine how many nucleotide mismatches are allowed before the reads can't be joined together. A mismatch limit of 5 allows 5 out of 50 = 10 % difference or 2 indels (as they cost 2 penalty units each).
Vote for conflict resolution. Ignore non-specific matches.
The former prohibits ambiguities in the contigs, and instead uses the most common nucleotide. The latter option ignores all reads that match to more than one contig. As we cannot know which contig they belong to, it is safest to ignore them.
Minimum contig length 200 bases.
Map reads back to contigs and update contigs based on mapped reads.
This option makes the assembly considerably more time-intensive and can be ignored if you are pressed for time. However, the assembly can be improved by matching reads to it one extra time, and as it is very important to have as good as possible an assembly for downstream analysis, we recommend checking this option.
Create summary report and save log.
c. Complete the assembly by clicking "finish".
3) To further study the results of the assembly, create a detailed mapping report.
a. Toolbox -> High-Throughput sequencing -> Create detailed mapping report
b. Study the mapping report, especially the contig length distribution, proportions of reads used and coverage distributions. In most cases there will be a few contigs with high and many with lower coverage. The more reads that are included in the assembly, the longer (and perhaps fewer) the contigs will be, as they better will represent complete mRNA transcripts.
4) Export your newly created reference assembly in the FASTA format, and rename the contigs. FASTA files contain 2 lines per sequence, one identifier line, starting with >, and one sequence line. CLC names all contigs with the name of the first input file, plus a number. We want to change the names to something simpler, such as "contig#"
- Select your de novo assembly in the left panel. File -> Export, choose FASTA format and .fasta as file extension, save in the folder containing your project.
- Now, open your .fasta reference assembly in TextWrangler, and Find-Replace the contig names with something simpler. Make sure that the contig numbers remain, to keep each contig identifiable.
We have now created a de novo
assembly, which we will use as a reference for downstream analysis. The assembly only contains information about contig sequences, and no information about how many reads were used to create them or what samples they came from. The assembly is a proxy for a library of all mRNA transcripts present in the tissue at the time of sampling, although several contigs could belong to different parts of the same mRNA molecule. In the assembly, we allowed for 5 mismatches in any one read (about 10%), ignored reads that matched to more than one contig, and set a minimum contig length of 200 bases.