|Mapping Reads to a Reference|
Mapping reads to a set of reference sequences
Mapping refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly. There are numerous programs that have been developed to map reads to a reference sequence that vary in their algorithms and therefore speed (see Flicek and Birney 2009 and references therein). The program that we utilize in this pipeline is called BWA (Li and Durbin 2009). It uses a Burrow's Wheeler Transform method that results in much faster processing than the first wave of programs that used a hash-based algorithm such as MAQ (Li et al. 2008). The goal of mapping is to create an alignment file also known as a Sequence/Alignment Map (SAM) file for each of your samples. This SAM file will contain one line for each of the reads in your sample denoting the reference sequence (genes, contigs, or gene regions) to which it maps, the position in the reference sequence, and a Phred-scaled quality score of the mapping, among other details (Li et al. 2009). You will use the SAM files for your samples to extract gene expression information (the number of reads that map to each reference sequence) and to identify polymorphisms across your data.
There are several parameters that can be defined for the alignment process, including: the number of differences allowed between reference and query (-n), the number of differences allowed in the seed (-k), the number allowed and penalty for gap openings (-o, -O), and the number and penalty for gap extensions (-e, -E). Changing these parameters will change the number and quality of reads that map to reference and the time it takes to complete mapping a sample. For a complete list of the parameters and their default values go to http://bio-bwa.sourceforge.net/bwa.shtml. To optimize the parameters for your data, i.e. to obtain the highest number of high quality mapped reads, you should generate an evaluation data file (cleaned and trimmed FASTQ file from a single sample) that you can use to run through several permutations of bwa, changing a single parameter each time. You can use the counts script from the gene expression section of this pipeline to see the number and quality of reads mapped with each set of parameters. The number of nucleotide differences (-n) is probably the most important mapping parameter to fine-tune for your data. This should match the expected number of differences between two sequences for your species. n can be an integer (maximum edit distance, in other words the number of changes needed to convert one string to another, e.g. 5) or a fraction of missing alignments given a 2% uniform base error rate (e.g. 0.04). Counter intuitively, the larger the fraction, the fewer mismatches allowed and visa versa.
Terminology: Seed – a stretch of nucleotides within a read that are used to initiate the alignment process. Phred-scale – a number scale often used for quality scores; given a p-value ranging from 0 to 1 (probability of the mapping being incorrect), the Phred score is -10*log10p, rounded to the nearest integer.
In this section, we will map the reads from each of your cleaned and trimmed FASTQ files from Section 1 of this guide to the de novo reference assembly that you created in Section 2 of this guide (or the predicted genes if your study species has a genome sequence). Specifically, we will 1) create an index for the reference assembly (just once), 2) for each sample, map reads to the reference assembly, and 3) convert the resulting file into the SAM file format and append "read group" names to the SAM file for each sample. Steps 2 and 3 are "piped," or put together feeding the output of one program in as the input for the next program. The read groups, which can have the same names as your sample names, will be appended to each file and will become critical for the downstream SNP detection step. The read group name in each SAM file will connect the reads back to individual samples after files have been merged for SNP detection. All of the above steps for all samples can be "batch" processed at once by editing the bash script BWAaln.sh. We then want to remove all duplicate reads, for which we need to use the MarkDuplicates program from the software package "Picard". Picard uses the binary equivalent of SAM files, BAM, as input, so first we need to convert the files using SAMtools. These steps are performed by the convert_to_BAM_and_dedup.sh bash script.
Burrows-Wheeler Aligner (BWA) – available at: http://bio-bwa.sourcefourge.net
1. Open the BWAaln.sh script in TextWrangler using Finder or from Terminal, by moving to the scripts folder and typing:
2. Edit your file names for your reference, samples, and the "read group" names (in pink font in the script if you are using TextWrangler with default settings, these can be the same as your sample names and are important for linking alleles to specific sample files in your downstream SNP analyses). The –I flag to each bwa command tells the program that you are working with Illumina reads with a quality score offset of 64. If you have data with a quality score offset of 33, this flag can be removed. Note also that if you are analyzing Paired-End samples, there are 4 lines per sample, first aligning the singles file and each of the stillpaired files separately, and then combining the paired files into one alignment file (The singles file will be combined with the others in the merging step at the end of this section).
3. Move into your working directory where your reference and sample files are (or specify the complete path to these files in the BWAaln.sh script).
4. You may want to make a smaller evaluation file with which to test alignment parameters, particularly those described in the overview. Type:
Note that the double arrow (">>") will create or append to an existing file that has the same name. It can be safe practice to use ">>" rather than ">" so that you do not overwrite an existing file (but may also result in unintended extra appended data in a file, so always pay attention).
You can examine the effects of the different parameters by using the countxpression.py script described in Section 5 on Gene expression. By changing mapping parameters with your test file, you will see differences in the number of multiply (to many contigs) and singly (to one contig) aligned reads, and in the time required to process the file. In general, you want to maximize the number of reads mapped singly and minimize computing time. You should have at least 50% of your reads mapping to at least one contig (column called FracAligned in output file) and hopefully around 70% of you reads that map should map to a single contig (column called FracSingleOfAligned in output file). But mapping results may be very species specific.
5. Make sure that your input and output file names, as well as the read group headers are correct, then execute the script by typing:
6. When the computer has finished mapping, we want to see what the .sam file format looks like. We can easily view the bottom 50 lines of a file using the very useful bash command "tail". Type:
See the pdf in the link provided in the resources above for details on the SAM file format. Note that the text for each line will wrap around; you can make your Terminal window wider to fit more of read row on a single line. Generally, what you will see is a row for each read and many columns of data associated with each read. The first column is a string of the read name or "query template name", the second column will tell you if the read mapped anywhere (e.g. "4" did not map, "0" or "16" did map), the 14th column will tell you if the read mapped to one or many contigs (if the number after the second colon is "1" then it mapped to one contig, if it is greater than "1" than it mapped to more than one contig). The 3rd column tells the reference contig name to which the read maps (or first mapped). The 4th column tells the first position relative to the reference where the read maps. The 5th column tells the mapping quality and is Phred-scaled (see overview). The 10th column gives the sequence of the read and the 11th column gives the ASCII of the base quality plus 33 (also Phred-scaled as described in the overview).
7. Convert your .sam files to .bam, sort and remove duplicate reads.
Open the convert_to_bam_and_dedup.sh script in TextWrangler and input your sample names as in- and output files (add lines for each additional file), then re-save the script. The convert_to_bam_and_dedup.sh script has 2 elements: 1) It converts the .sam file to a binary bam file and sorts the reads within it. 2) It marks and removes duplicate reads using the MarkDuplicates program from the Picard package.
Run the script by typing:
You have mapped each of the cleaned data files to a reference assembly to generate an alignment file for each sample. You have also removed all duplicate reads from the dataset. You are now ready to move on to gene expression analyses or SNP detection.
De Wit P, Pespeni MH, Ladner JT, Barshis DJ, Seneca F, Jaris H, Overgaard Therkildsen N, Morikawa M and Palumbi SR (2012) The simple fool's guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis. Molecular Ecology Resources 12, 1058-1067.