Variant (SNP) detection
Different populations of a species are often quite genetically diverse, and this variation can have important implications for the future resilience and adaptive potential of a species. Patterns of genetic diversity in populations also provide a window into the demographic histories of species. Single nucleotide polymorphisms (SNPs) are one of the fundamental types of genetic variation, and with the growing popularity of next-generation sequencing, they are becoming the most ubiquitously utilized genetic markers in both evolutionary and biomedical research. Levels of polymorphism are highly variable among species, but short reads from a single lane of an Illumina sequencer can easily result in the identification and genotyping of 1000s of SNPs. An issue, however, is to separate true polymorphisms from sequencing or alignment artifacts, for which extensive filtering based on statistical models is necessary.
In this protocol, we will walk through the steps necessary to identify good quality SNPs from population-level next-generation sequencing data. We will also introduce several tools that can be used to begin looking at patterns of genetic variability within and among populations. Firstly, we'll output information on genotypes at variant sites shared by all individuals, in a format usable by Microsoft Excel. While it is typically not practical to work with large datasets in spreadsheets, it can still be useful for visualizing the data and performing simple population genetic calculations. We will also explore the polymorphism data using Principal Components Analysis. This type of analysis is not dependent on fixed sequence differences, but instead can utilize patterns of allele frequency differences across many different loci to detect population structure. Finally, we will utilize an FST outlier approach to detect variable positions that exhibit greater or less than average differentiation among two or more populations. These outlier loci are likely candidates for gene regions that have been affected by selection.
For all the data processing steps within this section, we have chosen to follow the recommendations of the Broad Institute, created for the 1000 Genomes Project:
http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2. We highly recommend reading the instructions of this site for more information and updated protocols. They also have an excellent forum for posting technical questions. The only step in their protocol that we do not use is the Base Quality Score recalibration, as this step requires a list of known variant sites as input. This protocol's focus is on non-model organisms, for which there typically are no known variant sites. If you do have access to this information, we recommend following the instructions of the Broad Institute.
There are three major steps to this section of the protocol. First, we need to process our alignment files slightly (section 6a). We start by merging all the deduplicated .bam files from section 4 into one file called merged.bam, which will be our base for SNP discovery. At this step, it is crucial that the "read group" headings for your samples (specified in section 4) are correct, as they will be used to keep track of the samples within the merged .bam file. For simplicity, from this point until the end of section 6b, we have dictated the file names for each step within the bash scripts, so you do not need to change them. We then index our merged .bam file and search through the file for areas containing indels, where the initial mapping might be of poor quality. By using information from all samples in the merged file in a realignment step, we improve our chances of correctly aligning these regions. The steps of section 6a have been combined into two scripts, marked in different colors in the flowchart of section 6a.
The merged realigned .bam file is what we will use in the next step, variant (SNP) detection and filtering (section 6b). An initial search for variant sites outputs a .vcf file, which is a list of all possible variant sites and the genotypes of all individuals for those sites. We then apply a number of filters to this list, to remove sequencing and alignment artefacts from true variants, and save only the highest quality variant sites, which we will consider "true" sites for further processing. An additional search for variant sites, now with a lower quality threshold, is then conducted and by using our "true" variant sites from the first search we can build a Gaussian mixture model to separate true variants from false positives using a log-odds ratio (VQSLOD) of a variant being true vs. being false (http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration). As in section 6a, all 8 steps of section 6b have been combined into two scripts, marked in different colors in the flowchart with section 6b.
Following this, we can start to address our biological questions (section 6c). We can extract the genotype information from the .vcf file, while specifying a genotype quality threshold, and use this information to calculate allele and genotype frequencies in Excel. We can conduct PCA and FST outlier analyses, as well as calculate any number of other standard population genetic metrics with software such as Genepop or DNAsp (not covered in this protocol).
The objectives of this section are to 1) merge your alignment files and realign poorly mapped regions, 2) detect variant sites and filter out true sites from false positives, 3) extract genotype information for all individuals at all variant sites, 4) calculate allele and genotype frequencies, 5) perform a Principal Components Analysis to find large-scale differences between populations, and 6) perform an FST outlier analysis to find loci potentially under selection.
SAMtools: http://samtools.sourceforge.net/ : a collection of programs to manipulate .sam and .bam files.
Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G, Durbin, R, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25: 2078-2079.
Genome Analysis Toolkit (GATK): http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit : A collection of programs for searching through and manipulating .bam files; we use it for realigning .bam files, finding and filtering variant sites.
McKenna, A, Hanna, M, Banks, E, Sivachenko, A, Cibulskis, K, Kernytsky, A, Garimella, K, Altshuler, D, Gabriel, S, Daly, M, DePristo, MA. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20: 1297-1303.
DePristo, M, Banks, E, Poplin, R, Garimella, K, Maguire, J, Hartl, C, Philippakis, A, del Angel, G, Rivas, MA, Hanna, M, McKenna, A, Fennell, T, Kernytsky, A, Sivachenko, A, Cibulskis, K, Gabriel, S, Altshuler, D, Daly, M. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43: 491-498.
Smartpca: http://genepath.med.harvard.edu/~reich/Software.htm : Part of the Eigensoft package, smartpca performs principal components analysis on individuals using multi-locus genetic data. It also utilizes Tracy-Widom theory to assign levels of significance to each eigenvector.
Patterson, N, Price, A, Reich, D. 2006. Population structure and eigenanalysis. PLoS Genetics 2: e190.
BayeScan: http://cmpg.unibe.ch/software/bayescan/ : A Bayesian approach to identify loci putatively under selection.
Foll, M, Gaggiotti, OE. 2008. A genome scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective. Genetics 180: 977-993
6a. Alignment processing
1) Merge your deduplicated .bam files into one file and index the merged file.
P: samtools merge and samtools index
I: YOURFILE_dedup.bam files for each individual; rg.txt
a. Create a tab-delimited text file called rg.txt and format it like this (new line for each sample):
@RG ID:READ_GROUP_ID_SPECIFIED_IN_SECTION_4 SM:SAMPLE_NAME PL:Illumina
b. Merge your deduplicated .bam files:
samtools merge –rh rg.txt merged.bam *dedup.bam
c. Index your merged .bam file so that GATK will be able to search through it:
samtools index merged.bam
2) Realign around InDels using the GATK.
a. Open the realigner.sh script in TextWrangler and make sure that the path to your reference assembly (created in section 2) is correct, then re-save the script (as file names already are dictated, there is no need to change file names).
b. There are two parts to this script: 1) call on RealignerTargetCreator to search for poorly mapped regions near indels and save those intervals in an output.intervals file. 2) call on IndelRealigner to realign the intervals specified in step 1, and save the realigned file as merged_realigned.bam
c. Execute the script from the Terminal shell:
6b. Variant detection and annotation
1) Detect high quality variant sites.
P: Unified Genotyper, VariantAnnotator, VariantFiltration (GATK)
a. Open the SNP_detection.sh script in TextWrangler and make sure that the path to your reference assembly (from section 2) is correct (as file names already are dictated, there is no need to change file names). Re-save the script. The script has six elements: 1) It calls on the Unified Genotyper to confidently call variant sites with a Phred scale quality of more than 30 (probability of being a true variant site >0.999); 2) It calls on VariantAnnotator to add annotations to the .vcf file output from step 1 (for more information, see http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator); 3) It calls on the Unified Genotyper to confidently call indel sites with a Phred scale quality of more than 30; 4) It filters the .vcf file and adds annotations in the filter column of all sites that are close to indels; 5) It applies a number of other quality filters to the .vcf file and annotates the sites that do not pass the filters; 6) It saves only the variant sites that have passed all the filters into a new file called highQualSNPS.vcf
b. Execute the script from the Terminal:
c. View the bottom 10 lines of the .vcf file with:
and familiarize yourself with the format. For each line, the first nine columns contain information about the variant site (location and quality metrics) as well as a filter flag column. Columns 10 and above contain genotype information for each individual in the dataset. More information about the .vcf file format can be found at: http://www.1000genomes.org/wiki/Analysis/vcf4.0
2) Low threshold variant detection and Variant Quality Score Recalibration (VQSR)
a. Open the VQSR.sh script in TextWrangler and make sure that the path to your reference assembly (from section 2) is correct. Re-save the script. The script has eight elements: 1) It calls on the Unified Genotyper to confidently call variant sites with a Phred scale quality of more than 4; 2) It calls on VariantAnnotator to add annotations to the .vcf file output from step 1; 3) It calls on the Unified Genotyper to confidently call indel sites with a Phred scale quality of more than 4; 4) It filters the .vcf file and adds annotations in the filter column of all sites that are close to indels; 5) It applies a number of other quality filters to the .vcf file and annotates the sites that do not pass the filters; 6) It uses the high quality SNPS found in step 1 above to build a Gaussian mixture model to be able to accurately distinguish true variant sites from false positives; 7) It applies the models and annotates the variant sites that fail to pass the filter; 8) It saves only the variant sites that have passed the VQSR into a new file called VQSR_PASS_SNPS.vcf
b. Execute the script from the Terminal by typing:
6c. Genotype extraction and data analysis
1) Extract genotypes in variant sites shared by all individuals:
Extract genotypes of all individuals at all variable sites from the .vcf file into a format useable by Microsoft Excel, using a genotype quality threshold.
a. Execute the python script by typing:
getgenosfromvcf.py VQSR_PASS_SNPS.vcf Genotypes.txt rows 20
The final argument specifies a genotype Phred quality score cutoff of 20 (99% probability of being true). This parameter can be changed according to your needs. The "rows" argument specifies that SNPs will be output in rows, with two columns per individual, one for each allele (specifying "cols" would return the same output, but with SNPs as columns, two columns per SNP).
b. Once the script has finished, we can create a new file containing only the variant sites for which we have genotype information for all individuals (all lines that do not contain any "."). We will do this using a combination of the unix commands 'cat' and 'grep':
cat Genotypes.txt | grep -v '\.' > genotypes_shared_by_all.txt
It is often a good idea to focus on the best quality SNPs when working with next-gen sequence data. However, some analyses do not require data for all individuals at every SNP (as you will see later in the FST outlier section).
c. genotypes_shared_by_all.txt can now be opened in Microsoft Excel to visualize the genotype data. Although it is not typically practical to work with next-gen sequencing data using Excel, it can be a useful format to get comfortable with the data as well as to calculate allele and genotype frequencies.
2) Principal Components Analysis:
a. Create the input files for the smartpca program.
O: PREFIX_Indiv, PREFIX_Geno, PREFIX_SNP, par.PREFIX
This script will convert the genotype data from the .vcf file into the proper format for analysis with the program smartpca. This includes four different files: the Indiv file, the Geno file, the SNP file and the parameter or 'par' file. Added to the names of all of the output files will be a prefix specified on the command line when the program is run. In order to focus in on only high-quality portions of the dataset, this script will only utilize data from SNPs that have been genotyped in all individuals.
Make sure that you are in the same directory as your files, then execute the script with the following command:
vcf2smartpca.py VQSR_PASS_SNPS.vcf passing_snps_q20 20 popfile \
In this case, the prefix used is 'passing_snps_q20.' This argument can be any arbitrary string of characters, but it is nice to try to make it as informative as possible.
The third argument (20) is an integer specifying a minimum genotype quality cutoff. This is a cut-off for the quality of the individual-level genotypes, as opposed to the SNP quality as a whole. Individuals with genotype qualities below the minimum threshold will be treated as missing data. Therefore, you should try preparing input files at a couple different quality cut-offs. Choose a cut-off that provides a good number of SNPs (1000s, at least), while cutting out some portion of the worst quality genotypes. After creating the input files, the script will print to the screen the number of SNPs that were included in the input files for the smartpca analysis.
popfile should be a plain text file that relates each of your individuals to its population of origin. It should have one line per individual. Each line should contain two strings. The first should be an individual ID (exactly the same as specified in the .vcf file). The second should be the population that individual was sampled from. The two strings can be separated by any amount of whitespace. The popfile must have an entry for every individual in the .vcf file even if some individuals are omitted from the outfiles (see below).
Following the popfile argument, you can specify the names of individuals that are in your .vcf file but which you do not want to include in the PCA. These names should be identical to the names in the .vcf file and should be separated by whitespace. If no names are specified then all individuals will be included in the PCA.
b. Run the principal components analysis using smartpca:
I: PREFIX_Indiv, PREFIX_Geno, PREFIX_SNP, par.PREFIX
O: PREFIX_#LOCI.eval, PREFIX_#LOCI.eval, PREFIX_snpweights,
Execute the script with the following command:
smartpca –p par.passing_snps_q20 > passing_snps_q20_logfile.txt
There are several runtime parameters that are specified in the 'par' file. The above script automatically sets these parameters to levels that we have found to be useful in our data analysis. However, these parameters can easily be changed prior to running the analysis. All parameters are described in the help pages for smartpca.
This analysis will generate a number of output files. The one that we are the most interested in is the one with the extension '.evec', This file contains each individual's loadings on each of the principle components, and it is these loadings that we will plot to visualize the data. Open this eigenvector file in TextWrangler. The top row specifies the amount of variance explained by each of the eigenvectors. This is followed by one row per individual containing that individual's loadings for each principle component.
To easily work with the data in Excel, we need to replace all of the whitespace in the file with tabs. We can do this with a simple find and replace within TextWrangler. To do so, press command + f, search for ' +' (that is a single space character followed by a plus sign) and then replace with '\t' (this is the regular expression for a tab character, make sure that the 'grep' box is checked in the 'find' dialog box). This search and replace, therefore, will replace any continuous stretch of spaces with a single tab character.
We can then copy and paste the contents of this file into Excel and create scatterplots to visualize the data. It is best to start by comparing principle components 1 & 2 because these will explain the largest variance in the data. Plot different 'series' for each population in the analysis so that differences can be easily visualized. If populations are significantly differentiated then we expect to see clustering of individuals by their respective populations. This clustering will not necessarily occur on the first several principle components though, so go ahead and visualize at least the first five or six, to see the major trends in the data.
The number of individuals that are included in the analysis determines the number of possible principal components for a data set. However, the default for this script is to only report the top 8 (this can be changed in the 'par' file).
PCA is most useful for getting a general idea of the major trends in the data set. However, it is also possible to look at the loci that explain the most of the variance in the data along a particular axis (i.e. likely to be under selection). This is done by looking at the PREFIX_snpweights output file. This file contains weights for each SNP along all reported principle component axes. The higher the absolute value of the weight, the stronger the correlation of that SNP with that principle component. Open the snpweight file in TextWrangler, and replace all white space with tabs (as above). Then copy and paste the data into Excel and find the most strongly weighted SNPs for the first few principal components.
c. Identify the number of significant eigenvectors using twstats:
twstats is a second program that comes in the Eigensoft package. It takes the '.eval' file output during the smartpca analysis and calculates the significance of each principal component. This is a useful tool for determining the number of principal components exhibiting meaningful variation. However, the Tracy-Widom statistics used to determine significance can break down under certain conditions. See Patterson et al. (2006) for more details.
Execute the script with the following command:
twstats –t twtable –I passing_snps_q20_#loci.eval \
twtable is necessary for these calculations. It is distributed along with the Eigensoft package and should be located in your programs folder.
3) FST Outliers:
A targeted approach to look for patterns of local selection is to look for FST outliers. This approach compares the level of differentiation at a given locus to levels of differentiation across the genome (or transcriptome) to determine whether there is evidence of selection. For example, a locus that is significantly more divergent than average has likely been affected by positive or directional selection. Similarly, lower than usual FST suggests either balancing or purifying selection. To conduct an FST outlier analysis we will use the program BayeScan, which uses a Bayesian framework to estimate the probability that each locus has been acted on by selection
a. Create input for BayeScan:
O: bayes_input.txt, snpkey.txt
As input, BayeScan requires allele frequency counts for each SNP in each population. To generate this input file we will use the script make_bayescan_input.py. This script extracts allele count data from a .vcf file. Because this analysis does not require individual level data, it is less critical to limit the analysis to SNPs with full coverage across all individuals. Therefore, by default this script will include any SNP that has high quality genotype information from at least 5 individuals in each population (the number of individuals required per population can be changed with an optional third argument on the command line). This allows us to increase the minimum genotype quality score while still maintaining a large number of SNPs.
Execute the script with the following command:
make_bayescan_input.py VQSR_PASS_SNPs.vcf popfile 20 [# indivs req/pop]
popfile should be a plain text file that relates each of your individuals to its population of origin. It should have one line per individual. Each line should contain two strings. The first should be an individual ID (exactly the same as specified in the .vcf file). The second should be the population that individual was sampled from. The two strings can be separated by any amount of whitespace. The popfile does not need to have an entry for every individual in the .vcf file. Therefore, this script can be used to pull out information from subsets of your data set.
The third argument on the command line (and the last required argument) is an integer specifying the minimum genotype quality for a SNP to be used in the analysis. Depending on the size of the .vcf file this process may take several minutes. This step will generate two files: 1) bayes_input.txt will include the information needed by BayeScan to conduct the analysis and 2) snpkey.txt, which is a reference for you to move from the SNP numbers used by BayeScan to the contig and base pair location of the SNP in the reference contigs. This script will also print to your Terminal window the names of your populations along with the number that each one was given in the input file for BayeScan. Go ahead and open these 2 files in Textwrangler to get an idea of their formats. Also, check the number of SNPs that met the specified quality criteria.
b. Run the analysis for FST outliers using BayeScan:
O: bayes_input_fst.txt, bayes_input.sel, bayes_input_AccRte.txt, bayes_input_Verif.txt
Execute the program with the following command:
The first argument on the command line is the executable program. The exact name may vary depending on your operating system and version.
This analysis will likely take several hours to complete. When it is done it will have created several output files. The main file of interest is bayes_input_fst.txt. One way to visualize the results is to use a function in R that is included with the BayeScan distribution. It is distributed in the script plot_R.r. To load this function into R, open the R console and type the following command:
Then, run the function on the BayeScan results using the following command:
This function will calculate the Posterior Odds (PO) threshold leading to a false discovery rate of no more that 5%. Using the posterior odds value it will find and list the SNPs that are FST outliers. It will also produce a figure of FST vs. log(PO).
Additionally, you can look at what SNPs are nearly significant by opening the file bayes_input_fst.txt in Excel and sorting based on the second column, which corresponds to the probability that a SNP has been affected by selection. The largest values indicate the SNPs that are most likely to have been affected by selection.
Within this section, we have processed our alignment files from section 4 and created a transcriptome-wide list of variant sites that we are confident in. From this list, we have extracted the variant sites for which we could confidently assign genotypes to all individuals in our dataset, both in a format that we could input into Excel and a format that we could input into a principal components analysis. For an FST outlier analysis, we extracted a somewhat larger dataset of all variant sites for which there were confident genotypes for at least 5 individuals per population.