Gene expression analyses from RNA-Seq data
In this section we will extract count data - the number of reads that map uniquely to each contig or gene - from the alignment files that you generated for each sample in Section 4. The count data will serve as a proxy for the magnitude of gene expression since transcripts of greater abundance in the cell will have more reads generated from libraries prepared from RNA. Gene expression may vary among samples or individuals in your study due to your experimental design, for example control versus heat shock or high versus low intertidal zones. The biological questions you can address using the analyses below include: How many genes, if any, are differentially expressed between my treatments? Are differentially expressed genes concentrated in specific functional types of genes or are they randomly distributed with respect to function across the transcriptome?
In the analyses described below, we consider only the reads that map to only one place across the entire reference. There are a number of reasons a read may map to multiple reference sequences such as sequence similarity due to gene duplications or homologous regions across gene families, or errors in the generation of the reference sequences considering one gene as two genes. It is not possible to distinguish between the possibilities, so the most conservative approach is to only consider the reads that map to one contig or gene.
The objectives of this section are to 1) convert our deduplicated .bam files back to .sam, 2) extract count data from the mapped reads for each individual (.sam files), 3) make a combined counts data file for all individuals in the gene expression study (a column for each sample and a row for each gene), 4) normalize across individuals and identify significantly differentially expressed genes using the program DESeq implemented in R, and 5) use p-values from DESeq to identify 'transcriptome'-wide patterns of enrichment for functional classes of proteins using the software ErmineJ.
DESeq: Gene expression data analysis program. There is a very useful manual available on the website. http://www-huber.embl.de/users/anders/DESeq/
Anders, S, Huber, W. 2010. Differential expression analysis for sequence count data, Genome Biology 11: R106
The website is very informative. The link below describes the four input file formats.
The website called Quick-R (http://www.statmethods.net/ ) provides basic information for learning how to code in the R environment.
In this pipeline, we use the program DESeq (Anders 2010) to normalize counts and test for differences in gene expression. However, there are many other programs that perform similar functions. (e.g. EdgeR and BaySeq (Hardcastle and Kelly 2010, Robinson et al. 2010)).
Hardcastle, T, Kelly, K. 2010. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC bioinformatics 11: 422.
Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139.
Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10: 57-63.
1) Convert your deduplicated .bam files from section 4 back into their text equivalent, .sam.
a. In Terminal, change your working directory to the folder containing your deduplicated .sam files (cd PATH/TO/FOLDER). Tip: drag and drop folder from finder rather than typing out the full path
b. Convert .bam to .sam using SAMtools:
samtools view –h FILENAME_dedup.bam > FILENAME_dedup.sam
2) Extract count data from the deduplicated .sam file from each individual using the countxpression.py script. Execute the script from the Terminal shell.
I: SAMPLENAME_dedup.sam files for each individual
O: SAMPLENAME_counts.txt file for each individual; counts.txt
a. Execute the countxpression.py script from the Terminal shell.
countxpression.py 20 20 summarystats.txt *.sam
The first argument is the mapping quality threshold, the second argument is a length threshold, summarystats.txt is an output file combining count statistics from all files. The last argument tells the program to count reads from all .sam files in the folder.
b. Take a look at the summary stats output. What fraction of your reads mapped to only one place in the assembly? We have seen that approximately 70% of the reads that map do so to only one reference sequence (column called FracSingleOfAligned).
3) Extract the second column of data (number of unique reads mapped to each contig) from the counts file from each individual.
I: ContigNames.txt, FILENAME_counts.txt files for each individual
Put all your FILENAME_counts.txt files into one folder
Change your working directory to that folder (cd PATH/TO/FOLDER)
Make a file that is a list of all your contig names called ContigNames.txt and the first row called ContigName.
Execute the ParseExpression2BigTable_counts.py script from the Terminal shell. Note that the wildcard "*" will tell the script to process all files in your current working directory whose name ends with "counts.txt".
ParseExpression2BigTable_counts.py ContigNames.txt \ CombinedCounts.txt *counts.txt
Alternatively, if you are only dealing with a few files, you can open each one and copy and paste the second column using Excel. The script has the advantage of grabbing the correct column each time and labeling the column with the input file name.
4) Identify differentially expressed genes using the program DESeq. In this section you will work in the R programming environment executing sequential parts of a script file. It is useful to move through the script a few lines at a time in order to learn about your data and to realize what each step of the script does.
Open R and open the DESeqScript.R script file from inside R.
In the DESeqScript.R script, change the working directory (the drag and drop trick works in R, too), enter your input file name (likely CombinedCounts.txt if you successfully used the script from step 2), enter your conditions (e.g. hot and cold, or whatever your conditions are for each of your samples). There should be a named condition for each of the samples or columns of data in your file, i.e. each condition corresponds to a column of data in your input file (they need to be in the same order).
Run the script through the line head(countsTable) by highlighting the lines and entering apple+return. You should see the first 6 rows for each of your columns in the table populated by your input file.
Run the script through sizeFactors(cds). The calculated size factors are factors such that values in a column can be brought to a common scale by dividing by the corresponding size factor. Ideally you want all of the factors near 1. If you see a factor much less than 1, then there were many fewer singly mapped reads for that sample and likely fewer reads for that sample, and visa versa.
Run the script through head(res). You should see the output columns from the results table.
Run the script through counting the number of significantly differentially expressed genes (the line that begins with the nrow function). You should see how many differentially expressed genes you have in your data set based on these conditions and at the p < 0.05 and p < 0.01 significance levels.
Run a few more lines of script to filter out the contigs where the average number of counts is less than 5. This filter may affect the number and fraction of significantly differentially expressed genes.
Now filter your results excluding contigs that have high variance (down to line ~73). Step through the lines stopping at the head and dim functions to follow how the data are being processed.
Now you can change the file names and generate various output items to be used for further analyses (lines ~76-110).
You can also make a heat map of your significantly differentially expressed genes using the heatmap.2 function from the gplotslibrary in R. You can install gplots from within R: Go to "Packages & Data," "Package Installer," search for gplots, select it and "install dependencies," then click "Install Selected." Note that you need to be connected to the internet. To make the plot you should normalize each gene to itself, i.e. divide the counts for each individual by the average counts across all individuals – essentially relative fold difference. This will make all genes visible on a single plot even if some have two orders of magnitude higher counts, for example. You can make this new data matrix in R or Excel and save it as a .txt file. Now open the R script Heatmapping.R. Run the lines of the script sequentially as you did in the script above. Note that you will need to change the value of n in line 12. You can save the heat map image from Quartz in R though the quality is poor or you can press apple+shift+4 to capture a nice image of your heat map (it will save to the desktop by default). Or you can write a script in R to save a higher quality image.
5) Test for functional enrichment across your gene expression data using ErmineJ and your outputs from DESeq.
ErmineJ requires 4 input files:
GeneScores.txt – usually either p-values or "fold-change," file format must be 1st column ContigName, 2nd column ContigScore, one header line
GeneExpressionProfiles.txt – This is optional but useful for visualizing the data. 1st column – ContigName, subsequent columns are the normalized counts data for each of your samples output from DESeq.
GeneAnnotations.txt – The 1st and 2nd columns are the ContigName, 3rd column is annotation description of the gene (from blastx to nr or uniprot), and the 4th column is a delimited list (comma, space or pipe) of GO identifiers (e.g. GO:00494494,GO:00345678). These identifiers are generated during the annotation process in Section 3.
Gene Ontology XML file – downloaded from the GO consortium website http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz
An updated link is provided on the ErmineJ website.
Create files 1-3 using Excel from your DESeq outputs and your outputs from the annotations you did in Section 3.
Download the ErmineJ java applet or open it if you downloaded it in the Set up your computer section (http://www.bioinformatics.ubc.ca/ermineJ/webstart/ermineJ.jnlp)
- Start the ErmineJ application;
- Load the GO XML and GeneAnnotation files
The table and tree show two views of the number of genes you have in each functional category.
- Run a Gene score resampling analysis (uses the whole score distribution rather than a cutoff): Analysis -> Run Analysis -> Gene score resampling
- Select your GeneScore and GeneExpressionProfile files
- Click Next, unless you have created custom gene sets to include
- Consider just the Biological Process (at least initially as this is usually most interesting)
- Maximum gene set size: 100, Min: 10, functional categories with more that 100 genes are too general and ubiquitous while categories with less that 10 genes will be statistically inaccurate.
- Take the negative log of gene scores if dealing with p-values; Larger scores are better only if your GeneScores are fold differences (ie. greater differences between treatments are more interesting); choose 10,000 iterations to explore the data and 200,000 iterations and full resampling to publish. Now click "Finish."
- You should see if there are any functional categories that are enriched for differentially expressed genes (after correcting for multiple tests - green). You can 1) explore the results in the table and tree views, 2) double click on a category to see the genes and expression profiles of the gene members of a category, 3) save your results under Analysis -> Save Analysis
NOTE: You can run functional enrichment analyses in ErmineJ with some metric from your SNP data (e.g. FST as the metric in the GeneScore file rather than p-value or fold-change). To do this you need to change your GeneAnnotation file so that it has an entry for each SNP rather than just each gene (because you are likely to have more that one SNP in many of your genes). The GeneExpressionProfile file can have allele frequency data for each sample, however it is not necessary to include these in the analyses. Your GeneScore and GeneExpressionProfile files should also have the same number of rows as you have SNPs.
In this section you went from individual mapped reads files (SAM) to hopefully understanding the variability in gene expression among your samples at the individual gene level and at the broader functional categorization level. You did this by counting the number of reads that mapped uniquely to each contig or gene in your reference, combining those into a single data file, then analyzing those gene expression data for differences among your treatments using DEseq. You then combined your gene and functional annotations from Section 3 with the p-values and/or fold-change in expression you generated using DEseq to identify any transcriptome wide patterns in functional enrichment! Hopefully you have some exciting results!