BLAST comparison to known sequence databases and functional annotation
Establishing links between observed sequence variation and gene function is a major challenge when analyzing transcriptome data from non-model organisms. Here, the Basic Local Alignment Search Tool (BLAST) is used to compare your de novo assembled contigs to sequence databases in order to annotate them with similarity to known genes/proteins/functions. BLAST is a toolkit developed by the National Center for Biotechnology Information (NCBI), the US-based organization responsible for archiving and databasing the world's genetic sequence information.
By querying three major databases, GenBank's non-redundant protein database (NR) and Uniprot's Swiss-Prot and TrEMBL protein databases, we will identify the most similar known sequences for each of our contigs. The available information about these matching sequences will be used to annotate our contigs with likely functional properties. For the significant matches in the database, we will extract both gene names, general descriptions, and Gene Ontology (GO) categories (specific categorical classifications grouping genes based on cellular and molecular function, e.g. "cellular response to protein unfolding" or "calcium homeostasis"), along with additional information from the Uniprot knowledge database. GO categories will also be used in step 5 for the functional enrichment analysis.
|A note on E-values
To determine whether matches to the databases are "significant", we use a threshold E-value. The E-value describes the number of hits one can expect to see by chance when searching a database of a particular size. The lower the E-value, the more "significant" a match to a database sequence is (i.e. there is a smaller probability of finding a match just by chance). However, the quality of a match also depends on the length of the alignment and the percentage similarity, so these statistics may also be considered when evaluating the significance of a match. In the following steps there will be two thresholds employed: the first is a liberal threshold that is used during the BLAST search itself (generally 0.001) which serves to eliminate very bad matches from the raw BLAST output, the second is a more stringent threshold to identify very good matches (generally 1E-4 or below) that is used to parse out the top "hits" from the raw BLAST output for further annotation. Neither of these thresholds are set in stone; however, an E-value of 1E-4 or below seems to be commonly acceptable in the literature for diagnosing a "good" match.
The objectives of this section are to 1) retrieve the best matches in the NR and Uniprot databases for our assembled contigs, and 2) create a metatable that summarizes the available annotation information for each contig.
BLAST: Basic Local Alignment Search Tool.
These tools should have been installed on your computer during the "How to set up your computer section"
The BLAST executables comprise multiple programs to compare a set of query sequences (in the FASTA file format) to a pre-formatted database on you computer. Here we will use the blastx tool that searches a nucleotide query, dynamically translated in all six reading frames, against a protein database.
A wide variety of searchable databases are publicly available on the web, but here we will limit our search to three major repositories:
- NR: NCBI's non-redundant protein sequence database
- The Uniprot protein knowledgebase, which consists of two sections:
- Swiss-Prot, which is manually annotated and reviewed.
- TrEMBL, which is automatically annotated and is not reviewed.
Generally, Swiss-Prot should provide the most conservative, verified, and up-to-date sequence information and functional role identification. If a given sequence finds no significant match in Swiss-Prot, then the TrEMBL and NR hits may be used to infer functional annotation, but entries in these more broad databases will usually be less informative.
For more information on GenBank and Uniprot databases, see
A note about BLASTing against very large databases
BLAST comparisons against very large databases (e.g. NR or TrEMBL) can often take ~30sec to 1min per query sequence. When you have 100,000 query sequences, this can add up to multiple days, weeks or even months of run-time, even when you employ the multiple-core option of BLAST. An alternative approach is to use a computer cluster to run multiple BLAST jobs at the same time. For example, say we have a de novo assembly of 100,000 contigs. If we run 1 BLAST job against NR it could take as long as 50,000 minutes/35 days!! (30sec/query sequence), however if we split this job into subsets of 5,000 sequences and ran 20 jobs in "parallel" on a cluster, our total run-time is reduced to only 41 hours. There are commercially available clusters that have a fee-for computation type structure (e.g. the amazon cloud http://aws.amazon.com/ec2/) and many universities have clusters available in-house as well. You'll need a simple Terminal-based capacity and may need to install the BLAST toolkit (just like you did on your desktop) onto the cluster computer. Alternatively, there are a few web-based BLAST services that provide parallel BLASTing for free for users in academia (e.g. the University of Oslo bioportal http://www.bioportal.uio.no/) and can generally run a large BLAST to NR in a couple of weeks. Smaller BLAST searches are most efficient when run locally on your desktop.
De novo assemblies generated with the test files (containing 458 contigs of mean length 421 bp) supplied with this guide should take ~20 min to BLAST against Swiss-Prot and about 10 hours for TrEMBL and NR when run on a standard desktop Mac.
Part 3.1: BLAST contigs against the three databases
- Download and format the databases.
- Create a folder for each of your reference databases. You can either run all your BLAST jobs from within these folders (which will require you to copy all your query files into the same folder) or you can specify the full path to these databases in every run.
- Download the the NR database in .fasta format from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ and the Swiss-Prot and TrEMBL databases in .fasta format from http://www.uniprot.org/downloads (note that the uncompressed NR and TrEMBL databases are very large (7-8 GB) and that BLAST searches to these take a long time. See Box 3.1 for alternative ways to process large BLAST jobs).
- In Terminal, change your working directory to the folder where you saved the three .fasta database files (cd FOLDER PATH) and format them into BLAST protein databases using the program makeblastdb by typing:
makeblastdb -in DATABASENAME.fasta -dbtype prot
Replace the words in capital letters with the relevant database names. The –in parameter specifies the name of the .fasta file that contains the sequences you want to format, e.g. uniprot_trembl.fasta; -dbtype specifies whether you are dealing with a nucleotide or protein database (here we only work with protein databases); and -out specifies the name you want to use for your formatted database.
2) Familiarize yourself with the Blastx tool.
The -query is the input file name, e.g. testfile_assembly.fasta; -db is the name you gave to your formatted BLAST database in step 1, e.g. NR (note that you should not specify the extensions for the formatted database file names (.phr, .pin and .psq) only the given name of the file); -out is the name you want for your output file, e.g. testfile_blastx2NR; -outfmt specifies the format for how the program outputs the results (for further processing in this pipeline we need option 5, the xml format; –evalue is the expectation value (E) threshold for saving hits (see the description in the beginning of this chapter; your choice should depend on how stringently you want to exclude potential chance matches); -gapopen, –gapextend, word_size, and –matrix specify parameters for the alignment algorithm (see http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml for details); -num_descriptions and -num_alignments are the number of matching sequences in the database to show one-line descriptions and alignments for, respectively (you can choose any number up to 500 and 250, respectively, but in part 3 here we will only extract the top hit); -num_threads is the number of threads the BLAST search should use (depends on how many processors/cores your computer runs on). In the Terminal, the command uname –a will indicate the number of threads at your disposal.
For many of the parameters, we use the default values from the NCBI web-based BLAST searches. However, to keep track of analysis settings, we have found it helpful to specify these parameters with every search as different versions of the stand-alone BLAST toolkit may have different default parameters. Feel free to modify these, but the parameters specified here should provide a good starting point.
Blast the contigs from your de novo assembly against Swissprot and Trembl.
- Follow the instructions in step 3, but this time selecting the formatted Swissprot and Trembl databases (one at the time) and choosing output format 7 which corresponds to a tabular format that we will need for further processing:
blastx -query ASSEMBLY.fasta -db DBNAME \
-out ASSEMBLY_BLASTX2DBNAME -outfmt 7 -evalue 0.0001 \
-gapopen 11 -gapextend 1 -word_size 3 -matrix BLOSUM62 \
-num_descriptions 20 -num_alignments 20 -num_threads 4
Parse the results from your BLAST against NR
In Terminal, make sure you are working from the directory that contains the .xml formatted output from your BLAST against NR
- Parse the output to tabular format (one line for each hit) with the script parse_blast.py. Replace and names in capital letters with your own file names and type in Terminal:
parse_blast.py YOURASSEMBLY_PARSEDblastx2nr.txt \ YOURASSEMBLY_blastx2nr
where the first argument following the script name is the name you want for the parsed output file and the second argument is the name of your .xml-formatted BLAST output. This generates a list for all the hits found for each contig (given the restrictions you specified in step 1). On rare occasion, the parse_blast.py script may not function correctly. If this is the case we have included an alternative parsing script for backup use (BlastParse.pl). Usage for this script is as follows:
BlastParse.pl YOURASSEMBLY_blastx2nr.xml >> \ YOUR_ASSEMBLY_PARSEDblastx2nr.txt
Part 3.2: Annotation of sequences and metatable generation
In part 1, you identified significant matches between your contig sequences and known proteins in the nr and Uniprot databases. In this part, we have developed an automated pipeline which will: (a) combine the blast results from the three databases, (b) download the associated Uniprot flatfiles from the uniprot.org website (you will need an active internet connection for this script to work), (c) extract additional information about each of these matching proteins, including a description of their function and their associated GO categories, and (d) combine all available annotation information into a master annotation metatable.
Make sure that your query sequence .fasta file (your de novo assembly) and the output files for the Swissprot, TrEMBL, and parsed NR BLAST searches are in the same folder and make this your working directory in Terminal.
- In this step, we have automated a lot of different processes, thus, pay careful attention to the specific order and the specifications for each input file. You will need to generate a short companion file called "nrcolumnheadersandbadwords.txt" to specify the names of the description and evalue columns generated during parsing and a customizable list of "bad words" that you can exclude during final combining of the nr results. Due to the size of the nr database, many top hits for non-model systems will be uninformative (e.g. "predicted protein"), so we have built in a second column in the meta table that includes a lower hit when possible for these contigs that excludes the "bad words". Execute the totalannotation.py script in Terminal by replacing the words in capitals with your own file names and typing:
totalannotation.py YOURASSEMBLY.fasta \ YOURASSEMBLY_Parsedblastx2nr.txt \ nrcolumnheadersandbadwords.txt \ YOURASSEMBLY_blastx2sprot.txt \
1E-4(or other evalue cutoff) UniProt_flatfiles \ YOURASSEMBLY_annotated.txt
YOURASSEMBLY.fasta is your de novo assembly contig file, e.g. testfile_assembly.fasta; YOURASSEMBLY_Parsedblastx2nr.txt is the parsed nr .xml file with each hit on one line; nrcolumnheadersandbadwords.txt are the column names corresponding to the match description and evalue on one line (e.g. "HitDescription" and "eValue") and a list of "bad words" that you want to skip over to find an informative nr match (e.g. "hypothetical", "Predicted", "unknown") all separated by tabs; YOURASSEMBLY_blastx2sprot.txt is the output file from the BLAST of your contigs against Swiss-Prot, e.g. testfile_blastx2sprot; YOURASSEMBLY_blastx2trembl.txt is the .txt output file from the BLAST of your contigs against TrEMBL, e.g. testfile_blastx2trembl ; 1E-4 is the threshold specifying that only BLAST hits with an E-value below this will be annotated; UniProt_flatfiles is the name for the folder that will contain the flat files for the matches in Swiss-Prot and TrEMBL; -and YOURASSEMBLY_annotated is the name you want for your output file.
You should now have a complete metatable summarizing annotation information for those of your contigs that matched known proteins in the queried databases. This provides a convenient overview of the top hits from the different databases, summarizes functional annotation information, and generates some of the input required in subsequent analyses (e.g. the GO functional enrichment analysis in step 5). While this annotation table is a good summary, since many contigs can have multiple BLAST matches, it can also be useful to look at the full list of BLAST results for certain contigs. Additionally, if a particular contig of interest does not result in any siginificant matches in the annotation process, hand-curated BLAST-ing via the NCBI web-interface may also provide additional sequence identification information.