Simple Fools Guide
 

guide map

Computer Setup

How to set up your computer to run smoothly through this protocol

Overview
Stop! Do not go any further before you have set up your computer. This will allow you to move through the pipeline without stuttering the whole way trying to install programs, packages and dependencies. It may take a day, but just commit yourself to it!

In this text, we are walking through the installation process on a new Mac with an Intel processor and the Lion (OS X version 10.7) or Snow Leopard (OS X version 10.6) operating system. This does not mean that a Mac is required, but Mac OS X is currently one of the most commonly used operating systems for bioinformatics, along with various versions of the Linux operating system family.

See Software Table - a list of all software used throughout this protocol

A note on operating systems

Mac OS X and Linux are both rooted in an operating system called "UNIX", which has a long-standing tradition of open source. This means that many software packages can be complied on Mac OS X, Linux, and BSD systems alike. Linux can be a very useful operating system, as it is free in most of its versions, and works on both Macs and PC's. It can nevertheless be perceived as having a steep learning curve. We have used Ubuntu Linux in our lab, as it is easy to install and use both on a PC running Windows and on a Mac. With minor modifications (see Appendix 1 of Haddock and Dunn (2010) for useful tips), it is possible to run this pipeline on a Windows PC (without installing Linux). This requires that a "UNIX environment portal" is installed. We have tried Cygwin, which we found to work well. The present protocol, however, assumes that the reader is using a relatively new Mac with an Intel processor and Mac OS X "Lion" or "Snow Leopard" installed.

There is often more than one way of installing a program on your computer. On Macs, the simplest is usually to download an installer .dmg package, which installs itself when you click on it. Sometimes, however, that is not available, making it necessary to download executable files and copy them manually to your hard drive. In some cases, you might even have to download the source code for programs and build them on your computer, for which you'll need special compiler software. The advantage of compiling your own software is that the software will be "custom-built" for your computer setup, and will run more smoothly than a version compiled somewhere else might. A list of all the software discussed in this chapter, along with information on where to find it, can be found in Table 1.

Before getting started with the process below, we highly recommend reading through chapters 4, 5, and 6 from Haddock and Dunn's (2011) "Practical Computing for Biologists" to gain comfort and familiarity with how UNIX-based operating systems such as Mac OS X and Linux are organized and how they process information. This reading will also make you comfortable with working through the command line in the Terminal window environment, which is essential to most of this protocol. Terminal is an application that allows you to interact with your computer through the command line. Using the command line, you can navigate around your computer as you do in Finder. You can open and manage files and folders and execute programs. The default shell, the program that displays the command line (the prompt and cursor), in Mac OS X is called "bash". A short summary of some of the most useful bash commands can be found at the end of this section.

When naming files and folders, there are a few simplifying rules to follow. Spaces and special characters are to be avoided. Also, you want make file names as informative as possible in a filename without making them too long, so abbreviations are nice. If you are planning to run through the protocol several times, it is helpful to add date and time to a filename (do not just call the file "new"). If you are planning to share files, adding your initials at the end can also be useful. It is also important to note that many of the files generated in this protocol are huge. It is important to make sure that there is enough hard-drive space before starting.

Throughout this protocol, we refer to "bash scripts", "scripts" and "programs". A bash script consists of a list of commands that you could just as well type directly into the Terminal. The advantage of using a bash script is that it facilitates the "batch" processing of many different files at the same time, since you can execute a series of commands one right after another, without having to enter each one manually. A bash script normally starts with #!/bin/bash and the filename extension is .sh. The bash scripts used in this protocol were written by us, and in general you will have to open them in a text editor and modify input and output file names. "Scripts" are programs that we have written in an interpreted (non-compiled) high-level programming language, such as Perl (.pl), Python (.py) or R (.r). You will not need to open the scripts used in this protocol unless you want to study exactly how they work or modify them in some form. Within most bash scripts and scripts there are lines that begin with #; these are comment lines for your benefit and are not used by the computer. Reading these lines will help you understand what the script does. "Programs", as used herein, refers to executable files that have been precompiled to binary code, and thus cannot be opened in a text editor in any meaningful way. The programs used here were not written by us, nor should you try to change any of their contents; how to acquire them is the focus of the rest of this section.
The scripts provided in the "scripts" repository (http://sfg.stanford.edu) are free software: you can redistribute them and/or modify them under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. These programs are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with these programs (gpl.txt). If not, see http://www.gnu.org/licenses/.

Below, you will find one way to setup your Mac computer. If you have problems installing any of the software, you can also try using a package manager such as MacPorts (http://www.macports.org/) or Fink (http://www.finkproject.org/).

Objectives

1) Set up your computer so that it understands where to find the programs and scripts (PATH), and how to interpret the programming languages that we use in this pipeline.
2) Install the software that we will use generally and within the different sections of the protocol.

Resources

Haddock and Dunn (2011). Practical Computing for Biologists. Sinauer Associates, Inc. Sunderland, MA, USA.
Table 1

Process
Part 1 – Basic setup: PATH, compilers, programming languages and modules

  1. Create a "scripts" and a "programs" folder in your home directory.
  2. Set up PATH. This will tell your computer where to look for programs and files. From your home directory (see instructions for how to move between directories at the end of this section) in Terminal, type:

nano .bash_profile

or

edit .bash_profile

This will create a file if it doesn't exist already. Type in the file:


PATH="~/scripts:~/programs:${PATH}"
export PATH

Now, save and exit the file.

  1. Download the Xcode package from the AppStore and install. If installing XCode 3, make sure to include the optional 10.4 SDK tools which are needed for Biopython below.
  2. Download and install gcc (Fortran compiler) from http://hpc.sourceforge.net/ (we downloaded gcc-lion.tar.gz).  In Terminal, move into the "downloads" directory and if your computer has not automatically unzipped the file and removed the ".gz" file extension, type:

gunzip gcc-lion.tar.gz

Then, to install the compiler into the /usr/local folder (where your computer will be able to find it), type:

sudo tar –xvf gcc-lion.tar –C /

The "sudo" command overrides your computer's default security settings in order to write to a folder outside of your home directory. You will need administrator access (and password) in order to do this.
Also download the g77 compiler (g77-intel-bin.tar.gz). From the "downloads" folder in Terminal, type (as above):

gunzip g77-intel-bin.tar.gz
sudo tar –xvf g77-intel-bin.tar –C /

Once the process is finished, you can delete the downloaded installation files.

  1. Download and install Git from http://git-scm.com/download.  This is a program for downloading other programs from within Terminal (Git v1.7.6 built for SnowLeopard seems to work with Lion. Choose the x86_64 version if working on a 64-bit Intel Mac).
  2. Check that Python is installed. Open a Terminal window, type:

python

The first line should print the version (we're using version 2.7.1).

  1. Make sure that numpy is installed. In the python interpreter in Terminal (you should be there if you typed python in point 6), type:

import numpy

If it is installed you should just get a new line, if not you will get an error. It should already be installed if you're working with python version 2.7. If you're working with an older version you install numpy using Git (or download it with a web browser, the link is given in Table 1). Open a new Terminal window or quit python with quit() (so that you're no longer in the python interpreter) and move into your programs folder. Then type:

git clone git://github.com/numpy/numpy.git numpy

Then move into the new numpy folder and type:

sudo python setup.py install

Install scipy: From your programs folder, type:

git clone git://github.com/scipy/scipy.git scipy

Then move into the new scipy folder, type:

sudo python setup.py install

  1. Download the latest version of Biopython (we're installing the biopython-1.57 source tarball) and unzip it by double-clicking on the file in a finder window. Then, in the Terminal, move into the biopython folder that you unzipped and type:

python setup.py build
python setup.py test
sudo python setup.py install

Once these packages (numpy, scipy and biopython) are installed you can delete the downloaded installation files.

  1. Check that Perl is installed. Open a Terminal window, type:

perl –version

The first line should print the version (we're using version 5.12.3). Perl should be installed on all Mac OS X machines.

  1. Download and install BioPerl (BioPerl is only used once is this protocol, to parse the output from a BLAST to the nr database. If you already have access to an annotated reference transcriptome, you do not need to go through this step). Go to http://www.bioperl.org/wiki/installing_Bioperl_for_Unix and follow instructions for "preparing to install" and "installing Bioperl." Be prepared, this is a long and involved process. You'll be prompted to answer questions when you finally get to the installation step. We followed, and can recommend, the "Easy way using CPAN" installation pipeline, but either way should work fine.
  2. Make sure that you have a recent version of java installed.  We have used java version 1.6.0.  From Terminal type:

java –version

If you need to install or update your version, go to your Applications folder, Utilities, Java Preferences.

  1. Test if ant is installed by typing:

ant –version

If not, then download and install the latest version of the java library called Apache Ant (http://ant.apache.org/bindownload.cgi). Move the folder into your programs folder.

  1. R: Download the latest version from http://www.r-project.org/. Follow its installation instructions.

Part 2 - Software installation
Here we will download and install all the bioinformatics programs utilized in this pipeline. We install them in the same order that they are used in the pipeline. You do not need to install all programs if you are not going to perform all steps of the pipeline.

  1. Download the scripts.zip file, unzip and copy all the scripts files from the SFG repository (http://sfg.stanford.edu) into the scripts folder in your home directory.
  2. Download the text editor TextWrangler http://www.barebones.com/products/textwrangler/download.html, and drag the icon to your applications folder to install.
  3. FASTX-Toolkit (Section 1 – Data post-processing): Download the precompiled binary for Mac OS X from the website (http://hannonlab.cshl.edu/fastx_toolkit/download.html). Unzip the folder and copy the individual files into the programs folder in your home directory.
  4. CLC Genomics Workbench (Section 2 – de novo assembly) (This is the only software package in the pipeline that is not available for free. See section 2 for alternative options). If you have a license for CLC Genomics Workbench or if you will use the free trial version, download and install from the website (http://www.clcbio.com). Follow step-by-step instructions.
  5. BLAST (Section 3 – Gene annotation): Go to ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ to download the latest BLAST+ executables in the zip archive ending in universal-macosx.tar.gz. The BLAST executables are precompiled, such that you can just unzip and copy them from the bin folder into the programs folder in your home directory.
  6. BWA (Section 4 – Mapping): Download the latest version from the sourceforge page (http://sourceforge.net/projects/bio-bwa/files/). We are using version 0.5.9. Unzip in your downloads folder. From Terminal, move into downloads and into the bwa folder. To build the bwa executable file, type:

    make

Then copy the executable file called "bwa" into your programs folder. You can delete the rest of the downloaded files.

  1. DESeq (Section 5 – Gene expression analysis). Open R, type:

    source("http://www.bioconductor.org/biocLite.R")
    biocLite("DESeq")

  2. ErmineJ (Section 5 – Gene Expression analysis). Go to http://www.chibi.ubc.ca/ermineJ/ where you can download and install or run from the web with in a java application – follow links accordingly.
  3. SAMTools (Section 6 – SNP detection): Download the latest version from the sourceforge page (http://sourceforge.net/projects/samtools/files/samtools/). We are using version 0.1.17. Unzip in your downloads folder. In Terminal, move into downloads and into the samtools folder. To build the samtools executable file, type:

    make

Then copy the executable file called "samtools" into your programs folder. Though we do not use it in this pipeline, you may also want to copy the "bcftools" executable into your programs folder in case you want to use samtools/bcftools for SNP detection. You can delete the rest of the downloaded files.

  1. Picard Tools (Section 6 – SNP detection): Download the latest version from the sourceforge page (http://sourceforge.net/projects/picard/files/). We are using version 1.50. Unzip in your downloads folder. Then copy all of the .jar files into your programs folder. You can delete the rest of the downloaded files. Note that the absolute path must be specified in your scripts when you call PicardTools (e.g. ~/programs/MarkDuplicates.jar).
  2. GATK v.1.0 (Section 6 – SNP detection): This is an older version of the GATK, can be downloaded from ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/. Choose the newest version of v.1.0 (look at the last modified date). The commands in the SNP detection section are formatted for this version. Feel free to use the latest version, but beware that some scripts in this pipeline might not work properly with it.

Unzip the downloaded file by clicking on it in finder and copy GenomeAnalysisTK.jar to the programs folder.
Note that the absolute path must be specified in your scripts when you call GATK (e.g. ~/programs/GenomeAnalysisTK.jar).

  1. EigenSoft (Section 6 – SNP analysis): The version available online is made for Linux, so it might be difficult to compile it on Mac OS X. Therefore, we are providing pre-built versions of the programs we will use from this package within the scripts folder. Move the three files smartPCA, twstats and twtable into your programs folder. To run, these programs require the fortran g77 compiler that you should have installed already (see part 1.4).
  2. BayeScan (Section 6 – SNP analysis): Download from http://cmpg.unibe.ch/software/bayescan/download.html. Find the BayeScan2.0_macos64bits file in the "binaries" folder and the plot_R.r file in the "R functions" folder; move these files into the programs folder in your home directory (save the manual somewhere, too).

Summary

We have now set up the computer so that it can interpret Perl, Python, R and Java and the bioinformatics modules for these programming languages and we have told the computer that our scripts and programs are located in the ~/scripts and ~/programs folders, respectively. We have also installed all the bioinformatics software that we will need to proceed through the rest of this protocol.

Using the Command Line – "the Terminal is your friend"

Below are some useful bash commands that allow you to view and modify files. For a more detailed list of useful commands, please see Appendix 3 in Haddock and Dunn's "Practical Computing for Biologists".     

Useful commands:
cd        change directory
cd ..     moves one step up in the file system hierarchy ("parent")
ls         list files in a directory
ls –l     list files and details in a directory
pwd     print working directory
mkdir   make new directory (folder)
cp        copy file or folder
chmod u+x change permissions to make a specified file executable

Useful programs:
edit      opens a file
head    prints the first 10 lines of a file
tail      prints the last 10 lines of a file       
less     view file contents
man    shows the manual for a program
history  prints the history of commands
PROGRAM –h  prints a short help menu for most command-line programs
cat      concatenates several text files into one
grep    prints lines containing a specified argument to the screen

Shortcuts:
Up arrow    moves back through your previous command history
Tab            auto-completion button
*               wildcard
Esc            repeat last word from last command given

  From within a program such as man or less or nano:
q              quit viewing
space        next page
b              back a page

TIP: Instead of typing out the path of a file or folder, you can drag the little folder icon at the top of a Finder window into the Terminal window.

See Software Table - a list of all software used throughout this protocol

Tissue   Sequencing   Computer   QC   Assembly   Annotation   Mapping   Expression   SNP

De Wit P, Pespeni MH, Ladner JTBarshis DJ, Seneca F, Jaris H, Overgaard Therkildsen N, Morikawa M and Palumbi SR (2012) The simple fool's guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis.  Molecular Ecology Resources 12, 1058-1067.