How to set up your computer to run smoothly through this protocol
In this text, we are walking through the installation process on a new Mac with an Intel processor and the Lion (OS X version 10.7) or Snow Leopard (OS X version 10.6) operating system. This does not mean that a Mac is required, but Mac OS X is currently one of the most commonly used operating systems for bioinformatics, along with various versions of the Linux operating system family.
See Software Table - a list of all software used throughout this protocol
A note on operating systems
Mac OS X and Linux are both rooted in an operating system called "UNIX", which has a long-standing tradition of open source. This means that many software packages can be complied on Mac OS X, Linux, and BSD systems alike. Linux can be a very useful operating system, as it is free in most of its versions, and works on both Macs and PC's. It can nevertheless be perceived as having a steep learning curve. We have used Ubuntu Linux in our lab, as it is easy to install and use both on a PC running Windows and on a Mac. With minor modifications (see Appendix 1 of Haddock and Dunn (2010) for useful tips), it is possible to run this pipeline on a Windows PC (without installing Linux). This requires that a "UNIX environment portal" is installed. We have tried Cygwin, which we found to work well. The present protocol, however, assumes that the reader is using a relatively new Mac with an Intel processor and Mac OS X "Lion" or "Snow Leopard" installed.
There is often more than one way of installing a program on your computer. On Macs, the simplest is usually to download an installer .dmg package, which installs itself when you click on it. Sometimes, however, that is not available, making it necessary to download executable files and copy them manually to your hard drive. In some cases, you might even have to download the source code for programs and build them on your computer, for which you'll need special compiler software. The advantage of compiling your own software is that the software will be "custom-built" for your computer setup, and will run more smoothly than a version compiled somewhere else might. A list of all the software discussed in this chapter, along with information on where to find it, can be found in Table 1.
Before getting started with the process below, we highly recommend reading through chapters 4, 5, and 6 from Haddock and Dunn's (2011) "Practical Computing for Biologists" to gain comfort and familiarity with how UNIX-based operating systems such as Mac OS X and Linux are organized and how they process information. This reading will also make you comfortable with working through the command line in the Terminal window environment, which is essential to most of this protocol. Terminal is an application that allows you to interact with your computer through the command line. Using the command line, you can navigate around your computer as you do in Finder. You can open and manage files and folders and execute programs. The default shell, the program that displays the command line (the prompt and cursor), in Mac OS X is called "bash". A short summary of some of the most useful bash commands can be found at the end of this section.
When naming files and folders, there are a few simplifying rules to follow. Spaces and special characters are to be avoided. Also, you want make file names as informative as possible in a filename without making them too long, so abbreviations are nice. If you are planning to run through the protocol several times, it is helpful to add date and time to a filename (do not just call the file "new"). If you are planning to share files, adding your initials at the end can also be useful. It is also important to note that many of the files generated in this protocol are huge. It is important to make sure that there is enough hard-drive space before starting.
Throughout this protocol, we refer to "bash scripts", "scripts" and "programs". A bash script consists of a list of commands that you could just as well type directly into the Terminal. The advantage of using a bash script is that it facilitates the "batch" processing of many different files at the same time, since you can execute a series of commands one right after another, without having to enter each one manually. A bash script normally starts with #!/bin/bash and the filename extension is .sh. The bash scripts used in this protocol were written by us, and in general you will have to open them in a text editor and modify input and output file names. "Scripts" are programs that we have written in an interpreted (non-compiled) high-level programming language, such as Perl (.pl), Python (.py) or R (.r). You will not need to open the scripts used in this protocol unless you want to study exactly how they work or modify them in some form. Within most bash scripts and scripts there are lines that begin with #; these are comment lines for your benefit and are not used by the computer. Reading these lines will help you understand what the script does. "Programs", as used herein, refers to executable files that have been precompiled to binary code, and thus cannot be opened in a text editor in any meaningful way. The programs used here were not written by us, nor should you try to change any of their contents; how to acquire them is the focus of the rest of this section.
Below, you will find one way to setup your Mac computer. If you have problems installing any of the software, you can also try using a package manager such as MacPorts (http://www.macports.org/) or Fink (http://www.finkproject.org/).
1) Set up your computer so that it understands where to find the programs and scripts (PATH), and how to interpret the programming languages that we use in this pipeline.
Haddock and Dunn (2011). Practical Computing for Biologists. Sinauer Associates, Inc. Sunderland, MA, USA.
This will create a file if it doesn't exist already. Type in the file:
Now, save and exit the file.
Then, to install the compiler into the /usr/local folder (where your computer will be able to find it), type:
The "sudo" command overrides your computer's default security settings in order to write to a folder outside of your home directory. You will need administrator access (and password) in order to do this.
Once the process is finished, you can delete the downloaded installation files.
The first line should print the version (we're using version 2.7.1).
If it is installed you should just get a new line, if not you will get an error. It should already be installed if you're working with python version 2.7. If you're working with an older version you install numpy using Git (or download it with a web browser, the link is given in Table 1). Open a new Terminal window or quit python with quit() (so that you're no longer in the python interpreter) and move into your programs folder. Then type:
Then move into the new numpy folder and type:
Install scipy: From your programs folder, type:
Then move into the new scipy folder, type:
Once these packages (numpy, scipy and biopython) are installed you can delete the downloaded installation files.
The first line should print the version (we're using version 5.12.3). Perl should be installed on all Mac OS X machines.
If you need to install or update your version, go to your Applications folder, Utilities, Java Preferences.
If not, then download and install the latest version of the java library called Apache Ant (http://ant.apache.org/bindownload.cgi). Move the folder into your programs folder.
Part 2 - Software installation
Then copy the executable file called "bwa" into your programs folder. You can delete the rest of the downloaded files.
Then copy the executable file called "samtools" into your programs folder. Though we do not use it in this pipeline, you may also want to copy the "bcftools" executable into your programs folder in case you want to use samtools/bcftools for SNP detection. You can delete the rest of the downloaded files.
Unzip the downloaded file by clicking on it in finder and copy GenomeAnalysisTK.jar to the programs folder.
We have now set up the computer so that it can interpret Perl, Python, R and Java and the bioinformatics modules for these programming languages and we have told the computer that our scripts and programs are located in the ~/scripts and ~/programs folders, respectively. We have also installed all the bioinformatics software that we will need to proceed through the rest of this protocol.
Using the Command Line – "the Terminal is your friend"
Below are some useful bash commands that allow you to view and modify files. For a more detailed list of useful commands, please see Appendix 3 in Haddock and Dunn's "Practical Computing for Biologists".
From within a program such as man or less or nano:
TIP: Instead of typing out the path of a file or folder, you can drag the little folder icon at the top of a Finder window into the Terminal window.
See Software Table - a list of all software used throughout this protocol
De Wit P, Pespeni MH, Ladner JT, Barshis DJ, Seneca F, Jaris H, Overgaard Therkildsen N, Morikawa M and Palumbi SR (2012) The simple fool's guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis. Molecular Ecology Resources 12, 1058-1067.