Computation

Computers Are Your Friend, Don't Be Scared, Give It a Try, You Will Be Surprised How Easy It Is!!

Over the course of my career the datasets I've worked on have continued to increase in size and scope. This is exciting as a scientist but eventually basic tools like Microsoft Excel will not be capable of dealing with the datasets and eventually you need to learn some computation tasks. This section of the website is dedicated to recording our experience learning how to deal with large datasets. The primary intention is to provide a resource to my group to quickly access our internal experiences so we don't spend time on google looking up the same things over and over again. Hopefully, it can be useful to others that are dealing with similar issues. It takes some time to get comfortable but as long as you are not scared of learning I promise you will pick things up quickly. Oddly most people think I'm a bioinformatician now but I'm the most comfortable in the lab.

Note:

  • Most of our recommendations are related to our environment in the Keats Lab at TGen.
    • I ensure everyone in my lab is on a MAC computer.
      • MACs are at their heart just fancy unix computers and most computation tools are designed for unix/linux.
      • This is not to say you can't do it on a PC, but there are some basic setup that you would need to sort out first.
    • Institutionally our compute resources all run CENT OS version 6 (this is a free linux distribution).
      • Most things should translate to any linux distribution but some subtle issues do exist

STEP 1 - Finding the Terminal and Learning Some Basic Command Line

The best resource we have found for this is an online resource from the Korf Lab at UC Davis. This resource call "Unix and Perl for Biologist" assumes you know absolutely nothing, it starts by showing you how to find the terminal on a MAC. They have even turned it into a book that I would highly recommend. It is this resource that started me down this track when my first RNAseq dataset landed in my lap as a post-doc and I had no choice but to learn. I spent two nights going through the unix section after my kids were in bed, I've still yet to do the perl section, and realized I could have done my post-doc in half the time if I'd done this earlier. I now make everyone that joins my lab do at least the unix section when they join the lab regardless if they will be on the wet or dry side of the lab.

I hope to at some point create a similar primer that is more focused on next-generation sequencing as some of their examples, though genetic related, are not things we would do on a typical day. As a start we maintain a "Unix Basics" section that contains a bunch of valuable highlights.

STEP 2 - Moving from Navigation to Computation

Once you get comfortable with a command line environment you will need to learn more advanced tools, in particular, awk and R. We maintain dedicated sections for both of these tools. As a lab we struggle with deciding if we will focus on perl or python, but I lean to python as I can actually understand the code.

Inevitably you will start needing to start moving files between you local environment and public resources. We provide instructions on using FTP and ASCP.

Many of these downloaded files or files you might need to upload will be compressed. We provide instructions on using common compression and decompression methods in the compression section.

As you start writing code you will likely move from a basic text editor like nano to a more advanced editor like VIM. We provide a dedicated section to basic VIM commands.

STEP 3 - Making the Terminal Nice and Colorful

I like colors and you can configure the terminal to use colors to nicely differentiate file types and folders and when using vim as a text editor you can nicely color different elements. There are many ways of doing this but the best and easiest one I've found is a color scheme called "solarized" by Ethan Schoonover. There are two versions, light and dark, you can pick.

The following steps will set this color scheme as the default on your MAC:

STEP A

#Open a Fresh Terminal Window
#Download the solarized package
git clone git://github.com/altercation/solarized.git
#Enter the downloaded package, make the required directory and copy the color scheme
cd solarized/vim-colors-solarized/colors
mkdir -p ~/.vim/colors
cp solarized.vim ~/.vim/colors/
## This will ensure the VIM text editor will use the solarized format
vim ~/.vimrc
#Add the following lines (see VIM section on how to use vim)
syntax enable
set background=dark

colorscheme solarized

## This will ensure the Terminal will use the solarized format when you use the unix "ls" command
vim ~/.profile
#Add the following line to your profile
export CLICOLOR=1
STEP B
Setup the Mac Terminal by double-clicking the solarized apps

Open the terminal and select the solarized scheme of your choice in the Terminal>Preferences window

Optional - Change the Selection/Highlight color for the scheme (I find the default hard to see, when I'm attempting to copy text)