Exercise 2

Today we are going to play around with some bioinformatics tools on a UNIX computer.

I have set up a temporary account on a machine at this school for this class.

The first step is to open up a Command Prompt (MSDos) from the Start menu

Start >> Run: Command

Then type:
    telnet: pinehost.sci.ccny.cuny.edu

We all are going to use the same account and password this is NOT the way to do things in the real world
    login: rise
    Password: ris3111

Once you are logged in, the first thing to do is create a folder (directory) for yourself
    mkdir  yourname  
Move into your new directory and look around
    cd  yourname   (change to directory yourname)
    ls             (list files)
You will get nothing, because your new directory is empty. Just to get a feel for UNIX, make a dummy file like this:
    echo abc > test1     (writes abc into a file called test1)
    ls                   (list files)
    more test1           (write the contents of test1 onto the screen)
Now, we are going to analyze a set of DNA sequences that I loaded into a folder called Seqs

Move over there and have a look around
    cd ..      (this moves you up one level)
    ls         (list the files and folders here)
    cd Seqs    (move into directory Seqs)
    ls         (list the files in Seqs)
and look at one of the files

more dros1.seq    (write dros1.seq to the screen)
Ok, now it is time to learn something about a program called fasta We have a copy of this program in the directory Fasta

fasta compares sequences one to another or one to a whole set.

You can try it out like this (you have to type the whole path to the program) But first, use cd to move back to your own directory - if you get lost type pwd, that shows where you are in the file system.

Fasta will ask you for the name of a test sequence - use ../Seqs/dros1.seq
Then it will ask you for a library file (to compare against) - use ../Seqs/dros2.seq
Then it asks for a "ktup" - just hit return (take the default of 6)

Now it does the computing, then it will ask you to give a name for its results file, and it asks how many scores you want to see and if you want alignments. Give it a file name like "dros1.fasta", but you can just hit return to take the default answers for the other questions.

Now look at the output file
    more dros1.fasta
It would be useful if we could automate the fasta program so that it could compare a whole bunch of sequences to a database or to each other. But in order to do this, you will have to learn something about writing UNIX programs shell scripts to be specific.

First, you need to understand the concept of a loop: A program repeats some set of actions a specific number of times or until some goal is reached. In this case, we want to repeat the fasta search for every sequence in the Seqs directory.

Also, we need to make a special file that contains all of the sequences that we want to search against. I have already done this; it is called seq.lib

Here is how to make a loop using the Foreach command:
    1)  csh       
    2)  Foreach x (../Seqs/*.seq)
    3)  ? ../Fasta/fasta q n $x seq.lib > $x.fasta
    4)  ? end
What this does is:

1) Set C-shell as your programming environment (because I know it best)

2) set a variable x to represent all of the sequences in ../Seqs and repeat for each one

2) run the fasta program (in directory ../Fasta) using the options q and n, use each *.seq file as the query sequence and compare it to the sequences in the special seq.lib file, also write the output of each search into a file with the same name as the *.seq file, but add .fasta at the end.

3) go back and repeat the loop until all the files have been run through the fasta program

One more point. We have been typing in commands on the command line. Once you are done, your commands are lost. If you make a mistake, you have to start all over. You can write these commands into a text file - which is then called a shell script. Then you can modify and re-use your program whenever you want to.

Now , we have to figure out how to get the results out of all of these .fasta files. There is a UNIX utility called "grep" that can find a word in a file and display the lines containing that word. We can make another loop that uses grep to scan all of the fasta results files and writes a summary of the matches to one new summary file.

    Foreach x (*.fasta)
    ? grep '>>' $x |cat fasta.log
    ? end
That should give you a file called fasta.log that contains all of the matches for each sequence. Now you can easily see what matches what.

This approach could be really useful if you were starting with thousands of sequences rather than just 16.

Stuart Brown - RCR
Last modified: Tue Oct 22 09:56:42 EDT 2002