Workshop on Genomics (Notes, Day 1) – Introduction to Linux

Lecture slides:

Linux tutorial:

Bioinformatics analysis, computational biology requires knowing UNIX/Linux.

Try to work in 2 terminals: one to keep track of files, one to execute commands

Slides, practice commands, navigating through paths

tab completion: auto-complete

up-arrow: history

variants to ls:

ls -l
ls -la
ls -lh

Four ways to view text file:


1. Move to the directory /etc
• What is the first line of the files ‘hosts’ in the directory /etc? localhost
• What is the relative file path to get to /var/log from here?
• What is the absolute path? localhost
2. Move to the directory /var/log/
• What is the contents on line 73 of the dmesg file?
ubuntu@ip-10-234-15-248:/var/log$ sed -n '73,73p' dmesg
[5536105.882829] No AGP bridge found
•  Without changing directories, what is the second line of the cpuinfo file in the proc directory?
ubuntu@ip-10-234-15-248:/var/log$ sed -n '2,2p' ../../proc/cpuinfo
vendor_id    : GenuineIntel
• What is the command to read this file with a relative path?
ubuntu@ip-10-234-15-248:/var/log$ cat ../../proc/cpuinfo
• An absolute path?
3. Move back to the root, what directories do you see?
ubuntu@ip-10-234-15-248:~$ ls
assembly             include   shell
bin                  install   shotgun_metagenomics
build                lib       software
conf                 libexec   stacks  logs      Templates
Desktop              Music     tmp
Documents            nxsetup   transcriptomics
Downloads            Pictures  tutorial_materials
etc                  Public    var
genomics_tutorial    qc        Videos
html                 sbin
igv                  share
4. Move back home, what are three ways to move home from the root?
This is moving from home to root:
ubuntu@ip-10-234-15-248:~$ cd ../..

To move from root to home:

ubuntu@ip-10-234-15-248:/$ cd
ubuntu@ip-10-234-15-248:/$ cd ~/
ubuntu@ip-10-234-15-248:/$ cd ~

Mixed, ‘Sequencing on illumina’ slide, Phred Quality Score is a measure of how clean peaks are . Q=-10(log10)p
Phred scores are not magical. can use to get rid of worst data, but hard to tell correctness
Translated into probability
10=1 in 10,
20=1 in 100,
30=1 in 1000

quality score series of letters, use ASCII code (8 bits = 2^8 combinations = 256)

Wiki on FASTQ is really good.
IonTorrent is Phred+33
grep -c is counting everything with ‘@’
grep -c -v is counting everything except ‘@’
wc is “word count”
wc -l is line

important commands, ^C and ‘man gzip’ (displays manual)

Pipes, think of water moving to different steps: program1|program2|program3

‘Cut’ will let you take data from specific columns:

cut -f 10 batch_1.genotypes_1.loc

Cut, capture the output:

cut -f 1-10 batch_1.genotypes_1.loc > genos

cut, pipe the output to grep

cut -f 2 batch_1.genotypes_1.loc | grep -c "nnxnp"
cut -f 1-10,15,17 batch_1.genotypes_1.loc|grep "nnxnp" > genos2

Examine a marker, translating the output

cat batch_1.genotypes_1.loc|tr "     " "," | grep "^96053"

Ctl-v then ‘Tab’ will tell shell to override actual keyboard Tab command and read Tab from file

Useful exercise, can be written in one line with | :


1. Decompress the file
2. Count the number of raw reads (250,000)
3. Count the number of reads with barcode: CGATA
4. Capture all FASTQ records for ACCAT into a file called
sample_01.fq (you should get 18352 records, 73408 lines)
5. Determine the count of all barcodes in the file
7900 TCAGA
10659 ACTGC
10931 TGACC
11536 GAGAT
11871 CTGAA
14409 CGGCG
14508 TGGTT
18226 GAAGC
18352 ACCAT
18375 TCGAG
19501 CGATA
23012 AATTT
26336 GCATT
31136 CTAGG

1. Use head when building a command, cat once the command is working
2. Look at the -n option for the head command, the -l
option for wc
3. The “^” character means “must occur at beginning of line” in a grep search
4. Look at the grep options: -c, -v, -A, -B
5. Read the man pages forsort and uniq to learn how to combine them

Ion Torrent is Phred+33:

1. grep out readlines (only the 2nd line) -> pipe
2. cut first 5 characters ->pipe
3. sort (automatically) alphabetically ->pipe
4. use uniq function with count, tells you how many times it counts > to file answer.txt
5. open file

head -n 100 s_1_sequence.txt | grep -A 1 '^@' | grep -v "^@" | grep -v "^--" | cut -c 1-5

About Lisa Cohen

PhD student at UC Davis.
This entry was posted in Genomics Workshop, Linux. Bookmark the permalink.

One Response to Workshop on Genomics (Notes, Day 1) – Introduction to Linux

  1. Pingback: Unix Ninja-ry…Baby Steps | Evolution and Genomics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s