Lisa K. Johnson

Workshop on Genomics (Notes, Day 2) – Advanced UNIX

Posted on January 14, 2014 by Lisa Johnson

UNIX tutorial instructions: http://evomics.org/learning/unix-tutorial/
Exercises, part 2: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_IntroUnixPart2.pdf

Common UNIX commands: http://mally.stanford.edu/~sr/computing/basic-unix.html

Homework assignment 1: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_unix_hw_1.pdf
Homework 2: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_unix_hw_2.pdf

Yesterday, take tab-separated file, manipulate data in columns, filter by value
Take gene-pop file, manipulate with shell program sed

Today:
1. Review of pipes
2. Regular Expressions
3. sed
4. edit fastq file headers
5. Shell loops
6. Shell scripts

Yesterday, pipes:

cat seqs.fa | grep '^ACGT' output

Only things that meet conditions of grep will make it through to output.

Today, Regular Expressions – finding patters in text

Simplest, ‘.’
Character class, match anything one or more of the letters: [ACGT]
One or more of the numbers 0-9: [0-9]+
Find someone’s first and last name (lower case): [a-z]+ [a-z]+
Phone number: [0-9]{3}\-[0-9]{3}\-[0-9]{4}
\makes literal hyphen
Find America date format before the 10th, e.g. June 3, 1978: [a-zA-Z]+ [0-9], [0-9]{4}

Command-line tricks:
Ctrl-a beginning of line
Ctrl-e end of line
Ctrl-d delete in place
Ctrl-v tab (literal tab)

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cp /unix_data/record.tsv.gz record.tsv.gz
ubuntu@domU-12-31-39-0B-68-22:~/shell$ gunzip record.tsv.gz

To highlight first and last name:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ grep -E "[a-z]+ [a-z]+" record.tsv
341341	julian catchen	541-485-5128	June 3, 1978
1243	rodger voelker 541-234-4732		January 12, 1981 
99999	andy berglund  541-498-9999		August 03, 2000
37916	william cresko (541) 234-4522		Mar 7, 1977
222	john letaw	123-455-7834	September 1996

To highlight phone numbers:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ grep -E "[0-9]{3}\-[0-9]{3}\-[0-9]{4}" record.tsv
341341	julian catchen	541-485-5128	June 3, 1978
1243	rodger voelker 541-234-4732		January 12, 1981 
99999	andy berglund  541-498-9999		August 03, 2000
222	john letaw	123-455-7834	September 1996

To match capital and lowercase words, character class order does not matter:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ grep -E "[A-Za-z]+" record.tsv

sed, a stream editor, ctd.

s/pattern/replace/

Search and replace:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[a-z]+ [a-z]+/foo/'
341341	foo	541-485-5128	June 3, 1978
1243	foo 541-234-4732		January 12, 1981 
99999	foo  541-498-9999		August 03, 2000
37916	foo (541) 234-4522		Mar 7, 1977
222	foo	123-455-7834	September 1996

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[a-z]+) [a-z]+/\1/'
sed: -e expression #1, char 20: Unmatched ) or \)
ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/([a-z]+) [a-z]+/\1/'
341341	julian	541-485-5128	June 3, 1978
1243	rodger 541-234-4732		January 12, 1981 
99999	andy  541-498-9999		August 03, 2000
37916	william (541) 234-4522		Mar 7, 1977
222	john	123-455-7834	September 1996

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[0-9]+//'
	julian catchen	541-485-5128	June 3, 1978
	rodger voelker 541-234-4732		January 12, 1981 
	andy berglund  541-498-9999		August 03, 2000
	william cresko (541) 234-4522		Mar 7, 1977
	john letaw	123-455-7834	September 1996

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[0-9]+//g'
	julian catchen	--	June , 
	rodger voelker --		January ,  
	andy berglund  --		August , 
	william cresko () -		Mar , 
	john letaw	--	September

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[0-9]+         //'
341341	julian catchen	541-485-5128	June 3, 1978
1243	rodger voelker 541-234-January 12, 1981 
99999	andy berglund  541-498-August 03, 2000
37916	william cresko (541) 234-Mar 7, 1977
222	john letaw	123-455-7834	September 1996

Use sed to rename series of many files.

Use this to pipe files into pipe one line at a time:

ls -1

Posted in Genomics Workshop, Linux | 1 Comment

Workshop on Genomics (Notes, Day 2) – Sequencing Technologies

Posted on January 14, 2014 by Lisa Johnson

http://www.flickr.com/photos/lpcohen/11945854246/in/photostream
Dr. Konrad Paszkiewicz, University of Exeter

Lecture slides: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/Cesky-Krumlov-Lecture-1-ver6.pdf

From Sanger -> human genome -> 2nd gen -> 3rd gen -> Nanopore (not yet released)

Recommended videos on Sanger sequencing:

Illumina – Least expensive and most commonly used

Advantages:
– short run times
– straight-forward sample prep
– open source software

Disadvantages:
– short reads
– pooling samples
– sequence clusters on flow cell

Each cycle uses -> 1.) blocking, 2.) enzyme, 3.) fluorophore

* Paired-end sequencing is analyzed based on lane proximities, bridge amplification, each dot represents cluster, lane calling

Each cycle looks at color and position of dots:

Oversimplification of adding bases at each cycle: dye-labeled bases are added, then block, then block is washed away, laser excitation, the next cycle starts:

The resulting image of the flow cell, at each cycle, looks like this:

Algorithms take into consideration emission wavelengths at each cycle at each spot, and the reads are constructed from these. Each cycle looks at color and position of the dots.

36-150 nucleotides per read, 300 Gb data per run

Quality scores decrease as run progresses, error rate of enzyme increases

2 sets of amplifications – before (library prep, enough sample), during (to get enough signal)

Issues:
– cluster merging (Krueger et al. 2011)
– inverted repeats and GGC motif

454 Sequencing: longer read lengths, bead-based adapters at end of bead, pyrosequencing, phased out by Roche in 2012

SOliD: abandoned by Life Technologies in favor of Ion Torrent recently

Benchtop Sequencing:

Ion Torrent: no optics, pH sensor detects nucleotide by release of H+ (different for each base), no incorporation or termination, 400 bp reads, 2 hr runtime, various chip types

Ion Proton: 200 bp reads, single end, PGM vs. Proton

Useful benchtop technology review paper, Loman et al. (2012): http://www.nature.com/nbt/journal/v30/n5/full/nbt.2198.html

3rd gen: single molecule sequencing – Pac Bio only one on market at the moment (RSII)
– Zero Mode wave (ZMW) guide on an SMRT flow cell (looks like a microarray with fixed position
– camera detection
– library prep required
– 100,000-150,000 per run, few hours, $500/run
– 300-500Mb
– distribution of read lengths
– good for detecting epigenetic changes
– error rates, but algorithms to correct,
– look at materials on PacBio github: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki
– Cicular consensus sequencing can be used

Nanopore = “very small hole”
– measures current changes as polymer passes through hole
– exonuclease to chop DNA
– cyclodextrin to measure charge across membrane

Strand sequencing, MinIon:
– 512 pore chip
– 100 Mb-1 Gb per run
– disposable after 6 hr run
– 4% error rate in trials
– requires library prep – introduces loops for feeding to Nanopore
– more for detection
– error profiles can be handled with different pore types with complimentary error
– cost is high

Illumina still least expensive of all technologies.

***Take-home message: Technology is becoming easier to acquire massive amounts of data. Bottleneck is now always at bioinformatics level.

Posted in Bioinformatics, Genomics Workshop, Sequencing | Leave a comment

Workshop on Genomics (Notes, Day 1) – Introduction to Linux

Posted on January 13, 2014 by Lisa Johnson

http://www.flickr.com/photos/lpcohen/11932547005/

Lecture slides: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_IntroUnixPart1.pdf

Linux tutorial: http://evomics.org/learning/unix-tutorial/

Bioinformatics analysis, computational biology requires knowing UNIX/Linux.

Try to work in 2 terminals: one to keep track of files, one to execute commands

Slides, practice commands, navigating through paths

tab completion: auto-complete

up-arrow: history

variants to ls:

ls -l
ls -la
ls -lh

Four ways to view text file:

Assignment:

1. Move to the directory /etc

• What is the first line of the files ‘hosts’ in the directory /etc?

127.0.0.1 localhost

• What is the relative file path to get to /var/log from here?

../var/log

• What is the absolute path?

127.0.0.1 localhost

2. Move to the directory /var/log/

• What is the contents on line 73 of the dmesg file?

ubuntu@ip-10-234-15-248:/var/log$ sed -n '73,73p' dmesg
[5536105.882829] No AGP bridge found

• Without changing directories, what is the second line of the cpuinfo file in the proc directory?

ubuntu@ip-10-234-15-248:/var/log$ sed -n '2,2p' ../../proc/cpuinfo
vendor_id    : GenuineIntel

• What is the command to read this file with a relative path?

ubuntu@ip-10-234-15-248:/var/log$ cat ../../proc/cpuinfo

• An absolute path?

/proc/cpuinfo

3. Move back to the root, what directories do you see?

ubuntu@ip-10-234-15-248:~$ ls
assembly             include   shell
bin                  install   shotgun_metagenomics
build                lib       software
conf                 libexec   stacks
configure_freenx.sh  logs      Templates
Desktop              Music     tmp
Documents            nxsetup   transcriptomics
Downloads            Pictures  tutorial_materials
etc                  Public    var
genomics_tutorial    qc        Videos
html                 sbin
igv                  share

4. Move back home, what are three ways to move home from the root?

This is moving from home to root:

ubuntu@ip-10-234-15-248:~$ cd ../..

To move from root to home:

ubuntu@ip-10-234-15-248:/$ cd
ubuntu@ip-10-234-15-248:/$ cd ~/
ubuntu@ip-10-234-15-248:/$ cd ~

http://www.flickr.com/photos/lpcohen/11934544713/

Mixed, ‘Sequencing on illumina’ slide, Phred Quality Score is a measure of how clean peaks are . Q=-10(log10)p
Phred scores are not magical. can use to get rid of worst data, but hard to tell correctness
Translated into probability
10=1 in 10,
20=1 in 100,
30=1 in 1000

FASTQ,
quality score series of letters, use ASCII code (8 bits = 2^8 combinations = 256)

Wiki on FASTQ is really good.
IonTorrent is Phred+33
grep -c is counting everything with ‘@’
grep -c -v is counting everything except ‘@’
wc is “word count”
wc -l is line

important commands, ^C and ‘man gzip’ (displays manual)

Pipes, think of water moving to different steps: program1|program2|program3

‘Cut’ will let you take data from specific columns:

cut -f 10 batch_1.genotypes_1.loc

Cut, capture the output:

cut -f 1-10 batch_1.genotypes_1.loc > genos

cut, pipe the output to grep

cut -f 2 batch_1.genotypes_1.loc | grep -c "nnxnp"
cut -f 1-10,15,17 batch_1.genotypes_1.loc|grep "nnxnp" > genos2

Examine a marker, translating the output

cat batch_1.genotypes_1.loc|tr "     " "," | grep "^96053"

Ctl-v then ‘Tab’ will tell shell to override actual keyboard Tab command and read Tab from file

Useful exercise, can be written in one line with | :

s_1_sequence.txt.gz

1. Decompress the file
2. Count the number of raw reads (250,000)
3. Count the number of reads with barcode: CGATA
(19,501)
4. Capture all FASTQ records for ACCAT into a file called
sample_01.fq (you should get 18352 records, 73408 lines)
5. Determine the count of all barcodes in the file
286 CTAGT
7900 TCAGA
10659 ACTGC
10931 TGACC
11536 GAGAT
11871 CTGAA
14409 CGGCG
14508 TGGTT
18226 GAAGC
18352 ACCAT
18375 TCGAG
19501 CGATA
23012 AATTT
26336 GCATT
31136 CTAGG

Hints:
1. Use head when building a command, cat once the command is working
2. Look at the -n option for the head command, the -l
option for wc
3. The “^” character means “must occur at beginning of line” in a grep search
4. Look at the grep options: -c, -v, -A, -B
5. Read the man pages forsort and uniq to learn how to combine them

Ion Torrent is Phred+33: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/?tool=pubmed

First:
1. grep out readlines (only the 2nd line) -> pipe
2. cut first 5 characters ->pipe
3. sort (automatically) alphabetically ->pipe
4. use uniq function with count, tells you how many times it counts > to file answer.txt
5. open file

head -n 100 s_1_sequence.txt | grep -A 1 '^@' | grep -v "^@" | grep -v "^--" | cut -c 1-5

Posted in Genomics Workshop, Linux | 1 Comment

Workshop on Genomics (Notes) – Intro Lab Setup

Posted on January 13, 2014 by Lisa Johnson

“Cloud” computing, essentially storage services (Hotmail, Gmail, flickr, iTunes, Netflix, Dropbox, etc.).

Article on ‘Wired’ about Google infrastructure. Data centers located all over world, water cooling required.

We’ll use Amazon Web Services (AWS).

Create virtual image. Can run all operating systems. Started with software written for gaming emulators. Terminals (secure shell) are essentially communicating devices, emulating old terminals that communicated with mainframes.

puTTY (text) or Gnome (graphical terminal with mouse coordinates), communication with cloud computing services for workshop

Username: student, Password: evomics

EC2, Elastic Computer Cloud. Create instance. Safe to explore, can’t mess anything up.

Click to access InstructionsforAmazonEC2.pdf

We use this for gnome to access AWS EC2: https://www.nomachine.com/

Posted in Genomics Workshop, Ubuntu | Leave a comment

High Performance Computing

Posted on December 12, 2013 by Lisa Johnson

“When I care how fast I get an answer”

Cluster user setup at work this week.

HPC has a lot of jargon and acronyms.

Intro: According to XSEDE (reference?), 31% of HPC users are in Molecular Biosciences

Concerned with number of Floating Point Operations Per Second (FLOP/s). The number the environment can provide is a concern. Delivers large amounts of processing capacity over long periods of time. Sustained vs. peak cycles – as close to 100% as possible. For example, “K” supercomputer >10PetaFLOP/s at 93%

Bioinformatics requires many task computing, completing large number of relatively short jobs. Need to effectively manage. We are task parallel vs. other applications more data parallel

Cluster: one or more networks, nodes connected. Software (MPI, open source) allows nodes to communicate, allows users to reserve resources. Log on to one node. 1 master, 4 slaves. Interconnect network.

Infiniband “fabric” with multicore parallel workstation

RDMA- Infiniband mediates interconnections

UMA- memory for all

NUMA- pools of memory

1 node has 64 (or 32?) cores

AMD vs. Intel processors, we have AMD

Cannot send data faster than latency and bandwidth of network (microseconds), want low latency.

Build RAM disk, put files directly on memory of computer, just like having solid state disk on laptop

Find code for GPU, co-processors (?)

Running faster, cache memory- order in which numbers are stored changes depending on what language you use. Efficient code will not waste space. Writing loops impacts the order in which data goes into CPU memory.

Running faster pipelines, n-stage pipeline, eliminate “if” loops with branch prediction, profile code if possible

Apparently FMA (floating multiplier add) not relevant to bioinformatics tasks?

Introduce checkpoints in code.

Posted in High Performance Computing | Leave a comment