Workshop on Genomics (Notes, Day 2) – Advanced UNIX

UNIX tutorial instructions: http://evomics.org/learning/unix-tutorial/
Exercises, part 2: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_IntroUnixPart2.pdf

Common UNIX commands: http://mally.stanford.edu/~sr/computing/basic-unix.html

Homework assignment 1: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_unix_hw_1.pdf
Homework 2: http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2014/01/2014_genomics_unix_hw_2.pdf

Yesterday, take tab-separated file, manipulate data in columns, filter by value
Take gene-pop file, manipulate with shell program sed

Today:
1. Review of pipes
2. Regular Expressions
3. sed
4. edit fastq file headers
5. Shell loops
6. Shell scripts

Yesterday, pipes:

cat seqs.fa | grep '^ACGT' output

Only things that meet conditions of grep will make it through to output.

Today, Regular Expressions – finding patters in text

Simplest, ‘.’
Character class, match anything one or more of the letters: [ACGT]
One or more of the numbers 0-9: [0-9]+
Find someone’s first and last name (lower case): [a-z]+ [a-z]+
Phone number: [0-9]{3}\-[0-9]{3}\-[0-9]{4}
\makes literal hyphen
Find America date format before the 10th, e.g. June 3, 1978: [a-zA-Z]+ [0-9], [0-9]{4}

Command-line tricks:
Ctrl-a beginning of line
Ctrl-e end of line
Ctrl-d delete in place
Ctrl-v tab (literal tab)

Screenshot from 2014-01-14 16:03:25

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cp /unix_data/record.tsv.gz record.tsv.gz
ubuntu@domU-12-31-39-0B-68-22:~/shell$ gunzip record.tsv.gz

To highlight first and last name:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ grep -E "[a-z]+ [a-z]+" record.tsv
341341	julian catchen	541-485-5128	June 3, 1978
1243	rodger voelker 541-234-4732		January 12, 1981 
99999	andy berglund  541-498-9999		August 03, 2000
37916	william cresko (541) 234-4522		Mar 7, 1977
222	john letaw	123-455-7834	September 1996

To highlight phone numbers:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ grep -E "[0-9]{3}\-[0-9]{3}\-[0-9]{4}" record.tsv
341341	julian catchen	541-485-5128	June 3, 1978
1243	rodger voelker 541-234-4732		January 12, 1981 
99999	andy berglund  541-498-9999		August 03, 2000
222	john letaw	123-455-7834	September 1996

To match capital and lowercase words, character class order does not matter:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ grep -E "[A-Za-z]+" record.tsv

sed, a stream editor, ctd.

s/pattern/replace/

Search and replace:

ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[a-z]+ [a-z]+/foo/'
341341	foo	541-485-5128	June 3, 1978
1243	foo 541-234-4732		January 12, 1981 
99999	foo  541-498-9999		August 03, 2000
37916	foo (541) 234-4522		Mar 7, 1977
222	foo	123-455-7834	September 1996
ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[a-z]+) [a-z]+/\1/'
sed: -e expression #1, char 20: Unmatched ) or \)
ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/([a-z]+) [a-z]+/\1/'
341341	julian	541-485-5128	June 3, 1978
1243	rodger 541-234-4732		January 12, 1981 
99999	andy  541-498-9999		August 03, 2000
37916	william (541) 234-4522		Mar 7, 1977
222	john	123-455-7834	September 1996
ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[0-9]+//'
	julian catchen	541-485-5128	June 3, 1978
	rodger voelker 541-234-4732		January 12, 1981 
	andy berglund  541-498-9999		August 03, 2000
	william cresko (541) 234-4522		Mar 7, 1977
	john letaw	123-455-7834	September 1996
ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[0-9]+//g'
	julian catchen	--	June , 
	rodger voelker --		January ,  
	andy berglund  --		August , 
	william cresko () -		Mar , 
	john letaw	--	September 
ubuntu@domU-12-31-39-0B-68-22:~/shell$ cat record.tsv |sed -E 's/[0-9]+         //'
341341	julian catchen	541-485-5128	June 3, 1978
1243	rodger voelker 541-234-January 12, 1981 
99999	andy berglund  541-498-August 03, 2000
37916	william cresko (541) 234-Mar 7, 1977
222	john letaw	123-455-7834	September 1996

Use sed to rename series of many files.

Use this to pipe files into pipe one line at a time:

ls -1

About Lisa Johnson

PhD candidate at UC Davis in Molecular, Cellular, and Integrative Physiology
This entry was posted in Genomics Workshop, Linux. Bookmark the permalink.

One Response to Workshop on Genomics (Notes, Day 2) – Advanced UNIX

  1. Pingback: Unix 101+1: Tinkerbell has issues | Evolution and Genomics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s