Great experience participating in genome assembly tutorial by Dr. Lex Nederbragt, Norwegian Sequencing Centre:
https://github.com/lexnederbragt
http://flavors.me/flxlex
https://flxlexblog.wordpress.com/
https://www.flickr.com/photos/lpcohen/20850074396/
Notes below, feel free to comment and correct!
Atlantic cod genome assembly! http://www.nature.com/nature/journal/v477/n7363/full/nature10342.html
Ensembl annotation:
http://www.ensembl.org/Gadus_morhua/Info/Index
http://www.aquagenome.uio.no/
Experience switching to computational work from wet lab work, 2013: http://www.nature.com/naturejobs/science/articles/10.1038/nj7479-319a
http://genome.cshlp.org/content/20/9/1165.long
Focus on nonmodel organisms!
Visualization of developments in highthroughput sequencing:
https://flxlexblog.wordpress.com/2013/10/01/developments-in-next-generation-sequencing-october-2013-edition/
Velvet is good tool for teaching assembly. Not the best to use practically, but good for teaching.
http://ivory.idyll.org/blog/the-assembly-exercise.html
de Bruijn graphs described for limited alphabets. Suffix trees. One of three types of graphs, (read, overlap, de Bruijn). Schatz et al. 2010.
Start tutorial.
Live coding – do not give students tutorial, follow along as he goes. Different style from previous tutorials where students are given tutorial pre-written then follow along. The idea is for collaborative interaction during tutorial.
Launch EC2 instance, Ubuntu 14.04 m4.xlarge:
sudo apt-get update sudo apt-get install -y g++ gcc make git zlib1g-dev python
How do you remember that? Printed paper with notes!
Etherpad with typed commands:
https://etherpad.mozilla.org/ngs2015-week3
Print history for class to see.
wget https://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz tar -xvzf velvet_1.2.10.tgz
cd velvet_1.2.10/ make 'MAXKMERLENGTH=127'
“If you couldn’t figure out if it did anything, re-run the commands.”
Stickies up and down. (Software Carpentry has a teaching method with colored post-it notes. Ours are pink and blue. http://software-carpentry.org/workshops/operations.html)
(Do not put spaces in PATH!!!)
Velvet installed. One hour to get here.
Now we need data.
mkdir velvet cd velvet mkdir data cd data wget https://www.dropbox.com/s/kopguhd9z2ffbf6/MiSeq_Ecoli_MG1655_50x_R1.fastq && wget https://www.dropbox.com/s/i99h7dnaq61hrrc/MiSeq_Ecoli_MG1655_50x_R2.fastq
Questions to ask students (not doing here):
- Are reads in the same order?
- What does /1 and /2 mean?
cd mkdir assembly cd assembly
We’re going to run the assembly and record results in a spreadsheet. We will each pick a kmer size. Here is the spreadsheet link to pick/assign ourselves a kmer:
https://docs.google.com/spreadsheets/d/1MejHg68VCicqs-8saAr267BN6sKoaK2LbiP-P7mLitI/edit#gid=0
I picked 99.
Why do kmers need be odd numbers? Can’t be even numbers unless self complementary kmers. Need figure. Titus: In theory simplifies, but doesn’t actually matter. Will: Palendromic kmers.
Build command for velvet. Explain each parameter.
velveth k99 99 -short -separate -fastq \ /home/ubuntu/velvet/data/MiSeq_Ecoli_MG1655_50x_R1.fastq \ /home/ubuntu/velvet/data/MiSeq_Ecoli_MG1655_50x_R2.fastq
Put in own numbers. These are the flags we will use. See manual for other options:
http://genome.cshlp.org/content/18/5/821.full
(Note, this software is no longer in development. We are using this for teaching purposes.)‘k99’ is name of folder
99 is parameter kmer size
-short (not sanger)
-separate (not interleaved)
-fastq (not fasta)
full path of files, including pwd (remind students to use tab complete)This will create a directory in the current working directory called ‘k99’. Then run the velvetg command:
velvetg k99/
Program will run, there will be some output stored in /k99/Log. Results:
Final graph has 583 nodes and n50 of 28687, max 100861, total 4564264, using 0/1546558 reads
Nodes are # nodes in the graph. N50 = greater than 50% of reads are >X length. Put numbers in spreadsheet.
Then, pick another number for k and run assembly again. I picked 45.
velveth k45 45 -short -separate -fastq \ /home/ubuntu/velvet/data/MiSeq_Ecoli_MG1655_50x_R1.fastq \ /home/ubuntu/velvet/data/MiSeq_Ecoli_MG1655_50x_R2.fastq velvetg k45
Results:
Final graph has 9427 nodes and n50 of 2363, max 12216, total 4727570, using 0/1546558 reads
This is a really cool learning experience to take part in a class of people all assembling data then populating table with results to make an x-y graph of N50 vs. k! Optimal kmer size for these data can be determined:
Doesn’t mean that at k81 all contigs are correct. Just an indication of lengths. But, good approximation. That is what we will use.
Quiz:
Find neighbor and discuss answer. Multiple choice question. Discuss which is correct and why. Link to Google form to enter answer. Then we discuss distribution of answers. Go back, discuss if we want to change our opinion based on what the class is thinking. Then vote again. Then look at what class thinks.
https://docs.google.com/forms/d/1KxKRpSvrZ1UzV_Ye7AUbtl3uaoSDQA0ndgk7PIVaGvk/viewform
We voted 4. Length of reads are 150. We know that error drops out as the read gets longer.
This is what the class thought:
2 min re-discussion, time on phone – rings. Class >80% still thinks #3 is correct.
Explanation: Higher kmer sizes, lose overlaps.
One sequence with ATTCG at end. Another sequence has same at beginning. Share stretch of kmers. Need long kmers to resolve repeats.
Titus: Does this change when there are more data? More errors?
Errors lead to more wrong kmers. Don’t have enough coverage to correct. Therefore #3 is correct. Not effective. If you have, e.g. 300x coverage with 150b reads, pushes kmer higher because excessive data some of which can be overlapped. Lower coverage, lower kmer size.
If you had perfect reads, you could have longer k. Errors are reducing effect of coverage.
Reading to do:
Zhang et al. 2014: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271
Chikhi and Medvedev: http://arxiv.org/abs/1304.5665
Click to access 2013-july-20-hitseq.pdf
ubuntu@ip-172-31-49-141:~/velvet/assembly/k81$ column -t stats.txt | less -S
Lesson in ‘less’!! This will allow you to scroll during less using the right arrow. How do you page 55 lines forward?
55f
55 lines back?
55b
Stats on all nodes. Length. Coverage.
Flipped kmer coverage distribution:
Two and a half hours to get here.
Then, we can make modifications and tell velvet we only want kmers above certain size. Run velvetg again (no need to run velveth again):
velvetg k81 -exp_cov 18 -cov_cutoff 4
Results:
Final graph has 393 nodes and n50 of 59759, max 156412, total 4564976, using 1409228/1546558 reads
If we did this practically, use Spades. Megahit with prokaryotic. Celera assembler good luck. This assembles with different kmer sizes and merges assemblies. Try with half the read-leangth.
velvetg k81 -exp_cov auto -cov_cutoff auto
Results:
Final graph has 361 nodes and n50 of 59759, max 156412, total 4562066, using 1408871/1546558 reads
In a course, you can have a discussion with students about:
- Assembly quality
- We didn’t use pairing information. We have paired reads, but we just told velvet ‘reads’. If we had, this gives the assembler more information, could potentially lead to better assemblies.
- Map to reference if there is one available. Cortex will allow reference based assembly.
- sequencing strategy for assembly application
Pingback: Week 3 – NGS2015 | Lisa Johnson Cohen
Pingback: Active learning strategies for bioinformatics teaching | In between lines of code