Q&A at Porecamp TAMU, June 2017: From the Ebola Outbreak to Sequencing on Mars

As part of my PhD project at UC Davis with Dr. C. Titus Brown and Dr. Andrew Whitehead, I am assessing the feasibility of using Oxford Nanopore Technologies (ONT) MinION device to generate long reads of DNA sequencing data for de novo genome assemblies of several species of freshwater Fundulidae killifish. The final assemblies for each freshwater species will be used to compare genomic regions to their sister marine species, Fundulus heteroclitus.

The MinION is a portable (plugs into the USB port of a laptop!), real-time sequencing technology that can potentially yield reads >100 kb (distributed read lengths >10 kb). This is so super cool! Below is our MinION device chugging along last week with data from Fundulus nottii.


From June 5-9, 2017, I attended Porecamp USA at Texas A&M, hosted by the TAMU Agrilife sequencing facility, to get organized instruction and hands-on experience with the MinION. This was great for several reasons. Since ONT is a relatively new sequencing technology, troubleshooting efforts have been challenging. Updates to the platform occur rapidly. Attending Porecamp allowed me the opportunity to meet experts in person as well as other researchers who are also excited about this new technology. Twitter and email are great for asking questions, but there is no substitute to asking someone a question or having a discussion in person. Making connections and being part of the community of MinION users was an invaluable part of the Porecamp experience. I felt lucky to attend Porecamp, took an overwhelming amount of notes and wanted to share here so others can hopefully benefit. This will likely be just one in a series of blogposts to come!

The first night, there was a very informative Q&A session with the panel of instructors for the workshop, who are leading experts in the field: Nick LomanJosh QuickMatt LooseJohn Tyson, and Mick Watson.

Below is a transcript from my notes. Any mis-interpretation or misinformation is error on my part. Please let me know!

Q: For species with small body sizes or low yields, how feasible are whole genome assemblies from multiple individuals using reads from only ONT and/or ONT and other platforms?

A: Same for any genome assembly. What you want for a reference genome assembly is to get a haploid genome assembly from tons of DNA. There are some diploid assembly tools, which try to separate out haplotypes, but to be honest they’re not very good. As soon as you put multiple individuals into a pool for sequencing, you get a worse assembly because all the structural variation is unique to each individual and the assembly graph kind of falls apart where they don’t agree with one another. But it’s perfectly possible, people have been doing it for years with nematode worms and all sorts of other stuff. Heterozygosity is a massive problem for all genome assemblies all the way back to when we were doing them with Sanger. If you can develop some kind of cell lines and generate buckets of DNA, that’s often a good way to go.

Q: How much time should one allow for ordering flowcells? (Backstory: Ordering to delivery has taken from 1 week to 1 month and has been difficult to plan for fresh DNA extractions, library prep and sequencing library and it’s been challenging.)

A: I think that’s quite an interesting question actually. We’re a bit biased because we’re in the UK. And I don’t know what effect being in the UK has on flowcell supply. In 2014 we all got involved in Nanopore sequencing and it would be fair to say that there have been periods of time when flowcell supply has been restrictive, but now it seems to be very robust and very reliable. We do have Michael Micorescu from ONT here to help answer these questions at another time.

Q: What would you say is the shelf life of a flowcell?

A: The shelf life of a flowcell now, 8 weeks is absolutely fine. We’ve got flowcells that have been sitting in the fridge for longer than that and I’m still looking forward to using them. So, there’s no problem. 8 weeks is fine. It used to be that you wanted to use them as soon as they arrived, but that’s not the case now. They’re pretty stable.

Since they’re more stable than they were initially, and I think the flowcells are much more consistent and more reliable than they have ever been, I think. You need to get the temperature right, they can’t be frozen.

I can confirm from a postdoc who did received flowcells and put them in the -80 degC and that doesn’t work. We did call Nanopore and they asked what had happened. It was quite an interesting response was, we don’t know what will happen, can you try running them anyway. The answer is they don’t work. So, don’t put them at -80.

Q: What level of sequencing coverage is necessary to correct for sequencing error?

A: The short answer is we don’t really have a good answer for that question. It depends on a number of factors, including are you doing 1D or 1D^2 and what is the context GC homopolymeric content of your gene-level region of interest. I’d say as a rule of thumb, it is diminishing returns as you get above 100x coverage. You’ll probably got most of the benefit from consensus calling quite early on, probably around 25x. We found that when we get systematic assessments in the Ebola project, actually 25x of 2D and 50x of 1D was sufficient to get us the correct genotype. If you got to 100X you might get a little bit more, but what’s notable is you don’t really get the perfect consensus at any point because of residual homopolymeric problems. And that’s an open problem with the platform. So it’s not worth it to go to thousands of x coverage, it just makes everything slower an doesn’t really improve anything.

It’s also important to use the signal. We’ll talk about that later [in the workshop] when we talk about bioinformatics, the importance of using the signal, not just the basecalls. The signal is where there’s extra information that can be utilized to get a really good result. And that’s often not used. Typically, in Illumina sequencing, you don’t go back to the signal. But it’s important in Nanopore that you do look at the signal information.

Q: Are there batch effects from one flowcell to another. How wary should we be of combining data from multiple flowcells?

A: Within a single flowcell type, we’re on R9.5 at the moment, I don’t think there are any issues with combining data across flowcells. If you take exactly the same library and split it across 2 different flowcells it will perform equivalently. In fact, we did a really nice experiment just last week showing that and looking at the read length distributions and they’re identical across 2 flowcells. You might get slightly different yield. And we’ll talk a lot about the nitty gritty of this on the running days, the number of pores you have available for sequencing. It can get complicated if you are starting to merge data from different flowcell types. So, that’s always worth remembering which flowcell you’re looking at.

There is potentially a benefit from doing that? Didn’t Jared [Simpson] once say he could get good results combining R9 and R7.3. But they will have slightly different error profiles, though. Different pore types have slightly different error profiles.

I think there’s another point about that we often don’t have the same starting DNA from different samples in an experiment, if you try to do a controlled experiment. With the high input requirements, it makes it quite tempting to go and do another extraction. Just bear in mind that those are going to be technical replicates, or biological replicates. They can also be completely different conditions depending on if you gone back and regrown organisms. So, you could be called out if you’re trying to do a bunch of different experiments and combine them toegether, especially for something like transcription.

Q: What is the minimum DNA requirement for the Tagmentation protocol?

A: According to the protocol, it is 400 ng for R9.5. If you want to target very long reads, input would be 2 or 3 or 4 ug. (laughter) A very broad minimum range.

Q: What are the factors that contribute to how much data one flowcell can yield.

A: Ultimately it comes down to the quality of the DNA you put in, in terms of the number of the ability to generate enough active ends with the motor to bind to the flow cell. In some senses, need to keep the pores fed so you need to keep the concentration high enough so enough sequence goes through the pore and when it’s finished it needs to bind to another fragment. So you want to keep that process as efficient as possible so you maximize the usage of the pores as well. Every flowcell you have will have a different number of pores. Although they are very consistent in terms of the activity of what’s there. The difference is that the more pores you have, the more sequence you’re going to get, because there’s more active processing.

I think it’s fair to say there’s a fair amount of variability. You can have a library that looks good and has run well before and sometimes it will be disappointing and sometimes it will be incredible. Some flowcells just don’t seem to ever give up and will be extreme outliers. And I think it’s fair to say that no one really understands why that is. You can see some runs on the MinION now out to 10, 15, possibly even 17 Gb. But that’s not typical. Typical for us would be more like 5 Gb. It’s not particularly well understood why, otherwise they would make it consistent in the manufacturing.

Q: How are the newest chemistries solving error issues?

A: One of the biggest improvements recently wasn’t the chemistry but the software engineering with the recent version of albacore basecalling, which solves a lot of the homopolymer errors. So, it’s not just the chemistry. The R9 flowcells have been wonderful in terms of quality, but the biggest improvement has been the software.

To say something about the pore or the shape of the pore. I think the understanding is that the different pores have different reading edge shapes. And that reading edge makes some context easier to read and in some contexts more difficult to read. The net effect of that when they moved from the R7 pore, which no one knows exactly what the biological molecule was, but the R9 we do (the csgG protein from E. coli). That was a very clear improvement from one pore to the other and it certainly improved the ability to solve high GC, which was something that the R7 pore struggled with. And that had to do with the shape of the aperture and the effect that has on the electrical current signal. And we’ll talk about that a bit more.

But I think too that the software improvements are just as relevant to getting better results at the moment. Moving from quite simple models of signal using Hidden Markov Models to the start of using neural networks, recurrent neural networks and deep learning approaches which can use much more long range context, that can understand, for example, the effect of what sequence is in the motor protein quite far upstream of the pore, say 20 bases upstream, what effect that sequence has on things like movement time and downstream bases that are actually in the pore. Seems like factoring in that conditionality gives a lot of improvements.

I think it’s quite exciting that the improvements have been driven by bioinformatics actually. Makes me think that someone can come along and make it even better going forward. In fact, I think, for example, Oxford Nanopore is usually quite closed in their software releases until recently, but in the last few days have open sourced their based caller, scrappie for the first time, so that is quite exciting. That means that the community can have a go at trying to improve on their current best. And I think that part of the reason why they’ve done that is to encourage the community to get those last percent or so of residual error sorted out.

Q: How can I get a PromethION?

A: I’d quite like to know the answer to that myself. The simple answer is you pay some money to Oxford Nanopore and they send you one. But I guess that’s not the answer that was requested. I think they hand build them to a certain rate. They’re not being rolled out as fast as they could be because there are still some improvements in flowcell chemistry that need to be sorted out before they’re ready for prime time.

And they’re constantly improving the compute as well. I think people underestimate how big of a machine the PromethION is, actually. If it delivers 8 Gb from 48 flowcells, you think managing the data from one MinION is bad. Wait until you start playing with a PromethION. They’re coming. They’re out there. But the infrastructure needs to be setup.

Large-scale sequencing facilities are looking to use the PromethION. There is the idea that if you get accuracy up to a certain level with the long reads, you can move away from re-sequencing the human genome to more de novo and large-scale structural variant analysis on the human genome on the Nanopore platform, but there needs to be a cost benefit before people would consider switching routinely.

Q: Can you multiplex RNA on the MinION?

A: On the direct RNA protocol you can’t with barcodes. The cDNA protocol you could. What you can do surprisingly well is multiplex samples without barcodes, particularly if they’re sufficiently distinct. We do that with bacterial genomes, like Staph, Strep, Pseudomonas, although you could barcode. But just throw them all in together and save yourself a little lab time, because they will assembly out as long as the read length is sufficiently long, probably 10kb is more than adequate. They’ll just fall out.

Long amplicon cDNA, can do that as well. As long as they’re different. Goal is targeted sequencing, can just mix them in there.

You could also make your own barcodes.

Q: What would be your recommendation for a lab that is considering switching from Illumina to ONT sequencing?

A: I don’t think they’re equivalent. I’d think you’d want to keep both so you can have a broader range of applications. They’re complimentary technology. Particularly the wgs, you want the long reads to get all the structural variation to get the assembly in the first place, but it’s still useful to have Illumina to polish out the errors. For RNAseq, you can get full-length cDNA with the long reads, but there probably aren’t enough reads to be truly quantitative and do counting so you’re going to need Illumina to do that. I wouldn’t recommend doing 16S anyway.

If the question was would you recommend switching from PacBio to Nanopore, then yes. For smaller genome, like C. elegans genomes that are ~100Mb with structural variations in them, we can span those now. We have the benefit of being able to detect those now. You can use Illumina data to polish those out. It’s not like you have to have one or the other. The cost is getting so cheap that it’s silly not to use it to improve. One can do very well with one or the other. But why not combine? Which is what we’re doing at the moment.

It’s worth knowing that there’s a lot of unused Illumina capacity in the world. There are people who have HiSeqX that are not on full, and loads more people are going to buy NovaSeqs, which won’t be full. So, you don’t have to run your own Illumina. Just buy the data from any number of service providers that will give it to you rather cheaply.

Q: Can we link SmidgeION to cloud computing to compute real-time basecalling from mobile devices? How would this revolutionize genomics in the field?

A: My understanding for the intention of the SmidgeION is that you won’t be basecalling in the cloud, you’ll be basecalling on the phone. Which is really quite cool. So, it would be local, hand-held. Bear in mind that the SmidgeION is the smaller through-put device. So, I don’t think that you’ll need to use the cloud for basecalling. It’s worth saying that cloud basecalling is gone. It used to be that you would send your reads into the Amazon cloud. They’d be base-called there, and you would get your called data back. That’s gone. It’s now all local basecalling anyways.

In terms of how that will revolutionize genomics, it’s a subject we’re very passionate about, which is really pathogen detection and diagnosis in the field. The key thing for the SmidgeION is not that it’s necessarily on a mobile device, but that the cost of the flowcell should be much, much lower than the current cost of the flowcells (individually, $900/flowcell to get 2-10 Gbases). And that’s really critical for diagnostics. A $500 diagnostic is too much for many applications. (flowcells are $500 each if you buy bulk packs of 48 for $24,000, $475 for a pack of 300 for $142,500). A $20 diagnostic, if you can get discovery of a whole bunch of different pathogens in one assay, and typing and antibiotic resistance information, quickly, for things like TB, HIV, Malaria, Ebola. All of those are going to be incredible applications that you can have in the field. It will be a change to diagnostics. Right now diagnostics is pretty much focused on targeted, small numbers of pathogens that you assay at a time for things that you expect to find. So, replace that with an open-ended diagnostic where you can discover things that you might not expect to find. The classic example for us from work that we’ve done is during the West African Ebola epidemic, that wasn’t recognized as Ebola for a good 2-3 months after the first case, and that was because Ebola had never been seen in that part of Africa before. You don’t want to test for everything you can imagine it could be. You want to get a definitive diagnosis quickly. That means you can put the right measures in to stop small outbreaks from becoming larger-scale outbreaks and eventually epidemics. So, that’s what we think is going to be the really exciting thing about an even more portable sequencing diagnostic. But you could also imaging similar things for human genetics as well. Some peopel think that people will bring sequencing home for personal sequencing, like your fitbit, you’ll have your personal transcriptome measuring device. Whether that’s going to happen is going to be interesting. (It’ll happen in my house. But that’s just because I really like sequencing.)

I think the MinION has already revolutionized genomics. Then with the SmidgeION, it’s only incremental on top of that.

Q: Is the platform appropriate for potentially fragmented or degraded DNA, particularly things like ancient DNA?

A: I think it’s going to depend on if you can generate a piece of DNA. You’re not going to get really long pieces. But it’s like any system, if you can amplify, then you can generate sequence. You’re not necessarily going to get long pieces. The MinION can sequence your pieces, you just won’t get the benefit of long reads, which happens to be the main benefit of the platform at the moment. Justin O’Grady has this protocol where he extracts DNA from sputum and he uses the rapid PCR kit and regularly does that. I think the idea of looking at ancient DNA on the MinION would be very exciting in the sense that contamination would be the major issue. As well as DNA damage. You have to remember with MinION and Nanopore, you are sequencing the native molecule. So, we’re actually looking at the molecule itself, not a secondary copy of it [like you are with the Illumina platform and sequencing-by-synthesis technology]. So, you are able to look at what is the damage that’s there. What are the changes that happen over time to the DNA molecule. That’s going to be really exciting.

We think it’s going to be huge in the direct RNA sequencing. We think it’s going to be the major draws. We’ve seen this in the little bit we have done so far. There are certainly modifications you can spot just on the fact that it throws the basecaller off compared to what the reference sequence should be. So, you can have a reference sequence from the exact same sample and you don’t see base changes, but the basecaller from the MinION is showing a change. That may be indicative of some direct RNA modification to hte base that’s going on there. It’s a power in being able to actually see what is there. Being able to interpret that requires the ability to train and recognize those different chemistries but I think that will come with time for sure.

We were just down at ASM the last few days in New Orleans, hearing Astronaut Dr. Kate Rubins on sequencing on the ISS. The idea was that NASA wanted to get a sequencing capacity sorted out for the first trips to Mars with the idea that if there are similar macromolecules to biological ones, something like a Nanopore sequencer would be the correct way of assaying them. A little bit science fiction, but sequencing in space on something like the ISS would have been regarded as science fiction just 3 or 4 years ago. And it’s happened. So, maybe we should keep an open mind about the opportunity for sequencing alien life on Mars.


(Slide from Matt Loose)

Posted in biotech, workshops | 1 Comment

ASLO Aquatic Sciences meeting – Honolulu, HI


Last week I gave a talk on re-assembling de novo transcriptomes for the MMETSP at the Association for the Sciences of Limnology & Oceanography (ASLO) Aquatic Sciences meeting in Honolulu, HI held at the Hawaii Convention Center from February 27 to March 3, 2017.

Here are links to the latest version of files:

Re-assembly .fasta files for individual MMETSP samples

Repository with pipeline code


Annotation, expression quantification, and peptide translations are also available. About 100 annotations are still incomplete because the dammit pipeline did not finish properly on our HPC, so I’m in the process of re-doing annotations for these.

It was a good experience presenting to a diverse audience of oceanographers, tool users and developers in the session on Advances in Aquatic Meta-Omics: Creating Tools for More Accurate Characterization of Microbial Communities, organized by Brooke Nunn and Emma Timmins-Shiffman. Some good questions and issues came up about the extra content in our re-assemblies compared to NCGR, contamination, and the need to combine multiple samples from the same strain into one assembly.

Special thank you to Harriet Alexander, a postdoc in our lab who was attending the meeting and provided positive feedback and encouragement before my presentation; my advisor, C. Titus Brown for supporting the trip to the meeting; to the DIB lab for helpful feedback prior to the meeting; and, of course, thank you to the Moore Foundation for supporting this work!

Relevant to pushing limits of institutional high performance computing clusters (HPC) with this MMETSP data set (1 TB raw data), while I was away, Amanda Charbonneau wrote this wonderful poem:

ASLO_HIconventionctr.png  LisaCohen_ASLOtalk.png

Overall, I enjoyed the ASLO meeting (click here for #ASLOmtg tweets). I thought that it was well-organized and welcoming to a diversity of participants. In addition to the full schedule of scientific talks, there were many inspiring messages delivered to the aquatic sciences community related to science communication, teamwork, environmental resource sustainability, and gender issues. In a time of political uncertainty regarding environmental protection and the sustainability of public funding for scientific research, it was refreshing to hear senior scientists in leadership positions speak about these topics in a way that left me optimistic for the future.

I was filled with joy during the opening plenary to hear Kalani Quiocho speak about traditional natural resources management practices in Hawai’i and other Pacific islands. He patiently explained to the audience the etymology of several Hawai’ian words, including Ahupua’a (watershed areas), which literally means “cairn” + “pig” because offerings were traditionally made in payment to the chiefs, who were once the watershed resource managers. After the meeting was over and I drove around the island, I noticed road signs denoting different Ahupua’a areas all over the island. This partitioning of land and water resources, local ownership and management by villages and chiefs is common to many Pacific islands. From my time as a Peace Corps Volunteer in Yap (which Kalani showed a picture of in his presentation!), I learned that island cultures and economies are deeply dependent and tied to their land and water resources. With increasing globalization and changing climate, islands are some of the most vulnerable and special environments in the world, capable of serving as miner’s canaries for civilization.

The science presented at this meeting is all contributing towards a better understanding and conservation of all aquatic environments. The technology and -omics tools being developed to analyze a growing volume of data are essential to be able to interpret signals of ecological and biogeochemical significance in our changing environments. I learned about many interesting papers, got some new ideas ideas, and met new people. I do agree that one of the main benefits of attending conferences is the energy derived from networking with people and exchanging ideas.

What a beautiful location to have a meeting.


I also saw and learned the name of the state fish! Humuhumunukunukuāpua‘a, the Hawaaiin Triggerfish.



Posted in MMETSP, science | Leave a comment

DIB Lab Retreat to Yosemite, February 2017

This past weekend, all of us in Titus Brown’s Data Intensive Biology (DIB) lab went to Yosemite Bug, which is just outside Yosemite National Park, for our first (annual?) lab retreat. We had a great time! I personally found it inspiring to gather thoughts on the direction of research in the lab, ask questions about what everyone else is working on, and think about how my research goals fit into the larger picture of the lab.

Here are some notes from the weekend in case anyone is interested. Please comment and ask questions. Further discussion is welcome!

Photos by Harriet Alexander (left – Camille Scott looking far yonder) and Lisa Cohen (right)


In October, about 4 months prior, we all agreed on location, date and began planning (gathering info, booked rooms and conference space from the resort). About one week prior, we had a brainstorming meeting about the schedule and what we would discuss.

Everyone drove up (~3.5 hrs from Davis) to Yosemite Bug on Friday. We discussed lab business on Sat and Sunday with some time in the middle to enjoy the outdoors and discuss with each other in an informal setting. The idea was to stimulate discussions about the lab (e.g. research, culture and career development) in a context outside the lab. We wanted it to be different than a conference. A retreat would just be our group, more broad/casual than regular lab meetings, to discuss the big picture of the lab’s research direction.

Saturday morning, presentations

To open, Titus identified major themes in the lab right now:
* Expect many big samples continuously arriving,
* Sketch data structures and online/streaming algorithms are good,
* Pre-filtering is good, especially when each step has low false negative rate
* Decentralized is good

Throughout the morning, there were presentations from the major projects in the lab. Presentations were 10 min each with 5 min discussions. Some hot topics bled over to be ~30 min each. These were informal talks with markers and flip chart only (no slides or projector allowed). The internet in the resort was patchy, so luckily the goal of the retreat was not to work on anything requiring an internet connection.

People gave a broad outline of what they are doing, followed by one or two things we’re excited about (enables X and Y, or Z is an opportunity), then 5 minutes of questions.

* Camille Scott / Streaming the RNAseq
* Luiz Irber / Architecture of all the buzzwords (amazing basic-level explanation of the internet for those of us who are unfamiliar)
* Taylor Reiter / sourmash RNAseq
* Daniel Standage / kevlar
* Harriet Alexander and Lisa Cohen / MMETSP and challenges of multi-species data analysis
* Tamer Mansour / Progress and opportunities in vet genetics

Sat afternoon, free time!

Weather was great, sunny! We had anticipated not-so-great weather with just-above-freezing rain. But, this was not the case. In the afternoon, we all piled into cars for exploration of Yosemite National Park! Shannon Joslin did an amazing job of summarizing available social activities into this list:


Sat evening, social time!

Jessica Mizzi, who takes fun VERY seriously, coordinated games and activities:


I participated in a few heated games of Settlers of Catan and Pictionary. It turns out there are several members of our lab who are relentless resource emperors and that there are varying degrees of artistic abilities. 🙂

pictionary image-uploaded-from-ios-1
Photos by Camille Scott (left) and Daniel Standage (right)

Sunday morning

Two postdoc lab members recently attended the Moore DDD early career workshop and brought back suggestions for continued discussion on the field of ‘data science’. We found it useful to discuss the larger context of how we market ourselves, develop our careers, and fit ourselves into biological research. Data-intensive biology is a large field. In our lab alone, we represent diverse disciplines, e.g. Software Engineering, Genomics, Biological Oceanography, Comparative Physiology, Medicine, Mathematics, just to name a few. We cannot each have a deep understanding of all of these peripherally-related topics. Yet, our collective knowledge is great. How can we better extract overlapping skills from each other to solve hard problems?

We broke out into 3 groups of 6 people at a mixture of career levels, e.g. beginning grad student, mid-level grad students, postdocs, post-PhD industry-bound to address these specific questions:

* How does Person x learn y topic?
* What works?
* How do we teach Davis community about y topics? (Especially if when we might not necessarily know these things ourselves.)

Discussion points

The following is an approach I’m trying based on some helpful blogging advice: choosing words and phrases explaining what has worked for us (or me, specifically) rather than telling people who read this what they should be doing. This is because I am more apt to listen to someone else’s wisdom gained from their own experiences.

– Learning topics has depended on why we want to learn.

– Up to the learner for finding motivation, not necessarily a list of what others think you need to know. Although, we acknowledged that it is hard to figure out what you need to know, if you don’t know. Some base level knowledge is required.

– We have been told that skills in bioinformatics are required for successful future careers. However, there is no institutional-level plans for how to disseminate these skills to learners.

– Beginning learners can feel overwhelmed because of the interdisciplinary nature of bioinformatics, sometimes requiring a combination of knowledge and skills in computer programming, statistics, cell and molecular biology, etc.

– It has helped many of us to take a project-based learning approach.

– Three motivating scenarios were identified for developing a working knowledge of bioinformatics skills:

  1. Biologist generating data, e.g. RNAseq for differential expression. In the long-term it doesn’t seem to make sense to rely on a sequencing facility to analyze data because decisions made during analysis affect the results. Making these decisions requires revisiting the question of why the data were generated in the first place, which is not necessarily within the scope of an independently contracted analyst to be familiar with. It has been our experience that data are best analyzed by people who know the projects very well.
  2. Data analyst understanding many projects simultaneously and advising those generating and analyzing their own data what is the best way to approach analysis based on their own experiences, consensus in the field and benchmark testing.
  3. Data Scientist at a senior level guiding the direction of a research, training program, and developing new methods for processing data.

– Our lab has representation from all three of these categories.

– Some combination of internet-learning, buddy system, participating in a community are all key aspects of learning bioinformatics skills that seem to work for all of us.

– Buddy system. Many of us have found that forming connections with a person or a community of people at a knowledgeable level to answer questions has been necessary for our learning process. Community and personal connections can be fostered via workshops, classes conferences, social events, friendships.

– We have found good luck with using opportunities to collaborate, asking for advice from experts when we meet them. The great thing about this lab and knowing Titus is being able to take advantage of his far-reaching network of collaborators.

– Internet-learning by Google searching. Stackoverflow is our friend

– Some of us have chosen a good book, e.g. Practical Computing for Biologists by Haddock and Dunn, to read and go through the exercises on a regular basis together with a group of people

– We’ve found it helpful to join a community to ask/answer questions. We are actively working towards fostering such a community at UC Davis via the DIB lab! See our website for training workshop schedule and to sign-up for the email list: http://dib-training.readthedocs.io/en/pub/

– It has been our experience that significant investment of personal time is required to learn.

Here are our flip chart notes from this discussion:

Sunday afternoon

The last afternoon discussion centered around lab culture hacking, i.e. what are we doing well, what needs improvement. A motivational speech from Titus: there are always going to be various things the lab can do better, but generally, we’re in a good place! The lab is a set of opportunities. Choose your own adventure. If we’re not doing something, we can provide resources to accomplish goals. Overall, his expectations are for us to us do wonderful and unexpected things. Preferably multiple wonderful things!

Then Titus left for an hour and a half while we discussed the lab. Topics included more frequent journal club, more frequent project reporting and scrum at every lab meeting (rather than one designated presenter each meeting only presenting slides on their own research the whole time), lab communication on Slack vs. email vs. Google calendar for scheduling. The common theme was that while our projects are all very different, we are all connected and the onus is on us to take more initiative to communicate with one another. We talked about positive and negative aspects of the lab. But generally concluded that our lab is awesome, because of our strong community and diverse backgrounds of our lab members. The meeting adjourned, with some of us returning to watch the Super Bowl while others of us stayed on to play more Settlers of Catan and Pictionary!

Thank you to Yosemite Bug, for the quiet, cozy, accommodating place for our group to stay and be productive this weekend. It was a perfect, small venue for this retreat.

Thank you, Titus for bringing us on this retreat. Thank you, everyone in the DIB lab for being fun people. And thank you, Moore Foundation for funding!

Photo by James Word

Photo by Shannon Joslin

Posted in Bioinformatics | Leave a comment

MMETSP re-assemblies

*** Update (9/27/2016): 719 assemblies now available at the links below. All FINISHED!!!

*** Update (9/22/2016): 715 assemblies now available at the links below.

I have *almost* finished re-assembling de novo transcriptomes from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (Keeling et al. 2014). This is an effort that I have been working on over the past year as a PhD student in Titus Brown’s lab at UC Davis. The lab has developed the Eel pond khmer protocol for assembling RNAseq data de novo from any species. I adapted and automated this protocol for Illumina PEx50 data from the MMETSP.

Here are 659 assemblies:


We’re not completely finished yet, but in the meantime we wanted to make these available in case they are useful to you. If you do use, (per academic protocol) please cite us:

Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2016): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare. https://dx.doi.org/10.6084/m9.figshare.3840153.v3

Let us know if you have any questions or difficulties. As a caveat: we aren’t guaranteeing that these assemblies are better than the NCGR assemblies. They are different as a whole. Here is some quality information and spreadsheets to look up your specific MMETSP ID of interest:

BUSCO scores

Transrate scores

Automated Scripts


I’m still working on makings these scripts more accessible and useful for people. Thanks especially to the DIB lab code review session, I will be working on these over the next month. Stay tuned!

The pipeline (flowchart below created with draw.io) is as follows: 1.) getata.py, 2.) trim_qc.py, 3.) diginorm_mmetsp, 4.) assembly.py. All of the Python scripts are controlled by the SraRunInfo.csv metadata spreadsheet downloaded from NCBI that contains the url and sample ID information.


This pipeline could be used with SraRunInfo.csv obtained for any collection of samples on NCBI:


Is there value in re-assembling what has already been done?

The NCGR already performed assemblies with their pipeline in 2014 when these data were sequenced (more information about their pipeline). Why are we interested in doing it again?

Software and best practices for RNAseq de novo transcriptome assembly are changing rapidly, even since 2014. We found preliminary evidence that our assemblies are different than NCGR. Taking a look at some quality metrics for 589 new assemblies, here are the proportion of contigs from our new DIB assemblies aligning with assemblies from NCGR as reference (left), compared to the reverse where the NCGR assemblies are aligned to our new DIB assemblies (right).


This confirms our preliminary evidence that we have assembled nearly everything the NCGR did, plus extra. Still not sure what the extra stuff is, if it’s useful.

There are some samples that have performed well (shown below BUSCO scores – left – and transrate scores – right), while others that have not.


The effects of one pipeline vs another – which software programs are used and decisions made during the pipeline are not well understood. Are there biological consequences, i.e. are there more protein coding regions detected, between pipelines? Ours may or may not be better, but they’re different so we’re exploring how and why.

If our results are different for these assemblies, chances are high that results from other people’s assemblies are dependent on the methods they are using to do their assemblies. Every month, there are ~a dozen publications announcing a new de novo transcriptome of a eukaryotic species. Just in 2016, there have been 743 publications (as of 9/13/2016) on “de novo transcriptome assembly”. When new software programs are developed, updated versions are released, decisions are made about which software programs and what pipeline/workflow to use. What are the effects of these decisions on the results being produced by the scientific community?

This is the first time – as far as we are aware – that anyone has looked carefully at a mass of assemblies like this from a diversity of organisms. This is a really great data set for a number of reasons: 1) a diversity of species, 2) library prep and sequencing was all done by the same facility. This is such a large number of samples that we can apply a statistical analysis to examine the distribution of evaluation metrics, e.g. transrate scores, BUSCO scores, proportion of contigs aligning to the reference, etc.

We’re really excited to be working with these data! And very grateful that these samples have been collected for this project and that the raw data and assemblies done by NCGR are available to the public!

Scripts for this project were developed on different systems:

The mystery of the extra samples:

This is more of an issue related to managing data from a project with this number of samples rather than a scientific problem. But it required some forensic files investigation. The number of samples in the SRA Bioproject is different than the number of samples on imicrobe. There are 678 samples approved by the MMETSP:


Whereas there are 719 Experiments in the SRA:



[1] “MMETSP0132_2” “MMETSP0196” “MMETSP0398” “MMETSP0451_2” “MMETSP0922” “MMETSP0924”
> OLlist$Venn_List$imicrobe
[1] “MMETSP0132_2C” “MMETSP0196C” “MMETSP0398C” “MMETSP0419_2” “MMETSP0451_2C”
[6] “MMETSP0922C” “MMETSP0924C”

It turns out that these “extras” on either side of the venn diagram were just ID with the letter “C” on the end.

For some reason, MMETSP0419_2 (Tetraselmis sp. GSL018) has its own BioProject accession: PRJNA245816. So, PRJNA248394 has 718 samples. Total, there are 719 MMETSP samples in the SRA: PRJNA231566. In the SraRunInfo.csv, the “SampleName” column for the extra sample (SRR1264963) says “Tetraselmis sp. GSL018” rather than MMETSP0419_2.

The next thing to figure out was that there are more IDs in SRA than in imicrobe because some MMETSP ID have multiple, separate Run ID in SRA:

> length(unique(MMETSP_id_NCBI))
[1] 677
> length(unique(ncbi$Run))
[1] 718
> length(unique(MMETSP_id_imicrobe))
[1] 678

These are the samples that have multiple MMETSP ID in SRA:


Samples were assembled individually by each SRR id.

Naming files is hard.

It took me >a day, plus some time thinking about this problem beforehand, to sort out what to name these assembly files for sharing. There are 2 useful ID for each sample: MMTESP id (MMETSPXXXX) and NCBI-SRA id (SRRXXXXXXX). Since files were downloaded from NCBI and I’m assembling samples individually, it made sense to organize by unique SRA id since there can be multiple MMETSP for some SRR. IDs are not recognizeable by humans reading them, though. So, in naming these files I wanted to include scientific names (Genus_species_strain) as well as the ids:


Upon further examination, it turns out that some of the Genus and species names were different between the metadata from the SRA and imicrobe. For example:

Different imicrobe: Nitzschia_punctata SRA: Tryblionella_compressa
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Crustomastix_stigmata SRA: Crustomastix_stigmatica
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Compsopogon_coeruleus SRA: Compsopogon_caeruleus
Different imicrobe: Lingulodinium_polyedra SRA: Lingulodinium_polyedrum

There were >100 of these differences. Some were spelling errors, but others, like the Lingulodinium polyedra vs. Lingulodinium polyedrum or Copsopogon coeruleus vs. Compsopogon caeruleus examples above, I wasn’t sure if this was a spelling preference or an actual difference. The scientific naming in the SRA is linked to the NCBI taxonomy convention, but it could be possible that the names assigned by experts in this field (thus making their way into the metadata hosted by imicrobe) are further ahead in its naming than NCBI. So, in these cases, I included both SRA and imicrobe names.


It was also necessary to clean the metadata to remove special, illegal characters like “)(%\/?’. Some of the assembly file names now have multiple “_” and “-” because characters had to be stripped out. OpenRefine is a fantastic program to automatically do this task. Anyone can freely use it. Those who manage projects with metadata input by a consortium of people individually entering data by hand should especially use OpenRefine! It will even cluster similar spellings and help to catch data entry typos! Data Carpentry has a fantastic lesson to walk you through using OpenRefine. Use it. It will make your life easier.

Uploading files

It turns out that version-controlling file storage and sharing for hundreds of files is not straight-forward yet. We explored figshare, Box, Dropbox, S3, ftp server hosting, and/or institutional server storage. For reasons, we chose figshare (for one .zip file) and Box cloud storage for individual files. As we get the last of the assemblies, we’ll move files and update the links at the top of this blog post.

Downloading files

We chose to use the NCBI version of these files. ENA numbers were not as easy as SRA to locate a metadata spreadsheet with a predictable url address for each sample. The imicrobe project hosts these files, but the files do not follow a predictable pattern to facilitate downloading all of the data. So, we downloaded the fastq from NCBI.

When we wanted to compare our assemblies to NCGR assemblies, Luiz Irber wrote this wonderful script for downloading the NCGR assemblies for all of the samples from the imicrobe ftp server.


# from Luiz Irber
# sudo apt-get install parallel

seq -w 000 147 | parallel -t -j 5 "wget --spider -r \\
ftp://ftp.imicrobe.us/projects/104/transcriptomes/MMETSP{} 2>&1 \\
| grep --line-buffered -o -E 'ftp\:\/\/.*\.cds\.fa\.gz' > urls{}.txt"
cat urls* | sort -u > download.txt
cat download.txt | parallel -j 30 -t "wget -c {}"

Loose ends:

Most assembles are done. Some aren’t.



  • Complete the remaining 59 assemblies. Assess what happened to these 59 samples.
  • Assess which samples are performing poorly according to evaluation metrics.
  • Look at Transrate output to help answer, “What is the extra stuff?”
  • Make scripts more accessible.
  • Partition a set of samples for benchmarking additional software. (With bioboxes?)
  • Update khmer protocols.


Special thanks to Moore Foundation for funding, Titus Brown, Harriet Alexander and everyone in the DIB lab for guidance and information. Thanks also to Matt MacManes‘ suggestions and helpful tutorial and installation instructions and Torsten Seeman‘s and Rob Patro‘s helpful assistance at NGS2016 with asking questions and getting BUSCO and Transrate running.

Posted in Bioinformatics, cluster, Data Analyses, MMETSP, reproducibility, science | Leave a comment

Adventures with ONT MinION at MBL’s Microbial Diversity Course

My time here at the Microbial Diversity Course at MBL has come to an end after visiting for the past 2 weeks with our lab’s MinION to sequence genomes from bacterial isolates that students collected from the Trunk River in Woods Hole, MA. (Photo by Jared Leadbetter of ‘Ectocooler’ Tenacibaculum sp. isolated by Rebecca Mickol).

13782048_10207768876873119_9151405278597781237_n (1)

In Titus Brown’s DIB lab, we’ve been pretty excited about the Oxford Nanopore Technologies MinION sequencer for several reasons:

1) It’s small and portable. This make the MinION a great teaching tool! You can take it to workshops. Students can collect samples, extract DNA, library prep, sequence, and learn to use assembly and annotation bioinformatics software tools within the period of 1 week.

2.) We’re interested in developing streaming software that’s compatible with the sequencing ->  find what you want -> stop sequencing workflow.

3.) Long reads can be used in a similar way as PacBio data to resolve genome and transcriptome (!) assemblies from existing  Illumina data.

Working with any new technology, especially from a new company, requires troubleshooting. While Twitter posts are cool, they tend to make it seem very easy. There is a MAP community for MinION users, but a login is required and public searching is not possible. In comparison to Illumina sequencing, there is not that much experience out there yet.


These are usually saved for the end, but since this is a long blog post, thought I would front-load on the gratitude.

I have really benefitted from blog posts from Keith Robison and lonelyjoeparker. Nick Loman and Josh Quick’s experiences have also been beneficial.

There is no match, though, for having people in person to talk to about technical challenges. Megan Dennis and Maika Malig at UC Davis have provided amazing supportive guidance for us in the past few months with lab space and sharing their own experiences with the MinION. I’m very grateful to be at UC Davis and working with Megan.

This trip was made possible by support from my PI, Titus Brown, who provided funding for my trip and all the flowcells and reagents for the MinION sequencing. It was necessary to have this 2 week block of time to focus on nothing else but getting the MinION to work, ask questions, and figure out what works (and what doesn’t).

Special thanks to Rebecca Mickol and Kirsten Grond in the Microbial Diversity course for isolating and culturing the super cool bacterial samples. Scott Dawson at UC Davis (faculty at the Microbial Diversity course) was instrumental in helping with DNA extractions. Jessica Mizzi assisted with library prep protocol development and troubleshooting. Harriet Alexander assisted with the assembly, library prep and showing me around Woods Hole, which is a lovely place to visit. Thank you also to the MBL Microbial Diversity Course, Hilary Morrison and the Bay Paul Center for hosting lab space for this work to take place.


Files: https://github.com/ljcohen/dib_ONP_MinION

Presentation slides: https://docs.google.com/presentation/d/1Zqd1ayumdZqYc5e8bfeul8-57trKGwGdOFUdPc2-mIU/edit?usp=sharing

Immediately following the Woods Hole visit at MBL, I went to the MSU Kellogg Biological Station as a TA for NGS 2016 course and wrote a tutorial for analyzing ONP data:


Purchasing and Shipping

Advice: allow 2-3 months for ordering. We ordered one month in advance. While ONP customer service probably worked overtime to send our flowcells and Mk1B after several emails, chats, and calling in special favors, in the future it is unclear whether we can count on a scheduled delivery with students. Communication required ~dozen emails and we could never get confirmation that flowcells would arrive in time for the course. It turns out that our order had been shipped and arrived on time, however we did not know about it because a tracking number was not sent to us. It took about a day of emailing and waiting to track the boxes down. Thankfully, the boxes were stored properly in the MBL shipping warehouse.

Communicate with ONP constantly. Stay on top of shipments, ask for tracking numbers and confirmation of shipment. Find out where the shipment is being delivered, as the address you’ve entered may not be the one on the shipping box and your order will be delivered to the wrong place.


QC the flowcells immediately. Bubbles are bad:


We ordered 7 flowcells (5 + 2 that came with the starter pack).The flow cells seemed to have inconsistent pore numbers and some arrived with bubbles. One flow cell had zero pores. They sent us a replacement for this flowcell within days, which was very helpful. However, for the flowcells that had bubbles, I was given instructions by ONP technical staff to draw back a small 15 ul vol of fluid to try to remove the bubble, then QC again. This did not work. The performance of these flowcells did not meet our expectations.

In communicating with the company, we were told that there was no warranty on the flowcells.

DNA Extractions

The ONP protocol says that at least  500-1000 ng clean DNA is required for successful library prep. Try to aim for more than this. Try to get as much DNA, as high molecular weight as possible. Be careful with your DNA. Do not mix liquids by pipetting. For the bacterial isolates from liquid culture, Scott Dawson recommended using Qiagen size exclusion columns to purify, and this worked really well for us. We started with ~2000 ug and used the FFPE repair step.

The ONP protocol includes shearing with the Covaris gtube to 8kb. When I eliminated this step to preserve longer strands, there was little to no yield and samples with adequate yield had poor sequencing results. In communicating with ONP about this, we suspected that the strands were shearing on their own somewhere during the multiple reactions, then either getting washed away during the bead cleanup steps, or the tether and hairpin adapters were sheared off so the strands were not being recognized by the pores.

We sequenced all three sets of DNA below (ladder 1-10kb). The Maxwell prep (gel below on the left) had a decent library quantity but the sequencing read lengths were not as long as we would have liked, which makes sense given the small smeary bands seen. (poretools stats report)


Library prep

When we first started troubleshooting the MinION, the protocols available through the MAP were difficult to follow in the lab. We needed a sheet to just print out and follow from the bench, so we created this:


A few months ago, ONP came out with a pdf checklist for library prep, which is great:


The library prep is pretty straight forward. One important thing I learned about the NEB Blunt/TA Master Mix:

Library prep and loading samples onto the flowcell can be tricky and nerve wracking for those who are not comfortable with lab work. I have >4 yrs of molecular lab experience knowing how to treat reagents, quick spins, pipetting small volumes, how to be careful not to waste reagents. One important point to convey to those who do not do molecular lab work often, is the viscous, sticky enzyme mixes that come in glycerol. You think you’re sucking up a certain volume, but an equal amount is often stuck to the outside of your pipette tip. You have to wipe it on the side of the tube to get it off so you don’t add this to your rxn volume, changing the optimal concentration and (probably the most important) also wasting reagent.

Other misc. advice:

  • The calculation: M1V1 = M2V2 is your friend.
  • Don’t mix by pipetting.
  • Instead, tap or flick the tube with care.
  • Quick spin your tubes often to ensure liquid is collected down at the bottom.
  • Bead cleanups require patience and care while pipetting.
  • Be really organized with your tubes (since there are a handful of reagent tubes that all look the same). Use a checklist and cross off each time you have added a reagent.

These are the things I take for granted when I’m doing lab work on a regular basis. It takes a while to remember when I’m in the lab again after taking a hiatus to work on computationally-focused projects.


Computer Hardware

In October 2015 last year when we were ordering everything to get set up, the computer hardware requirements for the MinION were: 8GB RAM and 128 SSD harddrive with i7 CPU. This is what we ended up ordering (which took several weeks to special order from the UC Davis computer tech center):

DH Part#: F1M35UT Manufacturer: HP Mfr #: F1M35UT#ABA HP ZBook 15 G2 15.6″ LED Mobile Workstation ­ Intel Core i7 i7­4810MQ Quad­core (4 Core) 2.80 GHz 8 GB DDR3L SDRAM RAM ­ 256 GB SSD ­ DVD­Writer ­ NVIDIA Quadro K1100M 2 GB ­ Windows 7 Professional 64­bit (English) upgradable to Windows 8.1 Pro ­ 1920 x 1080 16:9 Display ­ Bluetooth ­ English Keyboard ­ Wireless LAN ­ Webcam ­ 4 x Total USB Ports ­ 3 x USB 3.0 Ports ­ Network (RJ­45) ­ Headphone/Microphone Combo Port

One run requires around 30-50 GB, depending on the quality of the run. The .fast5 files are large, even though the resulting .fastq are small (<1 GB). The hard-drive on our MinION laptop is 256 GB, which can fill up fast. We bought a 2 TB external hard-drive, which we can configure Metrichor to download the reads to after basecalling, saving space on the laptop hard-drive.
Software and Data
  • Windows sucks
  • There’s a new GUI (graphical user interface) for MinKnow in the past months. It’s annoying to get used to this, but in general not too bad.
  • The poretools software to convert .fast5 to .fastq is buggy on Windows and does not play well with MinKnow. There’s probably a way to get them both to work, but I’ve already spent ~2-4 hrs of troubleshooting this issue, so am done with this for now. Instead, we’ve been uploading .fast5 to a Linux server, then running poretools on there.
  • MinKnow python scripts crash sometimes during the run! You can open the MinKnow software again, start the script again, and it should start the run from where it left off.
  • Use the 48 hr MinKnow script for sequencing.
  • Our flow of data goes from raw signal from the MinION (laptop) -> upload to Metrichor server for basecalling -> download to external hard-drive (“pass” or “fail” depending on the Metrichor workflow chosen, e.g. 1D or 2D or barcoding) -> plug external hard-drive to Linux or Linux laptop (for some reason this is easier on Linux laptop rather than Windows…) for transfer to Linux server -> on the Linux server, run poretools software to convert to fastq/fasta -> analysis
  • This all seems kind of ridiculous. If there is a better way, please let us know!

13900322_10104181243686033_7995956997637711388_n (1)


In a future workshop setting, where students are doing this for the first time but we have more experience now, a potential schedule could go something like this:

Day 1: Collect sample, culture

Day 2: Extract DNA, run on gel, quantify

Day 3: Library prep, sequence (this will be a long day)

Day 4: Get sequences, upload, assess reads, start assembly

Day 5: Evaluate assembly, Annotate

This is similar to the schedule arranged for Pore Camp, run by Nick Loman at the University of Birmingham in the UK. They have some great materials and experiences to share:



  • Still unknown what the cost is per sample.
  • Cost of troubleshooting?

I’ve put together a quick ONP MinION purchasing sheet:


Generally, these are the items to purchase:

  • Mk1B starter pack  came with 2 flowcells
  • computer
  • ONP reagents
  • third-party reagents (NEB)

Getting Help

  • MAP community has some answers
  • There is no phone number to call ONP. In contrast, Illumina has a fantastic customer service phone line, with well-trained technicians on the other end to answer emergency phone calls. Reagents and flowcells are expensive. When you’re in the lab and there is a problem, like a bubble on the flowcell or a low pore number after QC, it is often necessary to call and talk to a person on the phone to ask question so you don’t waste time or money.
  • I’ve had many good email conversations with ONP tech support, but there is no substitute to calling someone on the phone and discussing a problem. Often, there are things to work on after the email and it is difficult to follow up by going back and forth with email.
  • LiveChatting feature on the ONP website is great! (During UK business hours, there is a feature at the bottom of the store website that says “Do you have a question?”. During off hours it says “Leave a message”.

I realized through this process that I had lots questions and few answers. The MAP has lots of forum questions but few manuals. Phrase searching sucks. If you search for a phrase in quotes, it will still search for individual words. For example:


Remaining Questions:

1. Why does the number of flow cell pores fluctuate? What is the optimal pore number for a flow cell?

2. What is the effect of 1D reads on the assembly? Can we use the “failed” reads for anything? 

3. How long will a run take? 

4. How much hard-disk space is required for one run?

5. When are the reads “passing” and when are they “failing”? Is there value to the failing reads? 

6. How can we get the most out of the flow cells? There seem to be a lot of unknowns related to the efficiency of the flowcells. We tried re-using a washed flow cell. There were >400 pores in the flow cell during the initial QC. After we loaded the library and started the run, the pore numbers were in the 80s-100s. 2 hrs later, this number dropped down to ~30s. I added more library, and the pore numbers never increased again. Is this a result of the pore quality degrading? The next morning, loaded more library again. Not much change. Decided to switch flowcells and try a new one.

7. Are there batch effects of library prep and/or flowcells? Should we be wary of combining reads from multiple flowcells?


In the future, the aim is to move away worrying about the technology details and focus on the data analysis and what the data mean. The goal should be to focus on the biology and why we’re interested in sequencing anything and everything. What can we do with all of this information, now that we can sequence a genome of a new bacterial species in a week?

Feel free to comment and contact!


Posted in biotech, Sequencing | 8 Comments