New Blog!

This blog has moved! New posts will now be here:

Thank you for your continued interest.

For more information on how to blog with Pelican, the static site generator powered by Python, see this new blog post.



Posted in Python | Leave a comment

Q&A at Porecamp TAMU, June 2017: From the Ebola Outbreak to Sequencing on Mars

As part of my PhD project at UC Davis with Dr. C. Titus Brown and Dr. Andrew Whitehead, I am assessing the feasibility of using Oxford Nanopore Technologies (ONT) MinION device to generate long reads of DNA sequencing data for de novo genome assemblies of several species of freshwater Fundulidae killifish. The final assemblies for each freshwater species will be used to compare genomic regions to their sister marine species, Fundulus heteroclitus.

The MinION is a portable (plugs into the USB port of a laptop!), real-time sequencing technology that can potentially yield reads >100 kb (distributed read lengths >10 kb). This is so super cool! Below is our MinION device chugging along last week with data from Fundulus nottii.


From June 5-9, 2017, I attended Porecamp USA at Texas A&M, hosted by the TAMU Agrilife sequencing facility, to get organized instruction and hands-on experience with the MinION. This was great for several reasons. Since ONT is a relatively new sequencing technology, troubleshooting efforts have been challenging. Updates to the platform occur rapidly. Attending Porecamp allowed me the opportunity to meet experts in person as well as other researchers who are also excited about this new technology. Twitter and email are great for asking questions, but there is no substitute to asking someone a question or having a discussion in person. Making connections and being part of the community of MinION users was an invaluable part of the Porecamp experience. I felt lucky to attend Porecamp, took an overwhelming amount of notes and wanted to share here so others can hopefully benefit. This will likely be just one in a series of blogposts to come!

The first night, there was a very informative Q&A session with the panel of instructors for the workshop, who are leading experts in the field: Nick LomanJosh QuickMatt LooseJohn Tyson, and Mick Watson.

Below is a transcript from my notes. Any mis-interpretation or misinformation is error on my part. Please let me know!

Q: For species with small body sizes or low yields, how feasible are whole genome assemblies from multiple individuals using reads from only ONT and/or ONT and other platforms?

A: Same for any genome assembly. What you want for a reference genome assembly is to get a haploid genome assembly from tons of DNA. There are some diploid assembly tools, which try to separate out haplotypes, but to be honest they’re not very good. As soon as you put multiple individuals into a pool for sequencing, you get a worse assembly because all the structural variation is unique to each individual and the assembly graph kind of falls apart where they don’t agree with one another. But it’s perfectly possible, people have been doing it for years with nematode worms and all sorts of other stuff. Heterozygosity is a massive problem for all genome assemblies all the way back to when we were doing them with Sanger. If you can develop some kind of cell lines and generate buckets of DNA, that’s often a good way to go.

Q: How much time should one allow for ordering flowcells? (Backstory: Ordering to delivery has taken from 1 week to 1 month and has been difficult to plan for fresh DNA extractions, library prep and sequencing library and it’s been challenging.)

A: I think that’s quite an interesting question actually. We’re a bit biased because we’re in the UK. And I don’t know what effect being in the UK has on flowcell supply. In 2014 we all got involved in Nanopore sequencing and it would be fair to say that there have been periods of time when flowcell supply has been restrictive, but now it seems to be very robust and very reliable. We do have Michael Micorescu from ONT here to help answer these questions at another time.

Q: What would you say is the shelf life of a flowcell?

A: The shelf life of a flowcell now, 8 weeks is absolutely fine. We’ve got flowcells that have been sitting in the fridge for longer than that and I’m still looking forward to using them. So, there’s no problem. 8 weeks is fine. It used to be that you wanted to use them as soon as they arrived, but that’s not the case now. They’re pretty stable.

Since they’re more stable than they were initially, and I think the flowcells are much more consistent and more reliable than they have ever been, I think. You need to get the temperature right, they can’t be frozen.

I can confirm from a postdoc who did received flowcells and put them in the -80 degC and that doesn’t work. We did call Nanopore and they asked what had happened. It was quite an interesting response was, we don’t know what will happen, can you try running them anyway. The answer is they don’t work. So, don’t put them at -80.

Q: What level of sequencing coverage is necessary to correct for sequencing error?

A: The short answer is we don’t really have a good answer for that question. It depends on a number of factors, including are you doing 1D or 1D^2 and what is the context GC homopolymeric content of your gene-level region of interest. I’d say as a rule of thumb, it is diminishing returns as you get above 100x coverage. You’ll probably got most of the benefit from consensus calling quite early on, probably around 25x. We found that when we get systematic assessments in the Ebola project, actually 25x of 2D and 50x of 1D was sufficient to get us the correct genotype. If you got to 100X you might get a little bit more, but what’s notable is you don’t really get the perfect consensus at any point because of residual homopolymeric problems. And that’s an open problem with the platform. So it’s not worth it to go to thousands of x coverage, it just makes everything slower an doesn’t really improve anything.

It’s also important to use the signal. We’ll talk about that later [in the workshop] when we talk about bioinformatics, the importance of using the signal, not just the basecalls. The signal is where there’s extra information that can be utilized to get a really good result. And that’s often not used. Typically, in Illumina sequencing, you don’t go back to the signal. But it’s important in Nanopore that you do look at the signal information.

Q: Are there batch effects from one flowcell to another. How wary should we be of combining data from multiple flowcells?

A: Within a single flowcell type, we’re on R9.5 at the moment, I don’t think there are any issues with combining data across flowcells. If you take exactly the same library and split it across 2 different flowcells it will perform equivalently. In fact, we did a really nice experiment just last week showing that and looking at the read length distributions and they’re identical across 2 flowcells. You might get slightly different yield. And we’ll talk a lot about the nitty gritty of this on the running days, the number of pores you have available for sequencing. It can get complicated if you are starting to merge data from different flowcell types. So, that’s always worth remembering which flowcell you’re looking at.

There is potentially a benefit from doing that? Didn’t Jared [Simpson] once say he could get good results combining R9 and R7.3. But they will have slightly different error profiles, though. Different pore types have slightly different error profiles.

I think there’s another point about that we often don’t have the same starting DNA from different samples in an experiment, if you try to do a controlled experiment. With the high input requirements, it makes it quite tempting to go and do another extraction. Just bear in mind that those are going to be technical replicates, or biological replicates. They can also be completely different conditions depending on if you gone back and regrown organisms. So, you could be called out if you’re trying to do a bunch of different experiments and combine them toegether, especially for something like transcription.

Q: What is the minimum DNA requirement for the Tagmentation protocol?

A: According to the protocol, it is 400 ng for R9.5. If you want to target very long reads, input would be 2 or 3 or 4 ug. (laughter) A very broad minimum range.

Q: What are the factors that contribute to how much data one flowcell can yield.

A: Ultimately it comes down to the quality of the DNA you put in, in terms of the number of the ability to generate enough active ends with the motor to bind to the flow cell. In some senses, need to keep the pores fed so you need to keep the concentration high enough so enough sequence goes through the pore and when it’s finished it needs to bind to another fragment. So you want to keep that process as efficient as possible so you maximize the usage of the pores as well. Every flowcell you have will have a different number of pores. Although they are very consistent in terms of the activity of what’s there. The difference is that the more pores you have, the more sequence you’re going to get, because there’s more active processing.

I think it’s fair to say there’s a fair amount of variability. You can have a library that looks good and has run well before and sometimes it will be disappointing and sometimes it will be incredible. Some flowcells just don’t seem to ever give up and will be extreme outliers. And I think it’s fair to say that no one really understands why that is. You can see some runs on the MinION now out to 10, 15, possibly even 17 Gb. But that’s not typical. Typical for us would be more like 5 Gb. It’s not particularly well understood why, otherwise they would make it consistent in the manufacturing.

Q: How are the newest chemistries solving error issues?

A: One of the biggest improvements recently wasn’t the chemistry but the software engineering with the recent version of albacore basecalling, which solves a lot of the homopolymer errors. So, it’s not just the chemistry. The R9 flowcells have been wonderful in terms of quality, but the biggest improvement has been the software.

To say something about the pore or the shape of the pore. I think the understanding is that the different pores have different reading edge shapes. And that reading edge makes some context easier to read and in some contexts more difficult to read. The net effect of that when they moved from the R7 pore, which no one knows exactly what the biological molecule was, but the R9 we do (the csgG protein from E. coli). That was a very clear improvement from one pore to the other and it certainly improved the ability to solve high GC, which was something that the R7 pore struggled with. And that had to do with the shape of the aperture and the effect that has on the electrical current signal. And we’ll talk about that a bit more.

But I think too that the software improvements are just as relevant to getting better results at the moment. Moving from quite simple models of signal using Hidden Markov Models to the start of using neural networks, recurrent neural networks and deep learning approaches which can use much more long range context, that can understand, for example, the effect of what sequence is in the motor protein quite far upstream of the pore, say 20 bases upstream, what effect that sequence has on things like movement time and downstream bases that are actually in the pore. Seems like factoring in that conditionality gives a lot of improvements.

I think it’s quite exciting that the improvements have been driven by bioinformatics actually. Makes me think that someone can come along and make it even better going forward. In fact, I think, for example, Oxford Nanopore is usually quite closed in their software releases until recently, but in the last few days have open sourced their based caller, scrappie for the first time, so that is quite exciting. That means that the community can have a go at trying to improve on their current best. And I think that part of the reason why they’ve done that is to encourage the community to get those last percent or so of residual error sorted out.

Q: How can I get a PromethION?

A: I’d quite like to know the answer to that myself. The simple answer is you pay some money to Oxford Nanopore and they send you one. But I guess that’s not the answer that was requested. I think they hand build them to a certain rate. They’re not being rolled out as fast as they could be because there are still some improvements in flowcell chemistry that need to be sorted out before they’re ready for prime time.

And they’re constantly improving the compute as well. I think people underestimate how big of a machine the PromethION is, actually. If it delivers 8 Gb from 48 flowcells, you think managing the data from one MinION is bad. Wait until you start playing with a PromethION. They’re coming. They’re out there. But the infrastructure needs to be setup.

Large-scale sequencing facilities are looking to use the PromethION. There is the idea that if you get accuracy up to a certain level with the long reads, you can move away from re-sequencing the human genome to more de novo and large-scale structural variant analysis on the human genome on the Nanopore platform, but there needs to be a cost benefit before people would consider switching routinely.

Q: Can you multiplex RNA on the MinION?

A: On the direct RNA protocol you can’t with barcodes. The cDNA protocol you could. What you can do surprisingly well is multiplex samples without barcodes, particularly if they’re sufficiently distinct. We do that with bacterial genomes, like Staph, Strep, Pseudomonas, although you could barcode. But just throw them all in together and save yourself a little lab time, because they will assembly out as long as the read length is sufficiently long, probably 10kb is more than adequate. They’ll just fall out.

Long amplicon cDNA, can do that as well. As long as they’re different. Goal is targeted sequencing, can just mix them in there.

You could also make your own barcodes.

Q: What would be your recommendation for a lab that is considering switching from Illumina to ONT sequencing?

A: I don’t think they’re equivalent. I’d think you’d want to keep both so you can have a broader range of applications. They’re complimentary technology. Particularly the wgs, you want the long reads to get all the structural variation to get the assembly in the first place, but it’s still useful to have Illumina to polish out the errors. For RNAseq, you can get full-length cDNA with the long reads, but there probably aren’t enough reads to be truly quantitative and do counting so you’re going to need Illumina to do that. I wouldn’t recommend doing 16S anyway.

If the question was would you recommend switching from PacBio to Nanopore, then yes. For smaller genome, like C. elegans genomes that are ~100Mb with structural variations in them, we can span those now. We have the benefit of being able to detect those now. You can use Illumina data to polish those out. It’s not like you have to have one or the other. The cost is getting so cheap that it’s silly not to use it to improve. One can do very well with one or the other. But why not combine? Which is what we’re doing at the moment.

It’s worth knowing that there’s a lot of unused Illumina capacity in the world. There are people who have HiSeqX that are not on full, and loads more people are going to buy NovaSeqs, which won’t be full. So, you don’t have to run your own Illumina. Just buy the data from any number of service providers that will give it to you rather cheaply.

Q: Can we link SmidgeION to cloud computing to compute real-time basecalling from mobile devices? How would this revolutionize genomics in the field?

A: My understanding for the intention of the SmidgeION is that you won’t be basecalling in the cloud, you’ll be basecalling on the phone. Which is really quite cool. So, it would be local, hand-held. Bear in mind that the SmidgeION is the smaller through-put device. So, I don’t think that you’ll need to use the cloud for basecalling. It’s worth saying that cloud basecalling is gone. It used to be that you would send your reads into the Amazon cloud. They’d be base-called there, and you would get your called data back. That’s gone. It’s now all local basecalling anyways.

In terms of how that will revolutionize genomics, it’s a subject we’re very passionate about, which is really pathogen detection and diagnosis in the field. The key thing for the SmidgeION is not that it’s necessarily on a mobile device, but that the cost of the flowcell should be much, much lower than the current cost of the flowcells (individually, $900/flowcell to get 2-10 Gbases). And that’s really critical for diagnostics. A $500 diagnostic is too much for many applications. (flowcells are $500 each if you buy bulk packs of 48 for $24,000, $475 for a pack of 300 for $142,500). A $20 diagnostic, if you can get discovery of a whole bunch of different pathogens in one assay, and typing and antibiotic resistance information, quickly, for things like TB, HIV, Malaria, Ebola. All of those are going to be incredible applications that you can have in the field. It will be a change to diagnostics. Right now diagnostics is pretty much focused on targeted, small numbers of pathogens that you assay at a time for things that you expect to find. So, replace that with an open-ended diagnostic where you can discover things that you might not expect to find. The classic example for us from work that we’ve done is during the West African Ebola epidemic, that wasn’t recognized as Ebola for a good 2-3 months after the first case, and that was because Ebola had never been seen in that part of Africa before. You don’t want to test for everything you can imagine it could be. You want to get a definitive diagnosis quickly. That means you can put the right measures in to stop small outbreaks from becoming larger-scale outbreaks and eventually epidemics. So, that’s what we think is going to be the really exciting thing about an even more portable sequencing diagnostic. But you could also imaging similar things for human genetics as well. Some peopel think that people will bring sequencing home for personal sequencing, like your fitbit, you’ll have your personal transcriptome measuring device. Whether that’s going to happen is going to be interesting. (It’ll happen in my house. But that’s just because I really like sequencing.)

I think the MinION has already revolutionized genomics. Then with the SmidgeION, it’s only incremental on top of that.

Q: Is the platform appropriate for potentially fragmented or degraded DNA, particularly things like ancient DNA?

A: I think it’s going to depend on if you can generate a piece of DNA. You’re not going to get really long pieces. But it’s like any system, if you can amplify, then you can generate sequence. You’re not necessarily going to get long pieces. The MinION can sequence your pieces, you just won’t get the benefit of long reads, which happens to be the main benefit of the platform at the moment. Justin O’Grady has this protocol where he extracts DNA from sputum and he uses the rapid PCR kit and regularly does that. I think the idea of looking at ancient DNA on the MinION would be very exciting in the sense that contamination would be the major issue. As well as DNA damage. You have to remember with MinION and Nanopore, you are sequencing the native molecule. So, we’re actually looking at the molecule itself, not a secondary copy of it [like you are with the Illumina platform and sequencing-by-synthesis technology]. So, you are able to look at what is the damage that’s there. What are the changes that happen over time to the DNA molecule. That’s going to be really exciting.

We think it’s going to be huge in the direct RNA sequencing. We think it’s going to be the major draws. We’ve seen this in the little bit we have done so far. There are certainly modifications you can spot just on the fact that it throws the basecaller off compared to what the reference sequence should be. So, you can have a reference sequence from the exact same sample and you don’t see base changes, but the basecaller from the MinION is showing a change. That may be indicative of some direct RNA modification to hte base that’s going on there. It’s a power in being able to actually see what is there. Being able to interpret that requires the ability to train and recognize those different chemistries but I think that will come with time for sure.

We were just down at ASM the last few days in New Orleans, hearing Astronaut Dr. Kate Rubins on sequencing on the ISS. The idea was that NASA wanted to get a sequencing capacity sorted out for the first trips to Mars with the idea that if there are similar macromolecules to biological ones, something like a Nanopore sequencer would be the correct way of assaying them. A little bit science fiction, but sequencing in space on something like the ISS would have been regarded as science fiction just 3 or 4 years ago. And it’s happened. So, maybe we should keep an open mind about the opportunity for sequencing alien life on Mars.


(Slide from Matt Loose)

Posted in biotech, workshops | 1 Comment

ASLO Aquatic Sciences meeting – Honolulu, HI


Last week I gave a talk on re-assembling de novo transcriptomes for the MMETSP at the Association for the Sciences of Limnology & Oceanography (ASLO) Aquatic Sciences meeting in Honolulu, HI held at the Hawaii Convention Center from February 27 to March 3, 2017.

Here are links to the latest version of files:

Re-assembly .fasta files for individual MMETSP samples

Repository with pipeline code


Annotation, expression quantification, and peptide translations are also available. About 100 annotations are still incomplete because the dammit pipeline did not finish properly on our HPC, so I’m in the process of re-doing annotations for these.

It was a good experience presenting to a diverse audience of oceanographers, tool users and developers in the session on Advances in Aquatic Meta-Omics: Creating Tools for More Accurate Characterization of Microbial Communities, organized by Brooke Nunn and Emma Timmins-Shiffman. Some good questions and issues came up about the extra content in our re-assemblies compared to NCGR, contamination, and the need to combine multiple samples from the same strain into one assembly.

Special thank you to Harriet Alexander, a postdoc in our lab who was attending the meeting and provided positive feedback and encouragement before my presentation; my advisor, C. Titus Brown for supporting the trip to the meeting; to the DIB lab for helpful feedback prior to the meeting; and, of course, thank you to the Moore Foundation for supporting this work!

Relevant to pushing limits of institutional high performance computing clusters (HPC) with this MMETSP data set (1 TB raw data), while I was away, Amanda Charbonneau wrote this wonderful poem:

ASLO_HIconventionctr.png  LisaCohen_ASLOtalk.png

Overall, I enjoyed the ASLO meeting (click here for #ASLOmtg tweets). I thought that it was well-organized and welcoming to a diversity of participants. In addition to the full schedule of scientific talks, there were many inspiring messages delivered to the aquatic sciences community related to science communication, teamwork, environmental resource sustainability, and gender issues. In a time of political uncertainty regarding environmental protection and the sustainability of public funding for scientific research, it was refreshing to hear senior scientists in leadership positions speak about these topics in a way that left me optimistic for the future.

I was filled with joy during the opening plenary to hear Kalani Quiocho speak about traditional natural resources management practices in Hawai’i and other Pacific islands. He patiently explained to the audience the etymology of several Hawai’ian words, including Ahupua’a (watershed areas), which literally means “cairn” + “pig” because offerings were traditionally made in payment to the chiefs, who were once the watershed resource managers. After the meeting was over and I drove around the island, I noticed road signs denoting different Ahupua’a areas all over the island. This partitioning of land and water resources, local ownership and management by villages and chiefs is common to many Pacific islands. From my time as a Peace Corps Volunteer in Yap (which Kalani showed a picture of in his presentation!), I learned that island cultures and economies are deeply dependent and tied to their land and water resources. With increasing globalization and changing climate, islands are some of the most vulnerable and special environments in the world, capable of serving as miner’s canaries for civilization.

The science presented at this meeting is all contributing towards a better understanding and conservation of all aquatic environments. The technology and -omics tools being developed to analyze a growing volume of data are essential to be able to interpret signals of ecological and biogeochemical significance in our changing environments. I learned about many interesting papers, got some new ideas ideas, and met new people. I do agree that one of the main benefits of attending conferences is the energy derived from networking with people and exchanging ideas.

What a beautiful location to have a meeting.


I also saw and learned the name of the state fish! Humuhumunukunukuāpua‘a, the Hawaaiin Triggerfish.



Posted in MMETSP, science | Leave a comment

DIB Lab Retreat to Yosemite, February 2017

This past weekend, all of us in Titus Brown’s Data Intensive Biology (DIB) lab went to Yosemite Bug, which is just outside Yosemite National Park, for our first (annual?) lab retreat. We had a great time! I personally found it inspiring to gather thoughts on the direction of research in the lab, ask questions about what everyone else is working on, and think about how my research goals fit into the larger picture of the lab.

Here are some notes from the weekend in case anyone is interested. Please comment and ask questions. Further discussion is welcome!

Photos by Harriet Alexander (left – Camille Scott looking far yonder) and Lisa Cohen (right)


In October, about 4 months prior, we all agreed on location, date and began planning (gathering info, booked rooms and conference space from the resort). About one week prior, we had a brainstorming meeting about the schedule and what we would discuss.

Everyone drove up (~3.5 hrs from Davis) to Yosemite Bug on Friday. We discussed lab business on Sat and Sunday with some time in the middle to enjoy the outdoors and discuss with each other in an informal setting. The idea was to stimulate discussions about the lab (e.g. research, culture and career development) in a context outside the lab. We wanted it to be different than a conference. A retreat would just be our group, more broad/casual than regular lab meetings, to discuss the big picture of the lab’s research direction.

Saturday morning, presentations

To open, Titus identified major themes in the lab right now:
* Expect many big samples continuously arriving,
* Sketch data structures and online/streaming algorithms are good,
* Pre-filtering is good, especially when each step has low false negative rate
* Decentralized is good

Throughout the morning, there were presentations from the major projects in the lab. Presentations were 10 min each with 5 min discussions. Some hot topics bled over to be ~30 min each. These were informal talks with markers and flip chart only (no slides or projector allowed). The internet in the resort was patchy, so luckily the goal of the retreat was not to work on anything requiring an internet connection.

People gave a broad outline of what they are doing, followed by one or two things we’re excited about (enables X and Y, or Z is an opportunity), then 5 minutes of questions.

* Camille Scott / Streaming the RNAseq
* Luiz Irber / Architecture of all the buzzwords (amazing basic-level explanation of the internet for those of us who are unfamiliar)
* Taylor Reiter / sourmash RNAseq
* Daniel Standage / kevlar
* Harriet Alexander and Lisa Cohen / MMETSP and challenges of multi-species data analysis
* Tamer Mansour / Progress and opportunities in vet genetics

Sat afternoon, free time!

Weather was great, sunny! We had anticipated not-so-great weather with just-above-freezing rain. But, this was not the case. In the afternoon, we all piled into cars for exploration of Yosemite National Park! Shannon Joslin did an amazing job of summarizing available social activities into this list:

Sat evening, social time!

Jessica Mizzi, who takes fun VERY seriously, coordinated games and activities: 

I participated in a few heated games of Settlers of Catan and Pictionary. It turns out there are several members of our lab who are relentless resource emperors and that there are varying degrees of artistic abilities. 🙂

pictionary image-uploaded-from-ios-1
Photos by Camille Scott (left) and Daniel Standage (right)

Sunday morning

Two postdoc lab members recently attended the Moore DDD early career workshop and brought back suggestions for continued discussion on the field of ‘data science’. We found it useful to discuss the larger context of how we market ourselves, develop our careers, and fit ourselves into biological research. Data-intensive biology is a large field. In our lab alone, we represent diverse disciplines, e.g. Software Engineering, Genomics, Biological Oceanography, Comparative Physiology, Medicine, Mathematics, just to name a few. We cannot each have a deep understanding of all of these peripherally-related topics. Yet, our collective knowledge is great. How can we better extract overlapping skills from each other to solve hard problems?

We broke out into 3 groups of 6 people at a mixture of career levels, e.g. beginning grad student, mid-level grad students, postdocs, post-PhD industry-bound to address these specific questions:

* How does Person x learn y topic?
* What works?
* How do we teach Davis community about y topics? (Especially if when we might not necessarily know these things ourselves.)

Discussion points

The following is an approach I’m trying based on some helpful blogging advice: choosing words and phrases explaining what has worked for us (or me, specifically) rather than telling people who read this what they should be doing. This is because I am more apt to listen to someone else’s wisdom gained from their own experiences.

– Learning topics has depended on why we want to learn.

– Up to the learner for finding motivation, not necessarily a list of what others think you need to know. Although, we acknowledged that it is hard to figure out what you need to know, if you don’t know. Some base level knowledge is required.

– We have been told that skills in bioinformatics are required for successful future careers. However, there is no institutional-level plans for how to disseminate these skills to learners.

– Beginning learners can feel overwhelmed because of the interdisciplinary nature of bioinformatics, sometimes requiring a combination of knowledge and skills in computer programming, statistics, cell and molecular biology, etc.

– It has helped many of us to take a project-based learning approach.

– Three motivating scenarios were identified for developing a working knowledge of bioinformatics skills:

  1. Biologist generating data, e.g. RNAseq for differential expression. In the long-term it doesn’t seem to make sense to rely on a sequencing facility to analyze data because decisions made during analysis affect the results. Making these decisions requires revisiting the question of why the data were generated in the first place, which is not necessarily within the scope of an independently contracted analyst to be familiar with. It has been our experience that data are best analyzed by people who know the projects very well.
  2. Data analyst understanding many projects simultaneously and advising those generating and analyzing their own data what is the best way to approach analysis based on their own experiences, consensus in the field and benchmark testing.
  3. Data Scientist at a senior level guiding the direction of a research, training program, and developing new methods for processing data.

– Our lab has representation from all three of these categories.

– Some combination of internet-learning, buddy system, participating in a community are all key aspects of learning bioinformatics skills that seem to work for all of us.

– Buddy system. Many of us have found that forming connections with a person or a community of people at a knowledgeable level to answer questions has been necessary for our learning process. Community and personal connections can be fostered via workshops, classes conferences, social events, friendships.

– We have found good luck with using opportunities to collaborate, asking for advice from experts when we meet them. The great thing about this lab and knowing Titus is being able to take advantage of his far-reaching network of collaborators.

– Internet-learning by Google searching. Stackoverflow is our friend

– Some of us have chosen a good book, e.g. Practical Computing for Biologists by Haddock and Dunn, to read and go through the exercises on a regular basis together with a group of people

– We’ve found it helpful to join a community to ask/answer questions. We are actively working towards fostering such a community at UC Davis via the DIB lab! See our website for training workshop schedule and to sign-up for the email list:

– It has been our experience that significant investment of personal time is required to learn.

Here are our flip chart notes from this discussion:

Sunday afternoon

The last afternoon discussion centered around lab culture hacking, i.e. what are we doing well, what needs improvement. A motivational speech from Titus: there are always going to be various things the lab can do better, but generally, we’re in a good place! The lab is a set of opportunities. Choose your own adventure. If we’re not doing something, we can provide resources to accomplish goals. Overall, his expectations are for us to us do wonderful and unexpected things. Preferably multiple wonderful things!

Then Titus left for an hour and a half while we discussed the lab. Topics included more frequent journal club, more frequent project reporting and scrum at every lab meeting (rather than one designated presenter each meeting only presenting slides on their own research the whole time), lab communication on Slack vs. email vs. Google calendar for scheduling. The common theme was that while our projects are all very different, we are all connected and the onus is on us to take more initiative to communicate with one another. We talked about positive and negative aspects of the lab. But generally concluded that our lab is awesome, because of our strong community and diverse backgrounds of our lab members. The meeting adjourned, with some of us returning to watch the Super Bowl while others of us stayed on to play more Settlers of Catan and Pictionary!

Thank you to Yosemite Bug, for the quiet, cozy, accommodating place for our group to stay and be productive this weekend. It was a perfect, small venue for this retreat.

Thank you, Titus for bringing us on this retreat. Thank you, everyone in the DIB lab for being fun people. And thank you, Moore Foundation for funding!

Photo by James Word

Photo by Shannon Joslin

Posted in Bioinformatics | Leave a comment

MMETSP re-assemblies

*** Update (9/27/2016): 719 assemblies now available at the links below. All FINISHED!!!

*** Update (9/22/2016): 715 assemblies now available at the links below.

I have *almost* finished re-assembling de novo transcriptomes from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (Keeling et al. 2014). This is an effort that I have been working on over the past year as a PhD student in Titus Brown’s lab at UC Davis. The lab has developed the Eel pond khmer protocol for assembling RNAseq data de novo from any species. I adapted and automated this protocol for Illumina PEx50 data from the MMETSP.

Here are 659 assemblies:


We’re not completely finished yet, but in the meantime we wanted to make these available in case they are useful to you. If you do use, (per academic protocol) please cite us:

Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2016): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.

Let us know if you have any questions or difficulties. As a caveat: we aren’t guaranteeing that these assemblies are better than the NCGR assemblies. They are different as a whole. Here is some quality information and spreadsheets to look up your specific MMETSP ID of interest:

BUSCO scores

Transrate scores

Automated Scripts

I’m still working on makings these scripts more accessible and useful for people. Thanks especially to the DIB lab code review session, I will be working on these over the next month. Stay tuned!

The pipeline (flowchart below created with is as follows: 1.), 2.), 3.) diginorm_mmetsp, 4.) All of the Python scripts are controlled by the SraRunInfo.csv metadata spreadsheet downloaded from NCBI that contains the url and sample ID information.


This pipeline could be used with SraRunInfo.csv obtained for any collection of samples on NCBI:


Is there value in re-assembling what has already been done?

The NCGR already performed assemblies with their pipeline in 2014 when these data were sequenced (more information about their pipeline). Why are we interested in doing it again?

Software and best practices for RNAseq de novo transcriptome assembly are changing rapidly, even since 2014. We found preliminary evidence that our assemblies are different than NCGR. Taking a look at some quality metrics for 589 new assemblies, here are the proportion of contigs from our new DIB assemblies aligning with assemblies from NCGR as reference (left), compared to the reverse where the NCGR assemblies are aligned to our new DIB assemblies (right).


This confirms our preliminary evidence that we have assembled nearly everything the NCGR did, plus extra. Still not sure what the extra stuff is, if it’s useful.

There are some samples that have performed well (shown below BUSCO scores – left – and transrate scores – right), while others that have not.


The effects of one pipeline vs another – which software programs are used and decisions made during the pipeline are not well understood. Are there biological consequences, i.e. are there more protein coding regions detected, between pipelines? Ours may or may not be better, but they’re different so we’re exploring how and why.

If our results are different for these assemblies, chances are high that results from other people’s assemblies are dependent on the methods they are using to do their assemblies. Every month, there are ~a dozen publications announcing a new de novo transcriptome of a eukaryotic species. Just in 2016, there have been 743 publications (as of 9/13/2016) on “de novo transcriptome assembly”. When new software programs are developed, updated versions are released, decisions are made about which software programs and what pipeline/workflow to use. What are the effects of these decisions on the results being produced by the scientific community?

This is the first time – as far as we are aware – that anyone has looked carefully at a mass of assemblies like this from a diversity of organisms. This is a really great data set for a number of reasons: 1) a diversity of species, 2) library prep and sequencing was all done by the same facility. This is such a large number of samples that we can apply a statistical analysis to examine the distribution of evaluation metrics, e.g. transrate scores, BUSCO scores, proportion of contigs aligning to the reference, etc.

We’re really excited to be working with these data! And very grateful that these samples have been collected for this project and that the raw data and assemblies done by NCGR are available to the public!

Scripts for this project were developed on different systems:

The mystery of the extra samples:

This is more of an issue related to managing data from a project with this number of samples rather than a scientific problem. But it required some forensic files investigation. The number of samples in the SRA Bioproject is different than the number of samples on imicrobe. There are 678 samples approved by the MMETSP:


Whereas there are 719 Experiments in the SRA:



[1] “MMETSP0132_2” “MMETSP0196” “MMETSP0398” “MMETSP0451_2” “MMETSP0922” “MMETSP0924”
> OLlist$Venn_List$imicrobe
[1] “MMETSP0132_2C” “MMETSP0196C” “MMETSP0398C” “MMETSP0419_2” “MMETSP0451_2C”
[6] “MMETSP0922C” “MMETSP0924C”

It turns out that these “extras” on either side of the venn diagram were just ID with the letter “C” on the end.

For some reason, MMETSP0419_2 (Tetraselmis sp. GSL018) has its own BioProject accession: PRJNA245816. So, PRJNA248394 has 718 samples. Total, there are 719 MMETSP samples in the SRA: PRJNA231566. In the SraRunInfo.csv, the “SampleName” column for the extra sample (SRR1264963) says “Tetraselmis sp. GSL018” rather than MMETSP0419_2.

The next thing to figure out was that there are more IDs in SRA than in imicrobe because some MMETSP ID have multiple, separate Run ID in SRA:

> length(unique(MMETSP_id_NCBI))
[1] 677
> length(unique(ncbi$Run))
[1] 718
> length(unique(MMETSP_id_imicrobe))
[1] 678

These are the samples that have multiple MMETSP ID in SRA:


Samples were assembled individually by each SRR id.

Naming files is hard.

It took me >a day, plus some time thinking about this problem beforehand, to sort out what to name these assembly files for sharing. There are 2 useful ID for each sample: MMTESP id (MMETSPXXXX) and NCBI-SRA id (SRRXXXXXXX). Since files were downloaded from NCBI and I’m assembling samples individually, it made sense to organize by unique SRA id since there can be multiple MMETSP for some SRR. IDs are not recognizeable by humans reading them, though. So, in naming these files I wanted to include scientific names (Genus_species_strain) as well as the ids:


Upon further examination, it turns out that some of the Genus and species names were different between the metadata from the SRA and imicrobe. For example:

Different imicrobe: Nitzschia_punctata SRA: Tryblionella_compressa
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Crustomastix_stigmata SRA: Crustomastix_stigmatica
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Compsopogon_coeruleus SRA: Compsopogon_caeruleus
Different imicrobe: Lingulodinium_polyedra SRA: Lingulodinium_polyedrum

There were >100 of these differences. Some were spelling errors, but others, like the Lingulodinium polyedra vs. Lingulodinium polyedrum or Copsopogon coeruleus vs. Compsopogon caeruleus examples above, I wasn’t sure if this was a spelling preference or an actual difference. The scientific naming in the SRA is linked to the NCBI taxonomy convention, but it could be possible that the names assigned by experts in this field (thus making their way into the metadata hosted by imicrobe) are further ahead in its naming than NCBI. So, in these cases, I included both SRA and imicrobe names.


It was also necessary to clean the metadata to remove special, illegal characters like “)(%\/?’. Some of the assembly file names now have multiple “_” and “-” because characters had to be stripped out. OpenRefine is a fantastic program to automatically do this task. Anyone can freely use it. Those who manage projects with metadata input by a consortium of people individually entering data by hand should especially use OpenRefine! It will even cluster similar spellings and help to catch data entry typos! Data Carpentry has a fantastic lesson to walk you through using OpenRefine. Use it. It will make your life easier.

Uploading files

It turns out that version-controlling file storage and sharing for hundreds of files is not straight-forward yet. We explored figshare, Box, Dropbox, S3, ftp server hosting, and/or institutional server storage. For reasons, we chose figshare (for one .zip file) and Box cloud storage for individual files. As we get the last of the assemblies, we’ll move files and update the links at the top of this blog post.

Downloading files

We chose to use the NCBI version of these files. ENA numbers were not as easy as SRA to locate a metadata spreadsheet with a predictable url address for each sample. The imicrobe project hosts these files, but the files do not follow a predictable pattern to facilitate downloading all of the data. So, we downloaded the fastq from NCBI.

When we wanted to compare our assemblies to NCGR assemblies, Luiz Irber wrote this wonderful script for downloading the NCGR assemblies for all of the samples from the imicrobe ftp server.


# from Luiz Irber
# sudo apt-get install parallel

seq -w 000 147 | parallel -t -j 5 "wget --spider -r \\{} 2>&1 \\
| grep --line-buffered -o -E 'ftp\:\/\/.*\.cds\.fa\.gz' > urls{}.txt"
cat urls* | sort -u > download.txt
cat download.txt | parallel -j 30 -t "wget -c {}"

Loose ends:

Most assembles are done. Some aren’t.



  • Complete the remaining 59 assemblies. Assess what happened to these 59 samples.
  • Assess which samples are performing poorly according to evaluation metrics.
  • Look at Transrate output to help answer, “What is the extra stuff?”
  • Make scripts more accessible.
  • Partition a set of samples for benchmarking additional software. (With bioboxes?)
  • Update khmer protocols.


Special thanks to Moore Foundation for funding, Titus Brown, Harriet Alexander and everyone in the DIB lab for guidance and information. Thanks also to Matt MacManes‘ suggestions and helpful tutorial and installation instructions and Torsten Seeman‘s and Rob Patro‘s helpful assistance at NGS2016 with asking questions and getting BUSCO and Transrate running.

Posted in Bioinformatics, cluster, Data Analyses, MMETSP, reproducibility, science | Leave a comment