ASLO Aquatic Sciences meeting – Honolulu, HI


Last week I gave a talk on re-assembling de novo transcriptomes for the MMETSP at the Association for the Sciences of Limnology & Oceanography (ASLO) Aquatic Sciences meeting in Honolulu, HI held at the Hawaii Convention Center from February 27 to March 3, 2017.

Here are links to the latest version of files:

Re-assembly .fasta files for individual MMETSP samples

Repository with pipeline code


Annotation, expression quantification, and peptide translations are also available. About 100 annotations are still incomplete because the dammit pipeline did not finish properly on our HPC, so I’m in the process of re-doing annotations for these.

It was a good experience presenting to a diverse audience of oceanographers, tool users and developers in the session on Advances in Aquatic Meta-Omics: Creating Tools for More Accurate Characterization of Microbial Communities, organized by Brooke Nunn and Emma Timmins-Shiffman. Some good questions and issues came up about the extra content in our re-assemblies compared to NCGR, contamination, and the need to combine multiple samples from the same strain into one assembly.

Special thank you to Harriet Alexander, a postdoc in our lab who was attending the meeting and provided positive feedback and encouragement before my presentation; my advisor, C. Titus Brown for supporting the trip to the meeting; to the DIB lab for helpful feedback prior to the meeting; and, of course, thank you to the Moore Foundation for supporting this work!

Relevant to pushing limits of institutional high performance computing clusters (HPC) with this MMETSP data set (1 TB raw data), while I was away, Amanda Charbonneau wrote this wonderful poem:

ASLO_HIconventionctr.png  LisaCohen_ASLOtalk.png

Overall, I enjoyed the ASLO meeting (click here for #ASLOmtg tweets). I thought that it was well-organized and welcoming to a diversity of participants. In addition to the full schedule of scientific talks, there were many inspiring messages delivered to the aquatic sciences community related to science communication, teamwork, environmental resource sustainability, and gender issues. In a time of political uncertainty regarding environmental protection and the sustainability of public funding for scientific research, it was refreshing to hear senior scientists in leadership positions speak about these topics in a way that left me optimistic for the future.

I was filled with joy during the opening plenary to hear Kalani Quiocho speak about traditional natural resources management practices in Hawai’i and other Pacific islands. He patiently explained to the audience the etymology of several Hawai’ian words, including Ahupua’a (watershed areas), which literally means “cairn” + “pig” because offerings were traditionally made in payment to the chiefs, who were once the watershed resource managers. After the meeting was over and I drove around the island, I noticed road signs denoting different Ahupua’a areas all over the island. This partitioning of land and water resources, local ownership and management by villages and chiefs is common to many Pacific islands. From my time as a Peace Corps Volunteer in Yap (which Kalani showed a picture of in his presentation!), I learned that island cultures and economies are deeply dependent and tied to their land and water resources. With increasing globalization and changing climate, islands are some of the most vulnerable and special environments in the world, capable of serving as miner’s canaries for civilization.

The science presented at this meeting is all contributing towards a better understanding and conservation of all aquatic environments. The technology and -omics tools being developed to analyze a growing volume of data are essential to be able to interpret signals of ecological and biogeochemical significance in our changing environments. I learned about many interesting papers, got some new ideas ideas, and met new people. I do agree that one of the main benefits of attending conferences is the energy derived from networking with people and exchanging ideas.

What a beautiful location to have a meeting.


I also saw and learned the name of the state fish! Humuhumunukunukuāpua‘a, the Hawaaiin Triggerfish.



Posted in MMETSP, science | Leave a comment

DIB Lab Retreat to Yosemite, February 2017

This past weekend, all of us in Titus Brown’s Data Intensive Biology (DIB) lab went to Yosemite Bug, which is just outside Yosemite National Park, for our first (annual?) lab retreat. We had a great time! I personally found it inspiring to gather thoughts on the direction of research in the lab, ask questions about what everyone else is working on, and think about how my research goals fit into the larger picture of the lab.

Here are some notes from the weekend in case anyone is interested. Please comment and ask questions. Further discussion is welcome!

Photos by Harriet Alexander (left – Camille Scott looking far yonder) and Lisa Cohen (right)


In October, about 4 months prior, we all agreed on location, date and began planning (gathering info, booked rooms and conference space from the resort). About one week prior, we had a brainstorming meeting about the schedule and what we would discuss.

Everyone drove up (~3.5 hrs from Davis) to Yosemite Bug on Friday. We discussed lab business on Sat and Sunday with some time in the middle to enjoy the outdoors and discuss with each other in an informal setting. The idea was to stimulate discussions about the lab (e.g. research, culture and career development) in a context outside the lab. We wanted it to be different than a conference. A retreat would just be our group, more broad/casual than regular lab meetings, to discuss the big picture of the lab’s research direction.

Saturday morning, presentations

To open, Titus identified major themes in the lab right now:
* Expect many big samples continuously arriving,
* Sketch data structures and online/streaming algorithms are good,
* Pre-filtering is good, especially when each step has low false negative rate
* Decentralized is good

Throughout the morning, there were presentations from the major projects in the lab. Presentations were 10 min each with 5 min discussions. Some hot topics bled over to be ~30 min each. These were informal talks with markers and flip chart only (no slides or projector allowed). The internet in the resort was patchy, so luckily the goal of the retreat was not to work on anything requiring an internet connection.

People gave a broad outline of what they are doing, followed by one or two things we’re excited about (enables X and Y, or Z is an opportunity), then 5 minutes of questions.

* Camille Scott / Streaming the RNAseq
* Luiz Irber / Architecture of all the buzzwords (amazing basic-level explanation of the internet for those of us who are unfamiliar)
* Taylor Reiter / sourmash RNAseq
* Daniel Standage / kevlar
* Harriet Alexander and Lisa Cohen / MMETSP and challenges of multi-species data analysis
* Tamer Mansour / Progress and opportunities in vet genetics

Sat afternoon, free time!

Weather was great, sunny! We had anticipated not-so-great weather with just-above-freezing rain. But, this was not the case. In the afternoon, we all piled into cars for exploration of Yosemite National Park! Shannon Joslin did an amazing job of summarizing available social activities into this list:

Sat evening, social time!

Jessica Mizzi, who takes fun VERY seriously, coordinated games and activities: 

I participated in a few heated games of Settlers of Catan and Pictionary. It turns out there are several members of our lab who are relentless resource emperors and that there are varying degrees of artistic abilities. 🙂

pictionary image-uploaded-from-ios-1
Photos by Camille Scott (left) and Daniel Standage (right)

Sunday morning

Two postdoc lab members recently attended the Moore DDD early career workshop and brought back suggestions for continued discussion on the field of ‘data science’. We found it useful to discuss the larger context of how we market ourselves, develop our careers, and fit ourselves into biological research. Data-intensive biology is a large field. In our lab alone, we represent diverse disciplines, e.g. Software Engineering, Genomics, Biological Oceanography, Comparative Physiology, Medicine, Mathematics, just to name a few. We cannot each have a deep understanding of all of these peripherally-related topics. Yet, our collective knowledge is great. How can we better extract overlapping skills from each other to solve hard problems?

We broke out into 3 groups of 6 people at a mixture of career levels, e.g. beginning grad student, mid-level grad students, postdocs, post-PhD industry-bound to address these specific questions:

* How does Person x learn y topic?
* What works?
* How do we teach Davis community about y topics? (Especially if when we might not necessarily know these things ourselves.)

Discussion points

The following is an approach I’m trying based on some helpful blogging advice: choosing words and phrases explaining what has worked for us (or me, specifically) rather than telling people who read this what they should be doing. This is because I am more apt to listen to someone else’s wisdom gained from their own experiences.

– Learning topics has depended on why we want to learn.

– Up to the learner for finding motivation, not necessarily a list of what others think you need to know. Although, we acknowledged that it is hard to figure out what you need to know, if you don’t know. Some base level knowledge is required.

– We have been told that skills in bioinformatics are required for successful future careers. However, there is no institutional-level plans for how to disseminate these skills to learners.

– Beginning learners can feel overwhelmed because of the interdisciplinary nature of bioinformatics, sometimes requiring a combination of knowledge and skills in computer programming, statistics, cell and molecular biology, etc.

– It has helped many of us to take a project-based learning approach.

– Three motivating scenarios were identified for developing a working knowledge of bioinformatics skills:

  1. Biologist generating data, e.g. RNAseq for differential expression. In the long-term it doesn’t seem to make sense to rely on a sequencing facility to analyze data because decisions made during analysis affect the results. Making these decisions requires revisiting the question of why the data were generated in the first place, which is not necessarily within the scope of an independently contracted analyst to be familiar with. It has been our experience that data are best analyzed by people who know the projects very well.
  2. Data analyst understanding many projects simultaneously and advising those generating and analyzing their own data what is the best way to approach analysis based on their own experiences, consensus in the field and benchmark testing.
  3. Data Scientist at a senior level guiding the direction of a research, training program, and developing new methods for processing data.

– Our lab has representation from all three of these categories.

– Some combination of internet-learning, buddy system, participating in a community are all key aspects of learning bioinformatics skills that seem to work for all of us.

– Buddy system. Many of us have found that forming connections with a person or a community of people at a knowledgeable level to answer questions has been necessary for our learning process. Community and personal connections can be fostered via workshops, classes conferences, social events, friendships.

– We have found good luck with using opportunities to collaborate, asking for advice from experts when we meet them. The great thing about this lab and knowing Titus is being able to take advantage of his far-reaching network of collaborators.

– Internet-learning by Google searching. Stackoverflow is our friend

– Some of us have chosen a good book, e.g. Practical Computing for Biologists by Haddock and Dunn, to read and go through the exercises on a regular basis together with a group of people

– We’ve found it helpful to join a community to ask/answer questions. We are actively working towards fostering such a community at UC Davis via the DIB lab! See our website for training workshop schedule and to sign-up for the email list:

– It has been our experience that significant investment of personal time is required to learn.

Here are our flip chart notes from this discussion:

Sunday afternoon

The last afternoon discussion centered around lab culture hacking, i.e. what are we doing well, what needs improvement. A motivational speech from Titus: there are always going to be various things the lab can do better, but generally, we’re in a good place! The lab is a set of opportunities. Choose your own adventure. If we’re not doing something, we can provide resources to accomplish goals. Overall, his expectations are for us to us do wonderful and unexpected things. Preferably multiple wonderful things!

Then Titus left for an hour and a half while we discussed the lab. Topics included more frequent journal club, more frequent project reporting and scrum at every lab meeting (rather than one designated presenter each meeting only presenting slides on their own research the whole time), lab communication on Slack vs. email vs. Google calendar for scheduling. The common theme was that while our projects are all very different, we are all connected and the onus is on us to take more initiative to communicate with one another. We talked about positive and negative aspects of the lab. But generally concluded that our lab is awesome, because of our strong community and diverse backgrounds of our lab members. The meeting adjourned, with some of us returning to watch the Super Bowl while others of us stayed on to play more Settlers of Catan and Pictionary!

Thank you to Yosemite Bug, for the quiet, cozy, accommodating place for our group to stay and be productive this weekend. It was a perfect, small venue for this retreat.

Thank you, Titus for bringing us on this retreat. Thank you, everyone in the DIB lab for being fun people. And thank you, Moore Foundation for funding!

Photo by James Word

Photo by Shannon Joslin

Posted in Bioinformatics | Leave a comment

MMETSP re-assemblies

*** Update (9/27/2016): 719 assemblies now available at the links below. All FINISHED!!!

*** Update (9/22/2016): 715 assemblies now available at the links below.

I have *almost* finished re-assembling de novo transcriptomes from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (Keeling et al. 2014). This is an effort that I have been working on over the past year as a PhD student in Titus Brown’s lab at UC Davis. The lab has developed the Eel pond khmer protocol for assembling RNAseq data de novo from any species. I adapted and automated this protocol for Illumina PEx50 data from the MMETSP.

Here are 659 assemblies:


We’re not completely finished yet, but in the meantime we wanted to make these available in case they are useful to you. If you do use, (per academic protocol) please cite us:

Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2016): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.

Let us know if you have any questions or difficulties. As a caveat: we aren’t guaranteeing that these assemblies are better than the NCGR assemblies. They are different as a whole. Here is some quality information and spreadsheets to look up your specific MMETSP ID of interest:

BUSCO scores

Transrate scores

Automated Scripts

I’m still working on makings these scripts more accessible and useful for people. Thanks especially to the DIB lab code review session, I will be working on these over the next month. Stay tuned!

The pipeline (flowchart below created with is as follows: 1.), 2.), 3.) diginorm_mmetsp, 4.) All of the Python scripts are controlled by the SraRunInfo.csv metadata spreadsheet downloaded from NCBI that contains the url and sample ID information.


This pipeline could be used with SraRunInfo.csv obtained for any collection of samples on NCBI:


Is there value in re-assembling what has already been done?

The NCGR already performed assemblies with their pipeline in 2014 when these data were sequenced (more information about their pipeline). Why are we interested in doing it again?

Software and best practices for RNAseq de novo transcriptome assembly are changing rapidly, even since 2014. We found preliminary evidence that our assemblies are different than NCGR. Taking a look at some quality metrics for 589 new assemblies, here are the proportion of contigs from our new DIB assemblies aligning with assemblies from NCGR as reference (left), compared to the reverse where the NCGR assemblies are aligned to our new DIB assemblies (right).


This confirms our preliminary evidence that we have assembled nearly everything the NCGR did, plus extra. Still not sure what the extra stuff is, if it’s useful.

There are some samples that have performed well (shown below BUSCO scores – left – and transrate scores – right), while others that have not.


The effects of one pipeline vs another – which software programs are used and decisions made during the pipeline are not well understood. Are there biological consequences, i.e. are there more protein coding regions detected, between pipelines? Ours may or may not be better, but they’re different so we’re exploring how and why.

If our results are different for these assemblies, chances are high that results from other people’s assemblies are dependent on the methods they are using to do their assemblies. Every month, there are ~a dozen publications announcing a new de novo transcriptome of a eukaryotic species. Just in 2016, there have been 743 publications (as of 9/13/2016) on “de novo transcriptome assembly”. When new software programs are developed, updated versions are released, decisions are made about which software programs and what pipeline/workflow to use. What are the effects of these decisions on the results being produced by the scientific community?

This is the first time – as far as we are aware – that anyone has looked carefully at a mass of assemblies like this from a diversity of organisms. This is a really great data set for a number of reasons: 1) a diversity of species, 2) library prep and sequencing was all done by the same facility. This is such a large number of samples that we can apply a statistical analysis to examine the distribution of evaluation metrics, e.g. transrate scores, BUSCO scores, proportion of contigs aligning to the reference, etc.

We’re really excited to be working with these data! And very grateful that these samples have been collected for this project and that the raw data and assemblies done by NCGR are available to the public!

Scripts for this project were developed on different systems:

The mystery of the extra samples:

This is more of an issue related to managing data from a project with this number of samples rather than a scientific problem. But it required some forensic files investigation. The number of samples in the SRA Bioproject is different than the number of samples on imicrobe. There are 678 samples approved by the MMETSP:


Whereas there are 719 Experiments in the SRA:



[1] “MMETSP0132_2” “MMETSP0196” “MMETSP0398” “MMETSP0451_2” “MMETSP0922” “MMETSP0924”
> OLlist$Venn_List$imicrobe
[1] “MMETSP0132_2C” “MMETSP0196C” “MMETSP0398C” “MMETSP0419_2” “MMETSP0451_2C”
[6] “MMETSP0922C” “MMETSP0924C”

It turns out that these “extras” on either side of the venn diagram were just ID with the letter “C” on the end.

For some reason, MMETSP0419_2 (Tetraselmis sp. GSL018) has its own BioProject accession: PRJNA245816. So, PRJNA248394 has 718 samples. Total, there are 719 MMETSP samples in the SRA: PRJNA231566. In the SraRunInfo.csv, the “SampleName” column for the extra sample (SRR1264963) says “Tetraselmis sp. GSL018” rather than MMETSP0419_2.

The next thing to figure out was that there are more IDs in SRA than in imicrobe because some MMETSP ID have multiple, separate Run ID in SRA:

> length(unique(MMETSP_id_NCBI))
[1] 677
> length(unique(ncbi$Run))
[1] 718
> length(unique(MMETSP_id_imicrobe))
[1] 678

These are the samples that have multiple MMETSP ID in SRA:


Samples were assembled individually by each SRR id.

Naming files is hard.

It took me >a day, plus some time thinking about this problem beforehand, to sort out what to name these assembly files for sharing. There are 2 useful ID for each sample: MMTESP id (MMETSPXXXX) and NCBI-SRA id (SRRXXXXXXX). Since files were downloaded from NCBI and I’m assembling samples individually, it made sense to organize by unique SRA id since there can be multiple MMETSP for some SRR. IDs are not recognizeable by humans reading them, though. So, in naming these files I wanted to include scientific names (Genus_species_strain) as well as the ids:


Upon further examination, it turns out that some of the Genus and species names were different between the metadata from the SRA and imicrobe. For example:

Different imicrobe: Nitzschia_punctata SRA: Tryblionella_compressa
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Crustomastix_stigmata SRA: Crustomastix_stigmatica
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Compsopogon_coeruleus SRA: Compsopogon_caeruleus
Different imicrobe: Lingulodinium_polyedra SRA: Lingulodinium_polyedrum

There were >100 of these differences. Some were spelling errors, but others, like the Lingulodinium polyedra vs. Lingulodinium polyedrum or Copsopogon coeruleus vs. Compsopogon caeruleus examples above, I wasn’t sure if this was a spelling preference or an actual difference. The scientific naming in the SRA is linked to the NCBI taxonomy convention, but it could be possible that the names assigned by experts in this field (thus making their way into the metadata hosted by imicrobe) are further ahead in its naming than NCBI. So, in these cases, I included both SRA and imicrobe names.


It was also necessary to clean the metadata to remove special, illegal characters like “)(%\/?’. Some of the assembly file names now have multiple “_” and “-” because characters had to be stripped out. OpenRefine is a fantastic program to automatically do this task. Anyone can freely use it. Those who manage projects with metadata input by a consortium of people individually entering data by hand should especially use OpenRefine! It will even cluster similar spellings and help to catch data entry typos! Data Carpentry has a fantastic lesson to walk you through using OpenRefine. Use it. It will make your life easier.

Uploading files

It turns out that version-controlling file storage and sharing for hundreds of files is not straight-forward yet. We explored figshare, Box, Dropbox, S3, ftp server hosting, and/or institutional server storage. For reasons, we chose figshare (for one .zip file) and Box cloud storage for individual files. As we get the last of the assemblies, we’ll move files and update the links at the top of this blog post.

Downloading files

We chose to use the NCBI version of these files. ENA numbers were not as easy as SRA to locate a metadata spreadsheet with a predictable url address for each sample. The imicrobe project hosts these files, but the files do not follow a predictable pattern to facilitate downloading all of the data. So, we downloaded the fastq from NCBI.

When we wanted to compare our assemblies to NCGR assemblies, Luiz Irber wrote this wonderful script for downloading the NCGR assemblies for all of the samples from the imicrobe ftp server.


# from Luiz Irber
# sudo apt-get install parallel

seq -w 000 147 | parallel -t -j 5 "wget --spider -r \\{} 2>&1 \\
| grep --line-buffered -o -E 'ftp\:\/\/.*\.cds\.fa\.gz' > urls{}.txt"
cat urls* | sort -u > download.txt
cat download.txt | parallel -j 30 -t "wget -c {}"

Loose ends:

Most assembles are done. Some aren’t.



  • Complete the remaining 59 assemblies. Assess what happened to these 59 samples.
  • Assess which samples are performing poorly according to evaluation metrics.
  • Look at Transrate output to help answer, “What is the extra stuff?”
  • Make scripts more accessible.
  • Partition a set of samples for benchmarking additional software. (With bioboxes?)
  • Update khmer protocols.


Special thanks to Moore Foundation for funding, Titus Brown, Harriet Alexander and everyone in the DIB lab for guidance and information. Thanks also to Matt MacManes‘ suggestions and helpful tutorial and installation instructions and Torsten Seeman‘s and Rob Patro‘s helpful assistance at NGS2016 with asking questions and getting BUSCO and Transrate running.

Posted in Bioinformatics, cluster, Data Analyses, MMETSP, reproducibility, science | Leave a comment

Adventures with ONT MinION at MBL’s Microbial Diversity Course

My time here at the Microbial Diversity Course at MBL has come to an end after visiting for the past 2 weeks with our lab’s MinION to sequence genomes from bacterial isolates that students collected from the Trunk River in Woods Hole, MA. (Photo by Jared Leadbetter of ‘Ectocooler’ Tenacibaculum sp. isolated by Rebecca Mickol).

13782048_10207768876873119_9151405278597781237_n (1)

In Titus Brown’s DIB lab, we’ve been pretty excited about the Oxford Nanopore Technologies MinION sequencer for several reasons:

1) It’s small and portable. This make the MinION a great teaching tool! You can take it to workshops. Students can collect samples, extract DNA, library prep, sequence, and learn to use assembly and annotation bioinformatics software tools within the period of 1 week.

2.) We’re interested in developing streaming software that’s compatible with the sequencing ->  find what you want -> stop sequencing workflow.

3.) Long reads can be used in a similar way as PacBio data to resolve genome and transcriptome (!) assemblies from existing  Illumina data.

Working with any new technology, especially from a new company, requires troubleshooting. While Twitter posts are cool, they tend to make it seem very easy. There is a MAP community for MinION users, but a login is required and public searching is not possible. In comparison to Illumina sequencing, there is not that much experience out there yet.


These are usually saved for the end, but since this is a long blog post, thought I would front-load on the gratitude.

I have really benefitted from blog posts from Keith Robison and lonelyjoeparker. Nick Loman and Josh Quick’s experiences have also been beneficial.

There is no match, though, for having people in person to talk to about technical challenges. Megan Dennis and Maika Malig at UC Davis have provided amazing supportive guidance for us in the past few months with lab space and sharing their own experiences with the MinION. I’m very grateful to be at UC Davis and working with Megan.

This trip was made possible by support from my PI, Titus Brown, who provided funding for my trip and all the flowcells and reagents for the MinION sequencing. It was necessary to have this 2 week block of time to focus on nothing else but getting the MinION to work, ask questions, and figure out what works (and what doesn’t).

Special thanks to Rebecca Mickol and Kirsten Grond in the Microbial Diversity course for isolating and culturing the super cool bacterial samples. Scott Dawson at UC Davis (faculty at the Microbial Diversity course) was instrumental in helping with DNA extractions. Jessica Mizzi assisted with library prep protocol development and troubleshooting. Harriet Alexander assisted with the assembly, library prep and showing me around Woods Hole, which is a lovely place to visit. Thank you also to the MBL Microbial Diversity Course, Hilary Morrison and the Bay Paul Center for hosting lab space for this work to take place.


Presentation slides:

Immediately following the Woods Hole visit at MBL, I went to the MSU Kellogg Biological Station as a TA for NGS 2016 course and wrote a tutorial for analyzing ONP data:

Purchasing and Shipping

Advice: allow 2-3 months for ordering. We ordered one month in advance. While ONP customer service probably worked overtime to send our flowcells and Mk1B after several emails, chats, and calling in special favors, in the future it is unclear whether we can count on a scheduled delivery with students. Communication required ~dozen emails and we could never get confirmation that flowcells would arrive in time for the course. It turns out that our order had been shipped and arrived on time, however we did not know about it because a tracking number was not sent to us. It took about a day of emailing and waiting to track the boxes down. Thankfully, the boxes were stored properly in the MBL shipping warehouse.

Communicate with ONP constantly. Stay on top of shipments, ask for tracking numbers and confirmation of shipment. Find out where the shipment is being delivered, as the address you’ve entered may not be the one on the shipping box and your order will be delivered to the wrong place.


QC the flowcells immediately. Bubbles are bad:

We ordered 7 flowcells (5 + 2 that came with the starter pack).The flow cells seemed to have inconsistent pore numbers and some arrived with bubbles. One flow cell had zero pores. They sent us a replacement for this flowcell within days, which was very helpful. However, for the flowcells that had bubbles, I was given instructions by ONP technical staff to draw back a small 15 ul vol of fluid to try to remove the bubble, then QC again. This did not work. The performance of these flowcells did not meet our expectations.

In communicating with the company, we were told that there was no warranty on the flowcells.

DNA Extractions

The ONP protocol says that at least  500-1000 ng clean DNA is required for successful library prep. Try to aim for more than this. Try to get as much DNA, as high molecular weight as possible. Be careful with your DNA. Do not mix liquids by pipetting. For the bacterial isolates from liquid culture, Scott Dawson recommended using Qiagen size exclusion columns to purify, and this worked really well for us. We started with ~2000 ug and used the FFPE repair step.

The ONP protocol includes shearing with the Covaris gtube to 8kb. When I eliminated this step to preserve longer strands, there was little to no yield and samples with adequate yield had poor sequencing results. In communicating with ONP about this, we suspected that the strands were shearing on their own somewhere during the multiple reactions, then either getting washed away during the bead cleanup steps, or the tether and hairpin adapters were sheared off so the strands were not being recognized by the pores.

We sequenced all three sets of DNA below (ladder 1-10kb). The Maxwell prep (gel below on the left) had a decent library quantity but the sequencing read lengths were not as long as we would have liked, which makes sense given the small smeary bands seen. (poretools stats report)


Library prep

When we first started troubleshooting the MinION, the protocols available through the MAP were difficult to follow in the lab. We needed a sheet to just print out and follow from the bench, so we created this:

A few months ago, ONP came out with a pdf checklist for library prep, which is great:

The library prep is pretty straight forward. One important thing I learned about the NEB Blunt/TA Master Mix:

Library prep and loading samples onto the flowcell can be tricky and nerve wracking for those who are not comfortable with lab work. I have >4 yrs of molecular lab experience knowing how to treat reagents, quick spins, pipetting small volumes, how to be careful not to waste reagents. One important point to convey to those who do not do molecular lab work often, is the viscous, sticky enzyme mixes that come in glycerol. You think you’re sucking up a certain volume, but an equal amount is often stuck to the outside of your pipette tip. You have to wipe it on the side of the tube to get it off so you don’t add this to your rxn volume, changing the optimal concentration and (probably the most important) also wasting reagent.

Other misc. advice:

  • The calculation: M1V1 = M2V2 is your friend.
  • Don’t mix by pipetting.
  • Instead, tap or flick the tube with care.
  • Quick spin your tubes often to ensure liquid is collected down at the bottom.
  • Bead cleanups require patience and care while pipetting.
  • Be really organized with your tubes (since there are a handful of reagent tubes that all look the same). Use a checklist and cross off each time you have added a reagent.

These are the things I take for granted when I’m doing lab work on a regular basis. It takes a while to remember when I’m in the lab again after taking a hiatus to work on computationally-focused projects.

Computer Hardware

In October 2015 last year when we were ordering everything to get set up, the computer hardware requirements for the MinION were: 8GB RAM and 128 SSD harddrive with i7 CPU. This is what we ended up ordering (which took several weeks to special order from the UC Davis computer tech center):

DH Part#: F1M35UT Manufacturer: HP Mfr #: F1M35UT#ABA HP ZBook 15 G2 15.6″ LED Mobile Workstation ­ Intel Core i7 i7­4810MQ Quad­core (4 Core) 2.80 GHz 8 GB DDR3L SDRAM RAM ­ 256 GB SSD ­ DVD­Writer ­ NVIDIA Quadro K1100M 2 GB ­ Windows 7 Professional 64­bit (English) upgradable to Windows 8.1 Pro ­ 1920 x 1080 16:9 Display ­ Bluetooth ­ English Keyboard ­ Wireless LAN ­ Webcam ­ 4 x Total USB Ports ­ 3 x USB 3.0 Ports ­ Network (RJ­45) ­ Headphone/Microphone Combo Port

One run requires around 30-50 GB, depending on the quality of the run. The .fast5 files are large, even though the resulting .fastq are small (<1 GB). The hard-drive on our MinION laptop is 256 GB, which can fill up fast. We bought a 2 TB external hard-drive, which we can configure Metrichor to download the reads to after basecalling, saving space on the laptop hard-drive.
Software and Data
  • Windows sucks
  • There’s a new GUI (graphical user interface) for MinKnow in the past months. It’s annoying to get used to this, but in general not too bad.
  • The poretools software to convert .fast5 to .fastq is buggy on Windows and does not play well with MinKnow. There’s probably a way to get them both to work, but I’ve already spent ~2-4 hrs of troubleshooting this issue, so am done with this for now. Instead, we’ve been uploading .fast5 to a Linux server, then running poretools on there.
  • MinKnow python scripts crash sometimes during the run! You can open the MinKnow software again, start the script again, and it should start the run from where it left off.
  • Use the 48 hr MinKnow script for sequencing.
  • Our flow of data goes from raw signal from the MinION (laptop) -> upload to Metrichor server for basecalling -> download to external hard-drive (“pass” or “fail” depending on the Metrichor workflow chosen, e.g. 1D or 2D or barcoding) -> plug external hard-drive to Linux or Linux laptop (for some reason this is easier on Linux laptop rather than Windows…) for transfer to Linux server -> on the Linux server, run poretools software to convert to fastq/fasta -> analysis
  • This all seems kind of ridiculous. If there is a better way, please let us know!

13900322_10104181243686033_7995956997637711388_n (1)


In a future workshop setting, where students are doing this for the first time but we have more experience now, a potential schedule could go something like this:

Day 1: Collect sample, culture

Day 2: Extract DNA, run on gel, quantify

Day 3: Library prep, sequence (this will be a long day)

Day 4: Get sequences, upload, assess reads, start assembly

Day 5: Evaluate assembly, Annotate

This is similar to the schedule arranged for Pore Camp, run by Nick Loman at the University of Birmingham in the UK. They have some great materials and experiences to share:


  • Still unknown what the cost is per sample.
  • Cost of troubleshooting?

I’ve put together a quick ONP MinION purchasing sheet:

Generally, these are the items to purchase:

  • Mk1B starter pack  came with 2 flowcells
  • computer
  • ONP reagents
  • third-party reagents (NEB)

Getting Help

  • MAP community has some answers
  • There is no phone number to call ONP. In contrast, Illumina has a fantastic customer service phone line, with well-trained technicians on the other end to answer emergency phone calls. Reagents and flowcells are expensive. When you’re in the lab and there is a problem, like a bubble on the flowcell or a low pore number after QC, it is often necessary to call and talk to a person on the phone to ask question so you don’t waste time or money.
  • I’ve had many good email conversations with ONP tech support, but there is no substitute to calling someone on the phone and discussing a problem. Often, there are things to work on after the email and it is difficult to follow up by going back and forth with email.
  • LiveChatting feature on the ONP website is great! (During UK business hours, there is a feature at the bottom of the store website that says “Do you have a question?”. During off hours it says “Leave a message”.

I realized through this process that I had lots questions and few answers. The MAP has lots of forum questions but few manuals. Phrase searching sucks. If you search for a phrase in quotes, it will still search for individual words. For example:


Remaining Questions:

1. Why does the number of flow cell pores fluctuate? What is the optimal pore number for a flow cell?

2. What is the effect of 1D reads on the assembly? Can we use the “failed” reads for anything? 

3. How long will a run take? 

4. How much hard-disk space is required for one run?

5. When are the reads “passing” and when are they “failing”? Is there value to the failing reads? 

6. How can we get the most out of the flow cells? There seem to be a lot of unknowns related to the efficiency of the flowcells. We tried re-using a washed flow cell. There were >400 pores in the flow cell during the initial QC. After we loaded the library and started the run, the pore numbers were in the 80s-100s. 2 hrs later, this number dropped down to ~30s. I added more library, and the pore numbers never increased again. Is this a result of the pore quality degrading? The next morning, loaded more library again. Not much change. Decided to switch flowcells and try a new one.

7. Are there batch effects of library prep and/or flowcells? Should we be wary of combining reads from multiple flowcells?


In the future, the aim is to move away worrying about the technology details and focus on the data analysis and what the data mean. The goal should be to focus on the biology and why we’re interested in sequencing anything and everything. What can we do with all of this information, now that we can sequence a genome of a new bacterial species in a week?

Feel free to comment and contact!

Posted in biotech, Sequencing | 7 Comments

Computing Workflows for Biologists – Dr. Tracy Teal

I’m so excited to be visiting the Microbial Diversity Course at the Marine Biological Lab in Woods Hole, Massachusetts right now. Really enjoying talking to students and faculty working on projects related to microbial communities, aspects of microbial metabolism, microbial genomics, transcriptomics. I’m here with our lab‘s MinION to sequence genomes from cultured microorganisms isolated by students during the course. (More about this in a future blog post!)

View of Eel Pond from MBL St.


Each day of the course, there are lectures in the morning on a variety of interesting topics relevant to microbial diversity. For those not familiar, this field is rapidly accumulating and analyzing large collections of data. For example, see Raza and Luheshi (2016).

Dr. Tracy Teal, Executive Director of Data Carpentry gave us an inspiring talk this morning on data analysis, reproducibility and sharing.


Read her paper, which summarizes these topics:

Shade and Teal. 2015. Computing Workflows for Biologists: A Roadmap. PLoS Biology. doi:10.1371/journal.pbio.1002303

She raises a number of interesting points and gives good advice relevant to the growing amount of data in biology, so wanted to write them down to share here.

Dr. Teal opens with the question: “How many people use computers for your work?” Everyone in the room raised their hand.

We all use our computers for some aspect of our research.

The reasons for using good practices for data management and computer usage are not just for the greater good, but for you. And your sanity. We all appreciate how much data even one project can generate. This is not going to change in the future. There is an upward trend of data production over time. Thinking about this and planning now will help the future you. Even if you rely on others for the bulk of the data analysis.

How many of you work with other people?” Everyone works in a team in a lab and sometimes with outside collaborators. There is generally a need to communicate with others about data analyses so that someone else besides you can understand what you did. Paper reviewers and readers should be able to understand. But first, there are the people in your lab. This is called the “leaving science forever” test, where you ask yourself whether what you are doing could be followed by someone else if you were to suddenly leave. Have you ever taken over a project from someone where you found files, samples and notebooks were not descriptive enough for you to just pick up from where they left off? Don’t wait until this happens. The more transparent and vigilant you are about this on a regular basis, the happier you will be in the future.

What knowledge and elements are necessary for these good practices?


  1. How were data generated?
  2. Where are raw data located? (e.g. HPLC files, *.txt files, *.fastq sequence files, microarray *.cel files, etc)
  3. What were the data cleaning steps? (e.g. formatting steps between raw data and doing something interesting with software. This is actually a HUGE part of data analysis pipelines and can be >80% of your work. If you can automate these steps, the better off you will be in the future.)
  4. Steps of the data analysis: exact parameters used, software versions
  5. Final plots and charts: This is the least important. If you keep track of the other steps, you should be able to recreate the exact plots very easily.

Let’s talk about data.

Keep raw data files raw. Make copies of raw files before you start to work with the data. Post these files somewhere public, in a place where they will not be deleted. Why not make them public? If you don’t want to do that, put them in a safe lockbox, but where someone else can access them if needed.

How many people have a data management plan? If a lab has a policy where data have to be placed, besides someone’s personal hard-drive, the information will have a greater chance of surviving past the time when people leave the lab.

Let’s talk about spreadsheets.

Have you ever done something in an Excel spreadsheet that made you sad? We all have. Single columns get resorted rather than whole sheet. Autocorrecting spelling will change gene names. Dates get messed up. MS Excel makes these formatting mistakes. Google sheets makes the same formatting mistakes.

Train yourself to think like a computer.

There are rules for using Excel. This may seem silly, but following these rules will actually save you and collaborators much time. People know spreadsheets. Many biologists use spreadsheets in a way that is time-consuming in the long-run, e.g. laying out information to be read for humans, with color-coding and notes.

Follow these simple rules:

  • Put each variable into a separate column
  • Do not use color to convey information. Add a “calibrated” column and a one- or two-word code associated, e.g. YES or NO, EXTRACTED or NOT, etc.
  • Do not use Excel data files to write out long metadata notes about your file. This is best to be saved in another README file.
  • Leave raw data raw. If you’re going to transform data or perform a calculation, create a new file or a separate column(s)
  • Break data down into the finest scale resolution to give you the most options. Don’t combine multiple types of information into one column, e.g. Species-Sex, Month-Year. One simple trick to avoid the annoying auto-formatting of dates in Excel: use three separate columns for month, date, year. This will allow you to look at date ranges, e.g. only fall, easily pull out years, or 15th of every month. Gives more flexibility!
  • Export your .xls into a .csv to avoid errors in downstream analyses

If you need more motivation for why it’s a good idea train yourself to follow these Excel rules, this is a great list of all the common errors that spreadsheets can make:

Proceeding with analysis:

Good data organization is the foundation for any project. Without this, none of the actual meaningful aspects of the project will be easy or efficient and data analyses will drag on and on.

  1. What is your motivation, overarching goal of analysis? To test hypotheses? Exploratory?
  2. Adopt automation techniques to reduce errors, which are iterative patterns that don’t rely on human input
  3. Reproducibiltiy checkpoints
  4. Taking good notes
  5. Sharing responsibility, team approach


Hopefully your experimental design was set up to motivate different strategies, hypothesis-testing vs. exploratory. Write out each step of the workflow by hand. Just asking yourself, “What am I going to do now?” can help to guide a workflow.

Reproducibility checkpoints, scrutinizing integrity of analyses:

Modularize your workflow and set up checkpoints at certain points to make sure you have what you expect. Does it actually work? Is the outcome is consistent? (some programs have stochastic element) Do the results make biological sense?

Examples of negative consequences for having problems with code and research that is not reproducible:

fMRI results:

Clinical genetics:

Unfortunately, there are probably many other examples… (I’m interested in these, so please feel free to comment and share.)

Reproducibility and data management plans are now score-able in grant reviews and peer review. This is starting to be valued more in the research community.

This is difficult. No one is perfect. You get to decide what your values are. We have opportunities to set norms in our communities for what we see.

Take good notes

Include this information:

  • Software version
  • Description of what software is doing/goal
  • What are the default options?
  • Brief notes on deviation  from default options
  • Workflows: Include a progression using different software (e.g. PANDAseq -> QIIME –> R). See Figure 1 from Shade and Teal (2015).
  • ALL formatting steps required to move between tools. (Write a tutorial for others. This is a good example.) Avoid manually formatting data. Ideally, a script will be written and made available to automatically re-format data.
  • Anything else that will help you remember what you did
  • Most important person to explain your process to is you in 6 months. Unfortunately, you from 6 months ago will not answer email. If you need to re-do something, you need to remember what you did.

When writing a paper, go through your workflow again. Start from the beginning and make sure you can do again what you thought you did. Make sure you can reproduce. We rarely have the opportunity to do this with lab work because it’s too expensive. But we can do this with computational analyses!

These things take time. It’s easy to fling data everywhere. Being organized takes time and is less easy. Value this.

Shared responsibility

Shared responsibility enhances reproducible workflows. Holding each other accountable for high-quality results, confidence in results promotes a strong sense of collaboration. Some general advice:

  1. Shared storage and workspace can facility access to all group data. Within a lab group, it is VERY common to have different computers (each lab member usually has one, for example). Institutional shared drives are maintained by administrators and occasionally need to be deleted to preserve space.
  2. No one is perfect.Not backing files up, or knowing where files or code are, are common mistakes. It happens. It’s easy to throw hands up in the air and complain or shame each others’ work habits related to all topics we’re discussing here. Shame is less productive than learning from mistakes, growing and discussing as a group. Use these opportunities to productively grow together. Few people have malicious intent. We’re all people. Work together to make productive, positive changes.
  3. Talk to data librarians at institutions. (Advocate for starting such a position is this person does not exist.)
  4. Share data. Dr. C. Titus Brown advocates for publishing all pieces of data publicly on figshare. Half of peoples’ problems with data stem from the desire to keep data private until publishing. This is usually >3 yrs from time of collection. Then you can’t find it. Or you spend too much time trying to make it “perfect”. Publish the data as soon as you collect it. Then you can go back and improve data annotations. When you do a “data dump”, your name will be associated with those data. Chances of people being malicious, wanting to steal your data are almost unheard of. (If you have examples, would be interesting to hear.) There is almost never a reason NOT to publish data as soon as it’s collected. Publishing data as soon as it is collected is a great way to advertise what you are doing so others can collaborate or not go down the same avenue if unproductive.
  5. Join data working groups
  6. Using version control repositories for code and data analyses (github)
  7. Set expectations for ‘reproducibility checkpoints’ with team “hackathons” or open-computer group meetings dedicated to analysis
  8. Lab paper reviews focused on data reproducibility
  9. Look for help/support outside the lab, e.g. bioinformatics or user group office hours, Stack Overflow, BioStars. You are not alone. Few people are alone in wanting to learn things. We never can know everything, so talk to people.

Bioinformatics resources:

If you see a typo or problem with tutorials, please let people know. 🙂

Here is an exercise to try!

View of Eel Pond from Water St.


Posted in Data Analyses, reproducibility, science, talks, workshops | Leave a comment