Marine Microbes! What to do with all the data?

UPDATE: Check out Titus’ blog post, Bashing on monstrous sequencing collections.

Since Sept 2015, I’ve been a PhD student in C. Titus Brown’s lab at UC Davis working with data from Moore’s Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). I would like to share some progress on that front from the past 6 months. Comments welcome!

MMETSP is a really unique and valuable data set consisting of 678 cultured samples with 306 species representing more than 40 phyla (Keeling et al 2014).  It is public, available on NCBI. The MMETSP data set consists entirely of cultured samples submitted by a large consortium of PIs to the same sequencing facility. All samples were PE 100 reads run on an Illumina HiSeq 2000 sequencing instrument. A few samples were run on a GAIIx.

For many species in this set, this is the only sequence data available because reference genomes are not available. The figure below from Keeling et al. 2014 shows the diverse relationships between samples represented in the MMETSP. The dashed lines indicate groups without reference genome whereas the solid lines have references.


Here are a few stars (Micromonas pusilla – left, Thalassiosira pseudonana – right):

colored-2a thalassiosira-pseudonanana-n-kroger-tu-dresden

It’s worth emphasizing that this is – if not THE, one of the – largest public, standardized RNAseq datasets available from a diversity of species. Related to this cool dataset, I’m really grateful for a number of things: a.) the MMETSP community who has taken the initiative to put this sequencing dataset together, b.) the Moore Data Driven Discovery program for funding, c.) to be working with a great PI who is willing and able to focus efforts on these data, d.) being in a time when working with a coordinated nucleic acid sequencing data set from such a large number of species is even possible. 

Automated De Novo Transcriptome Assembly Pipeline

The NCGR has already put together de novo transcriptome assemblies of all samples from this data set with their own pipeline. Part of the reason why we decided to make our own pipeline was that we were curious to see if ours would be different. Also, because I’m a new student, developing and automating this pipeline has been a great way for me to learn about automating pipeline scripts, de novo transcriptome assembly evaluation, and the lab diginorm khmer software. We’ve assembled just a subset of 56 samples so far, not the entire data set yet. It turns out that our assemblies are different from NCGR. (More on this at the bottom of this post.)

All scripts and info that I’ve been working with are available on github. The pipeline is a modification of the first three steps of the Eel Pond mRNAseq Protocol to run on an AWS instance. (I’m aware that these are not user-friendly scripts right now, sorry. Working on that. My focus thus far has been on getting these to be functional.)

  1. download raw reads from NCBI
  2. trim raw reads, check quality
  3. digital normalization with khmer
  4. de novo transcriptome assembly with Trinity
  5. compare new assemblies to existing assemblies done by NCGR

The script takes a metadata file downloaded from NCBI (SraRunInfo.csv), see screenshot below for how to obtain this file:


The metadata file contains info such as run (ID), download_path, ScientificName, and SampleName. These are fed into a simple Python dictionary data structure, which allows for looping and indexing to easily access and run individual processes on these files in an automated and high-throughput way (subset of the dictionary data structure shown below):


Each subsequent script (,,,, uses this dictionary structure to loop through and run commands for different software (trimmomatic, fastqc, khmer, Trinity, etc). Assemblies were done separately for each sample, regardless of how they were named. This way, we will be able to see how closely assemblies cluster together or separately agnostic of scientific naming.


There have been several challenges so far in working with this public data set.

It might seem simple in retrospect, but it actually took me a long time to figure out how to grab the sequencing files, what to call them, and how to connect names of samples and codes. The SraRunInfo.csv file available from NCBI helped us to translate SRA id to MMETSP id and scientific names, but figuring this out required some poking around and emailing people.

Second, for anyone in the future who is in charge of naming samples, small deviations from naming convention, e.g. “_” after the sample name, can mess up automated scripts. For example,


had to be split with the following lines of code:

if len(test_mmetsp)>1:

Resulting in this:

['MMETSP0251', '2']
['MMETSP0229', '2']

Then I grabbed the first entry of the list so that they looked like the rest of the MMETSP id without the “_”. Not really a big deal, but it created a separate problem that required some figuring out. My advice is to pick one naming convention then name all of the files with the same exact structure.

Lastly, several of the records in the SraRunInfo.csv were not found on the NCBI server, which required emailing with SRA.


The people affiliated with SRA who responded were incredibly helpful and restored the links.

Assembly Comparisons

I used the transrate software for de-novo transcriptome assembly quality analysis to compare our assemblies with the NCGR assemblies (*.cds.fa.gz files). Below are frequency distributions of proportions of reference contigs with Conditional Reciprocal Best-hits Blast (CRBB), described in Aubry et al. 2014. The left histogram below shows our “DIB” contigs compared to NCGR. On the right are NCGR contigs compared to DIB contigs. This means that we have assembled almost everything in their assemblies plus some extra stuff!


Here is the same metric shown in a different way:


We’re not sure whether the extra stuff we’ve assembled is real, so we plan to ground truth a few assemblies with a few reference genomes to see.

For further exploration of these contig metrics, here are some static notebooks:

If you would like to explore these data yourself, here is an interactive binder link that lets you run the actual graph production (thanks to Titus for creating this link!):


Based on these findings, several questions have come up about de novo transcriptome assemblies. Why are there different results from different pipelines? Are these differences significant? What can this tell us about other de novo transcriptome assemblies? Having distributions of quality metrics from assemblies is a unique situation. Usually, assemblies are done by one pipeline for just one species at a time, so means and standard deviations are not available. There are increasingly more new de novo transcriptome assemblies being done and published by different groups worldwide for species x, y, z. Yet, evaluations of the qualities of the assemblies are not straight-forward. Is it worth developing approaches, a prioritized set of metrics that will allow any de novo assembly to be evaluated in a standard way? 

Moving forward, the plan is to:

  • keep working on this assembly evaluation problem,
  • assemble the rest of the samples in this data set,
  • make these data and pipeline scripts more user-friendly and available,
  • standardize annotations across species to enable meaningful cross-species analyses and comparisons.

Questions for the Community

What analyses would be useful to you?

What processed files are useful to you? (assembly fasta files, trimmed reads, diginorm reads)

The idea is to make this data-intensive analysis process easier for the consortium of PI who put these data together so that discoveries can be proportional to the data collected. In doing so, we want to make sure we’re on the right track. If you’re reading this and are interested in using these data, we’d like to hear from you!

Special thanks to C. Titus Brown, Harriet Alexander and Richard Smith-Unna for guidance, insightful comments, and plots!

Also, the cat:

Posted in Grad School, MMETSP, science | 3 Comments

Intro git – Lab meeting

By the end of this meeting, you will be able to:

  • Create a new repository
  • Edit (markdown lesson from Reid Brennan)
  • clone to local desktop
  • make changes
  • commit and push changes
  1. Create a new repository

Click on the “New” button to create a new repository:


Name the repository whatever you would like. Examples: test, data, lab protocols, awesome killifish RNA extractions, significant genes lists, abalone data files, etc. The idea is this will be your repository/directory with version-controlled files that you will pull/push back and forth between your computer and github. Click on the “Initialize this repository with README”:


You have created a new repository!


2. Edit the markdown file (Reid)

3. clone directory to local desktop

To copy the url, click on the clipboard-like icon next to the web address for this repository (see below). Sidenote: this is the same web address you can use to share this repository with colleagues. You can also just copy the url from the web address in your browser.


Open your terminal, navigate to a directory where you would like to put the new repository. Type this command to “clone” the repository:

git clone


You should see the “Cloning into ___” like in the screenshot above. Use the ‘ls’ command to list the contents of the current working directory to make sure it’s there. It is!


4. Make changes to the git directory

Now, we can make changes to this directory and they will be tracked. First, change directories into the one you just created:

cd super_awesome_killifish_data

Let’s copy a file into this directory. (This is a small text file I had in one directory up from the current one, so I use ../ to indicate where it will be found then . to indicate that I want to copy it to the current directory.)

cp ../cluster_sizes.txt .

5. Commit and push changes

Now that you have made a change to this directory, you want to make sure they are saved to github. The following commands are standard for staging and push changes to github repository:

git status
git add --all
git commit -m "added cluster_sizes.txt for A_xenica"
git push

(Type in your github user name and password. The letters you type in might not show up on the screen, but they are getting typed in, don’t worry!)


Now, you can go to the web github and see the changes made:



Posted in github | Leave a comment

Fish Gill Cell Culture System, MCB263 science communication assignment

My tweet for this week’s MCB263 science communication assignment is about a new technique by researchers at King’s College London, published in Nature Protocols, using freshwater gill tissue in culture as a proxy for in vivo fish gill tissue which is commonly used in toxicity testing to assess whether compounds have negative biological effects. Since gill tissue is designed to filter water and extract oxygen and osmolytes for the fish, it is a great method for measuring toxicity of compounds in water since they get trapped in the gill tissue. In this fish gill cell culture system by Schnell et al. (2016), whole fish do not have to be sacrificed for this purpose. Instead, only the gill tissue is used to assess potential toxicity, bioaccumulation, or environmental monitoring. Not only is this system resource-efficient and portable, only requiring a thin layer of tissue and some reagents to maintain, but it does not require raising and maintaining whole animals. The elegant features of gill tissue can be used without waste.

This video demonstrates the technique:

My group’s tweets:

Laura Perilla-Henao: market resistance to synthetic malaria drug

Ryan KawakitaE. coli used to generate morphine precursor

Stephanie Fung: researchers publish in Cell exploring microbiome evolution from hunter-gather to modern western society

Prema Karunanithi: release of real-time data on Zika virus infection study in monkeys

Posted in Articles, science | Leave a comment

Downloading masses of files, useful ftp commands

You have finished a sequencing project and now your sequencing facility is sending you lots of files! You’re eager to see your results. Your facility sends you a link with files displayed on a website. Now what are you supposed to do?

First, log in to your hpc server (AWS or HPC cluster) where you will work with the files.

Depending on the type of server where your files have been made available (ftp or http), try these commands:

If ftp server, with user and password required:

wget -r --user XXX --password XXX ftp://1234.5678.10

You can also log in to the ftp server, navigate and change directories, then mget to copy multiple files from the remote server to your local server. (Although, this will sometimes take longer than wget for some reason.) (The -P 2121 is optional, leave it out if you don’t know the port.)

ncftp -u XXX -p XXX -P 2121
cd /path/to/files
mget *

If http or https server, with user and password. The  -r is recursive, -l1 is only level 1, and –no-parent means no files up. (These stop wget from automatically downloading files linking files being downloaded, which can happen with websites.)

wget -r -l1 --no-parent --user=user --password=password

With no user or password required:

wget -r -l1 --no-parent --no-check-certificate

Most facilities are friendly and will help you download your files. In some cases, they will provide temporary ssh access, but other times not. It depends on the facility. If you don’t know the name of the ftp or http server, or the directory where the files are located, write to them and ask.

Thanks to Luiz Irber and Dragos Scarlet for the help and motivation for working through this problem. 🙂

Posted in Linux | Leave a comment

How to install linux software without root privileges

HPC cluster users rarely have root privileges for installing new software. Default location for installation is usually /usr/bin/, but our user directories are usually located somewhere else like /home/ljcohen/ or /mnt/home/ljcohen/. So, what are you supposed to do when you need to use a new software package? You can contact your cluster sysamins to have them install the package cluster-wide. In general, it’s a good idea to be in contact with them. (Usually they’re nice, helpful, and like to hear from their users to learn more about how you’re using the cluster.) But, sometimes you just want to try something out to see if it’s useful. Here is how.

I wanted to install samstat. (btw, I don’t mean to single out samstat. This is just what I happened to be trying out today. And it was a relatively easy fix!)

The first step is to download the installation file by grabbing the latest version available through the software’s website:

But, notice how this web address is longer than it needs to be. This is because puts on the extra stuff at the end you don’t need.  You could first download this file to your local harddrive then transfer the file onto your cluster account. But, the easiest thing is the wget command and download the file directly into your linux cluster environment.

Right-click on the “Direct Link”, select “Copy Link Address” from the menu, then manually delete everything after the filename:


Create and keep a directory where you will download and install all of your local software. I keep software in .local. Navigate to that directory and download the file. Use this command:


If you follow the installation instructions exactly, you will get a few errors:



By looking at the README file, there are slightly different instructions compared to the website:


After decompressing the .tar.gz file, the next command ./configure results in “self tests run under valgrind” not passing:


I decided to keep going and see what would happen. Running the next command ‘make’ appears fine:


Even ‘make check’ passes all the tests:


The executable file ‘samstat’ is in the appropriate directory. It appears everything is fine:


However, the final command ‘make install’ has a problem:


There is a problem creating a file in /usr/local/bin/samstat with “Permission denied”. This directory does not belong to me, and I don’t have privileges to write there.

From experience, I know that I need to tell the installation process where to install files so it does not try in the default location.

But how do I know this?

When faced with errors, knowing what to use as a Google search string can be a challenge. When I used the exact error “cannot create regular file”, the results were not very helpful. (at least for me)

This is what I used: “make install without root privileges”.

The most helpful page for me was a post on the Biostars forum.

To undo the failed installation attempt, all you have to do is delete the directory with rm -rf. But remember to keep the compressed tar file!


Then start over. Finally, I managed to get it right! Here are the working commands I used:

tar -zxvf samstat-1.5.1.tar.gz
cd samstat-1.5.1/
./configure --prefix=/home/ljcohen/.local
make check
make install

And it works!


Don’t forget to add this working directory to your $PATH.


When you get something to work, it feels good. 🙂 While all of your commands are saved in your history. Don’t forget to keep a log of what you’ve done and how to fix it so that your future self can remember when this happens again.


Posted in cluster, High Performance Computing, Linux, software | 2 Comments