MMETSP re-assemblies

*** Update (9/27/2016): 719 assemblies now available at the links below. All FINISHED!!!

*** Update (9/22/2016): 715 assemblies now available at the links below.

I have *almost* finished re-assembling de novo transcriptomes from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (Keeling et al. 2014). This is an effort that I have been working on over the past year as a PhD student in Titus Brown’s lab at UC Davis. The lab has developed the Eel pond khmer protocol for assembling RNAseq data de novo from any species. I adapted and automated this protocol for Illumina PEx50 data from the MMETSP.

Here are 659 assemblies:


We’re not completely finished yet, but in the meantime we wanted to make these available in case they are useful to you. If you do use, (per academic protocol) please cite us:

Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2016): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.

Let us know if you have any questions or difficulties. As a caveat: we aren’t guaranteeing that these assemblies are better than the NCGR assemblies. They are different as a whole. Here is some quality information and spreadsheets to look up your specific MMETSP ID of interest:

BUSCO scores

Transrate scores

Automated Scripts

I’m still working on makings these scripts more accessible and useful for people. Thanks especially to the DIB lab code review session, I will be working on these over the next month. Stay tuned!

The pipeline (flowchart below created with is as follows: 1.), 2.), 3.) diginorm_mmetsp, 4.) All of the Python scripts are controlled by the SraRunInfo.csv metadata spreadsheet downloaded from NCBI that contains the url and sample ID information.


This pipeline could be used with SraRunInfo.csv obtained for any collection of samples on NCBI:


Is there value in re-assembling what has already been done?

The NCGR already performed assemblies with their pipeline in 2014 when these data were sequenced (more information about their pipeline). Why are we interested in doing it again?

Software and best practices for RNAseq de novo transcriptome assembly are changing rapidly, even since 2014. We found preliminary evidence that our assemblies are different than NCGR. Taking a look at some quality metrics for 589 new assemblies, here are the proportion of contigs from our new DIB assemblies aligning with assemblies from NCGR as reference (left), compared to the reverse where the NCGR assemblies are aligned to our new DIB assemblies (right).


This confirms our preliminary evidence that we have assembled nearly everything the NCGR did, plus extra. Still not sure what the extra stuff is, if it’s useful.

There are some samples that have performed well (shown below BUSCO scores – left – and transrate scores – right), while others that have not.


The effects of one pipeline vs another – which software programs are used and decisions made during the pipeline are not well understood. Are there biological consequences, i.e. are there more protein coding regions detected, between pipelines? Ours may or may not be better, but they’re different so we’re exploring how and why.

If our results are different for these assemblies, chances are high that results from other people’s assemblies are dependent on the methods they are using to do their assemblies. Every month, there are ~a dozen publications announcing a new de novo transcriptome of a eukaryotic species. Just in 2016, there have been 743 publications (as of 9/13/2016) on “de novo transcriptome assembly”. When new software programs are developed, updated versions are released, decisions are made about which software programs and what pipeline/workflow to use. What are the effects of these decisions on the results being produced by the scientific community?

This is the first time – as far as we are aware – that anyone has looked carefully at a mass of assemblies like this from a diversity of organisms. This is a really great data set for a number of reasons: 1) a diversity of species, 2) library prep and sequencing was all done by the same facility. This is such a large number of samples that we can apply a statistical analysis to examine the distribution of evaluation metrics, e.g. transrate scores, BUSCO scores, proportion of contigs aligning to the reference, etc.

We’re really excited to be working with these data! And very grateful that these samples have been collected for this project and that the raw data and assemblies done by NCGR are available to the public!

Scripts for this project were developed on different systems:

The mystery of the extra samples:

This is more of an issue related to managing data from a project with this number of samples rather than a scientific problem. But it required some forensic files investigation. The number of samples in the SRA Bioproject is different than the number of samples on imicrobe. There are 678 samples approved by the MMETSP:


Whereas there are 719 Experiments in the SRA:



[1] “MMETSP0132_2” “MMETSP0196” “MMETSP0398” “MMETSP0451_2” “MMETSP0922” “MMETSP0924”
> OLlist$Venn_List$imicrobe
[1] “MMETSP0132_2C” “MMETSP0196C” “MMETSP0398C” “MMETSP0419_2” “MMETSP0451_2C”
[6] “MMETSP0922C” “MMETSP0924C”

It turns out that these “extras” on either side of the venn diagram were just ID with the letter “C” on the end.

For some reason, MMETSP0419_2 (Tetraselmis sp. GSL018) has its own BioProject accession: PRJNA245816. So, PRJNA248394 has 718 samples. Total, there are 719 MMETSP samples in the SRA: PRJNA231566. In the SraRunInfo.csv, the “SampleName” column for the extra sample (SRR1264963) says “Tetraselmis sp. GSL018” rather than MMETSP0419_2.

The next thing to figure out was that there are more IDs in SRA than in imicrobe because some MMETSP ID have multiple, separate Run ID in SRA:

> length(unique(MMETSP_id_NCBI))
[1] 677
> length(unique(ncbi$Run))
[1] 718
> length(unique(MMETSP_id_imicrobe))
[1] 678

These are the samples that have multiple MMETSP ID in SRA:


Samples were assembled individually by each SRR id.

Naming files is hard.

It took me >a day, plus some time thinking about this problem beforehand, to sort out what to name these assembly files for sharing. There are 2 useful ID for each sample: MMTESP id (MMETSPXXXX) and NCBI-SRA id (SRRXXXXXXX). Since files were downloaded from NCBI and I’m assembling samples individually, it made sense to organize by unique SRA id since there can be multiple MMETSP for some SRR. IDs are not recognizeable by humans reading them, though. So, in naming these files I wanted to include scientific names (Genus_species_strain) as well as the ids:


Upon further examination, it turns out that some of the Genus and species names were different between the metadata from the SRA and imicrobe. For example:

Different imicrobe: Nitzschia_punctata SRA: Tryblionella_compressa
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Crustomastix_stigmata SRA: Crustomastix_stigmatica
Different imicrobe: Chrysochromulina_polylepis SRA: Prymnesium_polylepis
Different imicrobe: Compsopogon_coeruleus SRA: Compsopogon_caeruleus
Different imicrobe: Lingulodinium_polyedra SRA: Lingulodinium_polyedrum

There were >100 of these differences. Some were spelling errors, but others, like the Lingulodinium polyedra vs. Lingulodinium polyedrum or Copsopogon coeruleus vs. Compsopogon caeruleus examples above, I wasn’t sure if this was a spelling preference or an actual difference. The scientific naming in the SRA is linked to the NCBI taxonomy convention, but it could be possible that the names assigned by experts in this field (thus making their way into the metadata hosted by imicrobe) are further ahead in its naming than NCBI. So, in these cases, I included both SRA and imicrobe names.


It was also necessary to clean the metadata to remove special, illegal characters like “)(%\/?’. Some of the assembly file names now have multiple “_” and “-” because characters had to be stripped out. OpenRefine is a fantastic program to automatically do this task. Anyone can freely use it. Those who manage projects with metadata input by a consortium of people individually entering data by hand should especially use OpenRefine! It will even cluster similar spellings and help to catch data entry typos! Data Carpentry has a fantastic lesson to walk you through using OpenRefine. Use it. It will make your life easier.

Uploading files

It turns out that version-controlling file storage and sharing for hundreds of files is not straight-forward yet. We explored figshare, Box, Dropbox, S3, ftp server hosting, and/or institutional server storage. For reasons, we chose figshare (for one .zip file) and Box cloud storage for individual files. As we get the last of the assemblies, we’ll move files and update the links at the top of this blog post.

Downloading files

We chose to use the NCBI version of these files. ENA numbers were not as easy as SRA to locate a metadata spreadsheet with a predictable url address for each sample. The imicrobe project hosts these files, but the files do not follow a predictable pattern to facilitate downloading all of the data. So, we downloaded the fastq from NCBI.

When we wanted to compare our assemblies to NCGR assemblies, Luiz Irber wrote this wonderful script for downloading the NCGR assemblies for all of the samples from the imicrobe ftp server.


# from Luiz Irber
# sudo apt-get install parallel

seq -w 000 147 | parallel -t -j 5 "wget --spider -r \\{} 2>&1 \\
| grep --line-buffered -o -E 'ftp\:\/\/.*\.cds\.fa\.gz' > urls{}.txt"
cat urls* | sort -u > download.txt
cat download.txt | parallel -j 30 -t "wget -c {}"

Loose ends:

Most assembles are done. Some aren’t.



  • Complete the remaining 59 assemblies. Assess what happened to these 59 samples.
  • Assess which samples are performing poorly according to evaluation metrics.
  • Look at Transrate output to help answer, “What is the extra stuff?”
  • Make scripts more accessible.
  • Partition a set of samples for benchmarking additional software. (With bioboxes?)
  • Update khmer protocols.


Special thanks to Moore Foundation for funding, Titus Brown, Harriet Alexander and everyone in the DIB lab for guidance and information. Thanks also to Matt MacManes‘ suggestions and helpful tutorial and installation instructions and Torsten Seeman‘s and Rob Patro‘s helpful assistance at NGS2016 with asking questions and getting BUSCO and Transrate running.


About Lisa Johnson

PhD candidate at UC Davis in Molecular, Cellular, and Integrative Physiology
This entry was posted in Bioinformatics, cluster, Data Analyses, MMETSP, reproducibility, science. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s