UPDATE: Check out Titus’ blog post, Bashing on monstrous sequencing collections.
Since Sept 2015, I’ve been a PhD student in C. Titus Brown’s lab at UC Davis working with data from Moore’s Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). I would like to share some progress on that front from the past 6 months. Comments welcome!
MMETSP is a really unique and valuable data set consisting of 678 cultured samples with 306 species representing more than 40 phyla (Keeling et al 2014). It is public, available on NCBI. The MMETSP data set consists entirely of cultured samples submitted by a large consortium of PIs to the same sequencing facility. All samples were PE 100 reads run on an Illumina HiSeq 2000 sequencing instrument. A few samples were run on a GAIIx.
For many species in this set, this is the only sequence data available because reference genomes are not available. The figure below from Keeling et al. 2014 shows the diverse relationships between samples represented in the MMETSP. The dashed lines indicate groups without reference genome whereas the solid lines have references.
Here are a few stars (Micromonas pusilla – left, Thalassiosira pseudonana – right):
It’s worth emphasizing that this is – if not THE, one of the – largest public, standardized RNAseq datasets available from a diversity of species. Related to this cool dataset, I’m really grateful for a number of things: a.) the MMETSP community who has taken the initiative to put this sequencing dataset together, b.) the Moore Data Driven Discovery program for funding, c.) to be working with a great PI who is willing and able to focus efforts on these data, d.) being in a time when working with a coordinated nucleic acid sequencing data set from such a large number of species is even possible.
Automated De Novo Transcriptome Assembly Pipeline
The NCGR has already put together de novo transcriptome assemblies of all samples from this data set with their own pipeline. Part of the reason why we decided to make our own pipeline was that we were curious to see if ours would be different. Also, because I’m a new student, developing and automating this pipeline has been a great way for me to learn about automating pipeline scripts, de novo transcriptome assembly evaluation, and the lab diginorm khmer software. We’ve assembled just a subset of 56 samples so far, not the entire data set yet. It turns out that our assemblies are different from NCGR. (More on this at the bottom of this post.)
All scripts and info that I’ve been working with are available on github. The pipeline is a modification of the first three steps of the Eel Pond mRNAseq Protocol to run on an AWS instance. (I’m aware that these are not user-friendly scripts right now, sorry. Working on that. My focus thus far has been on getting these to be functional.)
- download raw reads from NCBI
- trim raw reads, check quality
- digital normalization with khmer
- de novo transcriptome assembly with Trinity
- compare new assemblies to existing assemblies done by NCGR
The script getdata.py takes a metadata file downloaded from NCBI (SraRunInfo.csv), see screenshot below for how to obtain this file:
The metadata file contains info such as run (ID), download_path, ScientificName, and SampleName. These are fed into a simple Python dictionary data structure, which allows for looping and indexing to easily access and run individual processes on these files in an automated and high-throughput way (subset of the dictionary data structure shown below):
Each subsequent script (trim_qc.py, diginorm_mmetsp.py, assembly.py, report.py, salmon.py) uses this dictionary structure to loop through and run commands for different software (trimmomatic, fastqc, khmer, Trinity, etc). Assemblies were done separately for each sample, regardless of how they were named. This way, we will be able to see how closely assemblies cluster together or separately agnostic of scientific naming.
There have been several challenges so far in working with this public data set.
It might seem simple in retrospect, but it actually took me a long time to figure out how to grab the sequencing files, what to call them, and how to connect names of samples and codes. The SraRunInfo.csv file available from NCBI helped us to translate SRA id to MMETSP id and scientific names, but figuring this out required some poking around and emailing people.
Second, for anyone in the future who is in charge of naming samples, small deviations from naming convention, e.g. “_” after the sample name, can mess up automated scripts. For example,
had to be split with the following lines of code:
mmetsp=line_data[position_mmetsp] test_mmetsp=mmetsp.split("_") if len(test_mmetsp)>1: mmetsp_id=test_mmetsp
Resulting in this:
['MMETSP0251', '2'] ['MMETSP0229', '2']
Then I grabbed the first entry of the list so that they looked like the rest of the MMETSP id without the “_”. Not really a big deal, but it created a separate problem that required some figuring out. My advice is to pick one naming convention then name all of the files with the same exact structure.
Lastly, several of the records in the SraRunInfo.csv were not found on the NCBI server, which required emailing with SRA.
The people affiliated with SRA who responded were incredibly helpful and restored the links.
I used the transrate software for de-novo transcriptome assembly quality analysis to compare our assemblies with the NCGR assemblies (*.cds.fa.gz files). Below are frequency distributions of proportions of reference contigs with Conditional Reciprocal Best-hits Blast (CRBB), described in Aubry et al. 2014. The left histogram below shows our “DIB” contigs compared to NCGR. On the right are NCGR contigs compared to DIB contigs. This means that we have assembled almost everything in their assemblies plus some extra stuff!
Here is the same metric shown in a different way:
We’re not sure whether the extra stuff we’ve assembled is real, so we plan to ground truth a few assemblies with a few reference genomes to see.
For further exploration of these contig metrics, here are some static notebooks:
If you would like to explore these data yourself, here is an interactive binder link that lets you run the actual graph production (thanks to Titus for creating this link!):
Based on these findings, several questions have come up about de novo transcriptome assemblies. Why are there different results from different pipelines? Are these differences significant? What can this tell us about other de novo transcriptome assemblies? Having distributions of quality metrics from assemblies is a unique situation. Usually, assemblies are done by one pipeline for just one species at a time, so means and standard deviations are not available. There are increasingly more new de novo transcriptome assemblies being done and published by different groups worldwide for species x, y, z. Yet, evaluations of the qualities of the assemblies are not straight-forward. Is it worth developing approaches, a prioritized set of metrics that will allow any de novo assembly to be evaluated in a standard way?
Moving forward, the plan is to:
- keep working on this assembly evaluation problem,
- assemble the rest of the samples in this data set,
- make these data and pipeline scripts more user-friendly and available,
- standardize annotations across species to enable meaningful cross-species analyses and comparisons.
Questions for the Community
What analyses would be useful to you?
What processed files are useful to you? (assembly fasta files, trimmed reads, diginorm reads)
The idea is to make this data-intensive analysis process easier for the consortium of PI who put these data together so that discoveries can be proportional to the data collected. In doing so, we want to make sure we’re on the right track. If you’re reading this and are interested in using these data, we’d like to hear from you!
Also, the cat: