Erich Schwarz, talk on conserved unknowns

UC Davis, Genome Center
9/17/2015
“Using C. elegans to discover functions of conserved unknown human genes.”
http://ivory.idyll.org/lab/talks/2015-09-17-conserved-unknowns.html

Titus: Erich is “not content with just a description of a genome”, goes deep into the conservation of gene families

How much of the human genome is conserved, deeply across metazoans? Are there genes with unknown functions that are doing something important that we’re not aware of?

  • What genes are conserved?
  • What genes are unknown?

We need practical definitions of both sets to answer these questions.

******* Evolution is giving us a glimpse of what is important.

Summary of literature going into this (summarized on one slide, not many studies).

Pandley et al. 2014 “Ignorome”, show quantitative significance in activity but are not well-studied. Genes that are well-studied when they have more papers, and more people are studying them! (basically no relationship between under-studied genes and biological importance…)

Categories of ways to characterize gene functions (no perfect  method):

  • sequence similarity
  • guilt by association
  • metabolic modeling (orphan enzymes)
  • chemical proteomics

(bold to indicate importance in future as technologies become easier)

C. elegans as model metazoan. Under 1,000 cells.

Methods for characterizing unknowns,

Used existing human-worm homolog sets: PFAM (domain oriented), PantherDB (precomputed mappings, HMM adaptability), TreeFam (trees)

Surprising findings with deep divergences.

Gene nomenclature is erratic, hard to map. Protein databases has unique and stable protein ID. “You would think genes would be this easy, but they’re not.” Solution is to get your genes to connect to UniProt.

Quantifying unknowns. Define known by sum of annotation densities per gene (or family)= annotation density

Most characterized set of genes is protein kinases. Least is something with no names.

If there is an unknown in humans, trick is to see whether there are any other mentionings of this in any other organisms.

Provides richness and expansion to what people think they already know.

“Cinderella” genes, starts out in obscurity then becomes famous.

They’re there. They’ve been there since the Cambrian. What are they??????????

Found ~30 gene families in humans conserved and unknown. What are these proteins?

TM sequence (transmembrane), ligands, cell surface, coiled-coils, etc.

Expression patterns scattered across worm. Specific phenotypes elicit specific unknowns.

Behavioral and other assays to discover functions of some unknowns in C. elegans by knocking out unknown genes. Striking phenotypes vs. hours of exposure. Could knock in human sequence to see if phenotype is rescued?

Discussion about using conserved unknown methods, “annotation density” metric when annotating new nonmodel organism genome.

Posted in talks | Leave a comment

Week 3 – NGS2015

Week 3 of NGS 2015 comes to a close. Discussion on what attendees and instructors have gained from this week: most important things learned, theory/big picture. Teaching and pedagogy especially. We went through 10 different workflows in 1 week! Analyzed complex data sets. This is a very new field. These methods are in development. Instructors are at the top of their fields, sharing methods for data analysis. Always something new to learn. Impossible to be expert in everything. But, surrounding ourselves with a community of people who are experts, learning from everyone. Reproducibility, automating workflows that lead to figure generation are important.

Cool teaching tools learned:

https://kahoot.it/#/
Google pole, live coding and Google docs xls graph updated with input from students Lex Nederbragt’s assembly lesson.
Data wrangling from Tiffany Timbers’ lesson on GWAS
https://www.flickr.com/photos/lpcohen/20330634694/in/dateposted-public/

Thank you! (power pose)

group2

Posted in Genomics Workshop | Leave a comment

GitHub, Pull Requests, and ReadTheDocs – NGS2015

“We’re good enough and deserve github.”

https://www.flickr.com/photos/lpcohen/20330046134/

Dr. C. Titus Brown shares methods for using readthedocs, which he uses for classes including NGS2015, as well as github and forking pull requests. Sphinx is Python based, readthedocs is web-based method for putting stuff in Sphinx. Learning goals at top of lesson pages. We’re going to go through steps on the web all together. Screenshots become out of data too quickly.

http://angus.readthedocs.org/en/2015/week3/CTB_github_editing.html

Readthedocs will take some version controlled project from somewhere (github or bitbucket) and format it for you. GitHub webhook activated. Readthedocs will sync and automatically rebuild.

readthedocs

This is  my version of the readthedocs:

http://angus-ljcohen.readthedocs.org/en/stable/

Edit in github, this will update:

http://angus.readthedocs.org/en/2015/week3/merge-demo.html

Forking one repository with groups of people.

https://www.flickr.com/photos/lpcohen/20764676629/in/dateposted-public/

Titus makes changes in file. Pull in changes made in central repository. My repo is behind:

behind_repo

Pull requests: One of top useful things Titus has learned! Goal is to keep track of changes, see progression.

Merge pull request, now all chnages updated. Click on “compare”, if there are any changes they will be highlighted. Once they are merged, there will not be any more changes. Sometimes you need to switch head fork to base fork accounts. Branches are very useful for years of courses, versions of software, etc.

compare_across_forks

Leigh: What are best practices with group of pull requests? What is one or some people are making tons, tons of changes? Should we pull? Master branch. You can ignore until ready to be merged. They will tell you when they want pull request, but they will have to reconcile with the one true. Person who is making changes has to deal with everyone else’s changes so everyone else doesn’t get behind. Software lines of code so “bombs” do not mess up everyone else’s code.

The one true branch.

https://www.flickr.com/photos/lpcohen/20329357194/in/dateposted-public/

Amanda: Where do people who have write access to the one true master branch? Does she work in own fork? Those people can make their own fork then merge and conflict reconcile with one true. There are 2 commits, one to merge pull (fetch) from original, second to put change in.

Now, we make changes to files. Add names to attendees list:

https://github.com/ljcohen/angus/blob/2015/week3.rst

Now compare.

compare_changes

commit_code

Can search for commits, issues, and pull requests associated with this code.

Merge conflicts  occur when computer can’t resolve.

Win!

git_commit

Posted in Genomics Workshop | 1 Comment

Differential expression and dosage compensation in RNAseq – NGS2015

Dr. Chris Hamm, University of Kansas, side effects of sexual reproduction in Lepidoptera!

https://www.flickr.com/photos/lpcohen/20920007072/in/dateposted-public/

Explanation of ESA, species status dependent # of self-sustaining populations, Mitchell’s satyr butterfly endangered. Suggests Butterfly People.

K-means cluster plots, each vertical bar individual. Michigan different than other populations.

Diversification of Lepidoptera, Mutanen et al. 2010:

https://www.flickr.com/photos/lpcohen/20307587874/

Evolution of sex chromosomes: female heterogamety, “chocoblock full of transposable elements” Sex chromosomes present gene dosage problem, no matter how many chromosomes you have, you still have the same gene expression. Females mammals inactivate extra X chromosome. Drosophila males double expression to deal with balance.

Tutorial

https://angus.readthedocs.org/en/2015/_static/SLDC-code.html
https://github.com/ngs-docs/angus/blob/2015/week3/SLDC/SLDC-code.Rmd

Simulated Data

plot_Code

MA

Practice Data

Bmori.dat <- read.csv(“Bmori-data.csv”, header = TRUE)

4plots

Plotting in R – Awesome Digression!

Multiple plots on one space, add plots, text, arrows, whatever on top with base R.
Reset with:

par(mfrow=c(1,1))

RPKM

Looking at expression per chromosome:

expr

zchrom

MA plots

http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/plotma.html
I posted a question on the Bioconductor listserv:

https://support.bioconductor.org/p/71562/
http://nar.oxfordjournals.org/content/43/7/e47

limma::plotMA

weird_MAplot

plot(log2(res1_filtered$baseMean), res1_filtered$log2FoldChange, col=ifelse(res1_filtered$padj < 0.25, “red”,”gray67″),main=”In vivo, ApoE vs. WT (padj<0.25)”,xlim=c(1,20),ylim=c(-10,10),pch=20,cex=1)
abline(h=c(-1 ,1), col=”blue”)

MAplot_invivo_DKOvWT_June2015

Posted in Genomics Workshop | Leave a comment

Reproducibility with AWS – NGS2015

Leigh Sheneman, PhD student at MSU in CS. Evolution and learning with digital organisms, applying to real world organisms in the future!

http://angus.readthedocs.org/en/2015/week3/AWS-tips.html

https://www.flickr.com/photos/lpcohen/20739813479/

Start EC2, medium-sized m3.medium is fine. Log in, update and install stuff. We need the packages for the software we will run with the eel-pond protocols: https://khmer-protocols.readthedocs.org/en/v0.8.4/

Discussion about interesting package name,

apt-get -y install libncurses5-dev

Titus: text window graphics from 70s games, likely needed for samtools tview? Everyone: Ahhhhh (understanding)

We’re going to make public AMI, for times if we wanted to share and distribute to colleagues for collaboration.

Go to EC2 console.

create_image

create

AMI

Efficient way to capture OS and software, can terminate instance and keep AMI and only get charged fraction of cost (about $0.10 per month/GB) rather than keeping instance running. Snapshot is for volume of data rather than image, which is OS filesystem.

Change permissions so you are not the only owner. Since we want to make public.

public_images

lookatpublic

Takes some time to make this public. So, wait a bit before sharing AMI-ID.

Important, this image was created in the ‘N. Virginia’ region. This image is only visible in the ‘N. Virginia’ region. There are other ways to share between regions.

Class discussion about costs for hosting images and sharing images associated with publications. Who pays? If reviewers of papers will need the images, how does that work? It is easy to share data and software associated with analyses for studies. We can provide all the instructions and data and software we want. But no one has figured out a realistic and sustainable management framework for computing resources for scientific studies. Reproducibility is of concern, but there are no incentives for scientists to provide data and transparent analyses via methods like AWS AMI to demonstrate reproducibility. If this were required for publication, there would likely be more funding resources available and everyone would do this instead of a select few. Now, people can provide stuff like this, but who is really going out and checking other peoples’ data and code and software, besides reviewers and few colleagues?

Create Volume

Make sure the availability zone (e.g. us-east-1e) matches the instance. If not, pull down menu and select:

public_ami

volume

volume_avail

Then attach a new 100GB volume to instance. Log out of ssh, log back in. Run mount commands to format disk :

mount

In the above list /dev/xvda1 is system disk, we attached /dev/xvdf

See elastic cloud computing manual for Amazon Web Services: AMI, Volume, Snapshot, and Instances.

https://www.flickr.com/photos/lpcohen/20738536090/

If creating an image for someone else, you would do the above where we took an image of an OS and a snapshot of a volume.

https://www.flickr.com/photos/lpcohen/20738795578/in/dateposted-public/

Now, (power pose) we will load someone else’s snapshot (it’s really our snapshot, but same idea). First, we have to Launch an AMI instance, m3.medium is fine:

createAMI

Then, create a volume from the snapshot to add to the running instance.

create_snapshot_thenvolume

The volume is available to attach:

volumes

Under “Actions”, attach volume and select the running instance (should pop up once you start typing).

Log in, then mount volume (do not format new volume because this contains the data!), and it is there!!

mount_xvdf

Creating a bucket to share files, S3

If you wanted to host files for others to download, $0.10/GB per month.

S3

Then, you can get the link for people to download:

curl -O https://s3.amazonaws.com/lisangs2015/bigwig.py
Posted in Genomics Workshop, reproducibility, workshops | Leave a comment