Multivariate Tests with NGS Data and Visualization in R – Week 3 NGS 2015

Dr. Ryan Williams, postdoc at Iowa State leads tutorial on R visualizations with multivariate statistical approaches for RNAseq data. Traditionally, multivariate stats approaches used for community ecology.

20150825_115542

Following are my patchy notes from the tutorial. Please feel free to comment and correct!

The tutorial, part 1:

https://github.com/ngs-docs/angus/blob/2015/week3/visualizations/multivariate-tests/tests.md

Using RStudio, load dataset from URL.

# Possible solution for ubuntu RCurl
# In the terminal:
sudo apt-get install libcurl4-openssl-dev

Head of data, transcript ID from htseq-count:

head_fly_htseq_data

A major assumption of MANOVA is that variables within each treatment level come from a multivariate normal distribution.

What’s the difference between a multivariate normal distribution and just a normal distribution? Generalization of one-dimensional (univariate) normal distribution to higher dimensions. Setup linear model first for univariate data, then set nested functions together.

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/summary.manova.html

Look at dataset transcript counts. Lots of zeros. If we ran the model, it might crash and give errors because of all these zeros. Let’s look at one transcript.

# just insert one of the transcripts,
# color by fly type "fly"
# set alpha to 0.5
ggplot(dataset)+geom_density(aes(x=FBgn0000022,fill=fly,alpha=0.5))

single_transcript_densityplot_histogram_smoothed_by_fly

Data are not normally distributed. Y axis is proportion of data. The above graphic looks different with the same data if you look at it with histogram function:

ggplot(dataset)+geom_histogram(aes(x=FBgn0000022,fill=fly,alpha=0.5))

histogram

Bins of counts for this particular transcript. Change to another transcript:

ggplot(dataset)+geom_histogram(aes(x=FBgn0000056,fill=fly,alpha=0.5),binwidth=100)

different_transcript

See info on ggplot graphics in References section below. First, add data to ggplot function. Then add ‘aes’ layers. The ‘aes’ means aesthetic which adds features to the graphic.

Run a MANOVA demonstrating Wilk’s lambda test, multivariate generalization of F-distribution.

Permanova, permutations of F stastistic. ?adonis Think of data like cloud, shuffles data around, gets p values, asks how often data are close within distance of alpha.

Challenge:

adonis(decostand(dataset[,-c(1:3)],method="pa")~dataset$fly*dataset$type,method="jaccard")

Tutorial, part 2:

https://github.com/ngs-docs/angus/blob/2015/week3/visualizations/multivariate-viz/visualizations.md

Samples are unlabeled, can we figure out what groups they belong to based on similarity?

Difference between MDS and PCA: With your cloud of data, rotate on axis – orthogonal axis 2nd, look at 3rd, 4th, 5th, etc. PCA looks at all coordinates possible. MDS collapses to smaller, user-defined dimensions, e.g. k=3 or k=2. Can play with this to see how stress statistical values can be minimized.

Here is MDS:

ggplot(sc)+geom_point(aes(x=NMDS1,y=NMDS2,colour=info,shape=type),size=3)

mds
Here is PCA:

pca<-capscale(dataset[,-c(1:4)]~1)
pca
eigs<-eigenvals(pca)
eigs/sum(eigs)
eigenvals(pca)
sum(eigs)
sc<-data.frame(scores(pca)$sites,dataset[,1:4])
ggplot(sc)+geom_point(aes(x=MDS1,y=MDS2,colour=info,shape=type))+labs(x="PC1 (75.9% of variation explained)",y="PC2 (5.8% of variation explained)")


PCA

This is how you get the first 2 eigenvalues for PC1 and PC2:

values_PC1_PC2

PCoA:

pcoa<-capscale(decostand(dataset[,-c(1:4)],"total")~dataset$fly,distance="bray")
scores(pcoa)$centroids
ggplot(sc)+geom_point(aes(x=MDS1,y=MDS2,colour=info,shape=type))+labs(x="MDS1 (33.0% of variation explained)",y="MDS1 (8.2% of variation explained)")+annotate("text",x=c(-.157,-.077,.17),y=c(-.004,.670,.156),label=c("HYB","ORE","SAM"))

Hardcoded coordinates of text of group names from pcoa$centroids:

centroids

PCoA

Discussion:

What parts of data are responsible for underlying differences? How to see how screwed up data are? If we only look at certain genes in NGS data example, how does that change variation? Can use this data? Which treatments interested in? Pairing tests appropriately with visualizations. Adonis package can rank based on ‘simper’ within adonis or ‘anosim’ or ‘envfit’ package to give explanatory components to see what is contributing to your PCA. Try ‘taxdist’ or ‘unifrac’, to look at taxonomic distances to add in weighting variables. See what happens when you chop off, reduce. Same pattern? Now we have these clusters. Can we figure out why our data is noisy, useless or useful?

References

Grammar of graphics by author of ggplot

R cookbook with examples

Stephen Turner’s (UVA) advanced visualization

https://en.wikipedia.org/wiki/Jaccard_index

Advertisements

About Lisa Johnson

PhD candidate at UC Davis.
This entry was posted in Genomics Workshop. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s