Dr. Ryan Williams, postdoc at Iowa State leads tutorial on R visualizations with multivariate statistical approaches for RNAseq data. Traditionally, multivariate stats approaches used for community ecology.
Following are my patchy notes from the tutorial. Please feel free to comment and correct!
The tutorial, part 1:
Using RStudio, load dataset from URL.
sudo apt-get install libcurl4-openssl-dev
Head of data, transcript ID from htseq-count:
A major assumption of MANOVA is that variables within each treatment level come from a multivariate normal distribution.
What’s the difference between a multivariate normal distribution and just a normal distribution? Generalization of one-dimensional (univariate) normal distribution to higher dimensions. Setup linear model first for univariate data, then set nested functions together.
Look at dataset transcript counts. Lots of zeros. If we ran the model, it might crash and give errors because of all these zeros. Let’s look at one transcript.
# just insert one of the transcripts, # color by fly type "fly" # set alpha to 0.5 ggplot(dataset)+geom_density(aes(x=FBgn0000022,fill=fly,alpha=0.5))
Data are not normally distributed. Y axis is proportion of data. The above graphic looks different with the same data if you look at it with histogram function:
Bins of counts for this particular transcript. Change to another transcript:
See info on ggplot graphics in References section below. First, add data to ggplot function. Then add ‘aes’ layers. The ‘aes’ means aesthetic which adds features to the graphic.
Run a MANOVA demonstrating Wilk’s lambda test, multivariate generalization of F-distribution.
Permanova, permutations of F stastistic. ?adonis Think of data like cloud, shuffles data around, gets p values, asks how often data are close within distance of alpha.
Tutorial, part 2:
Samples are unlabeled, can we figure out what groups they belong to based on similarity?
Difference between MDS and PCA: With your cloud of data, rotate on axis – orthogonal axis 2nd, look at 3rd, 4th, 5th, etc. PCA looks at all coordinates possible. MDS collapses to smaller, user-defined dimensions, e.g. k=3 or k=2. Can play with this to see how stress statistical values can be minimized.
Here is MDS:
pca<-capscale(dataset[,-c(1:4)]~1) pca eigs<-eigenvals(pca) eigs/sum(eigs) eigenvals(pca) sum(eigs) sc<-data.frame(scores(pca)$sites,dataset[,1:4]) ggplot(sc)+geom_point(aes(x=MDS1,y=MDS2,colour=info,shape=type))+labs(x="PC1 (75.9% of variation explained)",y="PC2 (5.8% of variation explained)")
This is how you get the first 2 eigenvalues for PC1 and PC2:
pcoa<-capscale(decostand(dataset[,-c(1:4)],"total")~dataset$fly,distance="bray") scores(pcoa)$centroids ggplot(sc)+geom_point(aes(x=MDS1,y=MDS2,colour=info,shape=type))+labs(x="MDS1 (33.0% of variation explained)",y="MDS1 (8.2% of variation explained)")+annotate("text",x=c(-.157,-.077,.17),y=c(-.004,.670,.156),label=c("HYB","ORE","SAM"))
Hardcoded coordinates of text of group names from pcoa$centroids:
What parts of data are responsible for underlying differences? How to see how screwed up data are? If we only look at certain genes in NGS data example, how does that change variation? Can use this data? Which treatments interested in? Pairing tests appropriately with visualizations. Adonis package can rank based on ‘simper’ within adonis or ‘anosim’ or ‘envfit’ package to give explanatory components to see what is contributing to your PCA. Try ‘taxdist’ or ‘unifrac’, to look at taxonomic distances to add in weighting variables. See what happens when you chop off, reduce. Same pattern? Now we have these clusters. Can we figure out why our data is noisy, useless or useful?
Grammar of graphics by author of ggplot