Today’s adventures in Venn Diagrams. Seems like a simple task, right? Two (or more) circles with numbers in the middle. I want to use this visualization graphic to show the number of genes differentially expressed between 2 sets of groups.
Fold-change differences in gene expression were calculated between 2 groups, relative to the control.
The question: How many genes are expressed differently between one set of groups vs. the other set of groups? What are the genes they are expressing in common?
Since these are three groups, two of which are being compared to a control, it is helpful to know what genes are expressed in common between the two treatments and which are different. This way, the researchers can narrow in on what those same or different genes are doing, then target treatment, therapy, or possibly use a suite of these as biomarkers for diagnosis.
I’m using R. This is a topic for another post about how much I dislike R.
Venn diagrams are a bit arbitrary, but they can give you a ballpark idea. This is just a methodical process of filtering. The cutoff for what is “differentially expressed” is a bit arbitrary and depends on the results. What is a reasonable cutoff? We’re aiming for significant, so a p or alpha value of 0.05 is usually used to indicate that 95% of the differences seen can be explained from biological significance, rather than by chance. We also have a q value, which is an adjusted p value and more stringent. (The mathematical details of these particular tests are also for another day’s post.) The q value is more stringent, and since we have tens of thousands of genes, using q<0.25 will narrow our results down to lists with hundreds of “significantly” different genes, making the task of identifying the functions of these genes (rather than tens or hundreds of thousands) more manageable. Since we assign an arbitrary cutoff of significance, some groups end up having slightly different numbers of significantly different genes compared to the control, so a Venn diagram approach to displaying these differences is a bit arbitrary, but narrowing down hundreds from hundreds of thousands will give an idea that there are differences identify genes to investigate further.
It would be more interesting if we had more comparisons to make, but we are only making 2 today. The first challenge to making graphics in R, or doing anything in R really, is getting your data in the right format. There are data.frames, matrices, tables, vectors, etc. Column names have to match up between matrices being compared, otherwise you are comparing the wrong values. There is a general graphic package in R called ‘gplots‘. This will give you all kinds of good graphing tools, such as histograms, heatmaps, plots, and venn diagrams. This was the most logical choice to look up when setting out to make a venn diagram. In doing an internet search, there seem to be quite a few venn diagram packages, introducing color, ease of coding, and other features. At first I was hung up on wanting color. After realizing how complicated this is, I became less interested in this feature. The limma Bioconductor package also has a function for the venn diagram. Also, no color. These are the basic steps. The venn diagram is just counts, so the basic idea is you have to create an array with 0 or 1 for each gene for each comparison to indicate whether the gene is present or absent in that group.
- Use the ‘union’ function in R to get a list of all unique genes shared between the groups.
- Make a list of all 0 the length of the union list for each group.
- Match the differentially-expressed gene list with the list of all 0 in #2 and set the matches=1. (This will give you a list of 0 and 1, depending on whether the gene was present or not.) Do this for all groups.
- cbind the groups together
- Venn diagram of counts.
# with help from: # https://insilicodb.com/compare-deg-signatures/ # http://cran.r-project.org/web/packages/colorfulVennPlot/colorfulVennPlot.pdf library(limma) library(gplots) raw_ApoE_WT<-read.csv("sig_genes_WT_ApoE.csv",sep=",") raw_DKO_WT<-read.csv("sig_genes_WT_DKO.csv",sep=",") gene.ApoE_WT<-raw_ApoE_WT$gene gene.DKO_WT<-raw_DKO_WT$gene gene_union<-union(gene.ApoE_WT,gene.DKO_WT) ApoE_WT<-array(0,dim=c(length(gene_union))) DKO_WT<-array(0,dim=c(length(gene_union))) ApoE_WT[match(gene.ApoE_WT,gene_union)]=1 DKO_WT[match(gene.DKO_WT,gene_union)]=1 #limma package vennDiagram venncounts_all<-cbind(ApoE_WT,DKO_WT) venncounts=vennCounts(venncounts_all) vennDiagram(venncounts) # gplots venn combined<-cbind(ApoE_WT,DKO_WT) combined_data.frame<-as.data.frame(combined) venn(combined_data.frame)
My colleague was disappointed that I couldn’t get colorful venn diagrams, so this will be a project for another day when I have more time to spend on this. The two packages that seem promising are venneuler and colorfulVennPlot. Here are some helpful references for then: