Pathway Analysis for RNAseq Data – NGS2015

Presentation by Dr. Asela Wijeratne, Ohio State University at Molecular and Celluar Imaging Center. Research interests in co-expression networks during pathogen attack in agricultural plant species to identify regulatory genes.

Today: Different approaches for pathway analysis, statistical approaches, example using GAGE, challenges for pathway analysis, gene set enrichment analysis.

Pathways: metabolic (e.g. Krebs), gene regulatory, signal transduction, etc.

Why do we do pathway analysis? What are activated, deactivated under certain conditions, how mutations affect? Functional descriptions of genes. Measure molecular responses somehow. Connect to statistical test, network reconstruction. Set of functions.

How do we get functional annotations? Gene ontology terms grouped by functional terms. GO, KEGG, BioCarta databases, transcription targets (tools) Great for model organisms but not great for nonmodel. Each term describes. Directed acyclic graph (DAG) du Plessis L et al. Brief Bioinform. 2001 to construct database of terms that represent functions for purposes and connectivity. Controlled by GO consortium with vocabulary for these terms. Examples annotation is given with evidence. Experimental evidence codes, computational analysis, author statement, curator statement, automatically-assigned evidence codes inferred from electronic annotation (be careful), obsolete evidence codes.

This is what I think is really interesting:

Just because you have a functional annotation, doesn’t mean it is real! Model systems have these annotations. Newly assembled transcriptomes and genomes have inferred annotations. Problems: KEGG pathways, not flexible to nonmodel species. User working with species with metabolic pathways different than species already in databases cannot re-draw. Evidence codes can be automatically-assigned.

For purposes of learning today, assume you are working with a model system, e.g. human or mouse. Which categories are the most common in databases you are using?

After RNAseq analysis, I have the gene list. Now what? Over or under-represented GO terms? Discuss.

Problem of small size. How to decide what is over- or under-represented? For example, hypergeometric test: 100/500 genes involved in cell cycle. 25/500 involved in cell cycle. Is cell cycle enriched? Use r function phyper(25,500,4500,100,lower.tail=FALSE) = 3.9e-06 Correct for multiple hypothesis testing: probability of saying something is being differentially expressed when it is not. Bonferroni (Google ‘Bonferroni), more stringent and (FDR) Benjamini-Hochberg method can be used, although FDR is less stringent. Rank list of GO categories by P-values. Then adjust (multiply) by total number of categories/rank to get FDR.

Functional Class Scoring (FCS) assumes coordinated changes in related set of genes can have signifiant effect. No threshold. Analysis define the “gene set” Enrichment is for an entire set. GSEA, for model species.

We will use the GAGE package to see how it works.

Flexible, generally applicable to nonmodel species. Separate gene sets into two categories, two different entities. For example, curated gene set and pathway information, e.g. KEGG and combine. Can use for microarray and RNAseq. 1-on-1 comparison, e.g. Control 1 to Exp 1, Control 1 to Exp 2, etc. Summarize and meta-test, correct for FDR. Fold change for each comparison and get global mean, also get global mean for entire set. Limitations of FCS: treats pathways independently, etc.

Pathway Topology (PT) based approaches: provides additional information such as gene interactions, how they interact, location of the interaction.

Example tools, packages used in R: SPIA, NetGSA, ScorePAGE

Discuss challenges: hard to compare different methods. Lacking benchmark datasets to put in all packages to see how they do. Inability to model dynamic responses, e.g. time-series. Inability to model effects of external stimuli. Main challenge, however is incomplete knowledge base for annotations. Often no annotations for all transcripts. Or inaccurate annotations. More GO annotations are inferred electronic annotations. Missing conditions, cell-specific information. Divergence in genes, still associated with pathway. Not known whether still performs same function. Khatri et al. 2012 paper, 10 years of challenges:


Install packages, see notes on Etherpad:


Here are Asela’s tutorial instructions:

From commandline:

mkdir pathway_analysis
cd pathway_analysis/
git clone

In RSTudio, set working directory, download dataset and import from current directory:

rnaseq_URL <- ""
download.file(rnaseq_URL, destfile = "rnaseq_data.txt")


Find short code for oraganism. Here it is:

korg[species, 1, drop = F]

If you need to find code for another organism, go here:

If your organism is not on the list of >3,000 species, do KEGG ortholog mapping. (instructions to follow)

Gene identifiers from kegg:


There are two ways to use GAGE:
1. Native: normalization/variance stabilization >> GAGE
2. Combine: EdgeR >> GAGE

We will do “Native”.

1. Remove zero counts:


2. Normalization library counts


3. Variance stabilization transformation, 8 is added from KEGG manual, to avoid -inf during log

plot MDS and MA plot to see processed data variances. Here is MDS plot, MA plot would not work for me.



4. Define reference and experiment samples
5. Enrichment analysis
6. Some outputs,


Quickly went through edgeR to combine with GAGE:

1. Make the study design
2. Create DGElist object
3. Remove lowly expressed genes
4. Normalize the data
5. multidimentional scaling plot
6. Estimate the dispersion
7. Differential gene expression
8. Making the data suitable for pathway analysis
9. GAGE analysis

Overrepresentation analysis with GAGE should include all normalized genes, not just the differentially expressed ones. GAGE determines pathways that are significant. Individual genes that are differentially expressed come from the edgeR (or DESeq2, limma, etc. packages)

Visualization is Key:

From last command in tutorial, here are genes differentially expressed in DNA replication pathway. The pathway was determined by GAGE. There are 2 genes differentially expressed (identified by edgeR) in this pathway. This gives biological meaning to the differential gene expression analysis. Not only were the genes differentially expressed, but this pathway was significantly enriched.



About Lisa Johnson

PhD candidate at UC Davis in Molecular, Cellular, and Integrative Physiology
This entry was posted in Bioinformatics, Genomics Workshop. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s