GWAS for NGS data – NGS2015

Dr. Tiffany Timbers, postdoc at Simon Fraser University. Became interested in automating and programming from Multiworm tracker, automated image analyses for worm behavioral studies.

2015-08-26_07-36-48

Today we will use a dataset from 2,007 individuals, 700,000 variants, 20,000 genes potentially contain disruptive alleles.

GWAS vs. SNP microarrays. Differences and similarities? Example of question typically asked with GWAS analyses, “Do I rescue the mutant by knocking out a region with significant SNP activity?”

Really cool website, facilitating classroom discussion with student input-typed answers:
Code: F321E8CC
https://b.socrative.com/student/#take-quiz/q/1

Power analyses to consider with GWAS from NGS vs. microarray, difficult to estimate effect sizes. Work needs to be done on how to do a power analysis for this type of study. Haplotype matrix, not a lot of good data available to generate this in non-human populations. Have to come up with own methods for determining robust design. Positives of GWAS for NGS data include being able to call rare variants. Take region with groupings, gene level rather than on variant-level. What are you doing with intergenic data? Very new type of analysis. How important is this? When severe mutations are of interest in isolated population vs. natural population. Indels vs. copy number variants.

Sequence kernel association test (SKAT): linear model
disease/phenotype = alpha + (alpha1)covariants (e.g. age, sex) + (beta)variantsets ( e.g. genes) + epsilon

SKAT R package documentation:
https://cran.r-project.org/web/packages/SKAT/SKAT.pdf

SKAT paper, Wu et al. Am. J. Hum. Genet. 2011:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3135811/pdf/main.pdf

Tutorial

(Proceed with amazing, well-worked out install instructions. All the code instructions for this whole tutorial are great – solid and work. Really appreciate having location instructions at beginning of each code block. 🙂
https://github.com/ttimbers/SKAT_NGS-2015/blob/master/Setup-for-gwas-on-EC2.md

https://github.com/ttimbers/SKAT_NGS-2015/blob/master/NGS_GWAS_via_SKAT.md

Tiffany: “Ignore all warnings, say ‘yes’ to everything, and it will work.” If it doesn’t work, kill the machine, start over. Computers are fun. Tweeting lots of quotes, #ngs2015. 🙂 Future for Docker. Tiffany: dplyr is worth it. Titus: R did something better than Python. See Jenny Bryan, stat545:

https://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html

Let’s look at data. There are some .vcf files, there are a lot. Here are the first 5. we need to put them all together into a massive .vcf file to look at whole population. We’re only going to look at chromosome 4 because it makes the data more manageable for this tutorial.

Here is .vcf format:

first5vcffiles

vcf

vcf_info

We will use vcf-merge, takes a minute:

vcf-merge data/*.vcf.gz | bgzip -c > MMP_vcf-merge.vcf.gz

Here is the merged file, 1.3 M:

merged_vcf

VCF-tools is silly and puts “.” in output. We need “0/0” to move on. Tiffany figured out this awk command to use to change that.

awk

This worked and is amazing:

awk_works

Enthusiasm from class at 8:45pm.

No best practices in field right now. Decisions that follow are investigator-specific. We’re going to only look at coding regions. Throwing out everything, e.g. silent mutations, intergenic regions, etc. With clunky but amazing grep command:

grep_cds

Get data in R:
r_table

Overheard: Don’t have to understand everything as long as the desired output comes out.

Create pheno_file and fam_file:

pheno_file

Join and drop files, rename colnames:
phenotypes

Now association analysis, gene by gene. The more SNPs you have, the longer it takes. Lex: Suggestion Jones et al. 2012 stickleback sliding window approach.

Set seed to 100, so p values are reproducible. Run SKAT.SSD.ALL, but we won’t do that. Open saved data file with output, as this takes too long to perform during class. Compares with null model.

Apply multiple testing correction with library(fdrtool). Sort. Cutoff.

R is cromulent.

SNPs:

SNPs

Shouldn’t just take this information and run with it. Columns 4 and 5 (integers) indicate number of times this SNP ID occurs, some are 2 and 3 in entire dataset. You will have to figure out how many times you will tolerate mutations to call it significant. Plot histogram, what are the minor allele count. Could use cluster and pathway analyses, gene adjacency to define SNP sets. Be careful with GO terms and pathway analyses for SNP analyses. These are pitfalls. Really cool, new software! But, more people need to be talking about this, developing these tools further. And know how it works. Great opportunity with this tutorial to review this workflow.

Frequency of allele counts for # genes:

histo

This is where “p hacking” comes in to determine what is reasonable variance to be in SNP set. No answer.

Throw SSID into R, throw away everything less than 5 variants to keep in data set. From histogram above, all small allele counts (largest number of genes) will be thrown away. Commence R joining plyr and dplyr genes-to-include magic…and voila:

reduced_SSID

Null model for SKAT doesn’t change. Run, this takes a long time so we will open reduced data output file. Now SKAT reduced results look more reasonable to use for further interpretation.

reduced_SKAT

Advertisements

About Lisa Johnson

PhD candidate at UC Davis.
This entry was posted in Genomics Workshop. Bookmark the permalink.

One Response to GWAS for NGS data – NGS2015

  1. Pingback: Week 3 – NGS2015 | Lisa Johnson Cohen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s