Dr. Tiffany Timbers, postdoc at Simon Fraser University. Became interested in automating and programming from Multiworm tracker, automated image analyses for worm behavioral studies.
Today we will use a dataset from 2,007 individuals, 700,000 variants, 20,000 genes potentially contain disruptive alleles.
GWAS vs. SNP microarrays. Differences and similarities? Example of question typically asked with GWAS analyses, “Do I rescue the mutant by knocking out a region with significant SNP activity?”
Really cool website, facilitating classroom discussion with student input-typed answers:
Power analyses to consider with GWAS from NGS vs. microarray, difficult to estimate effect sizes. Work needs to be done on how to do a power analysis for this type of study. Haplotype matrix, not a lot of good data available to generate this in non-human populations. Have to come up with own methods for determining robust design. Positives of GWAS for NGS data include being able to call rare variants. Take region with groupings, gene level rather than on variant-level. What are you doing with intergenic data? Very new type of analysis. How important is this? When severe mutations are of interest in isolated population vs. natural population. Indels vs. copy number variants.
Sequence kernel association test (SKAT): linear model
disease/phenotype = alpha + (alpha1)covariants (e.g. age, sex) + (beta)variantsets ( e.g. genes) + epsilon
SKAT R package documentation:
SKAT paper, Wu et al. Am. J. Hum. Genet. 2011:
(Proceed with amazing, well-worked out install instructions. All the code instructions for this whole tutorial are great – solid and work. Really appreciate having location instructions at beginning of each code block. 🙂
Tiffany: “Ignore all warnings, say ‘yes’ to everything, and it will work.” If it doesn’t work, kill the machine, start over. Computers are fun. Tweeting lots of quotes, #ngs2015. 🙂 Future for Docker. Tiffany: dplyr is worth it. Titus: R did something better than Python. See Jenny Bryan, stat545:
Let’s look at data. There are some .vcf files, there are a lot. Here are the first 5. we need to put them all together into a massive .vcf file to look at whole population. We’re only going to look at chromosome 4 because it makes the data more manageable for this tutorial.
Here is .vcf format:
We will use vcf-merge, takes a minute:
vcf-merge data/*.vcf.gz | bgzip -c > MMP_vcf-merge.vcf.gz
Here is the merged file, 1.3 M:
VCF-tools is silly and puts “.” in output. We need “0/0” to move on. Tiffany figured out this awk command to use to change that.
This worked and is amazing:
Enthusiasm from class at 8:45pm.
No best practices in field right now. Decisions that follow are investigator-specific. We’re going to only look at coding regions. Throwing out everything, e.g. silent mutations, intergenic regions, etc. With clunky but amazing grep command:
Overheard: Don’t have to understand everything as long as the desired output comes out.
Create pheno_file and fam_file:
Now association analysis, gene by gene. The more SNPs you have, the longer it takes. Lex: Suggestion Jones et al. 2012 stickleback sliding window approach.
Set seed to 100, so p values are reproducible. Run SKAT.SSD.ALL, but we won’t do that. Open saved data file with output, as this takes too long to perform during class. Compares with null model.
Apply multiple testing correction with library(fdrtool). Sort. Cutoff.
R is cromulent.
Shouldn’t just take this information and run with it. Columns 4 and 5 (integers) indicate number of times this SNP ID occurs, some are 2 and 3 in entire dataset. You will have to figure out how many times you will tolerate mutations to call it significant. Plot histogram, what are the minor allele count. Could use cluster and pathway analyses, gene adjacency to define SNP sets. Be careful with GO terms and pathway analyses for SNP analyses. These are pitfalls. Really cool, new software! But, more people need to be talking about this, developing these tools further. And know how it works. Great opportunity with this tutorial to review this workflow.
Frequency of allele counts for # genes:
This is where “p hacking” comes in to determine what is reasonable variance to be in SNP set. No answer.
Throw SSID into R, throw away everything less than 5 variants to keep in data set. From histogram above, all small allele counts (largest number of genes) will be thrown away. Commence R joining plyr and dplyr genes-to-include magic…and voila:
Null model for SKAT doesn’t change. Run, this takes a long time so we will open reduced data output file. Now SKAT reduced results look more reasonable to use for further interpretation.