Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
VQSR in non-human
I have been running an exome/RNAseq analysis and wanted to clarify a few things.
1: I have 16 exomes, 8 from two conditions. I have individually HaplotypeCaller'd them (using -L with exome BED) so have a GVCF for each. I then 'joint genotyped' GVCFs for all, cond1 and cond2 samples. I believe this to be the new way of doing things, correct (?). I made a training set from my all.GVCF by running LD-pruning in Plink (indep-pairwise with r-squared 0.25). The idea was to remove variants in LD which is a known issue in cow which I work on. I know you do not support Plink, this is just for completeness. I take the resulting ~20,000 variants and use them as my training set for VQSR on cond1 and cond2.GVCFs, prior of Q12. I also use the available dbSNP flat file (~2.9m SNPs) as 'known', prior of Q6. Is this a correct approach? FWIW I use MDS-plots to determine if the LD-pruned SNPs can reassemble my pedigree and they do a great job of that. Anecdotal but I like it!
2: Following VQSR on my cond1, cond2.GVCF I got the attached tranche, r_script plots. I am happy enough, I know I don't have any FP as I have so many 'known' in my dbSNP file.The 99% gives ~6000 SNPs to investigate, and Ti/Tv seems high enough. R_script are a bit messy but cluster around 0 on MQRankSum /ReadPosRankSum which I have taken as a good sign from reading threads here.
3: When running ApplyRecalibrator I use TS 99; TS is a bit of an enigma in my set, seeing as I made my own training set I have no idea if they are real. Any ideas on how one might validate this without further genotype data (hard to get hold of...). My annotations have turned up some interesting stuff, so I am happy with the analysis. Would you say it is on the right track?
4: I mentioned RNAseq, so I have a same animals as exome is from. Can I use my training set on this data too? As it is time-point data with same animals, would you break them into 'replicates'? Is 16 samples too small of a set? I know there is not a lot of consensus but being able to compare to exome should help get 'better' SNPs in RNAseq, righhhttt??
Thanks for all your help and the continual updating of the package,