The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
How can I increase the sensitivity of my initial GATK exome variant calling?
I am following the best practices for GATK exome variant calling, but am failing to cast a wide net and catch all validated SNPs on my first pass variant calling.
As a test, I have download the NA12878 GIAB exome read files from the 1000 genome consortium. I then align them to g37, remove duplicates, analyze patterns of covariation in the sequence dataset (
BaseRecalibrator), do a second pass to analyze covariation remaining after recalibration (
BaseRecalibrator), apply the recalibration to my sequence data (
PrintReads), and call variants in my sequence data (
I'm creating a GVCF from my experimental exome, which is then combined with 50 other GVCFs I've made from phase 3 1000 genome consortium exome data, and follow the GATK best practices through VQSR, etc. However, directly after my initial variant calling, I have found that my VCF file does not contain all of the SNPs found in NA12878. I've compared my experimental raw VCF file (I've run the pipeline in VCF mode and GVCF mode) to the high confidence NA12878 VCF, while only considering those sites that fall within the exome BED file supplied by 1000 genome consortium, and I am only picking up 90% of the true positives.
For completeness, my initial variant calling command is as follows:
java -Xmx192g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R $indexed_genome -I recal_reads.bam --genotyping_mode DISCOVERY --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o raw_variants.g.vcf -nct 64
Of course, with my raw variants more than half of my VCF is false positives, but after batch genotyping with 50 exomes and VQSR filtering, I am ending with about 90% true positive SNPs and 10% false positive SNPs. I am most concerned with increase my 90% to closer to 99%. I assume this may be accomplished by being more liberal with my initial alignment and variant calling, in order to cast a wider net and pull in more possible variants (sensitivity).
Can someone please help me understand how I can increase the sensitivity of my instance of GATK?