Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How can I increase the sensitivity of my initial GATK exome variant calling?

I am following the best practices for GATK exome variant calling, but am failing to cast a wide net and catch all validated SNPs on my first pass variant calling.

As a test, I have download the NA12878 GIAB exome read files from the 1000 genome consortium. I then align them to g37, remove duplicates, analyze patterns of covariation in the sequence dataset (BaseRecalibrator), do a second pass to analyze covariation remaining after recalibration (BaseRecalibrator), apply the recalibration to my sequence data (PrintReads), and call variants in my sequence data (HaplotypeCaller).

I'm creating a GVCF from my experimental exome, which is then combined with 50 other GVCFs I've made from phase 3 1000 genome consortium exome data, and follow the GATK best practices through VQSR, etc. However, directly after my initial variant calling, I have found that my VCF file does not contain all of the SNPs found in NA12878. I've compared my experimental raw VCF file (I've run the pipeline in VCF mode and GVCF mode) to the high confidence NA12878 VCF, while only considering those sites that fall within the exome BED file supplied by 1000 genome consortium, and I am only picking up 90% of the true positives.

For completeness, my initial variant calling command is as follows: java -Xmx192g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R $indexed_genome -I recal_reads.bam --genotyping_mode DISCOVERY --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o raw_variants.g.vcf -nct 64

Of course, with my raw variants more than half of my VCF is false positives, but after batch genotyping with 50 exomes and VQSR filtering, I am ending with about 90% true positive SNPs and 10% false positive SNPs. I am most concerned with increase my 90% to closer to 99%. I assume this may be accomplished by being more liberal with my initial alignment and variant calling, in order to cast a wider net and pull in more possible variants (sensitivity).

Can someone please help me understand how I can increase the sensitivity of my instance of GATK?

Answers

  • jtshrevejtshreve Member

    Any thoughts?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jtshreve
    Hi,

    Are you using the high confidence sites from the GIAB website to determine the sensitivity? I think they have two files; only one contains high confidence sites.

    As for increasing sensitivity, you can try altering the --standard_min_confidence_threshold_for_calling to a lower value. Also, have you done some quality control to ensure the data is useable?

    -Sheila

  • jtshrevejtshreve Member

    Hi Sheila,

    Thanks for your response. Yes, I am using the high confidence sites from GIAB. Also, as a test run, I am using the GIAB-supplied exome read files, from which I should be able to find >90% SNPs. I will try lowering --standard_min_confidence_threshold_for_calling and see if it makes a difference for me.

    I spoke with Geraldine in another forum posting, and she indicated that for using GIAB exome reads for GATK and subsequent SNP calling, only finding 90% of true positive reads prior to any filtering is considered low. Ideally, I would like to find something like 98-99% true positives, even if I have very many false positives (since they can be filtered out using VQSR). Besides --standard_min_confidence_threshold_for_calling, do you know of any other parameters either within BWA or GATK that I can manipulate to increase my sensitivity?

    Thank you again.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jtshreve
    Hi,

    I think that lowering --standard_min_confidence_threshold_for_calling should help quite a bit. Let us know if that does not. You may also try lowering the value for base qualities as well, but that should not be necessary.

    -Sheila

Sign In or Register to comment.