To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

"Overcalling" deletion in UnifiedGenotyper GENOTYPE_GIVEN_ALLELES mode

I am observing "overcalling" in a deletion when running UnifiedGenotyper in GENOTYPE_GIVEN_ALLELES mode. I am supplying the following variants (along with some additional variants):

7:117199640TATC>T
7:117199644ATCT>A

I am expecting a homozygous reference call for the former, and a heterozygous call for the latter. However, I am getting heterozygous calls for both. The output I get from UnifiedGenotyper is below (and the command header). I attached a screenshot of the pileup.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample
7       117199635       .       G       T       0       LowQual AC=0;AF=0.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=117.5772;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.677        GT:AD:DP:GQ:PL  0/0:250,0:250:99:0,722,9772
7       117199640       .       TATC    T       1372.73 .       AC=1;AF=0.500;AN=2;BaseQRankSum=1.738;DP=250;FS=6.976;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.364;QD=5.49;RPA=2,1;RU=ATC;ReadPosRankSum=2.343;SOR=1.077;STR    GT:AD:DP:GQ:PL  0/1:138,108:250:99:1410,0,7336
7       117199641       .       A       G       0       LowQual AC=0;AF=0.00;AN=2;DP=249;Dels=0.00;FS=0.000;HaplotypeScore=315.9359;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.539        GT:AD:DP:GQ:PL  0/0:249,0:249:99:0,725,9808
7       117199644       .       ATCT    A,GTCT  5999.74 .       AC=1,0;AF=0.500,0.00;AN=2;DP=250;FS=0.000;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;QD=24.00;SOR=1.049 GT:AD:DP:GQ:PL  0/1:0,109,0:250:0:6034,0,7823,6034,0,6034
7       117199648       .       T       G       0       LowQual AC=0;AF=0.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=173.9715;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.507        GT:AD:DP:GQ:PL  0/0:250,0:250:99:0,725,9763
__7       117199652       .       TG      T       0       LowQual AC=0;AF=0.00;AN=2;DP=249;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.595   GT:AD:DP:GQ:PL 0/0:249,0:249:99:0,750,11299

INFO 10:42:22,173 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:42:22,176 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-82-ge3b4844, Compiled 2015/02/24 07:15:40
INFO 10:42:22,176 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 10:42:22,177 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 10:42:22,183 HelpFormatter - Program Args: -T UnifiedGenotyper -I results/r7-48.20150210XC_AJ933.dedup.bam -L 7:117199622-117199662 -glm BOTH -alleles ../../../forcecall/forcecalling_variants.b37.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -et NO_ET -K /hpc/packages/minerva-common/gatk/eugene.fluder_mssm.edu.key -R /sc/orga/projects/ngs/resources/gatk/2.8/human_g1k_v37.fasta
INFO 10:42:22,188 HelpFormatter - Executing as lindem03@interactive1 on Linux 2.6.32-358.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_03-b04.
INFO 10:42:22,189 HelpFormatter - Date/Time: 2015/02/25 10:42:22
INFO 10:42:22,189 HelpFormatter - ---------------------------------------------------------------------------------

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mlinderm,

    We're aware of some issues with GGA mode, especially in HC but it's possible that UG may suffer from them in GATK 3.x versions (since some modifications were made in the common genotyping engine that they share). Unfortunately we don't have resources to spare to look into this at the moment, so I can't offer any resolution at this time. My recommendation will (uncharacteristically) be to try running an older version (maybe 2.8) and see if that produces a reasonable result.

  • Thanks @Geraldine_VdAuwera

    Of interest, running HC in GGA mode produces the correct result. However, we chose to use UG for GGA because it is more accurate for other variants, particularly complex indels. Do you have any suggestions on how to improve the accuracy of the one or the other?

  • tommycarstensentommycarstensen United KingdomMember

    @mlinderm I follow this thread with interest. Thanks for posting it. Have you by any chance calculated your homRef, het and homAlt genotype discordance per sample (assuming your original calls are the truth) for SNPs and indels for UG GGA and HC GGA? I would be very interested in seeing those numbers to gauge the severity of this issue.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mlinderm, I'm surprised to hear that UG is giving you better indel results. Does that apply to GGA only or discovery mode in general?

    I think applying the HC GVCF-based workflow where you include a cohort that possesses your alleles of interest should enable you to genotype those alleles most accurately. That does assume you have such a cohort available, as opposed to just a VCF of calls from a past analysis. We're moving toward a future in which that is enabled at a large scale but in the interim we may have a gap in functionality. If enough people find that problematic I may be able to press the devs for some GGA improvements.

  • @Geraldine_VdAuwera said:
    mlinderm, I'm surprised to hear that UG is giving you better indel results. Does that apply to GGA only or discovery mode in general?

    Thanks for the followup @Geraldine_VdAuwera

    Just GGA. The larger context is a project to replace a genotyping assay with a targeted sequencing panel. Some of those variants are complex indels. An example is 6:51890845GGG>GC.

    With UG in GGA mode (producing the expected result):

    6 51890845 . GGG GC 8246.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.142;DP=250;FS=2.856;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.568;QD=32.99;RPA=2,-2;RU=G;ReadPosRankSum=1.519;SOR=0.529;STR GT:AD:DP:GQ:PL 0/1:114,129:250:99:8284,0,7025

    With HC in GGA mode (not calling the expected variant):

    6 51890845 . GGG GC 0 LowQual AC=0;AF=0.00;AN=2;DP=252;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.654 GT:AD:DP:GQ:PL 0/0:251,0:251:99:0,819,2147483647

    What makes this an interesting variant is that it is a known pathogenic mutation that is readily called in discovery mode, but as two separate variants (below). The resulting haplotype is correct, it is just not the most parsimonious explanation (or the one that connects back to the existing knowledge base about this variant).

    6:51890841TG>T
    6:51890847G>C

    Of note, if I supply the "decomposed" representation, i.e. the two variants above, as --alleles to both UG and HC I get good calls at the decomposed site. I am trying to work out the best way to clean up the haplotype. Any suggestions?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh, that's interesting. That seems like it might be due to a quirk of the HC's internal representation. I'm not sure how to address it at this time, to be honest. There was an attempt to merge such cases into complex variants but it caused more problems than it solved for downstream analysis. If you provide us with some test snippets I can ask the devs to look at the data and evaluate whether this can be tackled in the near future or not.

  • tommycarstensentommycarstensen United KingdomMember

    @mlinderm said:
    With UG in GGA mode (producing the expected result):

    Is this UG2.8 or UG3.3 producing the expected result? Did UG2.8 call the deletion 7:117199640:TATC:T correctly? I might have to revert from UG3.3 to UG2.8 myself based on your findings. I'm watching this thread with hawk eyes and I'm very grateful for any results you share. Thanks.

Sign In or Register to comment.