Genotype calls in PhaseByTransmission output VCF

Hello, I ran PhaseByTransmission (GATK ver 3.6) on trios to get a list of Mendelian violations. In the first run, we used BAMs at average coverage of ~100x and in the second run, we used down-sampled BAMs (created with Picard) to get an average coverage of 20x.

When I compared results, I noticed cases (example below with POS changed) where genotype call in one of the parents was converted from hom-ref to het in PhaseByTransmission output VCF and therefore these were not identified as MIEs anymore. The number of reads supporting REF and ALT alleles are switched in the output AD tag. I see the same calls with DeNovoPrior at default or 1.0E-6. I'm trying to understand what's causing this and would appreciate any feedback.

Original BAM
Input VCF:
1 100 . A T 962.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=0.077;ClippingRankSum=0.000;DP=218;ExcessHet=3.0103;FS=2.261;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.000;QD=11.45;ReadPosRankSum=2.468;SOR=0.905 GT:AD:DP:GQ:PL 0/0:75,0:75:99:0,225,2547 0/1:46,38:84:99:993,0,1300 0/0:58,0:58:99:0,174,1929

PhaseByTransmission output VCF:
1 100 . A T 962.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=0.077;ClippingRankSum=0.000;DP=218;ExcessHet=3.0103;FS=2.261;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.000;QD=11.45;ReadPosRankSum=2.468;SOR=0.905 GT:AD:DP:GQ:PL:TP 0/0:75,0:75:99:0,225,2547:93 0/1:46,38:84:99:993,0,1300:93 0/0:58,0:58:99:0,174,1929:93

Down-sampled BAM
Input VCF:
1 100 . A T 137.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=-1.111;ClippingRankSum=0.000;DP=55;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.000;QD=7.62;ReadPosRankSum=1.634;SOR=0.675 GT:AD:DP:GQ:PL 0/0:17,0:17:51:0,51,583 0/1:11,7:18:99:168,0,309 0/0:20,0:20:60:0,60,680

PhaseByTransmission output VCF:
1 100 . A T 137.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=-1.111;ClippingRankSum=0.000;DP=55;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.000;QD=7.62;ReadPosRankSum=1.634;SOR=0.675 GT:AD:DP:GQ:PL:TP 1|0:17,0:17:0:0,51,583:9 1|0:11,7:18:99:168,0,309:9 0|0:20,0:20:60:0,60,680:9

Thanks,
Prachi

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @pkothiyal
    Hi Prachi,

    Basically, it is easier for the tools to change genotypes in sites where coverage is low and the genotype quality is low. Notice the GQ for the lower coverage site is 51, but for the higher coverage site, it is 99.

    What is the exact command you ran? Also, you may be interested in the Genotype Refinement Workflow.

    -Sheila

  • pkothiyalpkothiyal VAMember

    Thanks @Sheila . Here's the command:
    java -Xmx16g -jar GenomeAnalysisTK.jar -R $REFFILE -T PhaseByTransmission -V raw_variants.vcf -ped trio.ped -o pt.vcf -mvf mie.pt.txt
    I have a few related questions:

    • If we didn't have 100x data to compare against, 20x coverage is not horrible generally speaking. Is a call with QD>7, GQ>50 and DP>15 and passing other hard-filtering recommendations considered low quality?
    • What evidence/logic is used in converting maternal genotype call from GT:AD:GQ:PL 0/0:17,0:51:0,51,583 in the input to GT:AD:GQ:PL 1|0:17,0:0:0,51,583 in the output? It translates into that because an MIE was observed, maternal GT can be converted to het (1|0) and assigned a GQ of 0 so Mendelian consistency is restored. It's a toss up between a Mendelian consistent call with GQ==0 for one of the trio members vs. a Mendelian inconsistent call with GQ values at 51, 99 and 60.
    • Is GQ always the difference between the lowest and 2nd lowest PL value? If yes, how does it get to be 0 in the output VCF when PL is 0,51,583?
    • Could it have something to do with relaxing the DeNovoPrior? I tried default and 1.0E-6 and see the same call. Does it need to be higher at lower coverage to allow MIE detection?

    Thanks,
    Prachi

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @pkothiyal
    Hi Prachi,

    Sorry for the delay.

    1) No, it is not "bad quality" but the way the math works out with the new priors (from the ped file), it is easier to change a low GQ (like 50) compared to a high GQ of 99.

    2) The ped file and Mendelian violations file are used. I think if a site is a Mendelian violation, the likelihood of changing it to not be, is greater (especially at low quality sites).

    3) It should be the difference between second lowest and lowest PL, however, it looks like the tool calculated a new GQ (and PLs) but did not output the new PL field. Can you try the GenotypeRefinement workflow and check if this still happens? PhaseByTransmission is being phased out, so any bugs will not be fixed.

    4) You can try relaxing the de novo prior, but I suspect it has to be a lot larger than what you are trying. I think it will have to be larger at high coverage to make a big difference. However, I have not tested this myself so cannot definitely tell you. My best recommendation is to try the Genotype Refinement workflow and see if you get better results.

    -Sheila

  • pkothiyalpkothiyal VAMember

    Thanks Sheila. I will try the GenotypeRefinement workflow and check the results.

Sign In or Register to comment.