Unified genotyper: GT does not match PL, causes problems with PhaseByTransmission

kmsquirekmsquire Posts: 9Member

I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.

Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.

#CHROM  POS       ID           REF  ALT  100403001-1           100403001-1A           100403001-1B        
                                                                                                          
1       5933530   rs905469     A    G    0/0:37,0:0,99,1192    0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
1       5933530   rs905469     A    G    0/0:37,0:189,15,1192  0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
                                                                                                          
1       10412636  rs4846215    A    T    0/0:119,0:0,358,4297  0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
1       10412636  rs4846215    A    T    0/0:119,0:110,9,0     0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
                                                                                                          
1       11729035  rs79974326   G    C    0/0:50,0:0,141,1709   0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
1       11729035  rs79974326   G    C    0/0:50,0:1930,0,3851  0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
                                                                                                          
1       16735764  rs182873855  G    A    0/0:54,0:0,138,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
1       16735764  rs182873855  G    A    0/0:54,0:174,0,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
                                                                                                          
1       17316577  rs77880760   G    T    0/0:42,0:0,123,1470   0/0:38,0:0,111,1317    0/0:53,0:0,153,1817 
1       17316577  rs77880760   G    T    0/0:42,0:233,17,1470  0/0:38,0:0,111,1317    0/0:225,25:0,153,181
                                                                                                          
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:0,87,1066  
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:1844,159,0 
                                                                                                          
1       31740706  rs3753373    A    G    0/0:123,0:0,349,4173  0/0:110,0:0,319,3793   0/0:111,0:0,328,3885
1       31740706  rs3753373    A    G    0/0:123,0:117,6,0     0/0:110,0:0,319,3793   0/0:111,0:0,328,3885

Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.

In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.

Thanks, Kevin

Answers

  • LaurentLaurent Posts: 36Member, GSA Collaborator

    Hi Kevin,

    Reading your post, it seems like this is really more a CombineVariants issue than PhaseByTransmission issue, is it correct? PhaseByTransmission relies solely on the PLs and does not use the GT field at all, so if provided with incorrect PLs it will produce erroneous results.

    Cheers, Laurent

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi Kevin,

    Could you please post your command-line, and a specific example at one site showing the exact input records, and the exact subsequent output record that CombineVariants produces?

    @Laurent, this does look like the issue is strictly at the CombineVariants level, so I don't think you need to worry about PBT.

    Geraldine Van der Auwera, PhD

  • kmsquirekmsquire Posts: 9Member

    Laurent, my apologies. I wrote the title before I originally understood the true problem, and forgot to change the title afterward.

    Geraldine, thanks for your response. I'll be able to post an example in a couple of hours.

    Cheers, Kevin

  • kmsquirekmsquire Posts: 9Member

    Here is the (partial) command line:

    /home/kmsquire/local/lib/jdk1.7.0_40/bin/java -Xmx6G -Djava.io.tmpdir=/scratch1/kmsquire/Qstudy/tmp \
          -jar /home/kmsquire/src/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
          -T CombineVariants \
          --variant /scratch1/kmsquire/Qstudy/variants/151_10:000000001-017999674/analysis.10:1-17999674.raw.vcf \
          --variant /scratch1/kmsquire/Qstudy/variants/163_10:125894472-128691068/analysis.10:125894472-128691068.raw.vcf \
          --variant /scratch1/kmsquire/Qstudy/variants/164_10:128691069-133406403/analysis.10:128691069-133406403.raw.vcf \
          --variant /scratch1/kmsquire/Qstudy/variants/165_10:133406404-133702526/analysis.10:133406404-133702526.raw.vcf \
          --variant /scratch1/kmsquire/Qstudy/variants/166_10:133702527-135534747/analysis.10:133702527-135534747.raw.vcf \
    ...
          --variant /scratch1/kmsquire/Qstudy/variants/267_Y:009266322-011604552/analysis.Y:9266322-11604552.raw.vcf \
          --variant /scratch1/kmsquire/Qstudy/variants/339_hs37d5:000000001-035477943/analysis.hs37d5:1-35477943.raw.vcf \
          -o /scratch1/tmp/kmsquire/Studies/Qstudy/variants/analysis.raw.vcf \
          -nt 2 \
          -R /scratch0/ref/hs37d5/hs37d5.fa \
          --validation_strictness SILENT \
          --log_to_file /scratch1/kmsquire/Qstudy/variants/analysis.raw.vcf.log
    

    There are 339 inputs, all disjoint. It might be better if they were sorted numerically (they're sorted by the base filename), but in theory it shouldn't matter.

    Here's a full example record. First line is the input, second line is the output. Note that the actual records which have problems are few, and they change from run to run.

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  100403001-1 100403001-1A    100403001-1B    100403001-4 100403001-4A    100403001-4B    100403001-6 100403001-6A    100403001-6B    100403001-7 100403001-7A    100403001-7B    100403001-8 100403001-8A    100403001-8B    11001664-7  11001664-7A 11001664-7B 11001664-8  11001664-8A 11001664-8B
    1   5933530 rs905469    A   G   899.81  .   AC=3;AF=0.071;AN=42;BaseQRankSum=-3.482;DB;DP=737;Dels=0.00;FS=8.030;HaplotypeScore=0.7758;InbreedingCoeff=-0.0769;MLEAC=3;MLEAF=0.071;MQ=69.92;MQ0=1;MQRankSum=0.778;QD=10.00;ReadPosRankSum=-1.500    GT:AD:DP:GQ:PL  0/0:37,0:37:99:0,99,1192    0/0:35,0:36:90:0,90,1101    0/0:44,0:45:99:0,117,1412   0/0:35,0:35:99:0,99,1191    0/0:26,0:26:75:0,75,899 0/1:15,9:24:99:111,0,449    0/0:34,0:34:84:0,84,1017    0/0:56,0:57:99:0,138,1655   0/0:32,0:32:90:0,90,1093    0/1:17,18:35:99:472,0,477   0/0:39,0:39:99:0,108,1299   0/1:18,13:31:99:372,0,487   0/0:42,0:42:99:0,108,1299   0/0:48,0:48:99:0,123,1458   0/0:41,0:41:96:0,96,1149    0/0:34,0:34:87:0,87,1066    0/0:31,0:31:63:0,63,763 0/0:35,0:35:87:0,87,1052    0/0:23,0:23:57:0,57,673 0/0:20,0:20:57:0,57,684 0/0:32,0:32:93:0,93,1091
    1   5933530 rs905469    A   G   899.81  .   AC=3;AF=0.071;AN=42;BaseQRankSum=-3.482;DB;DP=737;Dels=0.00;FS=8.030;HaplotypeScore=0.7758;InbreedingCoeff=-0.0769;MLEAC=3;MLEAF=0.071;MQ=69.92;MQ0=1;MQRankSum=0.778;QD=10.00;ReadPosRankSum=-1.500;set=variant108 GT:AD:DP:GQ:PL  0/0:37,0:37:99:189,15,1192  0/0:35,0:36:90:0,90,1101    0/0:44,0:45:99:0,117,1412   0/0:35,0:35:99:0,99,1191    0/0:26,0:26:75:0,75,899 0/1:15,9:24:99:111,0,449    0/0:34,0:34:84:0,84,1017    0/0:56,0:57:99:0,138,1655   0/0:32,0:32:90:0,90,1093    0/1:17,18:35:99:472,0,477   0/0:39,0:39:99:0,108,1299   0/1:18,13:31:99:372,0,487   0/0:42,0:42:99:0,108,1299   0/0:48,0:48:99:0,123,1458   0/0:41,0:41:96:0,96,1149    0/0:34,0:34:87:0,87,1066    0/0:31,0:31:63:0,63,763 0/0:35,0:35:87:0,87,1052    0/0:23,0:23:57:0,57,673 0/0:20,0:20:57:0,57,684 0/0:32,0:32:93:0,93,1091
    

    Hope this helps. Let me know if you need anything else.

    Kevin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Hi Kevin,

    Can you tell me if this still happens if you don't use multithreading (nt)?

    Geraldine Van der Auwera, PhD

  • kmsquirekmsquire Posts: 9Member

    Can you tell me if this still happens if you don't use multithreading (nt)?

    It does not happen if I don't use multithreading (-nt), so it would seem that that is the culprit.

    Thanks! Kevin

  • kshakirkshakir Posts: 21GATK Developer mod

    Hi @kmsguire,

    I'm trying to reproduce the issue with -nt, but in our test cases I'm not having luck recreating different outputs with and without -nt. At the moment, I'm still looking for a test case so we can further diagnose your reported bug.

    If you can share your input vcfs that produce errors, along with log files produced with -log log_file.txt, that would help us immensely in terms of narrowing down the issue. After you upload the files, if you could email kshakir [at] broadinstitute.org with the names of files to look for, I'll go through the log files to find and run the commands on our servers and get back to you asap.

    See this post for how to upload data to our write-only ftp directory: http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

    Thanks!

Sign In or Register to comment.