Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Mutect2 output when calling variants for PON has genotype 0/1 for homozygous SNPs

I'm using Mutect2 v4.0.4.0 to call variants for the purpose of making a panel-of-normals using the recommended workflow. I observe many heterozygous variants in the output VCF that have genotype 0/1 but have AD allele depths of 0 for the reference allele (and dozens to hundreds of alternate allele reads). The genotype should be 1/1 should it not?

If necessary I can provide the input data.

The command line is:

java8 -Xmx8g -jar $GATK4 Mutect2 -R $REF -I $BAM -tumor $SAMPID -O out.vcf.gz --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter

Below is the start of the log output.

12:06:22.792 WARN  GATKReadFilterPluginDescriptor - Disabled filter (MateOnSameContigOrNoMappedMateReadFilter) is not enabled by this tool
12:06:22.917 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:06:23.320 INFO  Mutect2 - ------------------------------------------------------------
12:06:23.320 INFO  Mutect2 - The Genome Analysis Toolkit (GATK) v4.0.4.0
12:06:23.321 INFO  Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
12:06:23.321 INFO  Mutect2 - Executing as [email protected] on Linux v4.4.0-109-generic amd64
12:06:23.321 INFO  Mutect2 - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_152-b16
12:06:23.322 INFO  Mutect2 - Start Date/Time: April 26, 2018 12:06:22 PM PDT
12:06:23.322 INFO  Mutect2 - ------------------------------------------------------------
12:06:23.322 INFO  Mutect2 - ------------------------------------------------------------
12:06:23.323 INFO  Mutect2 - HTSJDK Version: 2.14.3
12:06:23.323 INFO  Mutect2 - Picard Version: 2.18.2
12:06:23.323 INFO  Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 1
12:06:23.323 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:06:23.323 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:06:23.323 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:06:23.323 INFO  Mutect2 - Deflater: IntelDeflater
12:06:23.323 INFO  Mutect2 - Inflater: IntelInflater
12:06:23.324 INFO  Mutect2 - GCS max retries/reopens: 20
12:06:23.324 INFO  Mutect2 - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
12:06:23.324 INFO  Mutect2 - Initializing engine
12:06:25.847 INFO  Mutect2 - Done initializing engine
12:06:28.279 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:06:28.363 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
12:06:28.701 WARN  NativeLibraryLoader - Unable to load libgkl_pairhmm_omp.so from native/libgkl_pairhmm_omp.so (/share/carvajal-archive/tmp/twtoal/libgkl_pairhmm_omp7409537124124025621.so: /usr/lib/x86_64-linux-gnu/libgomp.so.1: version `GOMP_4.0' not found (required by /share/carvajal-archive/tmp/twtoal/libgkl_pairhmm_omp7409537124124025621.so))
12:06:28.702 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
12:06:28.702 INFO  NativeLibraryLoader - Loading libgkl_pairhmm.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm.so
12:06:29.534 WARN  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:06:29.535 WARN  IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
12:06:29.536 INFO  PairHMM - Using the AVX-accelerated native PairHMM implementation
12:06:29.976 INFO  ProgressMeter - Starting traversal

Issue · Github
by Sheila

Issue Number
3079
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

Best Answers

  • tedtoaltedtoal
    Accepted Answer

    Sorry, I was posting a lot so that it might help other people bypass some of the time it took me to understand better what is going on.

    I don't think I need any help any more. The only thing I would ask is that you guys improve documentation of the GT field in the Mutect2 VCF file, and possibly remove the field from the VCF file. Documentation about it shouldn't be buried in a blog or tech support thread. I personally think you are using the GT values incorrectly, and have tried to get the maintainers of the VCF spec to more clearly define use of the GT field in tumors, but I don't think they will.

    Thanks.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tedtoal
    Hi,

    I need to check with the developer. In the meantime, can you post some example records?

    Thanks,
    Sheila

  • tedtoaltedtoal Member

    Yes, here are a few. Perhaps when run in PON mode, it was felt that the genotype calls didn't need to be accurate??

    chr1    1754601 .       G       T       .       .       DP=80;ECNT=1;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=300.32     GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,75:0.969:0,38:0,37:36:0,192:60:20:0.99,0.99,1:0.028,0.026,0.946
    chr1    1755504 .       T       C       .       .       DP=54;ECNT=2;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=201.9      GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB     0/1:2,49:0.935:2,14:0,35:26:254,224:60:20:0|1:1755504_T_C:0.96,0.96,0.961:0.021,0.036,0.943
    chr1    1755523 .       C       T       .       .       DP=46;ECNT=2;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=182.97     GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB     0/1:2,42:0.93:2,13:0,29:37:254,224:60:20:0|1:1755504_T_C:0.949,0.949,0.955:0.021,0.036,0.943
    chr1    1756186 .       T       C       .       .       DP=38;ECNT=1;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=125.51     GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,32:0.921:0,16:0,16:36:0,232:60:20:0.99,0.99,1:0.029,0.025,0.946
    chr1    1756753 .       A       G       .       .       DP=41;ECNT=1;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=149.83     GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,41:1:0,20:0,21:35:0,211:60:20:0.99,0.99,1:0.026,0.028,0.946
    chr1    1757030 .       C       G       .       .       DP=11;ECNT=2;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=33.65      GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,10:0.945:0,5:0,5:34:0,255:60:20:0.99,0.99,1:0.026,0.028,0.946
    chr1    1757074 .       G       T       .       .       DP=6;ECNT=2;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=22.74       GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,6:1:0,4:0,2:35:0,299:60:19:0.99,0.99,1:0.025,0.029,0.946
    chr1    1758323 .       T       C       .       .       DP=39;ECNT=1;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=141.28     GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,36:0.961:0,7:0,29:36:0,220:60:20:0.97,0.97,0.973:0.032,0.023,0.945
    chr1    1758687 .       A       AAC     .       .       DP=6;ECNT=1;POP_AF=0.001;P_GERMLINE=-0.0002169;RPA=1,2;RU=AC;STR;TLOD=21.37     GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,5:0.916:0,3:0,2:35:0,214:60:29:0.99,0,1:0.025,0.03,0.945
    chr1    1765220 .       A       G       .       .       DP=25;ECNT=1;POP_AF=0.001;P_GERMLINE=-0.0002169;TLOD=73.6       GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/1:0,24:0.975:0,7:0,17:26:0,243:60:20:0.99,0.99,1:0.028,0.026,0.946
    
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @tedtoal,

    This is expected. There is no genotyping going on. Mutect2 uses the same logic for PoN samples it does for somatic analysis. You can see similar example results in section 6 of Tutorial#11136 .

    Please see here for a list of other helpful articles on somatic calling.

  • tedtoaltedtoal Member

    If that is the case, the GT field in the VCF file should be "./." rather than "0/1".

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @tedtoal,

    I recommend you review this article that explains somatic calling. In particular, I'd like to point out the section titled:

    A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense.

  • tedtoaltedtoal Member

    Mutect2 didn't omit genotyping, it included erroneous genotyping.

    That document says Mutect2 omits genotyping "in the traditional sense." But it never says what sense it actual DOES genotyping.

    At the very least, it should say something about how it arrives at the genotype it actually does call.

    I could see it perhaps calling "0/1" if the read AD values are, say, "0,3", where there are insufficient reads to know whether or not the locus might actually be heterozygous, but in that case, the call quality should be low. When the AD is 0,75 as in the first instance above, I can't imagine what reason Mutect2 would have for calling it 0/1?

    It might be wise to add a flag to Mutect2 that tells it it is being used with a normal sample for creation of a PON. When that flag is set, it might do something different with the genotype field in the VCF.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @tedtoal,

    Please see the last section of the the same article that explains the role of the PoN.

  • tedtoaltedtoal Member

    Yes, I've read that article more than once, and that section. It's all well and good, but doesn't say anything about how one might interpret the FORMAT/GT field in the VCF file created by Mutect2 when run in this mode.

    It says the PON is created by first running a normal sample through Mutect2 as if it is a tumor sample, obtaining variant calls made with the sensitivity of a somatic caller and thus producting output that cannot be assumed to be reliable germline variant calls. That's fine, but one would still presume that the VCF file that is generated is reasonable. A file with GT=0/1 when AD=0,75 is not reasonable.

    Perhaps the intention is that the GT field should be assumed to be invalid and should be ignored. If so, it should state that.

    Sorry to belabor this point. I'm just trying to give a suggestion that might help other users.

  • tedtoaltedtoal Member

    Ok, now I'm beginning to understand. I think you are saying that AD (and other annotation fields) are keyed to the GT field? But why wouldn't they instead be tied to the ALT column? In other words, if there are two ALT alleles and GT is 1/2, I think you are saying that AD would have the depths for the two alt alleles, rather than having the depths for all three alleles, REF,ALT1,ALT2?

    I am using the Mutect2 output for input to the PureCN R package that calls copy number variation in panel data. PureCN was written using the old Mutect1 output, but the author is working on making it work with Mutect2, and I'm attempting to use Mutect2 with it, since I use Mutect2 (in GATK4) for somatic variant calling in my project. The problem is that Mutect1 used to have both somatic and germline variants available if you gave it the right options, but Mutect2 does not. But, when Mutect2 is run in PON mode, it produces variants that, if filtered heavily, can be taken as germline variants, and that's what I'm doing. I'm combining the Mutect2 PON germline variant calls with the Mutect2 somatic variant calls. I'm filtering on depth and VAF to remove a lot of the false positive germline calls.

  • tedtoaltedtoal Member

    Okay, maybe I should just redo my pipeline to incorporate that. Thanks.

  • tedtoaltedtoal Member
    edited April 5

    I find myself revisiting this issue. I see no 1/1 genotype calls emitted by Mutect2 in two different projects of mine. From the above, I think it is true that to call a genotype of a tumor sample at a locus, one CANNOT use the GT values of the VCF file produced by Mutect2. For example, here are the GT:AD values of a VCF file entry for a sample:

    0/1:0,20

    In this case, the genotype might be said to be 1/1 or homozygous alternate.

    However, this case is not described anywhere in your documentation that I can find. The closest I see to a description of Mutect2 GT field is at:

    https://gatkforums.broadinstitute.org/gatk/discussion/9183/how-to-call-somatic-snvs-and-indels-using-mutect2

    This shows several examples but none like the above.

    I do see some good description in the article you mentioned above:

    https://software.broadinstitute.org/gatk/documentation/article?id=11127

    That article did not come up in my searches for Mutect2 lacking homozygous genotype calls.

    After further consideration here, and reading the above, I'm now seeing where you are coming from. Since cancer genomes are so chaotic, with so much copy number variation and normal contamination, how do you really define genotype? If there are 10 copies of a locus, when do you call it heterozygous and when homozygous? Given this, I need to go back and see where I make use of the GT field, and question whether I should even be doing that.

    On the other hand, people speak of LOH in cancer. How would that be defined when amplification is present? Does LOH mean that previously there were 1 or more copies of both wild type allele, and now there are 0 copies of it? How would one infer this from the Mutect2 VCF file? If AD were, say, 1,100, then probably one would assume the 1 WT read is likely an error, and the best call would be LOH. But it is a statistical thing, and Mutect2 should be dealing with those statistics, and putting something in the VCF to indicate that it has called the locus as LOH. Is there such a value output by Mutect2?

    It seems to me that if you are going to use the GT key differently than the common conception of it, you should be more consistent with it, in the following way: either list GT numbers for alleles with non-zero read count ONLY, OR list GT numbers for ALL alleles and include read counts for ALL alleles even if 0. It makes no sense to say GT:AD=0/1:0,20 if the entire concept of assuming two copies of the allele are present has been discarded. Instead, use GT:AD=1:20, or, suppose ref=A, alt=C,T, then use for example GT:AD=0/1/2:0,20,0 where you show all alternate allele read counts. You mentioned something about constraints by other tools. My understanding now is that the VCF spec does not require the GT value to be genotype, although it doesn't explicitly say this (and the GT name doesn't help). If so, then I would think you should use the GT field in the most logical and useful way possible (which IMHO would be to list the read counts of ALL alleles including reference, even if 0). If other tools misinterpret it, they are probably misinterpreting it ANYWAY, regardless of how you use it.

    And definitely you should include clear descriptions of how the GT and AD fields work in the documentation for Mutect2, and clearly indicate that the GT field should NOT be interpreted as genotype, and briefly state why.

    Post edited by tedtoal on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 9

    HI @tedtoal

    Can you please in brief explain how I can help you. That is a lot of information and if you could please break it down for me and provide me with a gist of your question I will be able to get to it quicker.

  • tedtoaltedtoal Member
    Accepted Answer

    Sorry, I was posting a lot so that it might help other people bypass some of the time it took me to understand better what is going on.

    I don't think I need any help any more. The only thing I would ask is that you guys improve documentation of the GT field in the Mutect2 VCF file, and possibly remove the field from the VCF file. Documentation about it shouldn't be buried in a blog or tech support thread. I personally think you are using the GT values incorrectly, and have tried to get the maintainers of the VCF spec to more clearly define use of the GT field in tumors, but I don't think they will.

    Thanks.

Sign In or Register to comment.