Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

[GATK 4 beta] clustered_events in Mutect2/FilterMutectCalls

Hi,

I have a question about filtering Mutect2 calls. A well-characterized SNV (vcf records below 17:7577120) is filtered out by clustered_events filter. It appears that an artificial haplotype is assembled to have the SNV along with other variations nearby. But those variations don't appear in the same read and it seems unlikely to me that those coexist in one haplotype, so I am hesitant to configure a higher value for --maxEventsInHaplotype.
Do you think it is reasonable? If so, I wonder how I can get the variant (17:7577120) not-filtered (and variants on the left filtered).
Thank you!

17      7577079 .       CTTCCT  C       .       clustered_events        DP=592;ECNT=5;NLOD=95.58;N_ART_LOD=-2.510e+00;POP_AF=1.000e-03;P_GERMLINE=-9.228e+01;TLOD=18.63 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:316,0:0.019:41,0:0,0:413,0:60,0:41,0:false:false:0|1:7577079_CTTCCT_C       0/1:265,8:0.042:41,41:0,0:410,416:60,60:37,59:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.029:4.069e-03,3.816e-03,0.992
17      7577087 .       GT      G       .       clipping;clustered_events;read_position DP=570;ECNT=5;NLOD=93.62;N_ART_LOD=-2.494e+00;POP_AF=1.000e-03;P_GERMLINE=-9.031e+01;TLOD=18.71 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:302,0:3.807e-05:41,0:0,0:416,0:60,0:40,0:false:false:0|1:7577079_CTTCCT_C   0/1:260,8:0.029:41,0:0,0:416,416:60,60:40,0:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.030:4.483e-03,3.633e-03,0.992
17      7577089 .       G       A       .       clipping;clustered_events;read_position DP=566;ECNT=5;NLOD=91.51;N_ART_LOD=-2.484e+00;POP_AF=1.000e-03;P_GERMLINE=-8.821e+01;TLOD=18.76 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:PGT:PID:REF_F1R2:REF_F2R1:SA_MAP_AF:SA_POST_PROB   0/0:297,0:3.822e-05:0:0:NaN:41,0:0,0:413,0:60,0:40,0:false:false:.:.:.:.:0|1:7577079_CTTCCT_C:154:143   0/1:261,8:0.030:2:6:0.250:41,0:0,0:418,416:60,60:39,0:false:true:0.750:0.038:44.92:100.00:0|1:7577079_CTTCCT_C:133:128:0.030,0.020,0.030:4.595e-03,3.543e-03,0.992
17      7577090 .       CGCCGGT C       .       clipping;clustered_events;read_position DP=579;ECNT=5;NLOD=93.27;N_ART_LOD=-2.494e+00;POP_AF=1.000e-03;P_GERMLINE=-8.997e+01;TLOD=18.63 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:305,0:4.990e-03:41,0:0,0:416,0:60,0:40,0:false:false:0|1:7577079_CTTCCT_C   0/1:266,8:0.030:41,0:0,0:417,416:60,60:39,0:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.029:4.692e-03,3.389e-03,0.992
17      7577120 .       C       G       .       clustered_events        DP=520;ECNT=5;NLOD=82.01;N_ART_LOD=-2.443e+00;POP_AF=2.462e-05;P_GERMLINE=-8.032e+01;TLOD=188.97        GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBQ:OBQRC:REF_F1R2:REF_F2R1:SA_MAP_AF:SA_POST_PROB   0/0:273,0:0.016:0:0:NaN:41,0:0,0:415,0:60,0:39,0:false:false:.:.:135:138        0/1:191,56:0.233:26:30:0.536:41,41:0,0:418,388:60,60:37,44:false:false:54.12:100.00:102:89:0.222,0.202,0.227:9.482e-03,0.012,0.978

image

Issue · Github
by shlee

Issue Number
2302
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 18

    Hi again @dayzcool,

    Thanks for presenting your question in such an organized manner. It's my understanding that somatic analyses require much manual review. I've attached a PDF of the original GATK3-MuTect2 hands-on tutorial worksheet that I wrote back in January/February and that was presented by others in Leuven, Belgium (also back in February). Although the attached tutorial is about the previous implementation of M2, and does not use GATK4-M2, the concepts it discusses are still germane. In section 5, I count the instances for the example data that a filter is the sole reason for a site not passing.

    This simplistic comparison is meant to show how deriving a good somatic callset involves comparing callsets, e.g. from different callers, manually reviewing passing and filtered calls and, as alluded to, additional filtering.

    So I think the best bet for your case would be to manually remove the filter for your site of interest at 17:7577120.

    P.S. An updated hands-on tutorial that uses GATK4-Mutect2 has been written by another teammate and should be shared in a blogpost soon.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 18

    @dayzcool,
    It appears the default setting for --maxEventsInHaplotype is 2. I'm curious if you set this to 3, how much of an impact it will have on your passing calls. GATK4 splits the previous M2 functionality to Mutect2 and FilterMutectCalls is so that you can easily iterate the latter tool over your data while tweaking the parameters to obtain better callsets. I believe FilterMutectCalls will run fairly quickly.

  • dayzcooldayzcool Member

    @shlee, thanks for your help! Here's what I learned from following your advice.
    First, I experimented with --maxEventsInHaplotype argument. Although setting a bigger value for maxEventsInHaplotype didn't dramatically increase the number of PASS variants, the variant wouldn't pass the filter until maxEventsInHaplotype is set to 5+, 3 bigger than the default value and ended up passing 46% more number of variants. FYI, Default value (2) gave 503 PASS variants, maxEventsInHaplotype of 3 did 634, maxEventsInHaplotype of 4 did 733, maxEventsInHaplotype of 5 did 784, and maxEventsInHaplotype of 6 did 815.

    In addition, it appears that the local assembly may contribute to increase the number of events by realignment and make complex indels filtered out almost every time. The variant (complex indel?) on the left in the IGV screenshot (17:7577079-17:7577090) could be treated like one event instead of four events, because it is fairly close and phased.

  • dayzcooldayzcool Member
    edited July 20

    Second, increasing -kmerSize seems to be really promising and help phase better. But the variant is still filtered out by clustered_events. It is because artificial haplotypes are a lot longer in Mutect2 local assembly in comparison to GATK's. Longer haplotypes are more likely to have more number of variants so the filter becomes more aggressive; the well-characterized variant (17:7577120) was NOT assembled along with the complex indel left, but it is phased with other variants further away and filtered out.
    I am not sure what I can do not to filter the variants out without sample-specific filters.

    Additionally, I'd like to report two observations in this experiment.
    Mutect2 local assembly bams have a large number of artificial haplotypes even with --bamWriterType CALLED_HAPLOTYPES. -kmerSize of 10 gave 127 artificial haplotypes, 40 did 40 artificial haplotypes, and 100 did 16 artificial haplotypes. I wondered if --bamWriterType CALLED_HAPLOTYPES really works.
    Also, -kmerSize of 100 didn't phase the variants properly like -kmerSize of 40 did for the variants. Do you have any idea why?
    image

  • dayzcooldayzcool Member

    Oh, I'd like to add that the script from the developer worked great except for the indexing part below. I didn't have to reindex actually.

    #re-index the vcf
    gatk-launch IndexFeatureFile -F redo.vcf
    

    That command exits with this message:
    A USER ERROR has occurred: The IndexFeatureFile tool is temporarily disabled.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited August 29

    Hi @dayzcool,

    Thanks for these tests. Our developer finds the poor phasing at k=100 strange as well. To delve deeper into this, you can run Mutect2 on this locus with -debugGraphTransformations and graphOutput. We just released GATK4.beta.2 yesterday, fyi, in case these arguments are not in v4.beta.1, they should be in v4.beta.2. You can view the graphs with graphViz. Our developer would be glad to partake in interpreting the graphs if you post them.

    And here let me convey some additional thoughts from our developer on the topic of mapping artifacts.

    Assembling fewer false haplotypes would be useful, but the fundamental problem is that Mutect has to be conservative and treat significant variation as evidence of a mapping artifact. Most of the filters are just proxies for mapping errors....[In the future] we [hope to] have a more direct way of getting mapping errors and will be able to ease up on the proxies, which hurt sensitivity. Basically, it's a[n] issue that for now is only solved with manual review.

    We really appreciate your contribution here.

    Post edited by shlee on
  • dayzcooldayzcool Member

    @shlee, I see. Mutect focuses more on reducing false positives than reducing false negatives?
    Attached is an archive of dot files with kmerSize of 100, and the following message is seen in the log. Thank you for taking a look at it!

    ReadThreadingAssembler - Skipping writing of graph with kmersize 100

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @dayzcool It's more like Mutect2 seeks the optimal balance of achieving high sensitivity without allowing an explosion of false positives

  • dayzcooldayzcool Member

    @Geraldine_VdAuwera, understood. Apparently, I think Mutect2 is doing a great job.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dayzcool

    Hi,

    It seems the team is preparing to update the assembly engine code. This change may fix your issue, however, it will take some time to implement, so for now you will need to use manual review.

    -Sheila

  • @Sheila, thank you for the updates. I look forward to using the assembly engine that phases better!

  • kekekeke Member

    @shlee, as @dayzcool said:

    In addition, it appears that the local assembly may contribute to increase the number of events by realignment and make complex indels filtered out almost every time. The variant (complex indel?) on the left in the IGV screenshot (17:7577079-17:7577090) could be treated like one event instead of four events, because it is fairly close and phased.

    I am confused by the same question, how to merge three or four close and phased events into one event or will the updated program realize it?

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited December 11

    Hi @keke,

    Variants are left-aligned as much as possible. The reassembly has its own logic to minimize penalty scores, e.g. gaps. As you say, the variant in the IGV screenshot could indeed be treated as one event instead of multiple. However, currently, our callers produce individual records for these variants and connect them via physical phasing. You see for the variants on the same phase the pipe | and an identifier that points to the first variant in the series that are phased:

    :0|1:7577079_CTTCCT_C

    I hope this helps clarify your question.

Sign In or Register to comment.