To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GATK4 MuTect2 passing germline events?

kmegqkmegq BroadMember, Broadie

Dear GATK team,

I have been experimenting with calling somatic variants in my tumor/normal WES data using GATK4 MuTect2. After calling and filtering, I found many sites (mostly indels and MNP indels) passing filters that were also clearly present in the normal, and even were listed as having a high AF in the normal. This really surprised me, because I thought that by default, MuTect2 applied:

 --max_alt_allele_in_normal_fraction 0.03
--max_alt_alleles_in_normal_count 1

I went back and investigated the header of my VCF files, and I don't see those options anywhere. Should these be being applied by default, or do I need to pass them in by hand in the new version? I also looked at FilterMutectCalls, but didn't see anything for filtering sites present in the normal.

Thanks for your help!
Best,
Kate

Answers

  • kmegqkmegq BroadMember, Broadie

    Further investigation reveals that

    --max_alt_allele_in_normal_fraction and --max_alt_alleles_in_normal_count
    

    no longer exist in the new version of MuTect2. However, I found in the MuTect2 BETA documentation that there is now a flag called

    --minNormalVariantFraction
    

    which is "Minimum fraction of variant reads in normal for a pileup to be considered inactive." By default, this is set at 0.1. However, when I try to modify this option to make it more stringent, I get:

    A USER ERROR has occurred: minNormalVariantFraction is not a recognized option
    

    Is there any way to modify this behavior in MuTect2?

    Best,
    Kate

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited October 2017

    @kmegq
    Hi Kate,

    Can you post an example VCF record of a site in both the normal and tumor which passes filtering? Are you using the default filters in FilterMutectCalls? Can you also post the IGV screenshot of the original BAM files and bamout file?

    Some of the flags/arguments are still being hooked up in the beta version, so they may not work yet. You may find the Mutect2 tutorials/presentations helpful for understanding the filters. They are in the presentations section.

    Thanks,
    Sheila

  • kmegqkmegq BroadMember, Broadie

    Hi Sheila,

    Yes, sorry for the delay. I am attaching several lines from my VCFs, an IGV screenshot from the bamout and an IGV screenshot from the original input bam files. As a little bit of background, this is dog cancer data, and I have run these analyses using the --dontUseSoftClippedBases argument, because I was seeing a large number of artifactual insertion calls that were actually the ends of soft-clipped reads being called as insertions without any other support.


    1) The first thing I am seeing is instances of indels that seem to be present in both the tumor and the normal. I have seen this in both the GATK3 and GATK4 versions of MuTect2.

    Here is a GATK3 example:

    chr28   4157936 .       ACATACCT        A       .       PASS    ECNT=1;HCNT=2;MAX_ED=.;MIN_ED=.;NLOD=11.08;TLOD=8.15    GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1        0/0:39,0:0.00:0:0:1127,0:15:24  0/1:75,4:0.054:4:0:2011,98:44:31
    

    And the matching location in the input T and N bams (normal on top), with no variant at that location:


    Here is an example from my GATK4 output of a T insertion that is present in both the tumor and the normal:

    chr2    69660054        .       GT      GTT     .       PASS    DP=213;ECNT=1;NLOD=17.61;N_ART_LOD=-7.012e-01;POP_AF=1.000e-03;P_GERMLINE=-1.478e+01;RPA=9,10;RU=T;STR;TLOD=5.72        GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB   0/0:67,1:0.146:34,14:0,0:273,397:60,60:24,21    0/1:94,7:0.145:31,31:0,0:269,275:60,60:28,17:0.061,0.061,0.070:0.049,4.501e-03,0.947
    

    The bamout file:

    And the matching location in the input bams shows that again the T insertion is present in both:

    Another example from GATK4:

    chr29   27190001        .       AT      A       .       PASS    DP=97;ECNT=1;NLOD=5.00;N_ART_LOD=-4.251e-01;POP_AF=1.000e-03;P_GERMLINE=-1.871e+00;RPA=14,13;RU=T;STR;TLOD=6.33 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB   0/0:22,1:0.246:30,30:0,0:266,174:60,60:24,21    0/1:39,6:0.280:28,27:0,0:229,209:60,60:31,16:0.111,0.131,0.146:0.019,0.013,0.968
    

    The bamout:

    And the input bams, which seem to show the T deletion in the tumor, and one TT deletion in the normal:


    2) The second scenario is that there is a location with a deletion called in both the T and the N, but the starting site varies somewhat, and one variant of this is being called as a somatic variant.

    chr12   67843872        .       ATTT    A       .       PASS    DP=72;ECNT=2;NLOD=5.76;N_ART_LOD=-1.173e+00;POP_AF=1.000e-03;P_GERMLINE=-3.132e+00;RPA=15,12;RU=T;STR;TLOD=5.43 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB   0/0:22,1:0.245:30,0:0,0:250,0:60,0:20,0 0/1:18,4:0.288:27,28:0,0:222,260:60,60:23,23:0.081,0.131,0.143:0.056,9.053e-03,0.935
    

    Bamout:

    Input bams:


    3) The third scenario, which I find confusing, is that some variants appear to be true somatic variants, but have a very high allelic fraction in the normal. For example, here the T insertion only seems to be in one read in the normal when viewed in IGV, but the normal AF is 0.136

    chr17   7169355 .       CT      CTT     .       PASS    DP=87;ECNT=1;NLOD=5.42;N_ART_LOD=-4.555e-02;POP_AF=1.000e-03;P_GERMLINE=-2.235e+00;RPA=11,12;RU=T;STR;TLOD=6.52 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB   0/0:26,1:0.136:33,18:0,0:214,361:60,60:20,47    0/1:44,5:0.154:31,28:0,0:239,178:60,60:25,14:0.091,0.081,0.102:7.132e-03,0.038,0.955
    

    Bamout:

    In these cases, what is the ALT allelic fraction in the normal referring to? Is it a second, unlisted ALT allele, as I think I saw mentioned as a possibility in the forums? If so, why doesn't MuTect2 list the second ALT allele? Does it only list ALT alleles it calls as somatic?


    4) Finally, a related question - many of these variants appear to be written as multi-nucleotide polymorphism indels. I don't remember seeing them in the GATK3 MuTect2 output. I don't quite understand the notation...in the record I list below, wouldn't REF: CCTCT ALT: TCTCT be the same as REF: C ALT: T at the first position? I thought that MNPs had to differ at each position.

    chr31   19677805        .       CCTCT   TCTCT   .       PASS    DP=56;ECNT=1;NLOD=5.19;N_ART_LOD=-7.595e-01;POP_AF=1.000e-03;P_GERMLINE=-2.431e+00;TLOD=5.60    GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB   0/0:18,0:0.054:31,0:0,0:364,0:60,0:41,0 0/1:29,4:0.169:30,28:0,0:278,334:60,36:36,36:0.081,0.081,0.100:0.011,0.025,0.965
    

    A number of these calls seem to occur where insertions and deletions appear in both the tumor and the normal reads. I wonder if having a filter for variants near an indel would help resolve some of these issues?

    Sorry this is so long. Thank you again for all your help!

    Best,
    Kate

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kmegq
    Hi Kate,

    Sorry for the delay. I will get back to you soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kmegq
    Hi Kate,

    I have run these analyses using the --dontUseSoftClippedBases argument, because I was seeing a large number of artifactual insertion calls that were actually the ends of soft-clipped reads being called as insertions without any other support.

    Can you clarify what you mean by that? Is there no support even in the bamout? Are the calls filtered out after VQSR/hard filtering? We recommend using the soft clipped bases because the reassembly step could unveil indels in those regions. Do you know for sure if the indels are false positives?

    1,2) It is possible for sites with small amounts of evidence of the variant in the tumor to be called as variant (as long as the evidence in the tumor is greater than the evidence in the normal). Have a look at the Mutect2 presentation (in the presentations section). Are you inputting a PoN and germline resource as well as normal sample? Are those calls passing after FilterMutectCalls?

    3) Have a look at this thread.

    4) Hmm. That is interesting. Are there other variants close to that site?

    Thanks,
    Sheila

  • kmegqkmegq BroadMember, Broadie

    Dear Sheila,

    Thank you for your reply!

    Regarding the soft-clipped bases - when I first ran my data through (GATK3) MuTect2, I seemed to have two groups of samples - one with a low fraction of indels (~5-10%) and one with an extremely high fraction of indels (up to 91%), which were mostly insertions of 5-20 bp. Further inspection revealed that the proportion of indels almost perfectly divided my data by sample type - those with a low fraction of indels were prepared from frozen tissue, and those with the high fraction of insertions were prepared from FFPE tissue. This made us very suspicious that we were seeing an artifact.

    Upon further inspection, we found that these insertions appeared only at the ends of reads, which had been soft-clipped in the original input bam:

    In addition, when we BLASTed the sequence against the dog genome, the insertions usually mapped a few hundred to a few thousand bp away from the insertion site, on the same chromosome. I consulted with some researchers in the Cancer Program, and they were also suspicious that this was an artifact, and theorized that it was possibly caused by DNA fragmentation and cross-linking due to formalin fixation. I have since found this post on the GATK forums, and wondered if this is the same or a similar artifact. I would like to hear your thoughts, as well!

    That is why I decided to rerun everything using the -dontUseSoftClippedBases option. I have since tried running my data through the GATK4 version of MuTect2 + FilterMutectCalls, (using soft-clipped bases) and I think it does a better job of filtering the insertions out, but some still make it through:

    Interestingly, even after using the -dontUseSoftClippedBases argument, I still see a lot of insertions being called where the only support was at the end of the reads (in both GATK3 and GATK4). In fact, after calculating my top significantly mutated genes in my GATK3 data using Genome MuSiC, many of the top genes (including the top gene) were supported only by indels that all occur near the ends of reads:

    I was able to filter these out by adjusting the --minMedianReadPosition filter in FilterMutectCalls in the GATK4 data. However, because the GATK4 data seems to have some variants that appear in the normal, and there are MNP and mixed indels called that don't appear to be MNPs, and I was also seeing indels at the ends of reads in the GATK3 version, I thought I would take the consensus calls between the GATK3 and GATK4 versions. Do you think this is a reasonable approach?

    Thank you again for your help!

    Best,
    Kate

    Issue · Github
    by Sheila

    Issue Number
    2613
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kmegq
    Hi Kate,

    Thanks for the detailed posts :smile: I actually have to check with the Mutect2 team on a few things. I will get back to you when I have some concrete answers.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @kmegq,

    Our development philosophy for Mutect2 in GATK4 is to maximize sensitivity, so we expect the program to call quite lot of false positives, including artifacts that stem from tissue preservation techniques. Rather than build filters directly into Mutect2 itself, we provide additional tools to perform filtering on various criteria including sequencing context and external resources. This provides filtering functionality in a way that is far more flexible and controllable for the individual analyst. We don't yet have full documentation on this topic but you can look up the presentation and tutorial materials from our most recent workshop (linked here) for more details.

Sign In or Register to comment.