[GATK] joint calling for Mutect2?


I am interested in inferring clonal evolution using somatic variants called by Mutect2. One way to infer is by tracking VAF (variant allele fraction) of somatic variants in multiple time points and clustering.
One challenge in using Mutect2 calls is its difficulty to compute VAFs especially for indel, because some variants are called using local assembly in a subset of time points. Allele counting in time points where the variant is not called is tricky. Thus, I usually limit variants to SNPs which is less hard to count. But some cohorts don't carry many somatic variants and I believe it would be helped by joint calling. Does it make sense?
Would you consider implementing joint calling for Mutect2 like Haplotypecaller?

(Image credit: https://github.com/chrisamiller/fishplot)

Best Answers


  • dayzcooldayzcool Member

    @Sheila and @shlee,
    Thank you so much for referring me to the informative article and discussion!
    I can't say I fully understand the technical difficulties. But I understood it is nontrivial to implement and joint somatic variant calling should be different from the joint calling of haplotypecaller. I still think it might be worthwhile for Mutect2 to be able to call somatic variants jointly on multiple tumor samples from an individual. It would help track somatic variants of a person over time.

    By the way, the article mentioned that Mutect2 would run without a matched normal. I wonder if Mutect2 now supports the tumor only mode. I remember no variant passed filters in tumor only mode for an older version. (One the other hand, I now think tumor only calling with high false positives would be a privacy threat..)

    Image credit: Libbrecht lab

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I still think it might be worthwhile for Mutect2 to be able to call somatic variants jointly on multiple tumor samples from an individual.

    That's a great idea @dayzcool and I'll put in a feature request to that effect.

    Yes, Mutect2 has a tumor-only mode, which refers to analyzing a sample BAM (labeled with -tumor without a matched normal BAM (labeled with -normal). The mode is available to (i) call on normal samples towards panel of normals generation and (ii) for tumor analysis without a matched normal, in which case the expectation is that the germline resource will aid in filtering germline variants. The matched normal BAM is traditional, of course, in somatic calling. An alternative is to provide the matched normal's calls as a VCF using the germline resource argument. The privacy threat you mention is something we expect researchers to contend with. We just provide useful tool levers and general recommendations.

  • dayzcooldayzcool Member

    @shlee, thanks again for your kind help!!

  • dayzcooldayzcool Member

    Awesome - thank you for the update, @shlee!
    It is a great time for me to be able to use it, as I will work on clonal evolution.

    Happy Lunar New Year!!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Happy New Year to you too @dayzcool!

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Thanks @shlee for getting the word out about this new feature and thanks @dayzcool for trying it out!

  • dayzcooldayzcool Member

    Thanks again, @shlee and @davidben!

    At first glance, I feel I need a user guide for the multi-sample calling feature. Some things aren't straightforward to me. For instances, assuming that multiple samples are specimens from a biological being, how do multiple normal samples work? How would filtering be changed? And, do you plan to implement it in reference pipeline in wdl? I would love to be able to run multiple sample calling in mutect2.wdl pipeline.. It's very possible I might have missed docs/discussions though.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @dayzcool Multiple normal samples are pooled together as, effectively, a single normal. In the vcf they are separate, but what I mean is that as far as somatic calls are concerned they are just aggregated.

    We made a reasonable effort to refactor the filtering algorithm to extend to multiple samples, but it is somewhat heuristic and is probably the most "beta" aspect of the new feature. Essentially it's a vote among tumor samples, weighted by alt read counts, as to whether an allele is somatic or not.

    We plan to make a multi-sample wdl, but it is not written yet.

    The command lines for calling and filtering are (leaving out the pon and gnomad for brevity):

    gatk Mutect2 -R $ref \
        -I tumor1.bam -I tumor2.bam \
        -I normal1.bam -I normal2.bam \
        -normal normal1 -normal normal2 \
        -O calls.vcf
    gatk FilterMutectCalls -V calls.vcf -O filtered.vcf

    where normal1 is the sample name in the header of normal1.bam.

  • dayzcooldayzcool Member
    edited February 28

    Thanks again for implementing this, @davidben and @shlee. I am still evaluating it but it am happy with the multi-tumor calling results for clonal evolution study. I also realized the brilliance of the multi-normal calling; it seems to be potentially useful in calling variants on transplant cases. (If not, I appreciate any guidance on using mutect2 on transplant cases.)

    Minor issues in using and multi sample calling were

    • Computational performance
      It took 30+ hours in running mutect2 on typical WGS sample with 100X+. It spent considerable time on some steps like CollectF1R2Counts and FilterAlignmentArtifacts, which I couldn't find any good way to accelerate. And, I couldn't find #cpu argument for LearnReadOrientationModel which often uses more than 1 cpu. Do you have any suggestions on running those (optional) tools faster, or budgeting resources better?

    • P_GERMLINE is replaced by GERMQ?
      Not about multi sample calling, but I just noticed and wondered why the change is made, Anyhow, it looks that filtering still uses --max-germline-posterior, not GERMQ, and Mutect2 documentation might need to be updated to replace P_GERMLINE:

    • Multi sample
      I was able to modify mutect2.wdl for multi samples mostly. One case I wasn't sure how to implement is multi-normal calling, because I couldn't run CalculateContamination with multiple normal samples. Do I need to run CalculateContamination for all possible tumor-normal pairs then run FilterMutectCalls with those? Then, for a tumor sample, it would have multiple CalculateContamination outputs. For example, if running on tumor0, tumor1, normal0, normal1, I would have to run CalculateContamination on tumor0-normal0, tumor0-normal1, tumor1-normal0, tumor1-normal1?
      Funcotator doesn't look to be ready for multi samples. But I can understand it's not straightforward since it's unclear whether a variant is "called" for one specific tumor among tumors called together.

    • Evaluation
      I wonder if you could share how you do end-to-end testing.

  • dayzcooldayzcool Member

    @davidben Thank you for your very helpful explanations!
    So, I understood that no truth set you use are publicly available. I'd like to have more truth sets for validation, so please let me know if I misunderstood. (DREAM challenge synthetic data wasn't available for download for non-participants)
    Truth set using lineage sequencing approach sounds wonderful. If you plan to share it with the world, please advise how I can get updates.

Sign In or Register to comment.