Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Using Mutect2 to call the same variants in multiple tumor samples

FPBarthelFPBarthel HoustonMember ✭✭
edited August 2018 in Ask the GATK team

Hi,

I want to use M2 to call variants in multiple tumor samples (from the same individual) and have followed the great guide by @shlee here. Let's assume I have one normal control N and three tumor samples T1, T2 and T3. After calling and filtering variants I am left with three VCF files, eg.

T1-N.filtered.vcf
T2-N.filtered.vcf
T3-N.filtered.vcf

I want to merge these three VCF files so that each variant is genotyped in each tumor sample and the normal sample. The resulting merged VCF file should have a header like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  N    T1    T2    T3

Is there a recommended workflow for this?

Thanks!
Floris

Tagged:

Best Answers

  • SheilaSheila Broad Institute admin
    Accepted Answer

    @FPBarthel
    Hi,

    I know this is on the team's radar, but you can keep track of it here. Perhaps posting that you would like it there will also help :smile:

    -Sheila

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @FPBarthel
    Hi Floris,

    Actually, the team is working on a workflow for this right now, but in the meantime, you can simply combine the VCFs using CombineVariants.

    -Sheila

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks @Sheila . Is there any way to fill in empty GT:AD:DP:GQ:PL for variants not called when combining VCFs?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @FPBarthel
    Hi Floris,

    You can try adding --never_trim_vcf_format_field, but I am not sure if that will help.

    -Sheila

  • cbaocbao Member, Broadie ✭✭

    Hi @FPBarthel,

    You can try my in-house pipeline. I have tested them for many times.

    1. cbao/GATK3_CombineVariants
    2. cbao/GATK4_VcfToIntervalList
    3. cbao/GATK4_SplitIntervals
    4. cbao/GATK4_Mutect2_scatter_gga (Use Call Caching)

    All of them have configurations.

    Thanks,
    Chunyang

  • cbaocbao Member, Broadie ✭✭

    They are workflows in Firecloud.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks @cbao I will look into this. Meanwhile curious on the status of the official workflow for this, is there any idea when this will become available @Sheila ?

  • SheilaSheila Broad InstituteMember, Broadie admin
    Accepted Answer

    @FPBarthel
    Hi,

    I know this is on the team's radar, but you can keep track of it here. Perhaps posting that you would like it there will also help :smile:

    -Sheila

  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks I've posted there :)

  • FPBarthelFPBarthel HoustonMember ✭✭

    Great, thanks @davidben I will be testing this out in the coming months and update you on my findings.

  • FPBarthelFPBarthel HoustonMember ✭✭

    @davidben I will be starting on this today and wondering whether this works with a mix of whole genome and whole exome samples, and whether it would work with exome samples using different capture kits.

    eg. we have one subject with 8 samples:

    • 2 tumor WES restricted exome capture kit (~40ish Mb capture)
    • 1 tumor WES expanded exome kit including UTRs (90ish Mb capture)
    • 3 tumor WGS
    • 1 normal WES restricted exome capture
    • 1 normal WGS
  • FPBarthelFPBarthel HoustonMember ✭✭

    Another thing: is it possible to supply a list of variants/sites to make calls at in addition to the ones automatically called.

    In practice this would be of interest for certain frequent hotspot variants, eg. TERT promoter mutations, and one would want to have genotyped these variants regardless of coverage and whether or not it can be found in any of the samples.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    eg. we have one subject with 8 samples. . .

    @FPBarthel I think that this will work with your mix of exomes and genomes. I would probably implement it by merging the panels of normals for each capture into a single pon vcf. Please let us know how it goes. We have a lot to learn about this new mode and this is an extremely interesting use case.

    Another thing: is it possible to supply a list of variants/sites to make calls at in addition to the ones automatically called.

    Sure, this is called GGA mode. Just add --genotyping-mode GENOTYPE_GIVEN_ALLELES --alleles <vcf of alleles you want to force Mutect2 to genotype> and Mutect2 will genotype these variants in addition to anything it would otherwise discover.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited February 11

    Thanks @davidben, great point I did not consider the PON then. Seeing your answer, I am taking it's not possible to supply multiple PONs (yet?). I will proceed with a merged PON as you suggested.

    Sure, this is called GGA mode.

    Thanks for this, the documentation is a bit unclear here since it suggests GENOTYPE_GIVEN_ALLELES profiles ONLY the specified alleles.

    from: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php#--alleles

    When the caller is put into GENOTYPE_GIVEN_ALLELES mode it will genotype the samples using only the alleles provide in this rod binding

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Thanks for this, the documentation is a bit unclear here since it suggests GENOTYPE_GIVEN_ALLELES profiles ONLY the specified alleles.

    @FPBarthel You are quite right -- the documentation describes how HaplotypeCaller uses the mode. Confusingly, Mutect2 shares the parameter, along with the inherited javadoc, but uses it differently.

    Seeing your answer, I am taking it's not possible to supply multiple PONs (yet?

    Not yet. I can't say we're even working on it at the moment, but as we understand more about users' needs for multi-sample mode it may become a priority. For example, if you find that using a single merged PoN seems to hurt sensitivity. In fact, everything about multi-sample mode will depend on users' suggestions. Our own sense of how it will or should be used are hazy at best.

  • FPBarthelFPBarthel HoustonMember ✭✭

    How do I run CalculateContamination for multi-sample M2?

    Previously I would:

    • Run Mutect2 providing T and N as arguments (= 1 process)
    • Run GetPileupSummaries individually for T and N (= 2 parallel processes)
    • Run CalculateContamination providing T and N pileup summaries (= 1 process)
    • Run FilterMutect providing output from CalculateContamination (=1 process)

    Can I provide CalculateContamination with multiple T and multiple N samples?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel You need to

    • run GetPileupSummaries for each tumor and one normal.
    • Run CalculateContamination for each tumor providing its pileup summary and the normal's pileup summary.
    • Specify --tumor-segmentation multiple times on the FilterMutectCalls command line, once for each tumor output from CalculateContamination (optional but helpful for germline filtering when there are CNVs).
    • Specify --contamination-table multiple times on the FilterMutectCalls command line, once for each tumor's output from CalculateContamination.
  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks!! Same question for FilterByOrientationBias.

    Previously I would:

    • Run CollectSequencingArtifactMetrics on the single T bamfile
    • Run FilterByOrientationBias on the FilterMutectCalls vcffile given a single artifact metrics file

    Can I supply FilterByOrientationBias with multiple pre_adapter_detail_metrics files for each tumor bam?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel We have a new orientation bias tool, and FilterByOrientationBias is semi-deprecated. The old version usually works fine, but it misses artifacts on the new Illumina Novaseq machines. Therefore, we chose not to update it for the new multi-sample workflow.

    To use the new tool, you need to:

    1. Run CollectF1R2Counts on each tumor bam that you suspect of orientation bias artifacts. If in doubt it can't hurt to run it on all tumor bams.
     gatk CollectF1R2Counts \
      -R GRCh38.fasta \
      -I tumor.bam \
      -alt-table tumor-alt.tsv \
      -ref-hist tumor-ref.metrics \
      -alt-hist tumor-alt.metrics
    
    1. Run LearnReadOrientationModel on each tumor's output from Step 1.
     * gatk LearnReadOrientationModel \
      -alt-table tumor-alt.tsv \
      -ref-hist tumor-ref.metrics \
      -alt-hist tumor-alt.metrics \
      -O my-tumor-sample-artifact-prior.tsv
    
    1. Specify --orientation-bias-artifact-priors multiple times, once for each output of LearnReadOrientationModel, on the Mutect2 command line (note: not the FilterMutectCalls comand line). Mutect2 uses the artifact priors to annotate variant calls, and FilterMutectCalls picks up on this annotation.
  • FPBarthelFPBarthel HoustonMember ✭✭

    Thanks @davidben ! This was able to run several hundred samples in ~24 hours or so. Really cool to be one of the first to test this without any official tutorial. I will be generating data on more samples and doing some comparison with my previous pipeline which used freebayes to force genotype related samples in the coming days.

    For FilterMutectCalls, why are there two separate arguments --tumor-segmentation and --contamination-table that take the same files as parameters, why not have a single parameter to take the files and a second boolean parameter to determine whether tumor segmentation should be performed?

    Actually, in my tests, giving the --tumor-segmentation parameter the same output from CalculateContamination, gives the following error:

    java.lang.IllegalArgumentException: there is no such column: contig
            at org.broadinstitute.hellbender.utils.tsv.DataLine.columnIndex(DataLine.java:431)
            at org.broadinstitute.hellbender.utils.tsv.DataLine.get(DataLine.java:400)
            at org.broadinstitute.hellbender.utils.tsv.DataLine.get(DataLine.java:529)
            at org.broadinstitute.hellbender.tools.walkers.contamination.MinorAlleleFractionRecord$MinorAlleleFractionTableReader.createRecord(MinorAlleleFractionRecord.java:85)
            at org.broadinstitute.hellbender.tools.walkers.contamination.MinorAlleleFractionRecord$MinorAlleleFractionTableReader.createRecord(MinorAlleleFractionRecord.java:78)
            at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:357)
            at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
            at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:465)
            at java.util.Iterator.forEachRemaining(Iterator.java:115)
            at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
            at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
            at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
            at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
            at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
            at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
            at org.broadinstitute.hellbender.utils.tsv.TableReader.toList(TableReader.java:525)
            at org.broadinstitute.hellbender.tools.walkers.contamination.MinorAlleleFractionRecord.readFromFile(MinorAlleleFractionRecord.java:55)
            at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
            at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
            at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
            at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
            at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
            at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
            at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
            at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.<init>(Mutect2FilteringEngine.java:42)
            at org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls.onTraversalStart(FilterMutectCalls.java:119)
            at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:964)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
            at org.broadinstitute.hellbender.Main.main(Main.java:291)
    
  • FPBarthelFPBarthel HoustonMember ✭✭

    Too late to edit this but figured out what I did wrong here.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel Apologies for not being clear: what I meant to say is that you should have CalculateContamination output the segmentation via --tumor-segmentation segments.table and then pass this as an input to FilterMutectCalls via --tumor-segmentation segments.table. This file contains a rough segmentation of the genome into CNV segments and is distinct from the contamination table, which is just a single number, the contamination.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    And thanks very much @FPBarthel for being an early adopter!

  • FPBarthelFPBarthel HoustonMember ✭✭

    @davidben it looks like Funcotator set to produce MAF runs on multi-sample VCF FilterMutectCalls output without errors, however the sample IDs are not properly shown in the columns, and judging from the header only a single tumor sample is shown. For the Funcotator VCF output it looks like multiple samples are properly output.

    Also, is there a way to split multi-allelic M2 output across multiple lines? I've used bcftools for this in the past. I did not find a M2 parameter in the documentation that could do this.

    Multi-allelic events are less intuitive to compare across VCF files (in my opinion), eg..:

    If one wanted to compare vcf a

    1   100001  .   C   T,A
    

    to vcf b

    1   100001  .   C   T
    

    would compare easier if vcf a was written as follows:

    1   100001  .   C   T
    1   100001  .   C   A
    
  • FPBarthelFPBarthel HoustonMember ✭✭

    @davidben how to determine if a variant is "called" in a tumor sample? It seems that the genotypes are never set to 0/0 for tumor samples, and that multiple tumor samples always share the exact same genotypes?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel Mutect2 genotypes are just an afterthought that we put in to make the vcf look normal. The only meaningful call, at least for now, is on the variant as a whole. For now we leave it to the user to decide, for example, whether an AD of 100,1, for example, in a post-chemotherapy sample at what was a real somatic variant pre-treatment, is real or not. I expect that this problem is quite hard but that the overall signature of ADs can yield a reliable estimate of post-treatment somatic burden.

    All this is really a cop-out to say that we just don't have a model for it yet. One step at a time. However, we are very open to suggestions.

    As for splitting multiallelics, could you open a new thread for this? I don't know how to triage this feature and would like to see if others want it, which won't be possible if it's buried in this discussion.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited February 18

    Got it!

    Another question. How does multisample M2 calculate some of the parameters that were previously based on a single sample, eg SAAF and SAPP? They are part of the INFO and not FORMAT fields.

    And to add to that, do you think that mixing different input (eg. WGS and WXS, WXS from different experimental procedures) would likely have a big impact on post-M2 filtering?

    I went and checked some of the WGS + WXS mix samples and finding that there are a lot of PASS calls with 0,0 AD in the WXS but positive AD in the WGS, which is a good sign I suppose.

    Post edited by FPBarthel on
  • FPBarthelFPBarthel HoustonMember ✭✭

    Just to add, I think a boolean flag for each tumor sample flagging a variant whether it would have been called in a simple single-tumor-sample mutect2 run would definitely be helpful. I think not having this flag makes something like mutation frequency calculation very difficult. Sure you can create arbitrary thresholds based on AD values but I would certainly not be very comfortable doing so.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Most of the annotations involving read counts simply aggregate them over all tumor samples. The idea is that if some thing is real, it's identical by descent in all samples, and if something is an artifact, it's qualitatively the same artifact whenever it appears. I suppose we're also assuming that variants don't show up as real somatic mutations in some samples and artifacts in others.

    Our guess at the moment is that mixing different protocols ought to be okay for this reason -- if it's an artifact in WXS and absent in WGS, for example, then aggregating will conclude that it's an artifact wherever it appears. But I'm sure that experience will prove some of the assumptions about multi-sample mode in 4.1.0.0 to be naive.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited February 18

    Makes sense and from my initial testing that seems right, @davidben. I got a bit excited and double posted here so I'll just reiterate the point I made in my second post with some extra arguments.

    One issue that has been troubling me is the effect of multi-sample calling on determining mutation rate.

    Take patient A with a two tumor samples (pre- and post- treatment) and patient B with ten different tumor samples, including multi-sector samples from various relapses.

    For previous versions of M2 I would call variants individually for each tumor sample against a single normal control. To determine mutation frequency I would calculate the number of bases with n x coverage (lets say 30x) for each tumor sample and count all the M2 calls (with at least n x coverage) to get a coverage adjusted mutation rate.

    For this iteration of M2 I am a bit at a loss how to do this, since there are no M2 calls at a single-tumor-sample level. I could set arbitrary AD thresholds (eg. >0 alt reads, >3 alt reads ,>5% VAF, etc etc), but even with this approach I would assume that the individual mutation frequencies for all ten patient B samples will be biased towards a higher mutation frequency because there are more tumor samples to call variants from compared to the two patient A samples. Also it becomes more difficult to adjust for coverage in a single tumor sample since variant calls were made using a combination of tumor samples and not for a single tumor sample.

    For this reason, I think a boolean flag that's part of FORMAT for each tumor sample flagging a variant whether it would have been called in a simple single-tumor-sample mutect2 run would be helpful.

    For my dataset for now I will simply run single-tumor/normal M2 soly to determine mutation frequency, while using the multi-sample calls for other analyses.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited February 19

    @davidben sorry for all the double posting, if you have a better place for me to post my findings happy to do so. I've started posting some bug reports on the github for things that I'm certain are bugs.

    Either way, it seems that multi-sample M2 is very adamant about filtering. I am finding that canonical IDH1 R132H mutations are almost always getting filtered out in multi-sample M2, that are not filtered in single-sample M2 on the same BAM file (at least using ~ GATK 4.0.x filters). If you can share a private contact method or PGP pub key I can share a gist with some VCF data.

    I'm also seeing the same for canonical C228T TERT promoter mutations.

    For IDH1 R132H the most common filters are mappinq_quality and read_position and it seems that the mapping_quality filter is associated with having MMQ=60,0,60. In this case I did force calling on the A allele so the VCF line looks like 2 209113112 . C A,T. The MMQ of 0 I guess is from the A allele (R132L variant) that was force called and since there are zero reads here a mapping quality filter is applied?? Not really sure about the read_position filter (yet)

    For TERT C228T the most common filters are (in order): clustered_events and germline_risk. The sample with germline_risk filter has a GERMQ of 0 (others are 35+).

    Post edited by FPBarthel on
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel The interaction of force-calling with the mapping quality filter seems like an unanticipated bug to me. Could you file a gatk issue and tag me (@davidbenjamin) if you haven't already done so? I would bet read position is the same thing.

    If you have a vcf to share with as many dubious filters as you wish, please send it to me: at broadinstitute.org.

    Thanks a lot!

    PS Your suggestion about single-sample "calls" within multi-sample mode is intriguing. We'll consider implementing it.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited February 27

    Sure thing. Happy to open "feature request" github posts if that helps. Right now I have the following feature requests:

    • Single-sample calls within multi-sample calling mode (to get accurate mutation counts at a single sample level). In the more distant future a sample-level call/genotype that takes into account variation across multiple samples would be a nice feature to have but it would not replace the single-sample call.
    • Allow multiple --panel-of-normals specification to M2 call, for cases with eg. mixed WGS/WXS or mixed capture WXS experiments, mixed NovaSeq/HiSeq samples etc etc

    P.s. I retract my feature request for split multi-allelic sites since LeftAlignAndTrimVariants was ported in 4.1.x

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel Thanks so much for the ideas! We will do our best.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    And, yes, please open github issues liberally.

  • xiuczxiucz Member

    hi, @davidben
    The GGA mode seems not working for me, because it starts to run from the first variant position in the bam. Here is my command, and my GATK verison is 4.1.0.0.
    Could you give me some comments?

    gatk Mutect2 \
    -R ~/hg19/gatk_bundle/ucsc.hg19.fasta \
    -I ~/sample.recalibrated.bam \
    -alleles force-call-alleles.vcf \
    --genotyping-mode GENOTYPE_GIVEN_ALLELES \
    -O single_sample.vcf
    

    Thank you.

  • xiuczxiucz Member

    Hi, @bhanuGandham
    Can you give me some help?

  • amjaddamjadd FinlandMember ✭✭
    edited July 3

    @davidben said:
    @FPBarthel We have a new orientation bias tool, and FilterByOrientationBias is semi-deprecated. The old version usually works fine, but it misses artifacts on the new Illumina Novaseq machines. Therefore, we chose not to update it for the new multi-sample workflow.

    To use the new tool, you need to:

    1. Run CollectF1R2Counts on each tumor bam that you suspect of orientation bias artifacts. If in doubt it can't hurt to run it on all tumor bams.
     gatk CollectF1R2Counts \
      -R GRCh38.fasta \
      -I tumor.bam \
      -alt-table tumor-alt.tsv \
      -ref-hist tumor-ref.metrics \
      -alt-hist tumor-alt.metrics
    
    1. Run LearnReadOrientationModel on each tumor's output from Step 1.
     * gatk LearnReadOrientationModel \
      -alt-table tumor-alt.tsv \
      -ref-hist tumor-ref.metrics \
      -alt-hist tumor-alt.metrics \
      -O my-tumor-sample-artifact-prior.tsv
    
    1. Specify --orientation-bias-artifact-priors multiple times, once for each output of LearnReadOrientationModel, on the Mutect2 command line (note: not the FilterMutectCalls comand line). Mutect2 uses the artifact priors to annotate variant calls, and FilterMutectCalls picks up on this annotation.

    I am puzzled here. I thought we get the F1R2 counts from Mutect directly using the --f1r2-tar-gz argument, and then we can run LearnReadOrientationModel and pass the output file to FilterMutectCalls through the -ob-priors input.

  • amjaddamjadd FinlandMember ✭✭

    @FPBarthel said:
    Sure thing. Happy to open "feature request" github posts if that helps. Right now I have the following feature requests:

    • Single-sample calls within multi-sample calling mode (to get accurate mutation counts at a single sample level). In the more distant future a sample-level call/genotype that takes into account variation across multiple samples would be a nice feature to have but it would not replace the single-sample call.
    • Allow multiple --panel-of-normals specification to M2 call, for cases with eg. mixed WGS/WXS or mixed capture WXS experiments, mixed NovaSeq/HiSeq samples etc etc

    P.s. I retract my feature request for split multi-allelic sites since LeftAlignAndTrimVariants was ported in 4.1.x

    +1 to the first suggestion here. I am trying to do that currently but I need a way to compute the mismatch rate after mutect filters. My related question is in this post

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    I am puzzled here. I thought we get the F1R2 counts from Mutect directly using the --f1r2-tar-gz argument, and then we can run LearnReadOrientationModel and pass the output file to FilterMutectCalls through the -ob-priors input.

    @amjadd You're correct. We added that feature very recently. My advice about using ColectF1R2Counts was correct at the time. It gives identical results to the new method but it's much slower.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    The GGA mode seems not working for me, because it starts to run from the first variant position in the bam.

    @xiucz I'm not sure I understand the problem. Are you saying that it skips some of the given alleles? As of 4.1.2 Mutect2 will only skip a given allele when there is zero coverage, as far as we know. While not ideal it would be very cumbersome to change that behavior within the GATK. If any other given allele is being skipped it's a mistake on our part.

  • xiuczxiucz Member

    @davidben
    Thanks for replying, I have a tumor sample with its normal matched one.
    First, I used GATK4.1.0.0 Mutect2 tumoronly mode to call variants in tumor sample(the matched normal excluded), and I got this:

    chr5    170837543   .   C   CTCTG   .   PASS    CONTQ=93;DP=13;ECNT=2;GERMQ=13;MBQ=43,43;MFRL=240,248;MMQ=60,60;MPOS=33;POPAF=7.30;RPA=1,2;RU=TCTG;SAAF=0.667,0.616,0.667;SAPP=0.025,0.029,0.946;STR;TLOD=32.17 GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC   0/1:4,8:0.643:12:2,3:2,5:false:false
    

    I used GATK4.1.0.0 Mutect2 tumor-normal mode to deal with the tumor and normal sample, and the chr5:170837543 variant disappeared, not in the vcf any more. And bamout igv showed:

    Then I used the following command(GGA mode) to get the coverage information in the NORMAL sample at the position chr5:170837543, but the output file(single_sample.vcf) was empty except the annotation header.

    gatk Mutect2 \
    -R ~/hg19/gatk_bundle/ucsc.hg19.fasta \
    -I ~/sample.recalibrated.bam \
    -alleles force-call-alleles.vcf \
    -L force-call-alleles.vcf \
    -O single_sample.vcf
    

    As you see from the igv shot, the coverage of chr5:170837543 in the NORMAL sample is non-zero.

    Why Mutect2's GGA mode cannot report every site's information just like Haplotypecall's -ERC BP_RESOLUTION mode?

    Thank you.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @xiucz Could you try using GATK 4.1.2.0 and share your command lines for tumor-only and tumor-normal mode?

Sign In or Register to comment.