Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Mutect2

SystemSystem Administrator admin
This discussion was created from comments split from: New to the forum? Ask your questions here!.

Comments

  • stevebsteveb Member

    We want to try using MuTect2 on FFPE genomes. Should the panel of normals be made of up FFPE normals?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @steveb

    Yes that would be ideal. The more closely obtained and sequenced tumor and normal samples, the better it is for the tool to remove technical biases.

    Regards
    Bhanu

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    And be sure to run the orientation bias filter. For FFPE samples this is more important than the panel of normals.

  • manbamanba Member

    @davidben said:
    And be sure to run the orientation bias filter. For FFPE samples this is more important than the panel of normals.

    I am not quite understand what you said.

    and in one post about difference between gatk3 and 4 mutect2, it said gatk3 filter in the command mutect2, but gatk4 has a separate tool,
    but there can still has alt_allele in normal sample, means both normal and tumor different from reference, so I do not think gatk3 mutect2 automatic filtering , but the post said atomatic filtering

    thanks a lot.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @manba

    Which post are you referring to? Please try and give us the most amount of information you can so that we can help you better.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @manba

    I think i know the document you are referring to.

    still has alt_allele in normal sample

    Would you please post some examples supporting this observation.

    Thank you.

    Regards
    Bhanu

  • manbamanba Member
    edited December 5


    the red box , gatk3.7,
    thanks for your kind help.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @manba

    This other user question and this document explains alt_allele_in_normal flag.

    Regards
    Bhanu

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @manba GATK 4 has a separate tool, FilterMutectCalls, for filtering, but in addition to that there is also a separate tool for filtering by orientation bias, which is relevant in FFPE samples.

  • manbamanba Member

    thanks for your help.
    I have another question is about > @bhanuGandham said:

    Hi @manba

    This other user question and this document explains alt_allele_in_normal flag.

    Regards
    Bhanu

    thanks a lot

  • stevebsteveb Member
    Do you mean ?
    --orientation-bias-artifact-priors
    null table of prior artifact probabilities for the read orientation filter model
  • stevebsteveb Member
    Continuing about FFPE and PanelOfNormal,
    would it be better to create a Panel with the loads of FFPE Tumour samples and treat it as a panelOfNormal? Or to use nothing at all? We generally do not have FFPE tumour lib.s.
    Still investigating --orientation-bias-artifact-priors switch.
    Can anyone give more info on why we would use that with the FFPE samples?
  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @steveb Creating a panel of normals using tumor samples is probably not a good idea. A panel of normals serves mainly to catch mapping errors and so any panel of normals will perform decently, especially if the read and fragment sizes, and the alignment algorithm, are comparable to your tumors.

    The reason to do orientation bias filtering is that this catches artifacts due to chemical changes in DNA that occur on a single strand during sample prep, including sitting on a shelf in FFPE for a few years. The canonical reference for this sort of thing is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3616734/.

    A panel of normals, even an FFPE panel, is not a good substitute for FFPE filtering because the characteristic deamination artifact might only occurs maybe once in every 10,000 contexts where it is prevalent. Although this translates to a huge error rate of 1000's of false positives per genome, at any given site you probably won't get a sample in your panel. The panel of normals is not smart -- it just memorizes every site in isolation -- whereas the orientation bias filter learns that certain sequence contexts are error-prone by considering every place those contexts occur.

    The workflow for our orientation bias filter is:

    collect raw data

    1) gatk CollectF1R2Counts -R ref.fasta -I tumor.bam \
    -O tumor-artifact-prior-table.tsv \
    -alt-table tumor-alt.tsv \
    -ref-hist tumor-ref.metrics \
    -alt-hist tumor-alt.metrics

    learn the model

    2) gatk LearnReadOrientationModel \
    -alt-table my-tumor-sample-alt.tsv \
    -ref-hist my-tumor-sample-ref.metrics \
    -alt-hist my-tumor-sample-alt-depth1.metrics \
    -O my-tumor-sample-artifact-prior.tsv

    call

    3) Run Mutect2 with the argument --orientation-bias-artifact-priors my-tumor-sample-artifact-prior.tsv

    filter

    4) FilterMutectCalls automatically picks up on the relevant VCF annotations emitted by Mutect2 and filters accordingly.

  • shanbioshanbio Member

    @bhanuGandham @Geraldine_VdAuwera

    Congrats to the entire GATK4 team to make it public :)

    Recently I thought of updating to GATK4 and I tried Mutect2

    Below are my observations and questions:

    1.There is no QUAL or Filter information in the VCF file.

    This is explained in one of the thread as GATK gives less weightage for making somatic calls. (Reasonable)

    2.There is no "DP" tag in the FORMAT column where traditionally VCF files from samtools or Freebayes has. However it's present in INFO column which is less used in downstream analysis.

    3.There is a mismatch (huge) in the total read depth of "DP" and AD (Ref+Alt)

    This is also explained in one of the thread, but the difference is significant

    4._ [Both Mutect2 & Haplotype Caller]_ I wonder why the "Case" column-10 comes first followed by the "Control" column-11, whereas all the other variant callers provides control first followed by case column. This is creating a lot of issue in downstream pipeline.

    Example VCF:
    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT pat120_case pat120_control
    chr1 6185847 . C A . . DP=194;ECNT=1;NLOD=25.38;N_ART_LOD=0.375;POP_AF=1.000e-06;REF_BASES=GGCCTCTCTTCAAAGTCCTCA;TLOD=3.04 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:84,2:0.050:43,1:41,1:39,38:202,284:60:32:0.00,0.020,0.023:0.012,4.088e-03,0.984 0/0:98,1:0.030:48,0:50,1:39,39:204,149:60:6

    5.There are differences in the actual variant allele frequency and the VAF provided by GATK.

    This is also explained in one of the thread, but it's not clear.

    Please provide a solution to above queries.

    Command Used:

    GATKCommandLine=<ID=Mutect2,CommandLine="Mutect2 --tumor-sample pat120_case --normal-sample pat120_control --output pat120_somatic_gatk.vcf --intervals TruSeq_exome_targeted_regions.hg19.bed --input pat120_control.bam --input pat120_case.bam --reference hs37d5.fa --genotype-pon-sites false --genotype-germline-sites false --af-of-alleles-not-in-resource -1.0 --tumor-lod-to-emit 3.0 --initial-tumor-lod 2.0 --initial-pcr-qual 40 --max-population-af 0.01 --downsampling-stride 1 --max-suspicious-reads-per-alignment-start 0 --normal-lod 2.2 --max-mnp-distance 1 --ignore-itr-artifacts false --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --recover-dangling-heads false --do-not-recover-dangling-branches false --min-dangling-branch-length 4 --consensus false --max-num-haplotypes-in-population 128 --error-correct-kmers false --min-pruning 2 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --use-new-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --disable-tool-default-read-filters false --minimum-mapping-quality 20 --max-read-length 2147483647 --min-read-length 30 --disable-tool-default-annotations false --enable-all-annotations false",Version=4.0.8.1,Date="December 12, 2018 8:00:55 AM UTC">

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To your question 4: the sample/genotype columns are emitted in alphabetical order iirc. Given that the VCF specification does not provide a rule for column ordering, it is a bad idea for any downstream analysis to rely on arbitrary ordering expectations — the software should check what is the actual order every time.

  • shanbioshanbio Member

    @Geraldine_VdAuwera Thank you very much, will do as suggested and let you know.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    HI @shanbio

    Have all your questions been answered? It seems like you already have justifications/clarifications for most of the points you made, so I am not sure what your questions are. If there were any questions you need help with please let me know.

    Regards
    Bhanu

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @shanbio

    There is no QUAL or Filter information in the vcf file.

    Mutect2 emits a tumor log odds (the TLOD info field), which is the somatic equivalent of a QUAL. To filter, you must pass the output of Mutect2 to FilterMutectCalls.

    There is no "DP" tag in the FORMAT column.

    That's true, but the sum of the ADs in the genotype fields gives you the total sample depth.

    There is a mismatch (huge) in the total read depth of "DP" and AD (Ref+Alt).

    DP is an INFO field for the variant site as a whole aggregated over all samples, while AD is a sample-level field. In your example, the DP is 194 and the sample ADs are 84,2 and 98,1. The difference between the summed ADs 84 + 2 + 98 + 1 = 185 and the DP 194 can be accounted for partly by the fact that the ADs neglect uninformative reads that can't distinguish between alleles, and partly by the fact that overlapping read pairs are not counted twice.

    There are differences in the actual variant allele frequency and the VAF provided by GATK.

    Do you mean that AF != AD/DP? It's true that this is not what Mutect2 emits for the AF, which is a probabilisitc estimate that accounts for uncertainty. To see what I mean, suppose there are 100 total reads, 90 of which are ref and 10 of which each have a 60% chance of supporting the alt allele. As far as AD is concerned this is 10 alt reads, but as far as probability is concerned this is an expected 6 alt reads.

    whereas all the other variant callers provides control first followed by case column

    Mutect emits #tumor_sample and #normal_sample lines in the VCF header for downstream convenience.

Sign In or Register to comment.