Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Mutect2

SystemSystem Administrator admin
This discussion was created from comments split from: New to the forum? Ask your questions here!.

Comments

  • stevebsteveb Member

    We want to try using MuTect2 on FFPE genomes. Should the panel of normals be made of up FFPE normals?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @steveb

    Yes that would be ideal. The more closely obtained and sequenced tumor and normal samples, the better it is for the tool to remove technical biases.

    Regards
    Bhanu

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    And be sure to run the orientation bias filter. For FFPE samples this is more important than the panel of normals.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @manba

    Which post are you referring to? Please try and give us the most amount of information you can so that we can help you better.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @manba

    I think i know the document you are referring to.

    still has alt_allele in normal sample

    Would you please post some examples supporting this observation.

    Thank you.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @manba

    This other user question and this document explains alt_allele_in_normal flag.

    Regards
    Bhanu

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @manba GATK 4 has a separate tool, FilterMutectCalls, for filtering, but in addition to that there is also a separate tool for filtering by orientation bias, which is relevant in FFPE samples.

  • stevebsteveb Member
    Do you mean ?
    --orientation-bias-artifact-priors
    null table of prior artifact probabilities for the read orientation filter model
  • stevebsteveb Member
    Continuing about FFPE and PanelOfNormal,
    would it be better to create a Panel with the loads of FFPE Tumour samples and treat it as a panelOfNormal? Or to use nothing at all? We generally do not have FFPE tumour lib.s.
    Still investigating --orientation-bias-artifact-priors switch.
    Can anyone give more info on why we would use that with the FFPE samples?
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @steveb Creating a panel of normals using tumor samples is probably not a good idea. A panel of normals serves mainly to catch mapping errors and so any panel of normals will perform decently, especially if the read and fragment sizes, and the alignment algorithm, are comparable to your tumors.

    The reason to do orientation bias filtering is that this catches artifacts due to chemical changes in DNA that occur on a single strand during sample prep, including sitting on a shelf in FFPE for a few years. The canonical reference for this sort of thing is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3616734/.

    A panel of normals, even an FFPE panel, is not a good substitute for FFPE filtering because the characteristic deamination artifact might only occurs maybe once in every 10,000 contexts where it is prevalent. Although this translates to a huge error rate of 1000's of false positives per genome, at any given site you probably won't get a sample in your panel. The panel of normals is not smart -- it just memorizes every site in isolation -- whereas the orientation bias filter learns that certain sequence contexts are error-prone by considering every place those contexts occur.

    The workflow for our orientation bias filter is:

    collect raw data

    1) gatk CollectF1R2Counts -R ref.fasta -I tumor.bam \
    -O tumor-artifact-prior-table.tsv \
    -alt-table tumor-alt.tsv \
    -ref-hist tumor-ref.metrics \
    -alt-hist tumor-alt.metrics

    learn the model

    2) gatk LearnReadOrientationModel \
    -alt-table my-tumor-sample-alt.tsv \
    -ref-hist my-tumor-sample-ref.metrics \
    -alt-hist my-tumor-sample-alt-depth1.metrics \
    -O my-tumor-sample-artifact-prior.tsv

    call

    3) Run Mutect2 with the argument --orientation-bias-artifact-priors my-tumor-sample-artifact-prior.tsv

    filter

    4) FilterMutectCalls automatically picks up on the relevant VCF annotations emitted by Mutect2 and filters accordingly.

  • shanbioshanbio Member

    @bhanuGandham @Geraldine_VdAuwera

    Congrats to the entire GATK4 team to make it public :)

    Recently I thought of updating to GATK4 and I tried Mutect2

    Below are my observations and questions:

    1.There is no QUAL or Filter information in the VCF file.

    This is explained in one of the thread as GATK gives less weightage for making somatic calls. (Reasonable)

    2.There is no "DP" tag in the FORMAT column where traditionally VCF files from samtools or Freebayes has. However it's present in INFO column which is less used in downstream analysis.

    3.There is a mismatch (huge) in the total read depth of "DP" and AD (Ref+Alt)

    This is also explained in one of the thread, but the difference is significant

    4._ [Both Mutect2 & Haplotype Caller]_ I wonder why the "Case" column-10 comes first followed by the "Control" column-11, whereas all the other variant callers provides control first followed by case column. This is creating a lot of issue in downstream pipeline.

    Example VCF:
    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT pat120_case pat120_control
    chr1 6185847 . C A . . DP=194;ECNT=1;NLOD=25.38;N_ART_LOD=0.375;POP_AF=1.000e-06;REF_BASES=GGCCTCTCTTCAAAGTCCTCA;TLOD=3.04 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:84,2:0.050:43,1:41,1:39,38:202,284:60:32:0.00,0.020,0.023:0.012,4.088e-03,0.984 0/0:98,1:0.030:48,0:50,1:39,39:204,149:60:6

    5.There are differences in the actual variant allele frequency and the VAF provided by GATK.

    This is also explained in one of the thread, but it's not clear.

    Please provide a solution to above queries.

    Command Used:

    GATKCommandLine=<ID=Mutect2,CommandLine="Mutect2 --tumor-sample pat120_case --normal-sample pat120_control --output pat120_somatic_gatk.vcf --intervals TruSeq_exome_targeted_regions.hg19.bed --input pat120_control.bam --input pat120_case.bam --reference hs37d5.fa --genotype-pon-sites false --genotype-germline-sites false --af-of-alleles-not-in-resource -1.0 --tumor-lod-to-emit 3.0 --initial-tumor-lod 2.0 --initial-pcr-qual 40 --max-population-af 0.01 --downsampling-stride 1 --max-suspicious-reads-per-alignment-start 0 --normal-lod 2.2 --max-mnp-distance 1 --ignore-itr-artifacts false --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --recover-dangling-heads false --do-not-recover-dangling-branches false --min-dangling-branch-length 4 --consensus false --max-num-haplotypes-in-population 128 --error-correct-kmers false --min-pruning 2 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --use-new-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --disable-tool-default-read-filters false --minimum-mapping-quality 20 --max-read-length 2147483647 --min-read-length 30 --disable-tool-default-annotations false --enable-all-annotations false",Version=4.0.8.1,Date="December 12, 2018 8:00:55 AM UTC">

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To your question 4: the sample/genotype columns are emitted in alphabetical order iirc. Given that the VCF specification does not provide a rule for column ordering, it is a bad idea for any downstream analysis to rely on arbitrary ordering expectations — the software should check what is the actual order every time.

  • shanbioshanbio Member

    @Geraldine_VdAuwera Thank you very much, will do as suggested and let you know.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @shanbio

    Have all your questions been answered? It seems like you already have justifications/clarifications for most of the points you made, so I am not sure what your questions are. If there were any questions you need help with please let me know.

    Regards
    Bhanu

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @shanbio

    There is no QUAL or Filter information in the vcf file.

    Mutect2 emits a tumor log odds (the TLOD info field), which is the somatic equivalent of a QUAL. To filter, you must pass the output of Mutect2 to FilterMutectCalls.

    There is no "DP" tag in the FORMAT column.

    That's true, but the sum of the ADs in the genotype fields gives you the total sample depth.

    There is a mismatch (huge) in the total read depth of "DP" and AD (Ref+Alt).

    DP is an INFO field for the variant site as a whole aggregated over all samples, while AD is a sample-level field. In your example, the DP is 194 and the sample ADs are 84,2 and 98,1. The difference between the summed ADs 84 + 2 + 98 + 1 = 185 and the DP 194 can be accounted for partly by the fact that the ADs neglect uninformative reads that can't distinguish between alleles, and partly by the fact that overlapping read pairs are not counted twice.

    There are differences in the actual variant allele frequency and the VAF provided by GATK.

    Do you mean that AF != AD/DP? It's true that this is not what Mutect2 emits for the AF, which is a probabilisitc estimate that accounts for uncertainty. To see what I mean, suppose there are 100 total reads, 90 of which are ref and 10 of which each have a 60% chance of supporting the alt allele. As far as AD is concerned this is 10 alt reads, but as far as probability is concerned this is an expected 6 alt reads.

    whereas all the other variant callers provides control first followed by case column

    Mutect emits #tumor_sample and #normal_sample lines in the VCF header for downstream convenience.

  • stevebsteveb Member
    Returning after awhile...trying to create Oreintation Bias filter as in workflow presented by davidben above on this board. Please provide some clarificaton. You give the first two steps to use CollectF1R2Counts and LearnReadOrientationModel as in:
    The workflow for our orientation bias filter is:
    collect raw data

    1) gatk CollectF1R2Counts -R ref.fasta -I tumor.bam \
    -O tumor-artifact-prior-table.tsv \
    -alt-table tumor-alt.tsv \
    -ref-hist tumor-ref.metrics \
    -alt-hist tumor-alt.metrics

    learn the model

    2) gatk LearnReadOrientationModel \
    -alt-table my-tumor-sample-alt.tsv \
    -ref-hist my-tumor-sample-ref.metrics \
    -alt-hist my-tumor-sample-alt-depth1.metrics \
    -O my-tumor-sample-artifact-prior.tsv

    The CollectF1R2Counts tool doesn't seem to have a OUTPUT (-O) option. So I am running without waiting to see what output is.
    Did I miss something or is there no --output for that tool?
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @steveb Good catch. -O tumor-artifact-prior-table.tsv somehow crept into the javadoc for the wrong tool, and I pasted it onto this thread unthinkingly. The only outputs of CollectF1R2Counts are the alt table, ref histogram, and alt histogram.

  • stevebsteveb Member
    OKay good to know. The 'collect raw data' stage is complete and I have the 3 files from the output of CollectF1R2Counts.
    I am using them as inputs to LearnReadOrientationModel and waiting for output now.

    Hopefully running Mutect2 with the argument --orientation-bias-artifact-priors will speed things up with the FFPE sample, because without, the run is painfully slow at 675 Regions/Minute
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @steveb I'm afraid it won't do anything for speed because the learned artifact model is used for variant filtering after calling. We're working on supplementing that for highly-degraded FFPE samples with a read filter (or perhaps read transformer) that removes bad reads before they slow down assembly and realignment.

    In the meantime you could try experimenting with lowering --max-num-haplotypes-in-population from its default of 128. The name is outdated, being based on pre-GVCF HaplotypeCaller days, but it really just means the maximum number of haplotypes to assemble per assembly region. You could also try making graph pruning stricter by increasing --pruning-lod-threshold from its default of 1.0. This lod has roughly the same meaning as the M2 tumor lod, so you could set it to somewhere in the 3 - 7 range.

    I wish I could give more specific guidance but we just don't have FFPE best practices yet. The traditional approach has been to hope that a study had a few relatively good samples and ignore the rest. We would like to improve this in the next several months.

  • stevebsteveb Member
    Okay. I am seeing that --orientationBias has not sped things up. However I am trying the two other suggested parametrs but Mutect2 does not have a --pruning-lod-threshold. And not sure if you meant instead one of:
    --minPruning
    2 Minimum support to not prune paths in the graph
    --numPruningSamples
    1 Number of samples that must pass the minPruning threshold

    Will try the first at 5 for all the samples.
    will see if this speeds things up.
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @steveb The Mutect2 method in Firecloud is behind the GATK repo but should be updated within a couple of days for the 4.1 release. That version has the parameter I mentioned, which is the knob for the new adaptive assembly graph pruning algorithm. If you don't want to wait you can set gatk_docker to us.gcr.io/broad-gatk/gatk:4.1.0.0.

  • stevebsteveb Member
    But isn't parameter --minPruning the same as the one you mentioned? Setting up from the default of 2 to 5 has sped things up sixfold. ((Unless it was the other parameter --max-num-haplotypes-in-population which I also changed))
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @steveb min-pruning is a heuristic way to accomplish what adaptive pruning does. To illustrate the difference, suppose you assemble an exon target so that the end of your assembly region has much lower depth than the middle. You don't want to prune an alt path with 5 reads when the ref path has 20, whereas you may want to prune an alt path with 5 reads when the reference has 1000. The probabilistic model for adaptive pruning handles this naturally, so that Mutect2 handles varying depth over a bam file and even within an assembly region.

    By the way, this change is a big reason why our mitochondria group adopted Mutect2.

Sign In or Register to comment.