Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

a clear interpretation in filter column of Mutect2 vcf

hi, I hava read the doc mathematical notes on mutect.pdf and the latest mutect.pdf and the header in vcf.
but I am still confused when I am reading variant in vcf file. so I want to consult that with you carefully. hope you can give me so instructions, the question can be very detailed and may takes you too much time, I am so sorry, thanks a lot.


min-median-base-quality is the minimum median base quality of bases supporting a SNV.
but is does not say the concrete threshold value of the minimum median base quality



min-median-mapping-quality also does not say the concrete value of the minimum median base quality


clustered_events,Description="Clustered events observed in the tumor"> , is the maximum allowable number of called variants co-occurring in a single assembly region. If the number of called variants exceeds this they will all be filtered. how to understand the called "a single assembly region".


FILTER=<ID=bad_haplotype,Description="Variant near filtered variant on same haplotype.">, how to understand this? for example, here is a site

chr1 144854528 . A G . bad_haplotype;clustered_events DP=971;ECNT=5;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.537e+01;TLOD=1291.33 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/1:607,356:0.370:963:289,183:318,173:29,29:175,187:60:39:false:false:.:.:50.60:100.00:0:0|1:144854528_A_G:0.364,0.364,0.370:5.314e-03,0.015,0.980


FILTER=<ID=chimeric_original_alignment,Description="NuMT variant with too many ALT reads originally from autosome">, how to understand this?



• max-germline-posterior is the maximum posterior probability, as determined by the above germline probability model, that a variant is a germline event. but is does not say the concrete threshold value

FILTER=<ID=low_avg_alt_quality,Description="Low average alt quality">, is this all the reads of this alt site base quality, it seems comes out very rare, why?



max-alt-allele-count is the maximum allowable number of alt alleles at a site. By default only biallelic
variants pass the filter. whether it means a site can be at most three base possibility. (ref and two possible alt)


FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">, it seems comes out very rare, why?


FILTER=<ID=orientation_bias,Description="Orientation bias (in one of the specified artifact mode(s) or complement) seen in one or more samples.">, whether it means the reads just comes from positive or negative strand?

here is a example site
chr3 181496339 . G T . orientation_bias DP=450;ECNT=1;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.164e+02;TLOD=5.93GT:AD:AF:DP:F1R2:F2R1:FT:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:428,7:0.018:435:198,6:230,1:orientation_bias:35,28:178,195:60:27:true:false:0.857:0.249:33.02:100.00:0:0.010,0.020,0.016:3.913e-03,1.666e-03,0.994



min-median-read-position is the minimum median length of bases supporting an allele from the closest end
of the read. Indels positions are measured by the end farthest from the end of the read. but is does not say the concrete threshold value
here is a example site
chr1 16258280 . C CTCTAAATCTTCA . read_position DP=618;ECNT=1;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.682e+02;TLOD=5.82GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:586,4:8.350e-03:590:272,4:314,0:29,33:176,140:60:0:false:false:0:0.010,0.010,6.780e-03:1.216e-03,1.579e-03,0.997



max-strand-artifact-probability is the posterior probability of a strand artifact, as determined by the
model described above, required to apply the strand artifact filter. This is necessary but not su"cient – we also
require the estimated max a posteriori allele fraction to be less than min-strand-artifact-allele-fraction.
The second condition prevents filtering real variants that also have significant strand bias, i.e. a true variant
that also has some artifactual reads.

how to understand "Evidence for alt allele comes from one read direction only", how many read directions? can you plot for that when it is one read direction, when it is two? and does F1R2 and F2R1 stands for this?


FILTER=<ID=strict_strand_bias,Description="Evidence for alt allele is not represented in both directions">

, it comes out very rare, why?

another question is that, when view reads in igv, if I sort reads by sample, there can be many reads 'Sample=HC', it seems these will be deleted in count in vcf DP though they will show in bamout.bam

Tagged:

Answers

  • 29043594952904359495 Member

    thanks a lot.
    can you point the F1R2 and F2R1 in the picture you supplied @AdelaideR (https://gatkforums.broadinstitute.org/gatk/discussion/23467/filterbyorientationbias-ffpe-artifacts)
    A read pair is F1R2 (forward 1st, reverse 2nd) if the sequence of
    bases in Read 1 maps to the forward strand of the reference (F1), and the sequence of Read 2 to the reverse strand
    of the reference (R2). F2R1 is defined similarly.

  • 29043594952904359495 Member

    :D can anyone help, thanks a lot.

  • xiuczxiucz Member

    I or we also need this kind of interpretation, :)

  • 29043594952904359495 Member

    ;) no one helped

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @2904359495 and @xiucz

    Our support team is looking into this and will get back to you shortly.

  • 29043594952904359495 Member

    @bhanuGandham Thanks a lot, I have konwn some of them under the help of your gatk team, thanks a lot. take it easy if you are busy doing other things, I am not that nxious waiting

  • 29043594952904359495 Member

    in gatk4, HaplotypeCaller use the following statistic test to do the bias,

    INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

    INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">

    it is also different from the tumor-ony mode of mutect2, is there any reason? thanks a lot

  • 29043594952904359495 Member

    about the oritention.
    it calculated like this

    INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

    so what is Z-score from Wilcoxon rank sum test, and in mutect2, the corresponding is F1R2 and F2R1 ,am I right?
    Thanks a lot

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @2904359495

    Thank you for your patience.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @2904359495 Please use normal size fonts on the forum. Large fonts make it hard to scroll through the thread. You can use text blocks and code blocks to separate code and output from the rest of the question.

    min-median-mapping-quality also does not say the concrete value of the minimum median base quality

    You can find defaults in the GATK source code. However, if someone is enough of an expert to worry about the defaults then that person is also enough of an expert to choose suitable non-default values.

    how to understand the called "a single assembly region".

    This means all variation assembled from a single local de Bruijn graph.

    Variant near filtered variant on same haplotype.">, how to understand this?

    This means the variant was phased with another filtered variant, in which case we guess that the entire haplotype was some sort of technical or mapping error.

    it seems comes out very rare, why?

    Some filters, such as those tailored to the mitochondria and cfDNA pipelines, are not turned on by default.

    whether it means the reads just comes from positive or negative strand?

    Orientation bias is not the same thing as strand bias.

    so what is Z-score from Wilcoxon rank sum test, and in mutect2

    https://en.wikipedia.org/wiki/Mann–Whitney_U_test

    Z scores are part of all frequentist tests.

    so what is Z-score from Wilcoxon rank sum test, and in mutect2, the corresponding is F1R2 and F2R1 ,am I right?

    F1R2 and F2R1 are read counts and have nothing to do with a rank sum test.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @xiucz Orientation bias is the signature of substitutions (errors, not mutations) that occur on only one strand of DNA prior to adding adapters. This strand bias before amplification does not cause strand bias after sequencing, because an error on one strand serves as a template during PCR yields a piece of DNA with the corresponding error copied onto the other strand. However, because the 3' and 5' adapters are different (that is, they are not reverse complements -- that's what "forked" adapters mean) forward strand and reverse strand reads will have different adapters, which means error reads on one strand always occurs on read 1 and error reads on the other strand always occur on read 2.

    To put it another way, suppose we represent things in binary: forward strand = 1, reverse strand = -1; read 1 = 1; read 2 = -1. Then we don't have a bias in strandedness, but we do have a bias in the product strand * read -- almost all evidence has the same sign of strand * read.

  • 29043594952904359495 Member
    edited August 1

    This means the variant was phased with another filtered variant, in which case we guess that the entire haplotype was some sort of technical or mapping error.

    it seems comes out very rare, why?

    Some filters, such as those tailored to the mitochondria and cfDNA pipelines, are not turned on by default.

    I found the Mitochondrial mode, but what is the specific argument for cfdna pipline ? thanks a lot, and when should I consider turn on that, To put it another way, how deep will I need to consider this. thanksa lot

    You can find defaults in the GATK source code. However, if someone is enough of an expert to worry about the defaults then that person is also enough of an expert to choose suitable non-default values.

    I find most of them in FilterMutectCalls specific arguments, sorry for my stupid.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    I found the Mitochondrial mode, but what is the specific argument for cfdna pipline ? thanks a lot, and when should I consider turn on that

    cfDNA-only filters and annotations are useful for cfDNA data. That being said, the defaults work pretty well for cfDNA, too, and we don't have a formal, public, cfDNA pipeline yet, although we're working on one.

Sign In or Register to comment.