If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
a clear interpretation in filter column of Mutect2 vcf
hi, I hava read the doc mathematical notes on mutect.pdf and the latest mutect.pdf and the header in vcf.
but I am still confused when I am reading variant in vcf file. so I want to consult that with you carefully. hope you can give me so instructions, the question can be very detailed and may takes you too much time, I am so sorry, thanks a lot.
min-median-base-quality is the minimum median base quality of bases supporting a SNV.
but is does not say the concrete threshold value of the minimum median base quality
min-median-mapping-quality also does not say the concrete value of the minimum median base quality
clustered_events,Description="Clustered events observed in the tumor"> , is the maximum allowable number of called variants co-occurring in a single assembly region. If the number of called variants exceeds this they will all be filtered. how to understand the called "a single assembly region".
FILTER=<ID=bad_haplotype,Description="Variant near filtered variant on same haplotype.">, how to understand this? for example, here is a site
chr1 144854528 . A G . bad_haplotype;clustered_events DP=971;ECNT=5;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.537e+01;TLOD=1291.33 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/1:607,356:0.370:963:289,183:318,173:29,29:175,187:60:39:false:false:.:.:50.60:100.00:0:0|1:144854528_A_G:0.364,0.364,0.370:5.314e-03,0.015,0.980
FILTER=<ID=chimeric_original_alignment,Description="NuMT variant with too many ALT reads originally from autosome">, how to understand this?
• max-germline-posterior is the maximum posterior probability, as determined by the above germline probability model, that a variant is a germline event. but is does not say the concrete threshold value
FILTER=<ID=low_avg_alt_quality,Description="Low average alt quality">, is this all the reads of this alt site base quality, it seems comes out very rare, why?
max-alt-allele-count is the maximum allowable number of alt alleles at a site. By default only biallelic
variants pass the filter. whether it means a site can be at most three base possibility. (ref and two possible alt)
FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">, it seems comes out very rare, why?
FILTER=<ID=orientation_bias,Description="Orientation bias (in one of the specified artifact mode(s) or complement) seen in one or more samples.">, whether it means the reads just comes from positive or negative strand?
here is a example site
chr3 181496339 . G T . orientation_bias DP=450;ECNT=1;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.164e+02;TLOD=5.93GT:AD:AF:DP:F1R2:F2R1:FT:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:428,7:0.018:435:198,6:230,1:orientation_bias:35,28:178,195:60:27:true:false:0.857:0.249:33.02:100.00:0:0.010,0.020,0.016:3.913e-03,1.666e-03,0.994
min-median-read-position is the minimum median length of bases supporting an allele from the closest end
of the read. Indels positions are measured by the end farthest from the end of the read. but is does not say the concrete threshold value
here is a example site
chr1 16258280 . C CTCTAAATCTTCA . read_position DP=618;ECNT=1;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.682e+02;TLOD=5.82GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:586,4:8.350e-03:590:272,4:314,0:29,33:176,140:60:0:false:false:0:0.010,0.010,6.780e-03:1.216e-03,1.579e-03,0.997
max-strand-artifact-probability is the posterior probability of a strand artifact, as determined by the
model described above, required to apply the strand artifact filter. This is necessary but not su"cient – we also
require the estimated max a posteriori allele fraction to be less than min-strand-artifact-allele-fraction.
The second condition prevents filtering real variants that also have significant strand bias, i.e. a true variant
that also has some artifactual reads.
how to understand "Evidence for alt allele comes from one read direction only", how many read directions? can you plot for that when it is one read direction, when it is two? and does F1R2 and F2R1 stands for this?
FILTER=<ID=strict_strand_bias,Description="Evidence for alt allele is not represented in both directions">
, it comes out very rare, why?
another question is that, when view reads in igv, if I sort reads by sample, there can be many reads 'Sample=HC', it seems these will be deleted in count in vcf DP though they will show in bamout.bam