(howto) Apply hard filters to a call set

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin
edited March 30 in Tutorials

Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

Prerequisites

  • TBD

Steps

  1. Extract the SNPs from the call set
  2. Determine parameters for filtering SNPs
  3. Apply the filter to the SNP call set
  4. Extract the Indels from the call set
  5. Determine parameters for filtering indels
  6. Apply the filter to the Indel call set

1. Extract the SNPs from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_variants.vcf \ 
    -L 20 \ 
    -selectType SNP \ 
    -o raw_snps.vcf 

Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.


2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

  • QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

  • FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

  • MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

  • ReadPosRankSumTest (ReadPosRankSum) 8.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.


3. Apply the filter to the SNP call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_snps.vcf \ 
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ 
    --filterName "my_snp_filter" \ 
    -o filtered_snps.vcf 

Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.


4. Extract the Indels from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_HC_variants.vcf \ 
    -L 20 \ 
    -selectType INDEL \ 
    -o raw_indels.vcf 

Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.


5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

  • QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

  • FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • ReadPosRankSumTest (ReadPosRankSum) 20.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.


6. Apply the filter to the Indel call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_indels.vcf \ 
    --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ 
    --filterName "my_indel_filter" \ 
    -o filtered_indels.vcf 

Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • jfbjfb Posts: 2Member

    Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

  • jfbjfb Posts: 2Member

    Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

  • volaviivolavii Posts: 17Member

    maybe i get this wrong, but why should i filter by haplotypeScore? It describes the consistency of a haplotype, so its more a filter for heterozygous sites. So when i would have in the reference A and the alternative allele is G, and the genotype is 1/1, but the haplotypescore is high (>13), this SNP would be filtered out, right?

  • ebanksebanks Broad InstitutePosts: 689Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    The Haplotype Score describes the likelihood that the reads at the site come from at most two haplotypes. So, yes, it does make sense even in the homozygous case.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mjmaurermjmaurer Posts: 1Member

    Hello,

    I am unclear whether it is still recommended (since the new best practices guidelines for GATK 2.7) to use a max DP filter when performing hard filtering on variants called from whole genome data. Does the following still apply?

    "The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data."

    Also, it is ever recommended to use a minimum depth of coverage filter?

    Thanks,
    Matt

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi Matt,

    Sorry for the delayed response, your question slipped through my net.

    Yes, that remark still applies, as indicated in the document on hard filtering. We don't normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 37Member
    edited October 2013

    @jfb / other people that think that the VariantSelection/Filteringtools tools spit out too many errors:
    the errors only give error on missing values an do not mean anything. Although if you want to troubleshoot you commandline you can tap into the underlying variantcontext in your jexl expression to do this(and many other things). Here are in some options I specified with the VariantFiltration Tool:
    --filterExpression "vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0" \
    --filterName "HRUNgt5" \
    --filterExpression "vc.hasAttribute('RPA') && RPA > 8" \
    --filterName "RPAgt8" \
    The "vc.hasAttribute('attribute')" part returns true if variable exists and false if the variable does not exists. When false the second statement is not evaluated. Good luck at troubleshoot!!!

    @everyone
    Also my question: What is the easiest way to select String variables (without using variantcontext ) I now use:
    "(vc.hasAttribute('SNPEFF_IMPACT') && vc.getAttribute('SNPEFF_IMPACT').equals('HIGH'))"
    This thing woks, but completely unfriendly to beginners....(Took a day or two to figure this one out....)

    Post edited by mmterpstra on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi @mmterpstra, we realize this is a little rough, but we're planning to put together a cookbook of JEXL expressions to help new users get over that initial hump.

    Geraldine Van der Auwera, PhD

  • Bianca_TBianca_T StockholmPosts: 2Member

    Hi,

    I am working with a dataset from Ion Torrent (Ampliseq targeted resequencing on human samples). I have called the variants with HC (GATK 2.7.2) and would like now to apply hard filters. Seen the different kind of sequencing errors produced by Ion Torrent, how do you suggest I modify the filters? Applied as they are now, a huge number of indels are filtered out but those include my control known mutations.

    Thanks for your help.

    /Bianca

  • ebanksebanks Broad InstitutePosts: 689Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    Hi Bianca, we don't have much experience trying to get good calls out of Ion data. But I can say that from the little we've seen, we're doubtful that it's at all possible to call indels accurately from that data. If you do have success please post your findings here as we'd all love to know how it's done!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • bwubbbwubb Posts: 51Member

    Would it be recommended to apply any hard filters to MULTIALLELIC_COMPLEX variants? (And is that GATK long form of MNP?). If so would it be treated more like an indel? Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi @bwubb,

    Can you tell me where you encountered this name? I'm not familiar with it and I can't find it used anywhere in our codebase. The types of variants we work with are listed here:

    http://www.broadinstitute.org/gatk/guide/article?id=3682

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    Hi @Geraldine_VdAuwera

    Well this is in a vcf called by UG. I could provide the the settings I used, but the are more or less in line with the Updated Best Practices for 2.8-1, though the term is appears in the VariantType annotation. Here is an example (Sorry the post formatting gets weird):

    1  45787220  .  CT CTT,C   17247.46 PASS AC=20,6;AF=0.029,8.850e-03;AN=678;BaseQRankSum=-1.230;DP=59253;FS=11.224; InbreedingCoeff=-0.0093;LowMQ=0.0052,0.0093,59253;MLEAC=19,3;MLEAF=0.028,4.425e-03;MQ=59.43;MQ0=0;MQRankSum=-0.865; QD=4.76;RPA=12,13,11;RU=T;ReadPosRankSum=2.225;STR;VariantType=MULTIALLELIC_COMPLEX.Other GT:AD:DP:GQ:PL
    

    My guess would be Indel, and while this particular variant "PASSED", now based on your comment Im not certain if the the filters are actually considered. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Ah, found it. "MULTIALLELIC_COMPLEX" refers to a multiallelic variant that is neither a simple deletion nor a simple insertion (so it's a mix of both). I couldn't find it in the code because the complete string doesn't appear as such, it's a composite of "MULTIALLELIC_" + insertions, deletion or complex base on case.

    So yes, it's a sort of multiallelic indel. If you used a complex filtering expression, you might consider re-running it through each filter individually to make sure that it's getting evaluated properly on each property.

    Geraldine Van der Auwera, PhD

  • yeamanyeaman Posts: 23Member

    Hello,

    I was wondering if you have any specific suggestions for what kind of filtering criteria we should use when creating a new dbsnp for a non-model organism? I was assuming I would just filter based on the site quality score, but now that seems like it might be a coarse approach. I assume at this stage it is good to be very stringent? Would it be appropriate to use the values described above in the documentation for step #2?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi @yeaman,

    The values above should provide a good starting point (definitely better than just filtering on quals) but you'll need to experiment a little to find the values that are right for your data, and to produce the tradeoff between sensitivity and specificity that is right for your project. Good luck!

    Geraldine Van der Auwera, PhD

  • yeamanyeaman Posts: 23Member

    Thanks for the quick reply! I'll try that out.

  • davela3davela3 Posts: 5Member

    Hi @Geraldine_VdAuwera,

    Is there a reason why the examples have the --interval/-L argument set to "20"? I apologize if this question is a bit trivial, but what would determine whether we use the argument or not? In my case, I only used intervals during the ReAligner phase of the pipeline.

  • SheilaSheila Broad InstitutePosts: 1,938Member, GATK Dev, Broadie, Moderator, DSDE Dev admin
    edited May 2014

    @davela3

    Hi,

    In the example, this means to look only at chromosome 20. The -L argument, although an "interval" argument, can also be used to specify a chromosome to focus on.

    If running on a whole genome, you would not specify -L. If running on an exome, you can use it to provide a list of capture targets.

    I hope this answers your question.

    -Sheila

    Post edited by Sheila on
  • davela3davela3 Posts: 5Member

    Thanks so much, @Sheila!

  • davela3davela3 Posts: 5Member

    Hi,

    For other SNP filtering parameters, should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

    Similarly, are there are parameters we should consider for indels?

    I used HaplotypeCaller on human samples (single gene target) - but I don't seem to have HaplotypeScore in my filteredvcf files. Is this an assumed error on my part?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    @davela3,

    should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

    No, we do not make such a distinction, but feel free to experiment and see what works best in your data.

    are there are parameters we should consider for indels?

    Just the ones in the document above. Same comment as for SNPs: it's always a good idea to experiment and see fits what your dataset.

    HaplotypeCaller does not emit HaplotypeScore because that metric is already taken into account during the calling process.

    Geraldine Van der Auwera, PhD

  • davela3davela3 Posts: 5Member

    @Ger As usual, thank you for your response and insight!

  • noushin6noushin6 Baltimore, MDPosts: 15Member

    Hi @Geraldine_VdAuwera‌,

    I am using GATK 3.1, trying to apply the hard filter described here. I noticed that I get slightly different results using "MappingQualityRankSumTest" vs "MQRankSum" in filter definition, with the latter being more strict of a filter. My guess would be that if anything, the walker is respecting "MQRankSum" and not the other, but that is in contrast with usage suggested here. Does it recognize both?

    Thank you as always!
    Noushin

  • SheilaSheila Broad InstitutePosts: 1,938Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @noushin

    Hi Noushin,

    Those are two different names for the same filter.

    Please tell me what the command lines you used and what is the exact difference between the two results.

    Thanks,
    Sheila

  • leobokileoboki Atlanta, Georgia, USAPosts: 2Member

    Hi,

    I understand that MQRankSum and ReadPosRankSum are z-approximation from Mann-Whitney Rank Sum Test. Extreme values of these statistics indicate strong biases for mapping qualities and read positions. Intuitively, to set up thresholds for hard filtering, the criteria should be two-tailed in these cases, that is, MappingQualityRankSum < -12.5 || MappingQualityRankSum > 12.5 || ReadPosRankSum < -8.0 || ReadPosRankSum > 8.0. However, in the example above, the filtering criteria include only "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0."

    Does the expression "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" implicitly imply that these thresholds are two-tailed?

    Thanks,
    Leo

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Hi @leoboki,

    No, we only check whether the alt bases have lower scores than those of the ref bases.

    Geraldine Van der Auwera, PhD

  • wendywendy taiwanPosts: 12Member

    Hi all,
    After I use hard filtering.I got two data(filtered_snps.vcf and filtered_indels.vcf ). Should I merge these two data for the following steps?

    Thanks for help!

    Wendy

  • SheilaSheila Broad InstitutePosts: 1,938Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @wendy

    Hi Wendy,

    Yes, you can use CombineVariants for this step.

    -Sheila

  • sletortsletort francePosts: 40Member

    Hi,
    Just to mention it, space in my filter make the command fails with 'Invalid argument value'
    I'm using bash.

  • ZaagZaag Posts: 14Member

    There is a small error in step 5, should be indels i guess.

  • SheilaSheila Broad InstitutePosts: 1,938Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @Zaag

    Fixed :)

    Thanks,
    Sheila

  • metheusemetheuse ann arborPosts: 6Member

    I don't know why some variants were missing after I applied selectvariants to separate them into snps and indels. Like the following one:

    chr6 106965216 rs200196517 TACAC CACAC,T 12574.57 . AC=3,1;AF=0.375,0.125;AN=8;BaseQRankSum=2.23;ClippingRankSum=0.086;DB;DP=875;FS=0.820;GQ_MEAN=3385.00;GQ_STDDEV=1302.81;MLEAC=3,1;MLEAF=0.375,0.125;MQ=54.83;MQ0=0;MQRankSum=-8.372e+00;NCC=0;QD=15.30;ReadPosRankSum=-1.102e+00 GT:AD:DP:GQ:PL 0/1:127,120,0:247:99:5115,0,4980,5501,5375,10875 0/2:133,0,90:223:99:3404,3817,9566,0,5750,5473 0/1:120,97,0:217:99:3367,0,4846,3737,5138,8875 0/1:110,23,2:135:99:1789,0,4475,2096,4671,6758

    I used exactly the same commands listed on this page to select snps and indels. But the above variant is in neither of them.
    I think this variant could be either a SNP (for the allele "CACAC") or an INDEL (for the allele "T"). Is this why the program cannot decide its category so it discarded it?

    It's not clear to me how I should deal with this case. Any advice? Thanks.

  • KurtKurt Posts: 222Member ✭✭✭

    That would be -selectType MIXED I believe.

  • metheusemetheuse ann arborPosts: 6Member

    @Kurt said:
    That would be -selectType MIXED I believe.

    Thanks. Sorry I didn't notice that before. But what filtration I should do with these "mixed" variants, since SNP and INDEL have different filtration?
    Does it make sense to rewrite this kind of position into two lines (one SNP, and one INDEL) and then apply two filtration separately?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    We haven't examined this extensively so it's up to you, but FYI during VQSR the mixed type variants are processed with the indels. Best thing would be to try both approaches and compare results...

    Geraldine Van der Auwera, PhD

  • sibsccsibscc sibsPosts: 62Member

    Can VariantFiltration just filter the variant by depth?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Sure, if you want to. We don't recommend it though, at least not as a primary way of filtering your callset.

    Geraldine Van der Auwera, PhD

  • sibsccsibscc sibsPosts: 62Member
    edited November 2014

    @Geraldine_VdAuwera said:
    Sure, if you want to. We don't recommend it though, at least not as a primary way of filtering your callset.

    Thank you.
    Is there a document including all the filters in "--filterExpression"?

    Post edited by sibscc on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    You can compose any filter you want using the site-level annotations that are listed in the Variant Annotations documentation.

    Geraldine Van der Auwera, PhD

  • sibsccsibscc sibsPosts: 62Member

    @Geraldine_VdAuwera
    Hi,
    I am wondering if I do not specify SNP or INDEL, it will filter both SNPs and INDELs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    @sibscc That's correct, it will filter both. That's why we recommend separating them into different files.

    Geraldine Van der Auwera, PhD

  • GangpaoGangpao PekingPosts: 3Member

    Hi there!
    i meet an error message in the filters.
    commands as follows:

    add an annotation

    java -Xmx5g -jar $gatk -T VariantAnnotator -R $ref_dir/ucsc.hg19.fasta --variant $other_dir/test.recalibrated_snps_raw_indels.vcf -I $bam_dir/T-SZ-03-1.dedupped.realigned.recal.bam -o $other_dir/test.annotated.recalibrated_snps_raw_indels.vcf -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -L $other_dir/test.recalibrated_snps_raw_indels.vcf --dbsnp $ref_dir/dbsnp_138.hg19.vcf

    1. Extract the SNPs from the call set

    java -jar $gatk -R $ref_dir/ucsc.hg19.fasta -T SelectVariants -V $other_dir/test.annotated.recalibrated_snps_raw_indels.vcf -selectType SNP -o $other_dir/test.recalibrated_snps.vcf

    2. Apply the filter to the SNP call set

    java -jar $gatk -R $ref_dir/ucsc.hg19.fasta -T VariantFiltration -V $other_dir/test.recalibrated_snps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o $other_dir/test.recalibrated.filtered_snps.vcf

    the warn as below:
    WARN 22:02:16,983 Interpreter - ![63,84]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable MappingQualityRankSum

    so i have tried ”MappingQualityRankSum < -12.5” “MQRankSum < -12.5” “MappingQualityRankSumTest < -12.5” , none of them work.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    @Gangpao Sorry, the doc is a little out of date. HaplotypeScore should no longer be used, and MappingQualityRankSum should be replaced by MQRankSum. We'll update the doc asap.

    Geraldine Van der Auwera, PhD

  • agopalagopal CanadaPosts: 4Member

    Hi,

    Why is MQ not recommended for indel filtering, the way it is for SNP filtering?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,271Administrator, GATK Dev admin

    Most annotations related to mapping quality have been removed from the filtering recommendations for indels because there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.