Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

(howto) Apply hard filters to a call set

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

Prerequisites

  • TBD

Steps

  1. Extract the SNPs from the call set
  2. Determine parameters for filtering SNPs
  3. Apply the filter to the SNP call set
  4. Extract the Indels from the call set
  5. Determine parameters for filtering SNPs
  6. Apply the filter to the Indel call set

1. Extract the SNPs from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_variants.vcf \ 
    -L 20 \ 
    -selectType SNP \ 
    -o raw_snps.vcf 

Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.


2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

  • QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

  • FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

  • HaplotypeScore 13.0

This is the consistency of the site with two (and only two) segregating haplotypes. Note that this is not applicable for calls made using the UnifiedGenotyper on non-diploid organisms.

  • MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

  • ReadPosRankSumTest (ReadPosRankSum) 8.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.


3. Apply the filter to the SNP call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_snps.vcf \ 
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ 
    --filterName "my_snp_filter" \ 
    -o filtered_snps.vcf 

Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.


4. Extract the Indels from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_HC_variants.vcf \ 
    -L 20 \ 
    -selectType INDEL \ 
    -o raw_indels.vcf 

Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.


5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

  • QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

  • FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • ReadPosRankSumTest (ReadPosRankSum) 20.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.


6. Apply the filter to the Indel call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_indels.vcf \ 
    --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ 
    --filterName "my_indel_filter" \ 
    -o filtered_indels.vcf 

Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Geraldine Van der Auwera, PhD

Comments

  • jfbjfb Posts: 2Member

    Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

  • jfbjfb Posts: 2Member

    Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

  • volaviivolavii Posts: 17Member

    maybe i get this wrong, but why should i filter by haplotypeScore? It describes the consistency of a haplotype, so its more a filter for heterozygous sites. So when i would have in the reference A and the alternative allele is G, and the genotype is 1/1, but the haplotypescore is high (>13), this SNP would be filtered out, right?

  • ebanksebanks Posts: 683GATK Developer mod

    The Haplotype Score describes the likelihood that the reads at the site come from at most two haplotypes. So, yes, it does make sense even in the homozygous case.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mjmaurermjmaurer Posts: 1Member

    Hello,

    I am unclear whether it is still recommended (since the new best practices guidelines for GATK 2.7) to use a max DP filter when performing hard filtering on variants called from whole genome data. Does the following still apply?

    "The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data."

    Also, it is ever recommended to use a minimum depth of coverage filter?

    Thanks, Matt

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Hi Matt,

    Sorry for the delayed response, your question slipped through my net.

    Yes, that remark still applies, as indicated in the document on hard filtering. We don't normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited October 2013

    @jfb / other people that think that the VariantSelection/Filteringtools tools spit out too many errors: the errors only give error on missing values an do not mean anything. Although if you want to troubleshoot you commandline you can tap into the underlying variantcontext in your jexl expression to do this(and many other things). Here are in some options I specified with the VariantFiltration Tool: --filterExpression "vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0" \ --filterName "HRUNgt5" \ --filterExpression "vc.hasAttribute('RPA') && RPA > 8" \ --filterName "RPAgt8" \ The "vc.hasAttribute('attribute')" part returns true if variable exists and false if the variable does not exists. When false the second statement is not evaluated. Good luck at troubleshoot!!!

    @everyone Also my question: What is the easiest way to select String variables (without using variantcontext ) I now use: "(vc.hasAttribute('SNPEFF_IMPACT') && vc.getAttribute('SNPEFF_IMPACT').equals('HIGH'))" This thing woks, but completely unfriendly to beginners....(Took a day or two to figure this one out....)

    Post edited by mmterpstra on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Hi @mmterpstra, we realize this is a little rough, but we're planning to put together a cookbook of JEXL expressions to help new users get over that initial hump.

    Geraldine Van der Auwera, PhD

  • Bianca_TBianca_T StockholmPosts: 1Member

    Hi,

    I am working with a dataset from Ion Torrent (Ampliseq targeted resequencing on human samples). I have called the variants with HC (GATK 2.7.2) and would like now to apply hard filters. Seen the different kind of sequencing errors produced by Ion Torrent, how do you suggest I modify the filters? Applied as they are now, a huge number of indels are filtered out but those include my control known mutations.

    Thanks for your help.

    /Bianca

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Bianca, we don't have much experience trying to get good calls out of Ion data. But I can say that from the little we've seen, we're doubtful that it's at all possible to call indels accurately from that data. If you do have success please post your findings here as we'd all love to know how it's done!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • bwubbbwubb Posts: 47Member

    Would it be recommended to apply any hard filters to MULTIALLELIC_COMPLEX variants? (And is that GATK long form of MNP?). If so would it be treated more like an indel? Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Hi @bwubb,

    Can you tell me where you encountered this name? I'm not familiar with it and I can't find it used anywhere in our codebase. The types of variants we work with are listed here:

    http://www.broadinstitute.org/gatk/guide/article?id=3682

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 47Member

    Hi @Geraldine_VdAuwera

    Well this is in a vcf called by UG. I could provide the the settings I used, but the are more or less in line with the Updated Best Practices for 2.8-1, though the term is appears in the VariantType annotation. Here is an example (Sorry the post formatting gets weird):

    1  45787220  .  CT CTT,C   17247.46 PASS AC=20,6;AF=0.029,8.850e-03;AN=678;BaseQRankSum=-1.230;DP=59253;FS=11.224; InbreedingCoeff=-0.0093;LowMQ=0.0052,0.0093,59253;MLEAC=19,3;MLEAF=0.028,4.425e-03;MQ=59.43;MQ0=0;MQRankSum=-0.865; QD=4.76;RPA=12,13,11;RU=T;ReadPosRankSum=2.225;STR;VariantType=MULTIALLELIC_COMPLEX.Other GT:AD:DP:GQ:PL
    

    My guess would be Indel, and while this particular variant "PASSED", now based on your comment Im not certain if the the filters are actually considered. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Ah, found it. "MULTIALLELIC_COMPLEX" refers to a multiallelic variant that is neither a simple deletion nor a simple insertion (so it's a mix of both). I couldn't find it in the code because the complete string doesn't appear as such, it's a composite of "MULTIALLELIC_" + insertions, deletion or complex base on case.

    So yes, it's a sort of multiallelic indel. If you used a complex filtering expression, you might consider re-running it through each filter individually to make sure that it's getting evaluated properly on each property.

    Geraldine Van der Auwera, PhD

  • yeamanyeaman Posts: 12Member

    Hello,

    I was wondering if you have any specific suggestions for what kind of filtering criteria we should use when creating a new dbsnp for a non-model organism? I was assuming I would just filter based on the site quality score, but now that seems like it might be a coarse approach. I assume at this stage it is good to be very stringent? Would it be appropriate to use the values described above in the documentation for step #2?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Hi @yeaman,

    The values above should provide a good starting point (definitely better than just filtering on quals) but you'll need to experiment a little to find the values that are right for your data, and to produce the tradeoff between sensitivity and specificity that is right for your project. Good luck!

    Geraldine Van der Auwera, PhD

  • yeamanyeaman Posts: 12Member

    Thanks for the quick reply! I'll try that out.

  • davela3davela3 Posts: 5Member

    Hi @Geraldine_VdAuwera,

    Is there a reason why the examples have the --interval/-L argument set to "20"? I apologize if this question is a bit trivial, but what would determine whether we use the argument or not? In my case, I only used intervals during the ReAligner phase of the pipeline.

  • SheilaSheila Broad InstitutePosts: 448Member, GATK Developer, Broadie, Moderator admin
    edited May 1

    @davela3

    Hi,

    In the example, this means to look only at chromosome 20. The -L argument, although an "interval" argument, can also be used to specify a chromosome to focus on.

    If running on a whole genome, you would not specify -L. If running on an exome, you can use it to provide a list of capture targets.

    I hope this answers your question.

    -Sheila

    Post edited by Sheila on
  • davela3davela3 Posts: 5Member

    Hi,

    For other SNP filtering parameters, should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

    Similarly, are there are parameters we should consider for indels?

    I used HaplotypeCaller on human samples (single gene target) - but I don't seem to have HaplotypeScore in my filteredvcf files. Is this an assumed error on my part?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    @davela3,

    should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

    No, we do not make such a distinction, but feel free to experiment and see what works best in your data.

    are there are parameters we should consider for indels?

    Just the ones in the document above. Same comment as for SNPs: it's always a good idea to experiment and see fits what your dataset.

    HaplotypeCaller does not emit HaplotypeScore because that metric is already taken into account during the calling process.

    Geraldine Van der Auwera, PhD

  • davela3davela3 Posts: 5Member

    @Ger As usual, thank you for your response and insight!

  • noushin6noushin6 Baltimore, MDPosts: 14Member

    Hi @Geraldine_VdAuwera‌,

    I am using GATK 3.1, trying to apply the hard filter described here. I noticed that I get slightly different results using "MappingQualityRankSumTest" vs "MQRankSum" in filter definition, with the latter being more strict of a filter. My guess would be that if anything, the walker is respecting "MQRankSum" and not the other, but that is in contrast with usage suggested here. Does it recognize both?

    Thank you as always! Noushin

  • SheilaSheila Broad InstitutePosts: 448Member, GATK Developer, Broadie, Moderator admin

    @noushin

    Hi Noushin,

    Those are two different names for the same filter.

    Please tell me what the command lines you used and what is the exact difference between the two results.

    Thanks, Sheila

  • leobokileoboki Atlanta, Georgia, USAPosts: 2Member

    Hi,

    I understand that MQRankSum and ReadPosRankSum are z-approximation from Mann-Whitney Rank Sum Test. Extreme values of these statistics indicate strong biases for mapping qualities and read positions. Intuitively, to set up thresholds for hard filtering, the criteria should be two-tailed in these cases, that is, MappingQualityRankSum < -12.5 || MappingQualityRankSum > 12.5 || ReadPosRankSum < -8.0 || ReadPosRankSum > 8.0. However, in the example above, the filtering criteria include only "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0."

    Does the expression "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" implicitly imply that these thresholds are two-tailed?

    Thanks, Leo

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Hi @leoboki,

    No, we only check whether the alt bases have lower scores than those of the ref bases.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.