VariantFiltration not filtering correctly

guidebortoliguidebortoli BrazilMember
edited December 2017 in Ask the GATK team

Hi,
I'm running the following command to hard-filtering some variants:

gatk -R /Users/debortoli/Doutorado/hg19/hg19.fa \ -T VariantFiltration -V vcf_no_indels.recode.vcf \ --filterExpression "ReadPosRankSum < -5.0 || MQRankSum < -4.0 || MQ < 40.0 || QD < 2.0 || FS > 60.0 || SOR > 2.0" \ --filterName "FAIL" \ -o vcf_no_indels_hard_filtered_test.vcf

The output show some strange things like:

chr15 28228924 . G A 68.73 PASS AC=2;AF=3.236e-03;AN=618;DP=3360;ExcessHet=0.0106;FS=0.000;InbreedingCoeff=0.0278;MLEAC=1;MLEAF=1.618e-03;MQ=60.00;QD=22.91;SOR=2.833 GT:AD:DP:GQ:PL 0/0:22,0:22:66:0,66,768 0/0:35,0:35:99:0,105,1186 0/0:11,0:11:33:0,33,378
chr15 28419695 rs149592795 T C 339531 MQRankSum AC=133;AF=0.196;AN=680;BaseQRankSum=-8.510e-01;ClippingRankSum=0.036;DP=154411;ExcessHet=73.3363;FS=0.623;InbreedingCoeff=-0.2431;MLEAC=133;MLEAF=0.196;MQ=59.11;MQRankSum=-9.913e+00;QD=3.03;ReadPosRankSum=3.09;SOR=0.610 GT:AD:DP:GQ:PGT:PID:PL 0/0:89,0:89:99:.:.:0,120,1800 0/0:1055,0:1055:99:.:.:0,120,1800 0/0:338,0:338:99:.:.:0,120,1800

I'm wondering why they are passing the filter when they shouldn't....there are more examples along the vcf that also pass one of the filters when they shouldn't...

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @guidebortoli
    Hi,

    Hmm. Are you using the latest version of GATK? Can you try running only the SOR filter and checking if that site is filtered?

    Thanks,
    Sheila

  • obigriffithobigriffith McDonnell Genome InstituteMember
    edited October 19

    I'm seeing something similar. I'm running VariantFiltration on a single sample VCF:

    gatk VariantFiltration -R ~/data/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa -V ~/data/germline_variants/WGS_Norm_HC_calls.snps.vcf --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0" --filter-name "basic_snp_filter" -O ~/data/germline_variants/WGS_Norm_HC_calls.snps.filtered.vcf

    But, I get lots of variants still marked as PASS but with SOR values > 3.0. The other 5 filters in the above expression are working as expected. Here are some (of many) examples:
    chr6 54228213 . A G 412.77 PASS AC=2;AF=1.00;AN=2;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=52.55;QD=28.71;SOR=3.442 GT:AD:DP:GQ:PL 1/1:0,11:11:33:441,33,0
    chr6 54318686 . T G 527.77 PASS AC=2;AF=1.00;AN=2;DP=16;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.99;SOR=3.258 GT:AD:DP:GQ:PL 1/1:0,16:16:48:556,48,0
    chr6 54604934 . C A 400.77 PASS AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=58.11;QD=32.87;SOR=4.804 GT:AD:DP:GQ:PL 1/1:0,10:10:30:429,30,0

    If I run the command with ONLY the SOR filter (--filter-expression "SOR > 3.0"), as recommended above, then it works as expected. But, it consistently does not work as part of a multi-filter expression.

    NOTE: I see exactly the same behavior with DP (site level filtering). Works on its own but not when included in a multi-filter expression.

    I am using the latest release of gatk (v4.0.10.1).

    Issue · Github
    by shlee

    Issue Number
    3164
    State
    open
    Last Updated
    Assignee
    Array
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @obigriffith,

    Bhanu asked I look into your question. Thanks for your patience. I'll get back to you tomorrow.

  • julleejullee Member
    edited October 24

    I have the same problem with the VariantFiltration tool. It seems like everything is given a PASS but sites that fail have the filter name inserted in the info column in the output vcf. For example, here I've filtered (genotype) to DP >14 with the first few lines of the output vcf looking like:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ANN0802
    1 1       .       A       <NON_REF>       .       PASS    .       GT:AD:DP:GQ:PL  0:13,0:13:99:0,465
    1 2       .       T       <NON_REF>       .       PASS    .       GT:AD:DP:GQ:PL  0:13,0:13:99:0,457
    1 3       .       C       <NON_REF>       .       PASS    .       GT:AD:DP:GQ:PL  0:14,0:14:99:0,563
    1 4       .       C       <NON_REF>       .       PASS    .       GT:AD:DP:FT:GQ:PL       0:15,0:15:DP_filter:99:0,569
    1 5       .       A       <NON_REF>       .       PASS    .       GT:AD:DP:FT:GQ:PL       0:15,0:15:DP_filter:99:0,541
    1 6       .       T       <NON_REF>       .       PASS    .       GT:AD:DP:FT:GQ:PL       0:15,0:15:DP_filter:99:0,564
    

    Clearly the first three sites fail the filter but the only indication of this is that "DP_filter" is not inserted in their format column

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited October 25

    Hi @obigriffith,

    I am using GATK v4.0.11.0 and I also see what you are seeing. I've been taking an Android App development course since January (in my free time of course), and I've learned that with multiple expressions, sometimes the Java programming language needs help in parsing expressions. That is, we need to help the tool demarcate where an expression begins and ends.

    1. no filtering expected works as expected (but this is misleading)

    --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRandSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0"
    

    2. should be filtered based on SOR (at 0.608) but is not

    --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRandSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 0.5" 
    

    3. Using parentheses around each expression allows SOR (and presumably other expressions) to be read correctly

    --filter-expression "(QD < 2.0) || (FS > 60.0) || (MQ < 40.0) || (MQRankSum < -12.5) || (ReadPosRankSum < -8.0) || (SOR > 0.5)"
    

    This will allow the tool to read the SOR expression unambiguously. Here are results from my testing:

    4. Providing each expression as a separate parameter also allows SOR (and others) to be read correctly and also provides additional insight
    Separate out each condition into individual filter expressions:

    --filter-expression "QD < 2.0" --filter-name "QDlessthan2" --filter-expression "FS > 60.0" --filter-name "FSgreaterthan60" --filter-expression "MQ < 90.0" --filter-name "MQlessthan90" --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSumlessthannegative12.5" --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSumlessthannegative8" --filter-expression "SOR > 0.5" --filter-name "SORgreaterthan0.5"
    

    This gives you additional resolution into what was the condition that triggered the filtering. Here are the results from my testing:

    So be sure to either use parentheses around each expression or to express conditions independently.

    Issue · Github
    by shlee

    Issue Number
    5362
    State
    open
    Last Updated
    Assignee
    Array
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @jullee, I'll need to see the command you are using. Please post. Thanks.

  • julleejullee Member

    @shlee said:
    Hi @jullee, I'll need to see the command you are using. Please post. Thanks.

    My command is on a gvcf. Perhaps that is the problem?

    /Linux/GATK4/gatk --java-options "-Xmx4g" VariantFiltration -R organellar_ref.fasta -V output.0802snps.g.vcf -O filtered.output.vcf.gz --genotype-filter-expression "DP >14" --genotype-filter-name "DP_filter"
    
  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited October 30

    Hi @jullee,

    Your command looks fine and should filter at the genotype level. GVCF is a valid VCF and VariantFiltration should accept it. My testing shows that GATK v4.0.11.0 VariantFiltration works just fine.

    gatk40110 VariantFiltration -V mother.g.vcf -O mother_variantfiltration.vcf --genotype-filter-expression "DP > 17" --genotype-filter-name "DPgreaterthan17"
    

    gives

    So I wonder what is going on on your end. Can you tell us which version of GATK and Java you are using? Thanks.

    P.S. Tutorial#12340 goes over filtering at the genotype level.

  • obigriffithobigriffith McDonnell Genome InstituteMember

    Hi @shlee. Thanks for your help. I have switched to using separate filter expressions and putting each in parentheses. As I think someone pointed out in the git repo it would probably be a good idea to clarify this in docs and tutorials wherever possible. There are many widely accessed usage examples that are currently done as I originally started and it is not intuitive that this would not work.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @obigriffith, yes, that was me making a suggestion in https://github.com/broadinstitute/gatk/issues/5362. I'm waiting to hear back from those in charge which course of action they want to take. Thanks for checking back.

  • sp580sp580 GermanyMember

    Maybe too simple-minded, but the fact that all entries are labeled as PASS could be because the filtering expression is written as "threshold1 OR threshold2 OR threshold3", so if a site passes at least one of those thresholds, then it is labeled as PASS.

    In which case a solution would be to write the filtering expression as "threshold1 AND threshold2 AND threshold3".

Sign In or Register to comment.