Possible bug in variant filtering?

I was pulling my hair out over this one.

I was applying a hard filter to a genotyped gVCF using JEXL to access variant context attributes to decide what filter setting I would apply. The filter was

"vc.getGenotype("%sample%").isHomRef() ? vc.getGenotype("%sample%").getAD().size() == 1 ? DP < 10 : ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum <= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 : false"

In pseudocode it says:

 `if ( isHomRef ) then
   if ( getAD().size() == 1 ) then DP < 10 else
      ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum >= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 else ignore record`

The idea being that for records where not all reads contained the reference allele, we would filter out those positions where there was evidence to suggest that the reads supporting an alternate allele were of a significantly better quality. However, running this filter I keep getting the warning (snipped for clarity):

WARN [SNIP]... MQRankSum <= 3.2905 [SNIP]... : false;' undefined variable MQRankSum

So I thought the filter was failing. However, just as a test, I changed the direction of MQRankSum from >=3.2905 to <=3.2905 (a bit nonsensical, it should basically apply the filter to almost all HomRef positions that had any reads supporting an alternate allele).

I still get the warning but I found the filter was applied to variant records as it should be. e.g. the following went from PASS to BAD_HOMREF:

Supercontig_1.1 87 . C A . BAD_HOMREF BaseQRankSum=2.79;DP=40;MQ=39.95;MQ0=0;MQRankSum=-2.710e+00;ReadPosRankSum=0.819;VariantType=SNP GT:AB:AD:DP 0/0:0.870:34,5:39

So the filter is being correctly applied, but I am not sure why all the warnings are being generated? Is this a bug? Have I done something wrong?

Comments

  • No, this is expected, if a bit annoying. The RankSum statistics aren't calculated for all variants, and the entire expression is forced to some value (can't remember if it's true or false) when it references a missing value. So your console gets spammed with warnings

    A handy trick to bypass the warnings is to check for their existence first - so your expression would include ... or (vc.hasAttribute("MQRankSum") && MQRankSum <= 3.2905) or ... (thanks @mmterpstra‌ )

  • simono101simono101 London, UKMember
    edited January 2015

    @pdexheimer‌ that is where the vc.getGenotype("%sample%").getAD().size() == 1 part of the expression comes in.

    As far as I can tell, HomRef positions in a gVCF from HaplotypeCaller, always contains only a single number in the genotype AD field when all reads have the reference allele at that position. When this occurs we cannot, as you say, calculate the the RankSum stats. Therefore I filter only using DP < 10. The more expressive filter only gets executed when some reads are not reference, and therefore AD field has more than a single value (we achieve this using the ternary operator, i.e. (test condition ) ? (do this if condition true) : (do this if condition false) .

    Here's an example. Positions 103-106 should be evaluated by DP < 10 and position 107 by the more elaborate filter:

    103     .       A       .       .       .       DP=42;VariantType=NO_VARIATION  GT:AD:DP        0/0:42:42
    104     .       T       .       .       .       DP=42;VariantType=NO_VARIATION  GT:AD:DP        0/0:42:42
    105     .       A       .       .       .       DP=40;VariantType=NO_VARIATION  GT:AD:DP        0/0:40:40
    106     .       T       .       .       .       DP=40;VariantType=NO_VARIATION  GT:AD:DP        0/0:40:40
    107     .       G       A       .       .    DP=40;MQRankSum=-0.271;{SNIP}      GT:AB:AD:DP  0/0:0.870:34,5:39
    

    And this is what I see in the filtered file:

    8       .       T       .       .       BAD_HOMREF  DP=8;VariantType=NO_VARIATION   GT:AD:DP        0/0:8:8
    9       .       G       .       .       BAD_HOMREF  DP=8;VariantType=NO_VARIATION   GT:AD:DP        0/0:8:8
    10      .       A       .       .       PASS        DP=10;VariantType=NO_VARIATION  GT:AD:DP        0/0:10:10
    11      .       T       .       .       PASS        DP=11;VariantType=NO_VARIATION  GT:AD:DP        0/0:11:11
    ...
    86      .       C       .       .       PASS        DP=40;VariantType=NO_VARIATION  GT:AD:DP        0/0:40:40
    87      .       C       A       .       BAD_HOMREF  DP=40;MQRankSum=-0.271;{SNIP}   GT:AB:AD:DP  0/0:0.870:34,5:39
    88      .       A       .       .       PASS        DP=42;VariantType=NO_VARIATION  GT:AD:DP        0/0:42:42
    

    UPDATE....(!)

    I've just found records where despite all reads containing the reference allele, the AD field is a size two array and the mapping quality stats are calculated (but not the RankSum stats). e.g.

    9085    .       A       .       .       .       DP=60;VariantType=NO_VARIATION  GT:AD:DP        0/0:60:60
    9086    .       G       .       .       .       DP=59;VariantType=NO_VARIATION  GT:AD:DP        0/0:59:59
    9087    .       C       T       .       .       DP=70;MQ=57.65;MQ0=0;VariantType=SNP    GT:AD:DP        0/0:59,0:59
    

    It looks like this occurs when some reads do not pass the various read filters and the genotype DP is < the raw DP. Given this, the solution seems to be as @pdexheimer‌ suggests.

    Thanks!

    Post edited by simono101 on
  • Oh, I completely missed that this was on a gVCF. Why are you filtering a gVCF rather than the final, genotyped VCF?

    In general, getAD().size() isn't sufficient to tell whether alternate reads are present. It's perfectly legal and common to have an AD like 40,0, for instance. This might not be an issue in gVCFs.

    You have a point - the JEXL compiler clearly has some form of short circuit evaluation, since the hasAttribute trick works. But something is clearly not working properly. I can think of several possibilities:
    1. Your assumption that getAD().size() > 1 is a good test for the presence of MQRankSum doesn't hold
    2. The JEXL compiler doesn't properly short circuit around the ternary operator
    3. You're running into order of operations issues. The fact that it can compile the statement at all probably rules this out (the obvious possibilities I can think of would all get tripped up by that dangling ": false" on the end), but it's always something to consider with statements that complex.

    You could probably narrow these down by finding a VCF subset that triggers the warning and doing a series of runs with modifications on the expression. But I would probably run your cohort through GenotypeGVCFs first and work on the final data

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You really shouldn't ever be filtering the GVCFs.

  • simono101simono101 London, UKMember

    @pdexheimer‌ @Geraldine_VdAuwera‌ I am filtering a genotyped gVCF. The reason I am filtering a gVCF rather than a VCF is that I want to use sites that are equally covered in all my samples. I don't want to assume that sites that are not called variant in one sample are reference, hence why I am also filtering reference positions.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @simono101, that's not how the tools are designed to work. You should first genotype all the samples jointly. Then afterward you can apply whatever filtering you want, including filtering on whether some samples have lower coverage than what you consider usable. Doing it any other way is setting yourself up for trouble.

  • simono101simono101 London, UKMember
    edited January 2015

    @Geraldine_VdAuwera one of the issues I have with that is that I don't know how best to define my cohorts. I have many (well 121) samples from locations around the world. My samples are all of a single species. Some samples are from the same geographic area at successive time points, some samples are from geographically isolated areas, and are of deeply diverged lineages.

    I was thinking that I would be able to use individual sample genotyping and variant filtering because my samples have relatively high average coverage (100-300X). Is this still a bad idea? Would it still be best best to lump all these samples together to perform joint genotyping?

    Thanks for the continued discussion!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @simono101 I understand your concerns. It's difficult to answer categorically because this is a very different situation than what we validate the tools for, but I would still say your best best is to lump everything together and follow the recommended workflow. I think it's the least bad way to proceed, in the sense that it has drawbacks but fewer than with any other approach. At the very least you should do joint genotyping together on any samples that you intend to compare to each other in some way or another.

Sign In or Register to comment.