Undefined variable VariantFiltration

VMistry13VMistry13 Member
edited March 2013 in Ask the GATK team

Hello
I am filtering SNPs and indels for single sample targeted resequencing dataset. My command is:

my $bigfilter_snps="-filter \"QUAL<80.0\" -filterName vm_QUAL -filter \"DP<20\" -filterName vm_DP -filter \"MQ<40.0\" -filterName GATK_v3_MQ -filter \"QD<2.0\" -filterName GATK_v3_QD -filter \"MQRankSum<-12.5\" -filterName GATK_v3_MQRankSum -filter \"HRun>5\" -filterName GATK_v2_HRun ";

java -jar GenomeAnalysisTK.jar -R /software/GenomeAnalysisTK_support/human_g1k_v37.fasta -T UnifiedGenotyper -I ${sample}.sorted.realigned.recal.bam -o ${sample}.sorted.realigned.recal.snps.vcf --intervals /506amplicons.interval_list --min_base_quality_score 15 -stand_call_conf 30 --baq CALCULATE_AS_NECESSARY -glm SNP --baqGapOpenPenalty 65 --downsampling_type BY_SAMPLE --downsample_to_coverage 250 --output_mode EMIT_ALL_CONFIDENT_SITES"

I get the following error:

MQRankSum < -12.5;' undefined variable MQRankSum
WARN 11:04:58,217 Interpreter - ![0,4]: 'HRun > 5;' undefined variable HRun
WARN 11:04:58,217 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD

also the filtered vcf file doesn't have the ref allele or quality score:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 158259801 . A . 31.15 vm_DP;vm_QUAL AN=2;DP=2;MQ=66.59;MQ0=0 GT:DP 0/0:2
1 158259802 . G . 34.23 vm_DP;vm_QUAL AN=2;DP=2;MQ=66.59;MQ0=0 GT:DP 0/0:2
1 158259803 . A . 37.23 vm_DP;vm_QUAL AN=2;DP=5;MQ=68.66;MQ0=0 GT:DP 0/0:5
1 158259804 . A . 40.23 vm_DP;vm_QUAL AN=2;DP=10;MQ=69.33;MQ0=0 GT:DP 0/0:10

Any help on why the filtering is not working is appreciated as I am quite new to this

Thanks

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there, can you tell me what version you are using?

  • VMistry13VMistry13 Member
    edited March 2013

    the latest version GATk 2.3-9

    is it because MQ RankSum, HRun and QD only get output at non-ref hom sites. so hom sites won't have them?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Actually the latest version is 2.4-7, you should consider upgrading.

    No, missing annotations for some sites should just lead to evaluation as FALSE, and if the annotations are completely missing from your VCF the GATK would give you a specific error about that. The error looks more like there's a problem parsing your command line, it must be malformed somehow. You can test whether that's the case by passing the filters directly from the command lines instead of bundling them into a shell variable.

  • isp2isp2 Member

    Hello,

    I see the same problem. I'm using 2.4-9 version of GATK and my original script was:

    java -Xmx12g -jar ${GATKPATH}/GenomeAnalysisTK.jar \
    -R path_to_reference \
    -T VariantFiltration \
    -o output_f.vcf \
    --variant path_to_input.vcf \
    --clusterSize 3 \
    --clusterWindowSize 10 \
    --filterExpression "MQ < 40.00" \
    --filterName "MQ" \
    --filterExpression "QD < 2.00" \
    --filterName "QD" \
    --filterExpression "FS > 60.000" \
    --filterName "FS" \
    --filterExpression "HaplotypeScore > 13.0000" \
    --filterName "HaplotypeScore" \
    --filterExpression "MQRankSum < -12.500" \
    --filterName "MQRankSum" \
    --filterExpression "ReadPosRankSum < -8.000" \
    --filterName "ReadPosRankSum"
    

    Which gave me the following errors (note that for Chr12:27497172 there were no errors coming up):

    INFO  14:27:03,852 ProgressMeter -  Chr11:24333296        1.11e+07   31.0 m        2.8 m     91.5%        33.9 m     2.9 m 
    WARN  13:55:50,121 Interpreter - ![0,9]: 'MQRankSum < -12.500;' undefined variable MQRankSum 
    WARN  13:55:50,121 Interpreter - ![0,14]: 'ReadPosRankSum < -8.000;' undefined variable ReadPosRank
    Sum 
    WARN  13:55:50,447 Interpreter - ![0,9]: 'MQRankSum < -12.500;' undefined variable MQRankSum 
    WARN  13:55:50,448 Interpreter - ![0,14]: 'ReadPosRankSum < -8.000;' undefined variable ReadPosRank
    Sum 
    

    [...]
    INFO 13:55:52,665 ProgressMeter - Chr12:27497172 1.23e+07 16.5 m 80.7 s 100.0% 16.5 m 0.0 s
    INFO 13:55:58,224 ProgressMeter - done 1.23e+07 16.6 m 81.0 s 100.0% 16.6 m 0.0 s
    INFO 13:55:58,224 ProgressMeter - Total runtime 996.15 secs, 16.60 min, 0.28 hours
    INFO 13:55:59,116 GATKRunReport - Uploaded run statistics report to AWS S3
    ESTATUS [ VariantFiltration ]: 0

    So I tried using only:

    java -Xmx12g -jar /share/apps/gatk/2.4-9/GenomeAnalysisTK.jar \
    -R path_to_reference \
    -T VariantFiltration \
    -o output_f.vcf \
    --variant path_to_input.vcf \
    --filterExpression "MQRankSum < -12.5" \
    --filterName "MQRankSum"
    

    and I still get the same error:

    INFO  14:54:32,423 ProgressMeter -   Chr1:10706322        2.76e+05   30.0 s      109.0 s      2.9%        17.4 m    16.9 m 
    WARN  14:54:32,940 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum 
    WARN  14:54:33,115 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum
    

    Looking at the first few SNPs in my input vcf (which was generated using UnifiedGenotyper) I see MQRankSum values between -7 and 5, so the error can't be telling me that these values range from 0 to 9 (only if these are absolute values!). I also don't see a obvious difference between position Chr12:27497172 and other positions in the vcf file. And it doesn't seem to be a script problem because the error only occurred for MQRankSum and ReadPosRankSum in the original script.

    The output vcf is, however, generated. But I would like to have this tool working properly to be able to "play around" with the filters and select the best filtering option for my data.

    Has any of you figured out what this error means?

    Thank you,
    Ines.

  • The [0,9] part isn't actually informative - I think it means that the JEXL interpreter choked at line 0, column 9 of whatever expression was fed to it. It doesn't say anything about the data.

    Keep in mind that this filter is applied line-by-line in the VCF, not once over the whole file. So it's complaining that those values don't exist on particular lines. Which is entirely expected, since the RankSum metrics are only calculated on heterozygous sites. I would expect that the lines it complains about are homozygous for all of the samples in the file

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    What @pdexheimer said is correct.

    Or, this can also be due to multi-allelic sites.

  • isp2isp2 Member

    Working with a colleague I figured out that indeed some lines in the input vcf do not have MQRankSum:

    $ grep -v -P "^#" input.vcf | grep -c -v MQRankSum
    $ 7792

    and that these lines seem to be the ones causing the warnings:

    $ grep -c -P -i "MQRankSum" log_file
    $ 7793

    Looking at some of these positions I see that the genotypes are all homozygous for the alternative allele, for example:

    Chr1 11494869 . G C 2554.68 . AC=72;AF=1.00;AN=72;DP=149;Dels=0.00;FS=0.000;HaplotypeScore=0.0241;InbreedingCoeff=-0.1104;MLEAC=72;MLEAF=1.00;MQ=28.74;MQ0=0;QD=19.50;SNPEFF_EFFECT=INTRON;SNPEFF_FUNCTION
    AL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=mRNA;SNPEFF_GENE_NAME=conserved%20hypothetical%20protein;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=13101.m14870 GT:AD:DP:GQ:PL ./. 1/1:0,1:1:3:33,3,0 1/1:0,5:5:9:100,9,0 1/1:0,9:9:21:201,21,0 1/1:0,3:3:9:96,9,0 1/1:0,7:7:12:128,12,0 1/1:0,8:8:12:124,12,0 1/1:0,3:3:9:100,9,0 1/1:0,5:5:12:117,12,0 1/1:0,6:6:12:100,12,0 1/1:0,3:3:6:67,6,0 ./. 1/1:0,3:3:6:67,6,0 ./. 1/1:0,1:1:3:33,3,0
    ./. ./. 1/1:0,1:1:3:22,3,0 1/1:0,5:5:6:44,6,0 ./. 1/1:0,2:2:3:33,3,0 ./. ./.
    1/1:0,2:2:3:23,3,0 ./. 1/1:0,1:1:3:22,3,0 1/1:0,2:2:3:33,3,0 1/1:0,2:2:3:24,3,0 1/1:0,2:2:3:32,3,0 ./. ./. ./. 1/1:0,1:1:3:33,3,0 1/1:0,3:3:6:52,6,0 1/1:0,4:4:9:82,9,0 1/1:0,5:5:6:66,6,0 1/1:0,10:10:24:220,24,0 1/1:0,6:6:18:175,18,0 1/1:0,3:3:3:33,3,0 ./. 1/1:0,2:2:3:28,3,0 1/1:0,2:2:3:33,3,0 ./. 1/1:0,7:7:12:125,12,0 1/1:0,2:2:3:23,3,0 ./. 1/1:0,5:5:9:90,9,0 1/1:0,1:1:3:33,3,0 1/1:0,3:3:6:60,6,0 1/1:0,4:4:6:67,6,0 ./. 1/1:0,2:2:6:57,6,0

    So pdexheimer is completely right. And if you are curious to know what happens to these positions after filtering, they PASS filter if none of the other filter fails.

    Thank you for helping me figuring out what the warning is talking about!
    Ines.

  • bbimberbbimber HomeMember

    Hi - I'm hitting a similar issue with MQRankSum being undefined, and the VCF does in fact have lines without this value. What's the appropriate way to handle this in filtering? Should I write the filter like: -filter "ReadPosRankSum && ReadPosRankSum < -8.0" (i.e. test if variable is defined)? Thanks in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
  • jphekmanjphekman University of IllinoisMember

    Using VariantFiltration in GATK 3.5, getting the warning "undefined variable QD". Indeed in my vcf input there are lines without QD values. In this case, I'd expect the line to be filtered out. However, it isn't, and I'm not sure how to change my recipe to get the intended result.

    My recipe:

    -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0"

    Sample output:

    X 1517306 . C T 50.30 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=0.00;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL [...]

    As you can see, filter FS didn't catch it (FS is less than 30 here), but I'd expect filter QD to catch it (no QD here). However, it is marked PASS. I believe a missing QD means the call is low quality and therefore I don't want it marked PASS (but I could be misunderstanding QD's meaning).

    How should I proceed?

    Thanks very much!
    Jessica

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jphekman
    Hi Jessica,

    The WARN statements simply let you know that the annotation is missing at the site. There is no need to to worry about them, as sometimes an annotation cannot be calculated for a site. As for passing sites without the requested annotation, we do not fail sites unless the annotation values fail the filters. However, you can add this argument to fail those sites :smile:

    The reason your FS value of 0 is passing is because it is less than 30.0. Your filter is set to fail variants with an FS value of 30.0. To fail variants with FS value less than 30.0, you would have FS < 30.0. However, you probably don't want to do that! Have a look at this document for more information.

    -Sheila

  • jphekmanjphekman University of IllinoisMember

    Exactly what I needed -- thank you!

  • Will_GilksWill_Gilks University of Sussex, UKMember
    edited April 2016

    @Sheila I have a similar problem. Multiple of records with no MQRankSum or ReadPosRankSum value are passing filters.

    I see how records with missing values can be forced to fail filters by following the link previously provided
    (using --missingValuesInExpressionsShouldEvaluateAsFailing )

    However, some records without MQRankSum values have good quality. For example:
    chr3R 27495566 hc_smaro A T 146296.29 PASS AC=436;AF=1.00;AN=436;DP=3306;

    Perhaps the solution is to go into the records that didn't have MQRankSum (and ReadPosRankSum), and then extract out the good ones, and then merge them back in.

    Among other variables, I've tried to filter by allele number, like this:

       GenomeAnalysisTK -R ${refseq} \
            -T VariantFiltration \
            -V f0.${vcf} \
            --filterExpression "QD<2.0||FS>50.000||MQ<58.00||MQRankSum<-7.0||ReadPosRankSum<-5.0||DP>15000||AN<400" \
            --filterName "locusFilter" \
                -o marked.f0.${vcf}   
    
        GenomeAnalysisTK -R ${refseq} \
            -T SelectVariants \
            -V marked.f0.${vcf} \
            --excludeFiltered \
                -o f1.${vcf}
    

    This is one of the records from the resulting vcf:
    chrX 22911216 hc_nsode G A 69.49 PASS AC=4;AF=0.087;AN=46;DP=30;EFF=INTERGENIC(MODIFIER||||||||||A);FS=0.000;InbreedingCoeff=0.1074;MLEAC=3;MLEAF=0.065;MQ=94.00;QD=34.74;SOR=2.833;VRT=1;VariantType=SNP GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,38
    ... which has an AN of 46 (which is less than the specified 400).

    I found this example when looking through records with no MQRankSum, so perhaps the fact that there is no MQRankSum value means that all filtering is disabled for that record, or perhaps there is another reason.

    Also is there any way in which I can invert SelectVariants - so I get get a file of all the variants I have removed ?

    Please help. I love posting errors up on the GATK forum but my grandchildren are growing beards.

    Issue · Github
    by Sheila

    Issue Number
    831
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • There used to be a doc about this somewhere, not sure if it's still around. But your interpretation is correct - if one of the clauses of a complex JEXL statement fails with an error, the record is considered to pass. The solution is to break up your compound test into multiple simple tests - --filterExpression "QD<2.0" --filterName QD --filterExpression "FS>50.0" --filterName FS ... This will then add filters for each subexpression that actually fails, and any subexpressions that error out will simply be ignored

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Will_Gilks It sounds like your problem is that you're trying to filter on all annotations using a single filtering expression. I would recommend breaking it down into separate expressions (each in a separate filter) so that you can specify different behaviors for filters on annotations that are expected to be present in all records, vs those that may be absent but whose absence shouldn't be considered a dealbreaker. You can specify as many separate filters (each with their own name and expression) within the same command.

    And yes SelectVariants (in the latest version) has an argument to invert the selection criteria, have a look at the tool doc.

    Hopefully that will allow you to get done what you need so you can spend the rest of your time dandling your grandkids on your knee before they're so big that they crush your fragile old bones ;)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Great minds think alike, @pdexheimer :)

  • @Geraldine_VdAuwera - and apparently spend time on forums while their colleagues are at lunch...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh hah, well some just got back from the UK and were too jetlagged to realize it's lunch time...

  • RB_74RB_74 Member

    Hi Sheila,

    I encountered same problem.
    I have many like with first warning and few with second one.

    ${JAVA} -jar ${BIN_PATH}/GenomeAnalysisTK.jar -T VariantFiltration -R ${GENOME_FILE} -V ${RAWSNPS_OUTPUT} --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o ${FILTSNPS_OUTPUT}

    WARN 14:59:11,593 Interpreter - ![38,47]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0;' undefined variable MQRankSum
    and
    WARN 15:43:21,737 Interpreter - ![59,73]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0;' undefined variable ReadPosRankSum

    Do you think its also similar problem like @isp2?

    Thank you for helping me out!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @RB_74
    Hi,

    Yes, it is the same problem :smiley: You have nothing to worry about. @pdexheimer's response is correct.

    -Sheila

  • @Sheila : Surprisingly I am facing the same problem but in different way. When I use a combined expression for filters than there is a warning for each line of vcf file that undefined variable GQ but when I use individual filter expression for each filter than I got many warning for other variables also (while they were present in case of combined filter expression input). I also tried giving a separate expression for GQ and rest in a combined expression then again so many undefined variable warning is coming. And I have checked the vcf file, all of them are present because I have annotated the vcf file before filtration. I need your suggestions for this problem. Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @vivekruhela
    Hi,

    Can you please post the version of GATK you are using and some example WARN statements?

    Thanks,
    Sheila

  • @Sheila : Thanks for reply. Currently I am using The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836. My whole pipeline is working perfectly with this version till now. And here are few warning statement from log file.

    WARN 16:31:40,503 Interpreter - ![22,24]: 'QD < 2.0 || DP < 1 || GQ < 30.00 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || SB > -10 || ReadPosRankSum < -8.0;' undefined variable GQ

    WARN 16:31:40,504 Interpreter - ![22,24]: 'QD < 2.0 || DP < 1 || GQ < 30.00 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || SB > -10 || ReadPosRankSum < -8.0;' undefined variable GQ

    WARN 16:31:40,505 Interpreter - ![22,24]: 'QD < 2.0 || DP < 1 || GQ < 30.00 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || SB > -10 || ReadPosRankSum < -8.0;' undefined variable GQ

    WARN 16:31:40,506 Interpreter - ![22,24]: 'QD < 2.0 || DP < 1 || GQ < 30.00 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || SB > -10 || ReadPosRankSum < -8.0;' undefined variable GQ

    WARN 16:31:40,507 Interpreter - ![22,24]: 'QD < 2.0 || DP < 1 || GQ < 30.00 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || SB > -10 || ReadPosRankSum < -8.0;' undefined variable GQ

    WARN 16:31:40,508 Interpreter - ![22,24]: 'QD < 2.0 || DP < 1 || GQ < 30.00 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || SB > -10 || ReadPosRankSum < -8.0;' undefined variable GQ

    When I use combined expression then there is same warning for each line shown above. When I use separate expression for each filter then I got following warnings :

    WARN 04:28:04,244 Interpreter - ![0,2]: 'SB > -10;' undefined variable SB

    WARN 04:28:04,245 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum

    WARN 04:28:04,245 Interpreter - ![0,2]: 'GQ < 30.00;' undefined variable GQ

    WARN 04:28:04,245 Interpreter - ![0,14]: 'ReadPosRankSum < -8.0;' undefined variable ReadPosRankSum

    WARN 04:28:04,246 Interpreter - ![0,2]: 'SB > -10;' undefined variable SB

    WARN 04:28:04,246 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum

    WARN 04:28:04,246 Interpreter - ![0,2]: 'GQ < 30.00;' undefined variable GQ

    WARN 04:28:04,247 Interpreter - ![0,14]: 'ReadPosRankSum < -8.0;' undefined variable ReadPosRankSum

    WARN 04:28:04,247 Interpreter - ![0,2]: 'SB > -10;' undefined variable SB

    WARN 04:28:04,248 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum

    WARN 04:28:04,248 Interpreter - ![0,2]: 'GQ < 30.00;' undefined variable GQ

    WARN 04:28:04,248 Interpreter - ![0,14]: 'ReadPosRankSum < -8.0;' undefined variable ReadPosRankSum

    WARN 04:28:04,248 Interpreter - ![0,2]: 'SB > -10;' undefined variable SB

    WARN 04:28:04,249 Interpreter - ![0,2]: 'GQ < 30.00;' undefined variable GQ

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @vivekruhela
    Hi,

    Those are nothing to worry about. The tool outputs only one annotation that is not present, so when you run altogether, it picks the first one that is not present. When you run separately, it only outputs the annotation in the filter that is not present. Sometimes annotations cannot be calculated for a site; for example the RankSumTest annotations need a mix of ref and alt alleles which are not present at hom-var sites.

    I hope this makes sense.

    -Sheila

  • vivekruhelavivekruhela Member
    edited April 5

    @Sheila : Thanks for reply. I agree with you that some annotations are canot be calculated for some sites. But,here, annotations like SB, GQ are very common and I think these two can be calculated for each site. I found majority of warnings are for GQ and SB. Is something wrong with gVCF file.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @vivekruhela
    Hi,

    To confirm, you mean VCF, not GVCF? You should not be running any analysis on your GVCF, as it is meant to be an intermediate file not used in analysis.

    If you are running on a VCF and not a GVCF, I have some questions.

    1) What do you mean by "majority"? Do you have an idea of exactly how many sites are missing the GQ field? Can you post some example records?

    2) For SB, there should be at least four values in the field, and you are requesting only 1. I am surprised the tool runs with no error. Have a look at this thread.

    -Sheila

Sign In or Register to comment.