To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Undefined variable VariantFiltration

VMistry13VMistry13 Member
edited March 2013 in Ask the GATK team

Hello
I am filtering SNPs and indels for single sample targeted resequencing dataset. My command is:

my $bigfilter_snps="-filter \"QUAL<80.0\" -filterName vm_QUAL -filter \"DP<20\" -filterName vm_DP -filter \"MQ<40.0\" -filterName GATK_v3_MQ -filter \"QD<2.0\" -filterName GATK_v3_QD -filter \"MQRankSum<-12.5\" -filterName GATK_v3_MQRankSum -filter \"HRun>5\" -filterName GATK_v2_HRun ";

java -jar GenomeAnalysisTK.jar -R /software/GenomeAnalysisTK_support/human_g1k_v37.fasta -T UnifiedGenotyper -I ${sample}.sorted.realigned.recal.bam -o ${sample}.sorted.realigned.recal.snps.vcf --intervals /506amplicons.interval_list --min_base_quality_score 15 -stand_call_conf 30 --baq CALCULATE_AS_NECESSARY -glm SNP --baqGapOpenPenalty 65 --downsampling_type BY_SAMPLE --downsample_to_coverage 250 --output_mode EMIT_ALL_CONFIDENT_SITES"

I get the following error:

MQRankSum < -12.5;' undefined variable MQRankSum
WARN 11:04:58,217 Interpreter - ![0,4]: 'HRun > 5;' undefined variable HRun
WARN 11:04:58,217 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD

also the filtered vcf file doesn't have the ref allele or quality score:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 158259801 . A . 31.15 vm_DP;vm_QUAL AN=2;DP=2;MQ=66.59;MQ0=0 GT:DP 0/0:2
1 158259802 . G . 34.23 vm_DP;vm_QUAL AN=2;DP=2;MQ=66.59;MQ0=0 GT:DP 0/0:2
1 158259803 . A . 37.23 vm_DP;vm_QUAL AN=2;DP=5;MQ=68.66;MQ0=0 GT:DP 0/0:5
1 158259804 . A . 40.23 vm_DP;vm_QUAL AN=2;DP=10;MQ=69.33;MQ0=0 GT:DP 0/0:10

Any help on why the filtering is not working is appreciated as I am quite new to this

Thanks

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there, can you tell me what version you are using?

  • VMistry13VMistry13 Member
    edited March 2013

    the latest version GATk 2.3-9

    is it because MQ RankSum, HRun and QD only get output at non-ref hom sites. so hom sites won't have them?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Actually the latest version is 2.4-7, you should consider upgrading.

    No, missing annotations for some sites should just lead to evaluation as FALSE, and if the annotations are completely missing from your VCF the GATK would give you a specific error about that. The error looks more like there's a problem parsing your command line, it must be malformed somehow. You can test whether that's the case by passing the filters directly from the command lines instead of bundling them into a shell variable.

  • isp2isp2 Member

    Hello,

    I see the same problem. I'm using 2.4-9 version of GATK and my original script was:

    java -Xmx12g -jar ${GATKPATH}/GenomeAnalysisTK.jar \
    -R path_to_reference \
    -T VariantFiltration \
    -o output_f.vcf \
    --variant path_to_input.vcf \
    --clusterSize 3 \
    --clusterWindowSize 10 \
    --filterExpression "MQ < 40.00" \
    --filterName "MQ" \
    --filterExpression "QD < 2.00" \
    --filterName "QD" \
    --filterExpression "FS > 60.000" \
    --filterName "FS" \
    --filterExpression "HaplotypeScore > 13.0000" \
    --filterName "HaplotypeScore" \
    --filterExpression "MQRankSum < -12.500" \
    --filterName "MQRankSum" \
    --filterExpression "ReadPosRankSum < -8.000" \
    --filterName "ReadPosRankSum"
    

    Which gave me the following errors (note that for Chr12:27497172 there were no errors coming up):

    INFO  14:27:03,852 ProgressMeter -  Chr11:24333296        1.11e+07   31.0 m        2.8 m     91.5%        33.9 m     2.9 m 
    WARN  13:55:50,121 Interpreter - ![0,9]: 'MQRankSum < -12.500;' undefined variable MQRankSum 
    WARN  13:55:50,121 Interpreter - ![0,14]: 'ReadPosRankSum < -8.000;' undefined variable ReadPosRank
    Sum 
    WARN  13:55:50,447 Interpreter - ![0,9]: 'MQRankSum < -12.500;' undefined variable MQRankSum 
    WARN  13:55:50,448 Interpreter - ![0,14]: 'ReadPosRankSum < -8.000;' undefined variable ReadPosRank
    Sum 
    

    [...]
    INFO 13:55:52,665 ProgressMeter - Chr12:27497172 1.23e+07 16.5 m 80.7 s 100.0% 16.5 m 0.0 s
    INFO 13:55:58,224 ProgressMeter - done 1.23e+07 16.6 m 81.0 s 100.0% 16.6 m 0.0 s
    INFO 13:55:58,224 ProgressMeter - Total runtime 996.15 secs, 16.60 min, 0.28 hours
    INFO 13:55:59,116 GATKRunReport - Uploaded run statistics report to AWS S3
    ESTATUS [ VariantFiltration ]: 0

    So I tried using only:

    java -Xmx12g -jar /share/apps/gatk/2.4-9/GenomeAnalysisTK.jar \
    -R path_to_reference \
    -T VariantFiltration \
    -o output_f.vcf \
    --variant path_to_input.vcf \
    --filterExpression "MQRankSum < -12.5" \
    --filterName "MQRankSum"
    

    and I still get the same error:

    INFO  14:54:32,423 ProgressMeter -   Chr1:10706322        2.76e+05   30.0 s      109.0 s      2.9%        17.4 m    16.9 m 
    WARN  14:54:32,940 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum 
    WARN  14:54:33,115 Interpreter - ![0,9]: 'MQRankSum < -12.5;' undefined variable MQRankSum
    

    Looking at the first few SNPs in my input vcf (which was generated using UnifiedGenotyper) I see MQRankSum values between -7 and 5, so the error can't be telling me that these values range from 0 to 9 (only if these are absolute values!). I also don't see a obvious difference between position Chr12:27497172 and other positions in the vcf file. And it doesn't seem to be a script problem because the error only occurred for MQRankSum and ReadPosRankSum in the original script.

    The output vcf is, however, generated. But I would like to have this tool working properly to be able to "play around" with the filters and select the best filtering option for my data.

    Has any of you figured out what this error means?

    Thank you,
    Ines.

  • The [0,9] part isn't actually informative - I think it means that the JEXL interpreter choked at line 0, column 9 of whatever expression was fed to it. It doesn't say anything about the data.

    Keep in mind that this filter is applied line-by-line in the VCF, not once over the whole file. So it's complaining that those values don't exist on particular lines. Which is entirely expected, since the RankSum metrics are only calculated on heterozygous sites. I would expect that the lines it complains about are homozygous for all of the samples in the file

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    What @pdexheimer said is correct.

    Or, this can also be due to multi-allelic sites.

  • isp2isp2 Member

    Working with a colleague I figured out that indeed some lines in the input vcf do not have MQRankSum:

    $ grep -v -P "^#" input.vcf | grep -c -v MQRankSum
    $ 7792

    and that these lines seem to be the ones causing the warnings:

    $ grep -c -P -i "MQRankSum" log_file
    $ 7793

    Looking at some of these positions I see that the genotypes are all homozygous for the alternative allele, for example:

    Chr1 11494869 . G C 2554.68 . AC=72;AF=1.00;AN=72;DP=149;Dels=0.00;FS=0.000;HaplotypeScore=0.0241;InbreedingCoeff=-0.1104;MLEAC=72;MLEAF=1.00;MQ=28.74;MQ0=0;QD=19.50;SNPEFF_EFFECT=INTRON;SNPEFF_FUNCTION
    AL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=mRNA;SNPEFF_GENE_NAME=conserved%20hypothetical%20protein;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=13101.m14870 GT:AD:DP:GQ:PL ./. 1/1:0,1:1:3:33,3,0 1/1:0,5:5:9:100,9,0 1/1:0,9:9:21:201,21,0 1/1:0,3:3:9:96,9,0 1/1:0,7:7:12:128,12,0 1/1:0,8:8:12:124,12,0 1/1:0,3:3:9:100,9,0 1/1:0,5:5:12:117,12,0 1/1:0,6:6:12:100,12,0 1/1:0,3:3:6:67,6,0 ./. 1/1:0,3:3:6:67,6,0 ./. 1/1:0,1:1:3:33,3,0
    ./. ./. 1/1:0,1:1:3:22,3,0 1/1:0,5:5:6:44,6,0 ./. 1/1:0,2:2:3:33,3,0 ./. ./.
    1/1:0,2:2:3:23,3,0 ./. 1/1:0,1:1:3:22,3,0 1/1:0,2:2:3:33,3,0 1/1:0,2:2:3:24,3,0 1/1:0,2:2:3:32,3,0 ./. ./. ./. 1/1:0,1:1:3:33,3,0 1/1:0,3:3:6:52,6,0 1/1:0,4:4:9:82,9,0 1/1:0,5:5:6:66,6,0 1/1:0,10:10:24:220,24,0 1/1:0,6:6:18:175,18,0 1/1:0,3:3:3:33,3,0 ./. 1/1:0,2:2:3:28,3,0 1/1:0,2:2:3:33,3,0 ./. 1/1:0,7:7:12:125,12,0 1/1:0,2:2:3:23,3,0 ./. 1/1:0,5:5:9:90,9,0 1/1:0,1:1:3:33,3,0 1/1:0,3:3:6:60,6,0 1/1:0,4:4:6:67,6,0 ./. 1/1:0,2:2:6:57,6,0

    So pdexheimer is completely right. And if you are curious to know what happens to these positions after filtering, they PASS filter if none of the other filter fails.

    Thank you for helping me figuring out what the warning is talking about!
    Ines.

  • bbimberbbimber HomeMember

    Hi - I'm hitting a similar issue with MQRankSum being undefined, and the VCF does in fact have lines without this value. What's the appropriate way to handle this in filtering? Should I write the filter like: -filter "ReadPosRankSum && ReadPosRankSum < -8.0" (i.e. test if variable is defined)? Thanks in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
  • jphekmanjphekman University of IllinoisMember

    Using VariantFiltration in GATK 3.5, getting the warning "undefined variable QD". Indeed in my vcf input there are lines without QD values. In this case, I'd expect the line to be filtered out. However, it isn't, and I'm not sure how to change my recipe to get the intended result.

    My recipe:

    -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0"

    Sample output:

    X 1517306 . C T 50.30 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=0.00;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL [...]

    As you can see, filter FS didn't catch it (FS is less than 30 here), but I'd expect filter QD to catch it (no QD here). However, it is marked PASS. I believe a missing QD means the call is low quality and therefore I don't want it marked PASS (but I could be misunderstanding QD's meaning).

    How should I proceed?

    Thanks very much!
    Jessica

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jphekman
    Hi Jessica,

    The WARN statements simply let you know that the annotation is missing at the site. There is no need to to worry about them, as sometimes an annotation cannot be calculated for a site. As for passing sites without the requested annotation, we do not fail sites unless the annotation values fail the filters. However, you can add this argument to fail those sites :smile:

    The reason your FS value of 0 is passing is because it is less than 30.0. Your filter is set to fail variants with an FS value of 30.0. To fail variants with FS value less than 30.0, you would have FS < 30.0. However, you probably don't want to do that! Have a look at this document for more information.

    -Sheila

  • jphekmanjphekman University of IllinoisMember

    Exactly what I needed -- thank you!

  • Will_GilksWill_Gilks University of Sussex, UKMember
    edited April 2016

    @Sheila I have a similar problem. Multiple of records with no MQRankSum or ReadPosRankSum value are passing filters.

    I see how records with missing values can be forced to fail filters by following the link previously provided
    (using --missingValuesInExpressionsShouldEvaluateAsFailing )

    However, some records without MQRankSum values have good quality. For example:
    chr3R 27495566 hc_smaro A T 146296.29 PASS AC=436;AF=1.00;AN=436;DP=3306;

    Perhaps the solution is to go into the records that didn't have MQRankSum (and ReadPosRankSum), and then extract out the good ones, and then merge them back in.

    Among other variables, I've tried to filter by allele number, like this:

       GenomeAnalysisTK -R ${refseq} \
            -T VariantFiltration \
            -V f0.${vcf} \
            --filterExpression "QD<2.0||FS>50.000||MQ<58.00||MQRankSum<-7.0||ReadPosRankSum<-5.0||DP>15000||AN<400" \
            --filterName "locusFilter" \
                -o marked.f0.${vcf}   
    
        GenomeAnalysisTK -R ${refseq} \
            -T SelectVariants \
            -V marked.f0.${vcf} \
            --excludeFiltered \
                -o f1.${vcf}
    

    This is one of the records from the resulting vcf:
    chrX 22911216 hc_nsode G A 69.49 PASS AC=4;AF=0.087;AN=46;DP=30;EFF=INTERGENIC(MODIFIER||||||||||A);FS=0.000;InbreedingCoeff=0.1074;MLEAC=3;MLEAF=0.065;MQ=94.00;QD=34.74;SOR=2.833;VRT=1;VariantType=SNP GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,38
    ... which has an AN of 46 (which is less than the specified 400).

    I found this example when looking through records with no MQRankSum, so perhaps the fact that there is no MQRankSum value means that all filtering is disabled for that record, or perhaps there is another reason.

    Also is there any way in which I can invert SelectVariants - so I get get a file of all the variants I have removed ?

    Please help. I love posting errors up on the GATK forum but my grandchildren are growing beards.

    Issue · Github
    by Sheila

    Issue Number
    831
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • There used to be a doc about this somewhere, not sure if it's still around. But your interpretation is correct - if one of the clauses of a complex JEXL statement fails with an error, the record is considered to pass. The solution is to break up your compound test into multiple simple tests - --filterExpression "QD<2.0" --filterName QD --filterExpression "FS>50.0" --filterName FS ... This will then add filters for each subexpression that actually fails, and any subexpressions that error out will simply be ignored

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Will_Gilks It sounds like your problem is that you're trying to filter on all annotations using a single filtering expression. I would recommend breaking it down into separate expressions (each in a separate filter) so that you can specify different behaviors for filters on annotations that are expected to be present in all records, vs those that may be absent but whose absence shouldn't be considered a dealbreaker. You can specify as many separate filters (each with their own name and expression) within the same command.

    And yes SelectVariants (in the latest version) has an argument to invert the selection criteria, have a look at the tool doc.

    Hopefully that will allow you to get done what you need so you can spend the rest of your time dandling your grandkids on your knee before they're so big that they crush your fragile old bones ;)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Great minds think alike, @pdexheimer :)

  • @Geraldine_VdAuwera - and apparently spend time on forums while their colleagues are at lunch...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh hah, well some just got back from the UK and were too jetlagged to realize it's lunch time...

  • RB_74RB_74 Member

    Hi Sheila,

    I encountered same problem.
    I have many like with first warning and few with second one.

    ${JAVA} -jar ${BIN_PATH}/GenomeAnalysisTK.jar -T VariantFiltration -R ${GENOME_FILE} -V ${RAWSNPS_OUTPUT} --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o ${FILTSNPS_OUTPUT}

    WARN 14:59:11,593 Interpreter - ![38,47]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0;' undefined variable MQRankSum
    and
    WARN 15:43:21,737 Interpreter - ![59,73]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0;' undefined variable ReadPosRankSum

    Do you think its also similar problem like @isp2?

    Thank you for helping me out!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @RB_74
    Hi,

    Yes, it is the same problem :smiley: You have nothing to worry about. @pdexheimer's response is correct.

    -Sheila

Sign In or Register to comment.