Variant filtration won't run

Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument.
Here is what I ran:

java -jar tools/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \
-V vcftoolsextractGATK_snps.vcf \
-o GATK_darwinfinch_filter.vcf \
--filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \
--filterName “darwinfinchfilter”

I got the following error message: "2: No such file or directory."

However, I know my file path was set correctly as if i remove the last two arguments and only ran:
java -jar tools/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \
-V vcftoolsextractGATK_snps.vcf \
-o GATK_darwinfinch_filter.vcf

It would run.

I also tried Geraldine example filterExpression:

java -jar tools/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \
-V vcftoolsextractGATK_snps.vcf \
-o GATK_geraldinefilter.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName “Geraldine_snp_filter"

and I got the following error:
Unmatched ".

Comments

  • pdexheimerpdexheimer Member ✭✭✭✭

    The first error is because you didn't quote the expression, so it was interpreted but the shell. Specifically, it tried to run the command up to --filterExpression QD, and then supply the contents of the file called "2" as input. The error on your second command means that you have an extra quotation mark somewhere (or possibly a missing one), though I don't see it offhand.

  • setophagussetophagus Member

    @pdexheimer said:
    The first error is because you didn't quote the expression, so it was interpreted but the shell. Specifically, it tried to run the command up to --filterExpression QD, and then supply the contents of the file called "2" as input. The error on your second command means that you have an extra quotation mark somewhere (or possibly a missing one), though I don't see it offhand.

    Thanks! But after adding the quotation marks, I got this "2: No such file or directory." error message again. It still won't run... :(

    java -jar tools/GenomeAnalysisTK.jar -T VariantFiltration -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa -V vcftoolsextractGATK_snps.vcf -o GATK_darwinfinch_filter.vcf --filterExpression “QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2” --filterName “darwinfinchfilter”

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In your second example, it's not working because the first quotation mark on the line “Geraldine_snp_filter" is a "smart" quotation mark, ie it's a different character than the one that is interpreted as a functional quotation mark by the shell. You can tell because they're not completely straight, they're a little curvy. This typically happens when you type your command in a word processor or a text editor with auto-formatting then copy it to the shell.

    In your latest attempt, all the quotation marks have been converted to smart ones, negating their effect, so that's why you get the first error type again.

  • setophagussetophagus Member

    @Geraldine_VdAuwera said:
    In your second example, it's not working because the first quotation mark on the line “Geraldine_snp_filter" is a "smart" quotation mark, ie it's a different character than the one that is interpreted as a functional quotation mark by the shell. You can tell because they're not completely straight, they're a little curvy. This typically happens when you type your command in a word processor or a text editor with auto-formatting then copy it to the shell.

    In your latest attempt, all the quotation marks have been converted to smart ones, negating their effect, so that's why you get the first error type again.

    Thanks so much! Now it works after I changed the quotation marks. :)
    However, I got lines of warning message as the following:

    WARN 21:25:39,047 Interpreter - ![0,2]: 'QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2;' undefined variable QD
    WARN 21:25:39,160 Interpreter - ![0,2]: 'QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2;' undefined variable QD
    WARN 21:25:39,187 Interpreter - ![0,2]: 'QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2;' undefined variable QD

    I also got warning message for the Geraldine's filter example above. I got the warnings as the following for that:

    WARN 21:30:57,755 Interpreter - ![38,52]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable HaplotypeScore

    I saw a previous previous post about this before, and it seemed that we could ignore the warnings as such. However, my output is identical set of SNPs as the input file (for both of the filter attempts).... It doesn't seem that this filters have worked...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Variant filtering in GATK doesn't physically remove any variants from the callset; what it does is annotate variants' FILTER field. The idea is to do this processing in a way that is non lossy, meaning no information is lost. That way you are able to refine the filtering and rescue some variants if needed without having to go back to the original file.

    When you are ready to subset your passing variants, you do it with SelectVariants.

  • setophagussetophagus Member

    @Geraldine_VdAuwera said:
    Variant filtering in GATK doesn't physically remove any variants from the callset; what it does is annotate variants' FILTER field. The idea is to do this processing in a way that is non lossy, meaning no information is lost. That way you are able to refine the filtering and rescue some variants if needed without having to go back to the original file.

    When you are ready to subset your passing variants, you do it with SelectVariants.

    Thanks a lot! Are the warnings looking alright to you though?
    To subset the passing variants, should I use the following command with "-ef" option?

    java -Xmx2g -jar tools/GenomeAnalysisTK.jar -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa -T SelectVariants --variant GATK_darwinfinch_filter.vcf -o GATK_darwinfinch_select.vcf -ef

    When I do this, the input and output still have the same number of SNPs...

  • setophagussetophagus Member

    java -Xmx2g -jar tools/GenomeAnalysisTK.jar -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa -T SelectVariants --variant GATK_darwinfinch_filter.vcf -select 'vc.isNotFiltered()' -o GATK_darwinfinch_select.vcf

    I found the expression to filter SNPs that passed filtration. So I ran the above command. The output file is still having the same set of SNPs as the input. :(

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Using the -ef flag should be sufficient. Did you check the filter field in your filtered VCF, to make sure that the records actually got annotated?

  • setophagussetophagus Member

    @Geraldine_VdAuwera said:
    Hi there,

    Using the -ef flag should be sufficient. Did you check the filter field in your filtered VCF, to make sure that the records actually got annotated?

    Hello Geraldine,
    Thanks!
    I'm copying one line here. Is the "PASS" annotation from VariantFiltration, as I don't see this in the input file of the filtration step? It seems that every row I looked at has PASS, could it be the the filter step skipped many criteria (associated with the warnings)?

    1 2418 . C T 816.80 PASS AC=12;AF=0.044;AN=272;BaseQRankSum=1.22;ClippingRankSum=1.23;DP=853;FS=0.000;GQ_MEAN=14.35;GQ_STDDEV=14.73;InbreedingCoeff=0.2115;MLEAC=10;MLEAF=0.037;MQ=35.00;MQ0=0;MQRankSum=-1.231e+00;NCC=54;QD=30.25;ReadPosRankSum=-3.580e-01;SOR=0.368 GT:AD:DP:GQ:PL ./.:4,0:4:12:0,12,141 ./.:2,0:2:6:0,6,50 ./.:0,0:0:.:. ./.:2,0:2:6:0,6,54 ./.:2,0:2:6:0,6,78 ./.:3,0:3:9:0,9,97 ./.:3,0:3:9:0,9,120 ./.:2,0:2:6:0,6,72 ./.:3,0:3:9:0,9,114 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:2,0:2:6:0,6,53 ./.:2,0:2:6:0,6,58 ./.:3,0:3:9:0,9,111 ./.:4,0:4:12:0,12,143 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:2,0:2:6:0,6,75 ./.:0,0:0:.:. ./.:4,0:4:12:0,12,119 ./.:2,0:2:6:0,6,65 ./.:2,0:2:0:0,0,33 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:3,0:3:9:0,9,120 ./.:4,0:4:12:0,12,168 ./.:4,0:4:12:0,12,160 ./.:0,0:0:.:. ./.:4,0:4:12:0,12,127 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:2,0:2:6:0,6,67 ./.:2,0:2:6:0,6,78 ./.:4,0:4:12:0,12,131 ./.:4,0:4:12:0,12,169 ./.:2,0:2:6:0,6,81 ./.:2,0:2:6:0,6,76 ./.:4,0:4:12:0,12,162 ./.:4,0:4:12:0,12,141 ./.:2,0:2:6:0,6,75 ./.:4,0:4:12:0,12,163 ./.:0,0:0:.:. ./.:4,0:4:12:0,12,173 ./.:2,0:2:6:0,6,87 ./.:2,0:2:6:0,6,74 ./.:0,0:0:.:. ./.:2,0:2:6:0,6,65 ./.:2,0:2:6:0,6,79 ./.:3,0:3:9:0,9,129 ./.:4,0:4:12:0,12,158 ./.:3,0:3:9:0,9,111 ./.:2,0:2:6:0,6,74 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:4,0:4:12:0,12,157 ./.:0,0:0:.:. ./.:3,0:3:9:0,9,101 ./.:0,0:0:.:. ./.:3,0:3:9:0,9,111 ./.:4,0:4:12:0,12,134 ./.:2,0:2:6:0,6,69 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:2,0:2:6:0,6,65 ./.:2,0:2:6:0,6,59 0/0:9,0:9:27:0,27,327 ./.:0,0:0:.:. ./.:2,0:2:6:0,6,70 ./.:0,0:0:.:. ./.:3,0:3:9:0,9,121 ./.:2,0:2:6:0,6,74 ./.:0,0:0:.:. ./.:2,0:2:6:0,6,69 ./.:0,0:0:.:. ./.:2,0:2:6:0,6,65 ./.:0,0:0:.:. ./.:0,0:0:.:.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @setophagus Depending on which filter expression you ended up using, it's possible that the criteria were too lax and didn't filter anything. Try with just one annotation to start with, like in the most basic documented examples, and increase the complexity gradually (or use separate filter expressions). This will help you figure out what's going wrong. I would also recommend running on just a subset of your variants in order to rapidly test various expressions.

  • KurtKurt Member ✭✭✭

    If you are using the filter with the && statements, I wouldn't be surprised. A variant would have to be true for every one of your parameters; that is; for

    QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2

    a variant site has to have QD < 2 AND FS > 60.0 AND MQ < 50, etc.

    If you really want a filter were it is applied if the site fails any of the criteria, switch && with II

    Or as @Geraldine_VdAuwera suggest break out each one as it's own argument (I prefer that level of distinction myself).

    Also HaplotypeScore if only applicable for UnifiedGenotyper (I don't know if you are using that or HaplotypeCaller).

  • pdexheimerpdexheimer Member ✭✭✭✭

    Hmm, this used to be documented somewhere, but I can't remember where. I think @setophagus has the right idea here, that it's related to the warnings.

    Basically, if the JEXL interpreter decides it can't process an expression (because of an undefined variable, for instance), the entire expression is rendered as false, i.e., not filtered. The solution, as mentioned by several others, is to break up each clause of the expression into its own filter

  • setophagussetophagus Member

    Thank you all! :) It works when I do each criteria separately. I still got a lot of warning messages for each step.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @setophagus, glad to hear it's working for you! The warning messages are thrown when you have multiallelic sites, because the filtering system cannot handle them appropriately. In future we may be able to add some functionality to handle multiallelics natively.

Sign In or Register to comment.