Removing variants based on the FORMAT field


I have been trying to remove the Variants in my Sample with a GQ lower than 20 as part of the Genotype Refinement workflow

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref.fa -V Genotype_refinement.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o Genotype_refinement_GQ.vcf

This gives me a vcf file with the Filter name lowGQ in the FORMAT field.

Now I do not know how to use SelectVariants to remove these filtered variants. I think my JEXL is wrong

java -jar GenomeAnalysisTK.jar -R ref.fa -T SelectVariants -V Genotype_refinement_GQ.vcf -o Genotype_refinement_GQ_filtered.vcf -xl_se 'lowGQ'

I still get the lowGQ samples when I check the file. Could you help me understand how to remove these?

Also I wanted to preferably do this in a single step but apparently GATK SelectVariants does not recognise G_filter

##### ERROR MESSAGE: Argument with name 'G_filter' isn't defined.



  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    The best thing to do is use --setFilteredGtToNocall. This will set the genotype to ./. for all samples that failed the GQ filter. You don't want to remove the entire site if most of the samples passed the GQ filter. You just want to know which of the samples had a bad GQ at the site.

    I hope this helps.


  • niranjanksniranjanks IndiaMember

    I used -setFilteredGtToNocall and then filtered the ./. genotypes using

    sed '/.\/./d'

  • niranjanksniranjanks IndiaMember
  • tedtoaltedtoal Member

    Selecting on values in the sample fields (FORMAT field) is a problem because there might be more than one sample, and which do you use? Require all of them to pass the filter? Any one of them? But, many people have only a single sample in the vcf file. Then it makes sense to treat the FORMAT field like the INFO field. You can't do that EASILY with SelectVariants, but it can be done with vc().getGenotype(0).<etc.>. I think the following would work for niranjanks, if his file has only one sample:

    java -jar GenomeAnalysisTK.jar -R ref.fa -T SelectVariants -V Genotype_refinement_GQ.vcf \ -o Genotype_refinement_GQ_filtered.vcf \ -select 'vc.getNSamples() == 1 && vc.getGenotype(0).getExtendedAttribute("GQ", 99) <= 20.0'"

    The 99 is the default value to use if the GQ field is missing.

    You could access other sample FORMAT fields too. Change getGenotype(0) to getGenotype(1) to access the second sample.

    I just figured out how to make this work for the GT field. I wanted to filter out all records with "./." in the GT field:

    bash -c " \
        $JAVA -Xmx16g$TMP_DIR -jar $GATK --analysis_type SelectVariants \
        --reference_sequence $REF --variant VarScan_Consensus_$1.vcf.gz \
        --out VarScan_Consensus_Filtered2_$1.vcf \
        -select 'vc.getNSamples() == 1 && vc.getGenotype(0).getGenotypeString() != \"./.\"'"

    The backslashes before the double quotes at the end are cuz the whole thing is also in double quotes to pass to bash.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • ronlevineronlevine Lexington, MAMember ✭✭

    @tedtoal vc.getGenotype(0).getExtendedAttribute("GQ", 99) can be changed to vc.getGenotype(0).getExtendedAttribute("GQ") and vc.getGenotype(0).getGenotypeString() != "./."'" to vc.getGenotype(0).isCalled().

  • Hi @Sheila,

    I use --setFilteredGtToNocall, but the genotype for the filtered variant is not set to "./.". How could this issue be resolved?

    The command I use is:

    java -jar /home/yin168/bin/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T VariantFiltration -R /scratch/snyder/y/yin168/gatk/sl_genome/gPmar100.fa.txt -V $input -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o $output --setFilteredGtToNocall


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @XiaoshenYin,

    We have a GATK4 tutorial on filtering on GT and then setting the filtered GTs to no call at

  • Hi @shlee ,

    I tried exactly what the link you provided suggested, but it still does not work.

    Commands I use are:

    java -jar ~/bin/gatk- VariantFiltration -R /scratch/snyder/y/yin168/gatk/sl_genome_gatk4/gPmar100.fa.txt -V /scratch/snyder/y/yin168/gatk/sam/sam1/400.vcf -window 35 -cluster 3 --filter-name "FS" --filter-expression "FS > 30.0" --filter-name "QD" --filter-expression "QD < 2.0" -O 400_filter4.vcf

    java -jar ~/bin/gatk- Sele ctVariants -V 400_filter4.vcf --set-filtered-gt-to-nocall -O 400_filter4N.vcf

    One thing I notice is that the example used in the link is to filter heterozygous genotypes. Before filtering, the FORMAT field has "GT:AD:DP:GQ:PL", but, after filtering, the FORMAT field has "GT:AD:DP:FT:GQ:PL". "FT" is added to the FORMAT field.

    However, I filter FS and QD and, after filtering, nothing like "FS" and "QD" is added to the FORMAT field. Instead, "PASS", "QD", or "SnpCluster" is added to the FILTER field. In this case, how should I resolve this issue and remove variants that do not pass the filtering? I would greatly appreciate it if you could help fix this issue.


  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi - I just wanted to let you know @shlee is traveling at the moment, but I have made her aware of your follow-up question.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited December 2018

    Hi @XiaoshenYin,

    Adelaide kindly notified me of your followup question. I had been traveling to give a GATK workshop in Taiwan last week and I am now taking a few days off due to a stomach bug.

    The tutorial I had linked previously illustrates how to annotate FORMAT level genotypes. Given you ask:

    However, I filter FS and QD and, after filtering, nothing like "FS" and "QD" is added to the FORMAT field.

    You are now asking how to filter on INFO level annotations. This you can perform with VariantFiltration. Here is an example command from the recent hard filtering workshop hands-on worksheet I had updated:

    gatk VariantFiltration \
    -V illumina_platinum/motherSNPs.vcf.gz \
    -filter "​QD < 2.0​" --filter-name "QD2" \
    -filter "​QUAL < 30.0​" --filter-name "QUAL30" \
    -filter "​SOR > 3.0​" --filter-name "SOR3" \
    -filter "​FS > 60.0​" --filter-name "FS60" \
    -filter "​MQ < 40.0​" --filter-name "MQ40" \
    -filter "​MQRankSum < -12.5​" --filter-name "MQRankSum-12.5" \ 
    -filter "​ReadPosRankSum < -8.0​" --filter-name "ReadPosRankSum-8" \ 
    -O sandbox/motherSNPsfilters.vcf.gz

    When you filter at the INFO level, the filter name appears in the FILTER field of the VCF. Here are some example INFO-level filtered records:

    20      691035  .       A       *       0.12    QD2;QUAL30;SOR3 AC=1;AF=0.500;AN=2;BaseQRankSum=-8.310e-01;ClippingRankSum=0.00;DP=18;ExcessHet=3.9794;FS=22.595;MQ=40.58;MQRankSum=0.761;QD=0.00;ReadPosRankSum=-2.292e+00;SOR=3.590   GT:AD:DP:GQ:PL  0/1:16,2:18:3:3,0,1156
    20      1097992 .       C       T       42.16   FS60;QD2;SOR3   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.893e+00;ClippingRankSum=0.00;DP=33;ExcessHet=5.4407;FS=70.626;MQ=57.68;MQRankSum=1.10;QD=0.54;ReadPosRankSum=2.29;SOR=6.090  GT:AD:DP:GQ:PGT:PID:PL  0/1:23,10:33:23:1|0:1097976_C_T:23,0,650
    20      1160067 .       A       C       139.14  FS60;SOR3       AC=1;AF=0.500;AN=2;BaseQRankSum=1.43;ClippingRankSum=0.00;DP=14;ExcessHet=4.7712;FS=62.839;MQ=43.90;MQRankSum=-1.900e+00;QD=3.57;ReadPosRankSum=-1.006e+00;SOR=4.159    GT:AD:DP:GQ:PGT:PID:PL  0/1:12,2:14:42:0|1:1160059_G_GCCC:42,0,1450

    I hope this is helpful. If something doesn't make sense, please forgive me. I've been rather ill.

Sign In or Register to comment.