We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

(How to) Filter on genotype using VariantFiltration

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited November 2018 in Tutorials

Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.

This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.

First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (./.) genotypes with SelectVariants so that downstream tools may discount them.

We use example variant record FORMAT fields from trio.vcf to illustrate.

0/1:17,15:32:99:399,0,439       0/1:11,12:23:99:291,0,292       1/1:0,30:30:90:948,90,0

1. Annotate genotypes using VariantFiltration

If we want to filter heterozygous genotypes, we use VariantFiltration's --genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the --genotype-filter-name option. Here, this parameter's value is set to "isHetFilter".

gatk VariantFiltration \
-V trio.vcf \
-O trio_VF.vcf \
--genotype-filter-expression "isHet == 1" \
--genotype-filter-name "isHetFilter"

After filtering, in the resulting trio_VF.vcf, our example record adds an FT field and becomes:

0/1:17,15:32:isHetFilter:99:399,0,439   0/1:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

We see that HET (0/1) genotype calls get a isHetFilter in the FT field and other genotype calls get a PASS in the genotype field.

The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:

We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.

2. Transform filtered genotypes to no call

Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.

gatk SelectVariants \
-V trio_VF.vcf \
--set-filtered-gt-to-nocall \
-O trioGGVCF_VF_SV.vcf

The result is that the GT genotypes of the isHetFiltered genotype records become null or no call (./.) as follows.

./.:17,15:32:isHetFilter:99:399,0,439   ./.:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

Post edited by shlee on


  • vytautasraskvytautasrask Member
    edited July 2018

    Very interesting tutorial. Where I can get trio.vcf to practice it?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @vytautasrask,

    You can use trio data from the Workshop tutorial bundle. You can find a link to workshop materials on the Presentations page at https://software.broadinstitute.org/gatk/documentation/presentations.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.

  • avrajitavrajit Member

    Hi I did this using gatk- VariantFiltration
    -R GRCh38.p12.genome.fasta -V GENOTYPE.vcf
    --filter-name "FS" --filter-expression "FS > 30.0" \
    --filter-name "QD" --filter-expression "QD < 3.0" \
    --filter-name "DP" --filter-expression "DP >= 20" \
    -O filtered.vcf

    But the terminal output sowing

    ProgressMeter - Starting traversal
    17:18:50.892 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    17:18:52.328 INFO VariantFiltration - No variants filtered by: AllowAllVariantsVariantFilter
    17:18:52.329 INFO ProgressMeter - KI270850.1:223180 0.0 75222 3142980.5
    17:18:52.329 INFO ProgressMeter - Traversal complete. Processed 75222 total variants in 0.0 minutes.
    17:18:52.474 INFO VariantFiltration - Shutting down engine

    is that mean I am doing wrong somewhere?
    is there any problem input?
    for more information

    this is generated form RNA-seq data
    variant call done with haplotypeCaller
    combined vcfs by CombineGVCFs (control and treated)
    genotyping with GenotypeGVCFs

    please help

  • mernstermernster Member
    edited October 2019
    Hi, I have a multisample vcffile and would like to edit heterozygous individuals with a low or high allelic balance (AB).

    If the AB is below 0.25, I would like to change the genotype to 1/1 (homozygous for alternate). If the AB is above 0.75, I would like to change the genotype to 0/0 (homozygous for reference allele).

    Since I dont have the AB information for each genotype in my vcf, I am trying to use the genotype specific RO and DP instead. RO is the count of full observations of the reference haplotype and DP is the total number of reads.

    I have been trying to use this command to tag the variants that I want to convert:

    gatk VariantFiltration -R /ref.fasta -V input.vcf -O output.vcf --genotype-filter-expression 'isHet == 1 && RO / DP < 0.25 ' --genotype-filter-name 'lowAB'

    gatk VariantFiltration -R /ref.fasta -V input.vcf -O output.vcf --genotype-filter-expression 'isHet == 1 && RO / DP > 0.75 ' --genotype-filter-name 'highAB'

    But somehow this just adds an anotation to all heterozygous positions... Any ideas on what might be wrong?

    Also, how could I proceed afterwords? is there any way to change the genotype fields of lowAB to homozygous for alternate allele and the highAB to homozygousfor reference allele?
Sign In or Register to comment.