(How to) Filter on genotype using VariantFiltration

shleeshlee CambridgeMember, Broadie, Moderator admin
edited November 2018 in Tutorials

Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.


This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.

First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (./.) genotypes with SelectVariants so that downstream tools may discount them.

We use example variant record FORMAT fields from trio.vcf to illustrate.

GT:AD:DP:GQ:PL  
0/1:17,15:32:99:399,0,439       0/1:11,12:23:99:291,0,292       1/1:0,30:30:90:948,90,0

1. Annotate genotypes using VariantFiltration

If we want to filter heterozygous genotypes, we use VariantFiltration's --genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the --genotype-filter-name option. Here, this parameter's value is set to "isHetFilter".

gatk VariantFiltration \
-V trio.vcf \
-O trio_VF.vcf \
--genotype-filter-expression "isHet == 1" \
--genotype-filter-name "isHetFilter"

After filtering, in the resulting trio_VF.vcf, our example record adds an FT field and becomes:

GT:AD:DP:FT:GQ:PL
0/1:17,15:32:isHetFilter:99:399,0,439   0/1:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

We see that HET (0/1) genotype calls get a isHetFilter in the FT field and other genotype calls get a PASS in the genotype field.

The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:

We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.


2. Transform filtered genotypes to no call

Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.

gatk SelectVariants \
-V trio_VF.vcf \
--set-filtered-gt-to-nocall \
-O trioGGVCF_VF_SV.vcf

The result is that the GT genotypes of the isHetFiltered genotype records become null or no call (./.) as follows.

GT:AD:DP:FT:GQ:PL
./.:17,15:32:isHetFilter:99:399,0,439   ./.:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

Post edited by shlee on

Comments

  • vytautasraskvytautasrask Member
    edited July 2018

    Very interesting tutorial. Where I can get trio.vcf to practice it?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @vytautasrask,

    You can use trio data from the Workshop tutorial bundle. You can find a link to workshop materials on the Presentations page at https://software.broadinstitute.org/gatk/documentation/presentations.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.

  • avrajitavrajit Member

    Hi I did this using gatk-4.0.11.0 VariantFiltration
    -R GRCh38.p12.genome.fasta -V GENOTYPE.vcf
    --filter-name "FS" --filter-expression "FS > 30.0" \
    --filter-name "QD" --filter-expression "QD < 3.0" \
    --filter-name "DP" --filter-expression "DP >= 20" \
    -O filtered.vcf

    But the terminal output sowing

    ProgressMeter - Starting traversal
    17:18:50.892 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    17:18:52.328 INFO VariantFiltration - No variants filtered by: AllowAllVariantsVariantFilter
    17:18:52.329 INFO ProgressMeter - KI270850.1:223180 0.0 75222 3142980.5
    17:18:52.329 INFO ProgressMeter - Traversal complete. Processed 75222 total variants in 0.0 minutes.
    17:18:52.474 INFO VariantFiltration - Shutting down engine

    is that mean I am doing wrong somewhere?
    is there any problem input?
    for more information

    this is generated form RNA-seq data
    variant call done with haplotypeCaller
    combined vcfs by CombineGVCFs (control and treated)
    genotyping with GenotypeGVCFs

    please help

Sign In or Register to comment.