(How to) Filter on genotype using VariantFiltration
Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.
This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.
First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (
./.) genotypes with SelectVariants so that downstream tools may discount them.
We use example variant record FORMAT fields from
trio.vcf to illustrate.
GT:AD:DP:GQ:PL 0/1:17,15:32:99:399,0,439 0/1:11,12:23:99:291,0,292 1/1:0,30:30:90:948,90,0
1. Annotate genotypes using VariantFiltration
If we want to filter heterozygous genotypes, we use VariantFiltration's
--genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the
--genotype-filter-name option. Here, this parameter's value is set to
gatk VariantFiltration \ -V trio.vcf \ -O trio_VF.vcf \ --genotype-filter-expression "isHet == 1" \ --genotype-filter-name "isHetFilter"
After filtering, in the resulting
trio_VF.vcf, our example record adds an
FT field and becomes:
GT:AD:DP:FT:GQ:PL 0/1:17,15:32:isHetFilter:99:399,0,439 0/1:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0
We see that HET (
0/1) genotype calls get a
isHetFilter in the
FT field and other genotype calls get a
PASS in the genotype field.
The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:
We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.
2. Transform filtered genotypes to no call
Running SelectVariants with
--set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.
gatk SelectVariants \ -V trio_VF.vcf \ --set-filtered-gt-to-nocall \ -O trioGGVCF_VF_SV.vcf
The result is that the
GT genotypes of the
isHetFiltered genotype records become null or no call (
./.) as follows.
GT:AD:DP:FT:GQ:PL ./.:17,15:32:isHetFilter:99:399,0,439 ./.:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0