Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(How to) Filter on genotype using VariantFiltration

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited November 2018 in Tutorials

Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.


This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.

First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (./.) genotypes with SelectVariants so that downstream tools may discount them.

We use example variant record FORMAT fields from trio.vcf to illustrate.

GT:AD:DP:GQ:PL  
0/1:17,15:32:99:399,0,439       0/1:11,12:23:99:291,0,292       1/1:0,30:30:90:948,90,0

1. Annotate genotypes using VariantFiltration

If we want to filter heterozygous genotypes, we use VariantFiltration's --genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the --genotype-filter-name option. Here, this parameter's value is set to "isHetFilter".

gatk VariantFiltration \
-V trio.vcf \
-O trio_VF.vcf \
--genotype-filter-expression "isHet == 1" \
--genotype-filter-name "isHetFilter"

After filtering, in the resulting trio_VF.vcf, our example record adds an FT field and becomes:

GT:AD:DP:FT:GQ:PL
0/1:17,15:32:isHetFilter:99:399,0,439   0/1:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

We see that HET (0/1) genotype calls get a isHetFilter in the FT field and other genotype calls get a PASS in the genotype field.

The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:

We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.


2. Transform filtered genotypes to no call

Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.

gatk SelectVariants \
-V trio_VF.vcf \
--set-filtered-gt-to-nocall \
-O trioGGVCF_VF_SV.vcf

The result is that the GT genotypes of the isHetFiltered genotype records become null or no call (./.) as follows.

GT:AD:DP:FT:GQ:PL
./.:17,15:32:isHetFilter:99:399,0,439   ./.:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

Post edited by shlee on

Comments

  • vytautasraskvytautasrask Member
    edited July 2018

    Very interesting tutorial. Where I can get trio.vcf to practice it?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @vytautasrask,

    You can use trio data from the Workshop tutorial bundle. You can find a link to workshop materials on the Presentations page at https://software.broadinstitute.org/gatk/documentation/presentations.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.

  • avrajitavrajit Member

    Hi I did this using gatk-4.0.11.0 VariantFiltration
    -R GRCh38.p12.genome.fasta -V GENOTYPE.vcf
    --filter-name "FS" --filter-expression "FS > 30.0" \
    --filter-name "QD" --filter-expression "QD < 3.0" \
    --filter-name "DP" --filter-expression "DP >= 20" \
    -O filtered.vcf

    But the terminal output sowing

    ProgressMeter - Starting traversal
    17:18:50.892 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    17:18:52.328 INFO VariantFiltration - No variants filtered by: AllowAllVariantsVariantFilter
    17:18:52.329 INFO ProgressMeter - KI270850.1:223180 0.0 75222 3142980.5
    17:18:52.329 INFO ProgressMeter - Traversal complete. Processed 75222 total variants in 0.0 minutes.
    17:18:52.474 INFO VariantFiltration - Shutting down engine

    is that mean I am doing wrong somewhere?
    is there any problem input?
    for more information

    this is generated form RNA-seq data
    variant call done with haplotypeCaller
    combined vcfs by CombineGVCFs (control and treated)
    genotyping with GenotypeGVCFs

    please help

Sign In or Register to comment.