Do GATK4 tools ignore VCF sites marked as filtered, or must they be removed from the file?

foxDie00foxDie00 Member
edited October 2018 in Ask the GATK team

@Sheila said:
I think the GATK tools mostly ignore filtered sites (sites without PASS or .)...

Link to comment.

Hi, @Sheila ,
Can you confirm that GATK4 tools ignore variants in a VCF that are marked as filtered (i.e. not PASS), but that are still present in the VCF file?

For example, I have a bootstrapped knownSites.hardFilter.vcf file that I produced with VariantFiltration with the recommended hard filter parameters. This VCF still includes the filtered variants in the file—it has only marked them as filtered in the FILTER column. My question is this: Would, for example, BaseRecalibrator—which takes the knownSites.hardFilter.vcf as input—ignore the variants in the VCF file that are marked with a filter instead of PASS in the FILTER column in the VCF? I need to know if tools like BaseRecalibrator are actually ignoring variants marked as filtered but that are still present in the VCF file, or if I need to physically remove them using SelectVariants. Please let me know, thanks.

Best Answers

  • shleeshlee Cambridge admin
    Accepted Answer

    Hi @foxDie00,

    Sheila has moved on to green pastures. While our new support-specialist ramps up, I am helping out on the forum.

    The convention is to keep data that was costly to compute, e.g. via read reassembly, and to filter variant sites by labeling the FILTER column, typically with the reason for filtering. When you get to making a variant resource file to use as a population resource, for example, at this point it seems the convention is to construct a sites-only VCF that removes sample level annotations as well as any filtered variants.

    It has been my experience that some GATK4/Picard tools appropriately ignore the filtered variants while others are indifferent to FILTER status and use all sites/variants. This all depends on the tool and the context in which it is used. For example, the Mutect2 workflow uses all sites present in the panel of normals VCF and completely ignores the FILTER column status.

    Sorry, I do not know which camp BaseRecalibrator falls into. To be on the safe side, I would suggest you remove the filtered variants. Alternatively, you can test a small dataset to see if these FILTERed sites change the results you get.

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    Accepted Answer

    Hi @foxDie00,

    Sheila has moved on to green pastures. While our new support-specialist ramps up, I am helping out on the forum.

    The convention is to keep data that was costly to compute, e.g. via read reassembly, and to filter variant sites by labeling the FILTER column, typically with the reason for filtering. When you get to making a variant resource file to use as a population resource, for example, at this point it seems the convention is to construct a sites-only VCF that removes sample level annotations as well as any filtered variants.

    It has been my experience that some GATK4/Picard tools appropriately ignore the filtered variants while others are indifferent to FILTER status and use all sites/variants. This all depends on the tool and the context in which it is used. For example, the Mutect2 workflow uses all sites present in the panel of normals VCF and completely ignores the FILTER column status.

    Sorry, I do not know which camp BaseRecalibrator falls into. To be on the safe side, I would suggest you remove the filtered variants. Alternatively, you can test a small dataset to see if these FILTERed sites change the results you get.

  • foxDie00foxDie00 Member

    Thanks for the reply! I will now manually remove filtered sites for downstream tools to be on the safe side.

Sign In or Register to comment.