Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Do GATK4 tools ignore VCF sites marked as filtered, or must they be removed from the file?

foxDie00foxDie00 Member
edited October 2018 in Ask the GATK team

@Sheila said:
I think the GATK tools mostly ignore filtered sites (sites without PASS or .)...

Link to comment.

Hi, @Sheila ,
Can you confirm that GATK4 tools ignore variants in a VCF that are marked as filtered (i.e. not PASS), but that are still present in the VCF file?

For example, I have a bootstrapped knownSites.hardFilter.vcf file that I produced with VariantFiltration with the recommended hard filter parameters. This VCF still includes the filtered variants in the file—it has only marked them as filtered in the FILTER column. My question is this: Would, for example, BaseRecalibrator—which takes the knownSites.hardFilter.vcf as input—ignore the variants in the VCF file that are marked with a filter instead of PASS in the FILTER column in the VCF? I need to know if tools like BaseRecalibrator are actually ignoring variants marked as filtered but that are still present in the VCF file, or if I need to physically remove them using SelectVariants. Please let me know, thanks.

Best Answers

  • shleeshlee Cambridge ✭✭✭✭✭
    Accepted Answer

    Hi @foxDie00,

    Sheila has moved on to green pastures. While our new support-specialist ramps up, I am helping out on the forum.

    The convention is to keep data that was costly to compute, e.g. via read reassembly, and to filter variant sites by labeling the FILTER column, typically with the reason for filtering. When you get to making a variant resource file to use as a population resource, for example, at this point it seems the convention is to construct a sites-only VCF that removes sample level annotations as well as any filtered variants.

    It has been my experience that some GATK4/Picard tools appropriately ignore the filtered variants while others are indifferent to FILTER status and use all sites/variants. This all depends on the tool and the context in which it is used. For example, the Mutect2 workflow uses all sites present in the panel of normals VCF and completely ignores the FILTER column status.

    Sorry, I do not know which camp BaseRecalibrator falls into. To be on the safe side, I would suggest you remove the filtered variants. Alternatively, you can test a small dataset to see if these FILTERed sites change the results you get.

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    Accepted Answer

    Hi @foxDie00,

    Sheila has moved on to green pastures. While our new support-specialist ramps up, I am helping out on the forum.

    The convention is to keep data that was costly to compute, e.g. via read reassembly, and to filter variant sites by labeling the FILTER column, typically with the reason for filtering. When you get to making a variant resource file to use as a population resource, for example, at this point it seems the convention is to construct a sites-only VCF that removes sample level annotations as well as any filtered variants.

    It has been my experience that some GATK4/Picard tools appropriately ignore the filtered variants while others are indifferent to FILTER status and use all sites/variants. This all depends on the tool and the context in which it is used. For example, the Mutect2 workflow uses all sites present in the panel of normals VCF and completely ignores the FILTER column status.

    Sorry, I do not know which camp BaseRecalibrator falls into. To be on the safe side, I would suggest you remove the filtered variants. Alternatively, you can test a small dataset to see if these FILTERed sites change the results you get.

  • foxDie00foxDie00 Member

    Thanks for the reply! I will now manually remove filtered sites for downstream tools to be on the safe side.

  • daianagandaianagan Member
    Hi @shlee,
    I was wondering if --exclude-filtered also works for GATK4, as I was unable to make it work. Otherwise, do you know if there is an equivalent?
    Thank you!
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @daianagan

    When you weren't able to make it work, could you please provide some more information.

    What command were run?
    What version of GATK4 (we recently released an update).
    And what were the error messages you saw?

    SelectVariants --exclude-filtered does work for GATK4, so having this information would help us troubleshoot your question.

  • daianagandaianagan Member

    Hello @AdelaideR, I must have written the command wrongly, as now it works :smile: Thank you for your help!

Sign In or Register to comment.