Exclude variants not genotyped in x out of y samples


I apologise if this has been answered already, I haven't managed to find the answer (I promise I tried). I have applied hard-filters to my HC-SNP data set as recommended by GATK as a starting point. However, I also want to filter by depth (I think this is no longer a popular option but variants with good coverage are surely still more reliable than variants with very low coverage?).
Ideally, what I want is a filtered vcf file that contains variants which are genotyped in the majority of my samples at a minimum depth. I think this is what the CoveredByNSampleSites Walker used to do, but it seems this walker has been removed?

So instead I worked around this using VariantFiltration at the sample level: --genotypeFilterExpression 'DP < 10' --genotypeFilterName DP-10 --setFilteredGtToNocall, so all samples with a depth of less than 10 for the any given variant were set to ./.. This seemed to work fine.

Now I want to exclude variants for which a large proportion of my samples have no genotype (./.). For this, I used:

SelectVariants --maxFilteredGenotypes 10.

However, when I check how many samples have no genotype using VariantsToTable -F NO-CALL, I still have many variants where 20-40 of my samples have no genotype! The --maxFilteredGenotypes option seems to work fine, as I definitely have far fewer variants with many missing genotypes. I believe this is due to the fact that these missing genotypes weren't 'filtered', but simply have not been genotyped in the HaplotypeCaller. I guess to exclude these, I would need an option along the lines of "--maxNOCALL" instead of the --maxFilteredGenotypes, but as far as I can tell, this does not exist?

I am sure there must be a way to do this and I simply haven't found it?

Many thanks for any pointers :)


Best Answers


  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sarah,

    I think the easiest thing to do is use SelectVariants. For DP, you can use -select DP > "some number". For selecting sites in which most samples are variant and genotyped, you can use AF > "some number". The AF gives you the allele frequency of the alternate allele.

    I hope this helps.


  • SarahMSarahM JICMember

    Thanks for your answer. I had a look at the AF in my data set, and yes, in some cases the AF is higher when a large number of samples have no genotype (./.). However, I also have a larger AF in cases where a large number of samples are genotyped, but are homozygous (0/0)? I don't want to exclude these samples, so am not sure about using AF as a filter. I just want to make sure that in my final vcf, all variants listed are genotyped in for instance 60 out of 80 samples. I want to avoid using SNPs that have only been genotyped in 10 out of 80 samples for downstream analysis if that makes sense? Is there no direct way of doing so?

  • nkobmoonkobmoo ParisMember

    @Geraldine_VdAuwera said:
    FYI we have implemented the requested capability in development in the form of two new arguments: --maxNOCALLnumber and --maxNOCALLfraction which allow you to select sites where there is at most either a certain number or a certain fraction of no-call genotypes, respectively. This new functionality will be available in the nightly builds starting tomorrow and in the next official release (3.6).

    Hi @Geraldine_VdAuwera ,

    I'm also interested by these additional arguments. Do you know when the next official release will occur roughly ?

    Thank you very much in advances.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    I think we are hoping for a mid-May official 3.6 release. Have a look at Geraldine's April 8th response here :smile:


Sign In or Register to comment.