This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Exclude variants not genotyped in x out of y samples
I apologise if this has been answered already, I haven't managed to find the answer (I promise I tried). I have applied hard-filters to my HC-SNP data set as recommended by GATK as a starting point. However, I also want to filter by depth (I think this is no longer a popular option but variants with good coverage are surely still more reliable than variants with very low coverage?).
Ideally, what I want is a filtered vcf file that contains variants which are genotyped in the majority of my samples at a minimum depth. I think this is what the CoveredByNSampleSites Walker used to do, but it seems this walker has been removed?
So instead I worked around this using VariantFiltration at the sample level: --genotypeFilterExpression 'DP < 10' --genotypeFilterName DP-10 --setFilteredGtToNocall, so all samples with a depth of less than 10 for the any given variant were set to ./.. This seemed to work fine.
Now I want to exclude variants for which a large proportion of my samples have no genotype (./.). For this, I used:
SelectVariants --maxFilteredGenotypes 10.
However, when I check how many samples have no genotype using VariantsToTable -F NO-CALL, I still have many variants where 20-40 of my samples have no genotype! The --maxFilteredGenotypes option seems to work fine, as I definitely have far fewer variants with many missing genotypes. I believe this is due to the fact that these missing genotypes weren't 'filtered', but simply have not been genotyped in the HaplotypeCaller. I guess to exclude these, I would need an option along the lines of "--maxNOCALL" instead of the --maxFilteredGenotypes, but as far as I can tell, this does not exist?
I am sure there must be a way to do this and I simply haven't found it?
Many thanks for any pointers