Calling variants at known sites with HaplotypeCaller
Hi, all. We are calling variants on large numbers of dogs (WGS) of a variety of breeds using HaplotypeCaller followed by GenotypeGVCFs.
When we call VCFs of dogs of different breeds in one group, that goes well, but when we call a smaller number of dogs of a single breed, we lose variants in the final VCF. We suspect that the problem is that breed-specific variants are being lost. In other words, if we call variants on several Golden Retrievers, the pipeline will save variants that differ between those dogs, as well as variants where the dogs differ from the reference. However, German Shepherd-specific variants will be lost, as they do not appear in the Goldens.
We would like to specify a list of variant sites of interest, based on the large number of sequenced samples we currently have available to us. We'd then use this list of sites in our pipeline so that when we call a smaller number of dogs, all variants of interest are retained in that VCF, even if they are invariant in those samples and vs the reference. We would also retain variants new to these samples (so would not be LIMITED to sites previously of interest).
I've been struggling with the documentation and can't quite see how to do this, although there are a variety of parameters that are ALMOST what we want. What am I missing?
Currently using GATK 3.3, about to move to 4.0 and happy to find a 4.0 solution.