SelectVariants - Removing variants present in a sample reality check
I have a very small set non-model organism genomes, and have called SNPs following the Best Practices with
HaplotypeCaller. One of the genomes that I included is a re-sequenced version of the reference ancestral genome.
Surprisingly, all of the SNPs that were identified appeared to be present in the newly sequenced ancestor track as well; in fact looking at the joint-called VCF contig by contig I couldn't find a single SNP that wasn't present in the ancestor. A few of these make sense - for example looking at the bam alignment shows that some of the SNPs are almost exactly 50% split between two alternative bases, indicating that the reference may be heterozygous at that position. But before I accept that there are literally 0 new SNPs found I want to verify this with filtering; I wanted to select only variants that were not also called in the ancestor.
To do this, I created a new VCF containing only the ancestor, and then selected variants discordant from that ancestral VCF:
# First get only the ancestor java -jar $GATK -T SelectVariants \ -R $reference \ -V jointcalls.vcf \ -sn Ancestor \ -o jointcalls_AncestorOnly.vcf # Then get all variants that don't agree with ancestor (are in discordance) java -jar $GATK -T SelectVariants \ -R $reference \ -V jointcalls.vcf \ --discordance jointcalls_AncestorOnly.vcf \ -o jointcalls_NoAncestor.vcf
When I visualize this output w/ IGV and summarize it with VariantEval, both indicate that the file is empty - there are 0 calls. I think the interpretation here would be that literally every call made in any of the input genomes was also made in the ancestor. However, without even a single positive example to cross check, it's hard to trust that I didn't do something wrong o misunderstand the filtering.
Is this interpretation correct? Any sanity check would be appreciated. Thanks,