We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

SelectVariants - Removing variants present in a sample reality check

kastmankastman BostonMember


I have a very small set non-model organism genomes, and have called SNPs following the Best Practices with HaplotypeCaller. One of the genomes that I included is a re-sequenced version of the reference ancestral genome.

Surprisingly, all of the SNPs that were identified appeared to be present in the newly sequenced ancestor track as well; in fact looking at the joint-called VCF contig by contig I couldn't find a single SNP that wasn't present in the ancestor. A few of these make sense - for example looking at the bam alignment shows that some of the SNPs are almost exactly 50% split between two alternative bases, indicating that the reference may be heterozygous at that position. But before I accept that there are literally 0 new SNPs found I want to verify this with filtering; I wanted to select only variants that were not also called in the ancestor.

To do this, I created a new VCF containing only the ancestor, and then selected variants discordant from that ancestral VCF:

# First get only the ancestor
java -jar $GATK -T SelectVariants \
  -R $reference \
  -V jointcalls.vcf \
  -sn Ancestor \
  -o jointcalls_AncestorOnly.vcf

# Then get all variants that don't agree with ancestor (are in discordance)
java -jar $GATK -T SelectVariants \
  -R $reference \
  -V jointcalls.vcf \
  --discordance jointcalls_AncestorOnly.vcf \
  -o jointcalls_NoAncestor.vcf

When I visualize this output w/ IGV and summarize it with VariantEval, both indicate that the file is empty - there are 0 calls. I think the interpretation here would be that literally every call made in any of the input genomes was also made in the ancestor. However, without even a single positive example to cross check, it's hard to trust that I didn't do something wrong o misunderstand the filtering.

Is this interpretation correct? Any sanity check would be appreciated. Thanks,



Best Answers


  • kastmankastman BostonMember

    Hi Sheila,

    Thanks for the tip - I hadn't understood that hom_ref variants were still being included (makes sense now that they were still included as entries of no variance); adding -env was exactly what I needed. Now there are a few (although still not many) sites that are truly unique calls in the other strains.

    By the way, would you say this ancestral screen strategy is generally valid? Makes sense that I shouldn't be interpreting sites where the fasta reference is ambiguous, but are there other concerns I'm missing? Am I tossing out too much?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    It depends if you care about genotypes or not (whether the newer individuals have a genotype different from the ancestor's for the same SNPs). If not then I think your procedure sounds fine.
  • kastmankastman BostonMember

    Could you elaborate just a bit on this point?

    I'm searching for SNPs that may be involved in mutation from an ancestral strain (in this case different phenotypes of pigmentation, spore production etc.), so I do want to be able to assess different strains on a SNP-by-SNP basis and use variation to support phenotypic change (or more likely, rule out genomic variation and provide support for the changes being epigenetic and not caused by SNPs in DNA).

    Seems like I do care about genotypes, but I think I should still be able to get this out of the suggested procedure. What limitations were you thinking of specifically? Thanks again!

  • kastmankastman BostonMember

    Ah, that makes sense - you're saying that I would need further filtering / analysis to find heterozygous vs. homozygous. Thanks for the clarification

Sign In or Register to comment.