It'd be nice if SelectVariants can update ALT; for example, from ALT A,G to ALT A, or from ALT A to ALT . if the loci in the selected samples are hom-ref
I'm pretty sure it does, if you specify -env
Thanks pdexheimer! But -env isn't what I'm asking for.
@blueskypy, what is your SelectVariants query, and what is your result?
Thanks Geraldine! just a common question: If I select one sample from a vcf file of many samples, does SelectVariants update the ALT?
I believe it should. If you find that it doesn't, let me know.
We just tested this, and it turns out -env does the ALT cleanup in addition to dropping non-variants. We're going to look into the possibility of refactoring things to separate the two behaviors and give them separate flags.
Thanks, Geraldine! I highly appreciate!
A related question, I used to only keep variants in vcf file using -env. But then I realized that, in order for GenotypeConcordence to compute correctly, both files have to include non-variant sites as well. Is there a tool to add the non-variant sites, defined by a bed file, back to a vcf file?
No there isn't, sorry.
Seems nobody cares about those non-variant sites, does it mean it's an acceptable practice to use GenotypeConcordence w/o adding non-variants sites in the two compared files?
That is indeed the usual practice. Sites that are invariant in either the eval callset or the comp callset don't matter, since their "mismatchingness" is accounted for when counting the sensitivity and specificity metrics. By definition, we already know that their genotypes don't match. Genotype Concordance aims to determine whether the genotypes match or not for the sites that are identified as variant in both callsets. I am preparing some documentation to explain this since it seems to be confusing to quite of few of our users.
But for the discrepancy table on this page http://gatkforums.broadinstitute.org/discussion/48/using-varianteval
how to get the value of cell 2,3,5, and 9 w/o the invariants in both files?
Well that's a different analysis than GT Concordance, strictly speaking. One way to do it is to estimate those numbers based on the number of callable loci that aren't called in one, the other or both.