Stratify comp rods by sample in VariantEval?

mlindermmlinderm Posts: 29Member

Hi GATK Team,

I am heavy user of the VariantEval evaluators (particularly GenotypeConcordance) and tracked some unexpected results to the Sample stratification. What is the motivation for not stratifying the comp RODs by the sample? This seems to be a very conscious choice so I was hoping to understand the background of that choice

The relevant section of is:

for ( final RodBinding<VariantContext> compRod : comps ) {
                            // no sample stratification for comps
                            final HashMap<String, Collection<VariantContext>> compSetHash = compVCs.get(compRod);
                            final Collection<VariantContext> compSet = (compSetHash == null || compSetHash.size() == 0) ? Collections.<VariantContext>emptyList() : compVCs.get(compRod).values().iterator().next();

The effect for me is that many spurious genotypes get included in cases where is a comp variant, but not eval variant.




  • ebanksebanks Broad InstitutePosts: 698Member, Administrator, Broadie, Moderator, Dev admin

    Just so we can understand and improve the code, could you please give us a concrete example of how this hurts your analysis and why you get spurious sites included? Keep in mind that GenotypeConcordance is no longer an evaluator module - it was pulled out and released as a standalone tool a while back - so please use an example with a different evaluator. Thanks!

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • mlindermmlinderm Posts: 29Member

    Hi Eric,

    I am using a slightly customized copy of GenotypeConcordance that explicitly tracks genotypes from non-PASSing variants. I recognize that Evaluator has been deprecated (as an aside, what motivated the transition to a stand-alone walker?), and I understand if this falls outside "Ask the Team". But thought it was worth an ask to better understand the core VariantEval walker.

    I have been running the evaluator against many multi-sample "comp" and "eval" RODs that are all technical replicates (doing an all-pairs analysis), typically stratifying by some combination of filter, type and novelty. Just recently I needed to stratify by sample as well. I look at all sites in the comp and eval RODs, and because the VariantContext extracted from the comp ROD is not cut apart by sample (when using sample stratification) it retains all its genotypes inflating the counts of comp genotypes not called in the eval ROD.

    An example with two samples:

    eval ROD:
    variant1 0/0 0/1

    comp ROD:
    variant2 0/0 0/1

    in this case n_comp_HOM_REF_called_NO_CALL and n_comp_HET_called_NO_CALL would both be 1 for sample1 and sample2, as opposed to 1 and 0 for sample1 and 0 and 1 for sample2.

    Does that example make sense? Basically when there is no eval VariantContext, and the sample stratification is used there are excess genotypes in the comp VariantContext.


  • ebanksebanks Broad InstitutePosts: 698Member, Administrator, Broadie, Moderator, Dev admin

    Hey Michael,

    The motivation for moving the GenotypeConcordance evaluator into its own standalone tool was precisely the one you bring up in this thread: it had adverse interactions with some of the stratifications in certain contexts/situations. So instead of managing that headache (poorly), we decided to be safe and move it out.

    I'm 99% sure that Chris (the author of the new and improved GenotypeConcordance tool) took cases like yours into account when implementing it (and a cursory review of the code seems to confirm it). Could you try rerunning your analysis with it instead of VariantEval? And please continue to give such helpful feedback.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

Sign In or Register to comment.