HaplotypeCaller ERC option


I am interested in knowing if there is a way to emit all sites when the input file contains multiple samples in HaplotypeCaller. I have tried the
option emitRefConfidence but got an error as this option can only be used with a single sample. It is very important for calculating general population genetic parameters to know the total number of invariant as well as variant sites. Is there a way of doing this that I am missing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Seanna,

    There is no longer any option to emit all sites in HaplotypeCaller, sorry. We consider that the number of invariant sites is simply the total number of sites visited in the analysis (which should be available to you based on the length of your reference, subset as appropriate to regions of interest) minus the sites called as variant.

  • DaphGenomesDaphGenomes EdinburghMember

    Hi Geraldine;
    How disappointing that this option has been removed. The ERC option sounds like it would calculate what I am looking for, but only when there is a single sample. Are there future plans of expanding this for multiple samples?
    As you suggested, I had considered estimating the number of invariant sites by taking the number of variant sites from the total number of sites. However, I worry that this estimate will not take into account that there are sites that do not have sufficient depth or mapping/base quality to be called either way. Do you have any suggestions for addressing this (potential) problem?
  • rwnessrwness Member

    I have to agree with DaphGenomes. Anyone that needs to know the number of invariant sites for their analysis and wants to apply quality filters to these sites can not use HaplotypeCaller. I think this includes a large part of the population genetics community. For many analyses if you apply any filtering to variant sites and do not apply equivalent filters to invariant your analysis is biased and probably wrong.

  • I also agree with DaphGenomes. The total number of invariant sites is crucial for accurate analysis.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi everyone,
    I think there may be some misunderstanding here. The ERC option in the HaplotypeCaller is far superior to the EMIT_ALL_SITES option in the UnifiedGenotyper. The UG version will not produce output records at positions with no usable coverage while the HC version can produce records at every position of the genome - and the likelihoods are much more accurate (because they try to model all possible variation, not just SNPs).
    The only outstanding issue to date is that it hasn't been implemented for multiple samples at once. The reason for this is that we are moving away from the model of requiring all samples together for discovery and generating genotype likelihoods - and the ERC option is a key component of that move. We are hoping to have the whole pipeline finished up in the next month or two.

  • DaphGenomesDaphGenomes EdinburghMember

    Dear Ebanks;
    Many thanks for this clarification. This is excellent news! Can't wait for the implementation.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    My apologies to anyone who was alarmed by my incomplete answer.

