Why is HaplotypeCaller's '--emitRefConfidence' outputting 'NON_REF' in addition to other ALTs?

CarlosBorrotoCarlosBorroto Posts: 33Member
edited March 18 in Ask the GATK team

Hi,

When using HaplotypeCaller's '--emitRefConfidence', with both GVCF and BP_RESOLUTION, we get lines like this one:

1  2337032 rs1129171   C   T,<NON_REF> 480.77  .   BaseQRankSum=0.218;ClippingRankSum=0.103;DB;DP=45;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.344;ReadPosRankSum=1.046   GT:AD:DP:GQ:PL:SB   0/1:19,26,0:45:99:509,0,330,565,407,97
2:9,10,13,13

The new 3.0 tool GenotypeGVCFs has no problem processing these lines. However CombineVariants fails to run with this message:

##### ERROR MESSAGE: Reference records should not have more than one alternate allele

By looking at the relevant code:

public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java

if( vc.hasAllele(NON_REF_SYMBOLIC_ALLELE) ) {
                if( vc.getAlternateAlleles().size() > 1 ) { throw new IllegalStateException("Reference records should not have more than one alternate allele"); }

Sounds like the presence of <NON_REF> in lines with ALTs is triggering this exception.

I also looked at this forum entry and the example there doesn't have <NON_REF> in lines with ALTs.

"Using the reference confidence calculation mode & generating a GVCF"

Should I be able to run CombineVariants on files produced using '--emitRefConfidence'?

Thanks, Carlos PS: I tried to include links to github and GATK forum, but the spam filter seems not to like that.

Post edited by CarlosBorroto on

Comments

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    There's a specialized tool called CombineGVCFs - I suspect that you've found the biggest difference between it and CombineVariants

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    This particular issue is probably a bug, because this error shouldn't happen in principle, but the big picture is that you're not supposed to run CombineVariants on gVCFs produced by HaplotypeCaller anyway. As @pdexheimer points out, you should use the specialized tool called CombineGVCFs for this purpose. We'll try to clarify this in upcoming docs.

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member
    edited March 18

    Ah, I don't know why I didn't think of trying CombineGVCFs immediately. I got that recommendation now from 3 different sources.

    I now tried it and I ran into this issue:

    ##### ERROR MESSAGE: Key END found in VariantContext field INFO at 1:2336230 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

    Our downstream analysis doesn't currently support gVCF, we need to run HaplotypeCaller's '--emitRefConfident' with 'BP_RESOLUTION'. My follow up question is, does CombineGVCFs can run on files produced with 'BP_RESOLUTION' instead of 'GVCF'?.

    The odd thing here is I checked and no line has the 'INFO' key 'END'. If I have to guess CombineGVCFs is wrongly assuming every line has one and is complaining in advance.

    Thanks for your help!, Carlos

    Post edited by CarlosBorroto on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    You need to follow the instructions in this document: http://www.broadinstitute.org/gatk/guide/article?id=3893

    The GenotypeGVCFs tool in the next step will produce a normal VCF that you can use in your downstream analysis.

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @Geraldine_VdAuwera‌ ,

    I think I didn't explain myself in my last comment. For our downstream purposes we cannot use '--emitRefConfidence GVCF' we need to use '--emitRefConfidence BP_RESOLUTION'. The tool CombineGVCFs(step 2 in the document you link) is failing to combine VCF files produced using '--emitRefConfidence BP_RESOLUTION'.

    We are in the process of modifying our downstream tools to support gVCF format. In the meantime we would like to still be able to use the new joint calling protocol. I believe the error I mentioned from CombineGVCFs is a bug. At a minimum the error message is misleading, as there is not END key in the INFO field referenced by CombineGVCFs.

    Thanks, Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Yes, the error message is misleading because we did not expect people would try to run CombineGVCFs on files not generated by GVCF mode, so the case is not being handled correctly. This is a bug we'll need to fix. But this is a separate issue from the requirement to run in GVCF mode. CombineGVCFs is specifically designed to handle files generated by HC in GVCF mode. It will not work on anything else. To put it another way, the new protocol will not work if you don't use the GVCF mode. I'm sorry if that is incompatible with your current workflow, but there's nothing we can do about that.

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Thanks.

    It is interesting you think CombineGVCFs can't deal with BP_RESOLUTION files, as adding the END INFO key(even when is not used in any line) to the header is all I need to use these files with it.

    In any case, we also notice 3.1-1 added the option "--convertToBasePairResolution" to CombineGVCFs. This option solves our need to have an expanded gVCF after the combine step.

    Thanks again.

    PS: How can I mark a question answered? I would like to mark @pdexheimer answer as correct for my initial question.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Then that is my mistake; these tools were released while I was away so I may not be completely up-to-date on their usage. My apologies for the misleading answer. I'm worried though that GenotypeGVCFs (the next step in the protocol) might have issues with the BP_RESOLUTION files. Or not. Can you please let me know either way?

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @Geraldine_VdAuwera

    I can confirm I can use CombineGVCFs 3.1-1 option "--convertToBasePairResolution" and then run GenotypeGVCFs on the resulting file.

    For the record, as far as I can tell the only different between BP_RESOLUTION and GVCF files is the use of the END INFO key to define how many positions are represented by each line. Variant positions are always representing the amount of positions in the variant and they don't have/need an END key. I assumed, I think correctly, that any tool able to read GVCF should be able to read BP_RESOLUTION.

    Best, Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Great, thanks Carlos. That's right, GVCF and BP_RESOLUTION contain the same information, but the GVCF is banded (i.e. consecutive REF loci with similar values are collapsed into single records) for size and efficiency. We are pushing for people to use GVCF because they are better for performance, and we forgot to test BP_RESOLUTION in the new pipeline, hence the glitches you experienced. I jumped the gun when I told you BP_RESOLUTION wouldn't work; in principle it's fine, it's just not the recommended option. I'm going to blame jetlag ;)

    Geraldine Van der Auwera, PhD

  • dsenalikdsenalik Posts: 5Member

    I too had this problem, same error as message #3 above, and found this thread via Google. Thanks for the info. My meager contribution is, the current nightly build no longer has this bug.

  • HasaniHasani GermanyPosts: 12Member
    edited August 19

    Hi,

    for confirmation purposes, could you please agree that seeing line [1] means that the mutation of ID rs103 has one non-ref allele A of depth138 and other non-ref which are not listed of depth 0, while the ref-allele T has also depth 0?

    Thanks!

    [1]

    chr1 2376 rs103 T A,<NON_REF> 4149.77 . DB;DP=138;MLEAC=2,0;MLEAF=1.00,0.00;MQ=41.55;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,138,0:138:99:4183,416,0,4183,416,4183:0,0,0,0

    Post edited by Hasani on
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin
Sign In or Register to comment.