The current GATK version is 3.4-46

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Why is HaplotypeCaller's '--emitRefConfidence' outputting 'NON_REF' in addition to other ALTs?

Posts: 40Member
edited March 2014

Hi,

When using HaplotypeCaller's '--emitRefConfidence', with both GVCF and BP_RESOLUTION, we get lines like this one:

1  2337032 rs1129171   C   T,<NON_REF> 480.77  .   BaseQRankSum=0.218;ClippingRankSum=0.103;DB;DP=45;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.344;ReadPosRankSum=1.046   GT:AD:DP:GQ:PL:SB   0/1:19,26,0:45:99:509,0,330,565,407,97
2:9,10,13,13

The new 3.0 tool GenotypeGVCFs has no problem processing these lines. However CombineVariants fails to run with this message:

##### ERROR MESSAGE: Reference records should not have more than one alternate allele

By looking at the relevant code:

            if( vc.hasAllele(NON_REF_SYMBOLIC_ALLELE) ) {
if( vc.getAlternateAlleles().size() > 1 ) { throw new IllegalStateException("Reference records should not have more than one alternate allele"); }


Sounds like the presence of <NON_REF> in lines with ALTs is triggering this exception.

I also looked at this forum entry and the example there doesn't have <NON_REF> in lines with ALTs.

"Using the reference confidence calculation mode & generating a GVCF"

Should I be able to run CombineVariants on files produced using '--emitRefConfidence'?

Thanks,
Carlos
PS: I tried to include links to github and GATK forum, but the spam filter seems not to like that.

Post edited by CarlosBorroto on
Tagged:

• Posts: 502Member, GATK Dev, DSDE Dev mod

There's a specialized tool called CombineGVCFs - I suspect that you've found the biggest difference between it and CombineVariants

This particular issue is probably a bug, because this error shouldn't happen in principle, but the big picture is that you're not supposed to run CombineVariants on gVCFs produced by HaplotypeCaller anyway. As @pdexheimer points out, you should use the specialized tool called CombineGVCFs for this purpose. We'll try to clarify this in upcoming docs.

Geraldine Van der Auwera, PhD

• Posts: 40Member
edited March 2014

Ah, I don't know why I didn't think of trying CombineGVCFs immediately. I got that recommendation now from 3 different sources.

I now tried it and I ran into this issue:

##### ERROR MESSAGE: Key END found in VariantContext field INFO at 1:2336230 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

Our downstream analysis doesn't currently support gVCF, we need to run HaplotypeCaller's '--emitRefConfident' with 'BP_RESOLUTION'. My follow up question is, does CombineGVCFs can run on files produced with 'BP_RESOLUTION' instead of 'GVCF'?.

The odd thing here is I checked and no line has the 'INFO' key 'END'. If I have to guess CombineGVCFs is wrongly assuming every line has one and is complaining in advance.

Carlos

Post edited by CarlosBorroto on

The GenotypeGVCFs tool in the next step will produce a normal VCF that you can use in your downstream analysis.

Geraldine Van der Auwera, PhD

• Posts: 40Member

I think I didn't explain myself in my last comment. For our downstream purposes we cannot use '--emitRefConfidence GVCF' we need to use '--emitRefConfidence BP_RESOLUTION'. The tool CombineGVCFs(step 2 in the document you link) is failing to combine VCF files produced using '--emitRefConfidence BP_RESOLUTION'.

We are in the process of modifying our downstream tools to support gVCF format. In the meantime we would like to still be able to use the new joint calling protocol. I believe the error I mentioned from CombineGVCFs is a bug. At a minimum the error message is misleading, as there is not END key in the INFO field referenced by CombineGVCFs.

Thanks,
Carlos

Yes, the error message is misleading because we did not expect people would try to run CombineGVCFs on files not generated by GVCF mode, so the case is not being handled correctly. This is a bug we'll need to fix. But this is a separate issue from the requirement to run in GVCF mode. CombineGVCFs is specifically designed to handle files generated by HC in GVCF mode. It will not work on anything else. To put it another way, the new protocol will not work if you don't use the GVCF mode. I'm sorry if that is incompatible with your current workflow, but there's nothing we can do about that.

Geraldine Van der Auwera, PhD

• Posts: 40Member

Thanks.

It is interesting you think CombineGVCFs can't deal with BP_RESOLUTION files, as adding the END INFO key(even when is not used in any line) to the header is all I need to use these files with it.

In any case, we also notice 3.1-1 added the option "--convertToBasePairResolution" to CombineGVCFs. This option solves our need to have an expanded gVCF after the combine step.

Thanks again.

PS: How can I mark a question answered? I would like to mark @pdexheimer answer as correct for my initial question.

Then that is my mistake; these tools were released while I was away so I may not be completely up-to-date on their usage. My apologies for the misleading answer. I'm worried though that GenotypeGVCFs (the next step in the protocol) might have issues with the BP_RESOLUTION files. Or not. Can you please let me know either way?

Geraldine Van der Auwera, PhD

• Posts: 40Member

I can confirm I can use CombineGVCFs 3.1-1 option "--convertToBasePairResolution" and then run GenotypeGVCFs on the resulting file.

For the record, as far as I can tell the only different between BP_RESOLUTION and GVCF files is the use of the END INFO key to define how many positions are represented by each line. Variant positions are always representing the amount of positions in the variant and they don't have/need an END key. I assumed, I think correctly, that any tool able to read GVCF should be able to read BP_RESOLUTION.

Best,
Carlos

Great, thanks Carlos. That's right, GVCF and BP_RESOLUTION contain the same information, but the GVCF is banded (i.e. consecutive REF loci with similar values are collapsed into single records) for size and efficiency. We are pushing for people to use GVCF because they are better for performance, and we forgot to test BP_RESOLUTION in the new pipeline, hence the glitches you experienced. I jumped the gun when I told you BP_RESOLUTION wouldn't work; in principle it's fine, it's just not the recommended option. I'm going to blame jetlag

Geraldine Van der Auwera, PhD

• Posts: 5Member

I too had this problem, same error as message #3 above, and found this thread via Google. Thanks for the info. My meager contribution is, the current nightly build no longer has this bug.

• GermanyPosts: 28Member
edited August 2014

Hi,

for confirmation purposes, could you please agree that seeing line [1] means that the mutation of ID rs103 has one non-ref allele A of depth138 and other non-ref which are not listed of depth 0, while the ref-allele T has also depth 0?

Thanks!

[1]

chr1 2376 rs103 T A,<NON_REF> 4149.77 . DB;DP=138;MLEAC=2,0;MLEAF=1.00,0.00;MQ=41.55;MQ0=0
GT:AD:DP:GQ:PL:SB 1/1:0,138,0:138:99:4183,416,0,4183,416,4183:0,0,0,0

Post edited by Hasani on