We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Homozygous reference genotype is called in native mode but uncalled in HaplotypeCaller ERC modes

Hi,
I'm having an issue with HaplotypeCaller in -ERC GVCF mode. Our pipeline uses GATK Best Practices (version 3.5) with GVCF mode and normally we don't have many issues, but with this particular dataset, I'm having an issue. I notice that a lot of genotypes that should be homozygous reference show up with ./.
(uncalled) genotype in the final VCF after the GenotypeGVCFs step. This is also reflected in the gVCF output of HaplotypeCaller, where with GQ = 0 for the band containing this variant.
Here is one example of a variant with lots of ./.
genotypes after GenotypeGVCFs (most variants appear this way):
1 949831 . C T 1292.70 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=-6.980e-01;ClippingRankSum=-1.132e+00;DP=2023;ExcessHet=6.9897;FS=4.475;MLEAC=3;MLEAF=0.500;MQ=14.73;MQRankSum=-4.510e-01;POSITIVE_TRAIN_SITE;QD=10.60;ReadPosRankSum=-9.420e-01;SOR=0.375;VQSLOD=2.24;culprit=ReadPosRankSum GT:AD:DP:GQ:PL ./.:22,0:22 ./.:22,0:22 ./.:24,0:24 ./.:20,0:20 ./.:21,0:21 ./.:24,0:24 ./.:28,0:28 ./.:13,0:13 ./.:20,0:20./.:15,0:15 ./.:19,0:19 ./.:26,0:26 ./.:21,0:21 ./.:26,0:26 ./.:10,0:10 ./.:15,0:15 ./.:25,0:25 ./.:21,0:21 ./.:15,0:15 ./.:27,0:27 ./.:29,0:29 ./.:16,0:16 ./.:17,0:17 ./.:31,0:31./.:21,0:21 ./.:18,0:18 0/1:26,20:46:99:550,0,817 ./.:18,0:18 0/1:22,24:46:99:606,0,617 0/1:22,8:30:99:163,0,648 ./.:26,0:26 ./.:19,0:19 ./.:25,0:25 ./.:24,0:24 ./.:13,0:13 ./.:15,0:15./.:28,0:28 ./.:23,0:23 ./.:13,0:13 ./.:17,0:17 ./.:22,0:22 ./.:23,0:23 ./.:15,0:15 ./.:27,0:27 ./.:32,0:32 ./.:18,0:18 ./.:16,0:16 ./.:19,0:19 ./.:21,0:21 ./.:17,0:17 ./.:21,0:21./.:20,0:20 ./.:22,0:22 ./.:19,0:19 ./.:26,0:26 ./.:32,0:32 ./.:22,0:22 ./.:27,0:27 ./.:12,0:12 ./.:26,0:26 ./.:23,0:23 ./.:18,0:18 ./.:29,0:29 ./.:22,0:22 ./.:22,0:22 ./.:19,0:19./.:25,0:25 ./.:15,0:15 ./.:22,0:22 ./.:23,0:23 ./.:18,0:18 ./.:30,0:30 ./.:20,0:20 ./.:24,0:24 ./.:18,0:18 ./.:14,0:14 ./.:19,0:19 ./.:14,0:14 ./.:17,0:17 ./.:17,0:17 ./.:12,0:12./.:15,0:15 ./.:27,0:27 ./.:14,0:14 ./.:14,0:14 ./.:15,0:15 ./.:18,0:18 ./.:20,0:20 ./.:25,0:25 ./.:21,0:21 ./.:16,0:16 ./.:23,0:23 ./.:19,0:19 ./.:18,0:18 ./.:31,0:31
Some of these have relatively low depth, I admit, but some do have a decent depth, e.g., several with a depth of 31 or 32.
Here is the relevant line in the original gVCF for the first sample in the merged VCF (NIAID-1113):
1 949655 . C <NON_REF> . . END=949924 GT:DP:GQ:MIN_DP:PL 0/0:47:0:22:0,0,0
Because of the low MIN_DP for this region (22), and the low GQ, I this might be related to the -stand_call_conf
option (we set -stand_emit_conf 10 -stand_call_conf 30
in our pipeline), I reran the analysis setting -stand_call_conf
to 10 in both HaplotypeCaller and GenotypeGVCFs, but I got the same results.
I noticed there were some updates in GATK 3.7 related to -stand_call_conf
options, so I retried using GATK 3.7 (again with -stand_emit_conf
and -stand_call_conf
set to 10) and I got the same results. I tried adding -forceActive and I still see GQ = 0 in the gVCF file, though the band boundaries changed, as well as the MIN_DP:
1 949828 . A <NON_REF> . . END=949846 GT:DP:GQ:MIN_DP:PL 0/0:50:0:48:0,0,0
I also tried GATK 3.7 in -ERC BP_RESOLUTION mode (with -stand_emit_conf
and -stand_call_conf
set to 10 and not selecting -forceActive) to see if there was a problem with grouping variants into bands and I still see GQ = 0 and PL = 0,0,0 in the gVCF output for the first sample:
1 949831 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:46,2:48:0:0,0,0
And the genotype is uncalled in the final VCF after GenotypeGVCFs:
1 949831 . C T 1292.70 PASS AC=3;AF=0.500;AN=6;BaseQRankSum=-6.590e-01;ClippingRankSum=-1.132e+00;DP=2049;ExcessHet=6.9897;FS=4.475;MLEAC=3;MLEAF=0.500;MQ=14.64;MQRankSum=0.00;QD=10.60;ReadPosRankSum=-8.830e-01;SOR=0.375 GT:AD:DP:GQ:PL ./.:46,2:48:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:24,0:24:.:0,0,0 ./.:20,0:20:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:24,0:24:.:0,0,0 ./.:28,0:28:.:0,0,0 ./.:13,0:13:.:0,0,0 ./.:20,0:20:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:26,0:26:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:26,0:26:.:0,0,0 ./.:10,0:10:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:25,0:25:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:27,0:27:.:0,0,0 ./.:29,0:29:.:0,0,0 ./.:16,0:16:.:0,0,0 ./.:17,0:17:.:0,0,0 ./.:31,0:31:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:18,0:18:.:0,0,0 0/1:26,20:46:99:550,0,817 ./.:18,0:18:.:0,0,0 0/1:22,24:46:99:606,0,617 0/1:22,8:30:99:163,0,648 ./.:26,0:26:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:25,0:25:.:0,0,0 ./.:24,0:24:.:0,0,0 ./.:13,0:13:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:28,0:28:.:0,0,0 ./.:23,0:23:.:0,0,0 ./.:13,0:13:.:0,0,0 ./.:17,0:17:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:23,0:23:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:27,0:27:.:0,0,0 ./.:32,0:32:.:0,0,0 ./.:18,0:18:.:0,0,0 ./.:16,0:16:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:17,0:17:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:20,0:20:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:26,0:26:.:0,0,0 ./.:32,0:32:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:27,0:27:.:0,0,0 ./.:12,0:12:.:0,0,0 ./.:26,0:26:.:0,0,0 ./.:23,0:23:.:0,0,0 ./.:18,0:18:.:0,0,0 ./.:29,0:29:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:25,0:25:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:22,0:22:.:0,0,0 ./.:23,0:23:.:0,0,0 ./.:18,0:18:.:0,0,0 ./.:30,0:30:.:0,0,0 ./.:20,0:20:.:0,0,0 ./.:24,0:24:.:0,0,0 ./.:18,0:18:.:0,0,0 ./.:14,0:14:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:14,0:14:.:0,0,0 ./.:17,0:17:.:0,0,0 ./.:17,0:17:.:0,0,0 ./.:12,0:12:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:27,0:27:.:0,0,0 ./.:14,0:14:.:0,0,0 ./.:14,0:14:.:0,0,0 ./.:15,0:15:.:0,0,0 ./.:18,0:18:.:0,0,0 ./.:20,0:20:.:0,0,0 ./.:25,0:25:.:0,0,0 ./.:21,0:21:.:0,0,0 ./.:16,0:16:.:0,0,0 ./.:23,0:23:.:0,0,0 ./.:19,0:19:.:0,0,0 ./.:18,0:18:.:0,0,0 ./.:31,0:31:.:0,0,0
We also see from this that the depth for some of these samples is even higher than we expected -- the first ample has a depth of 48 instead of 22, and is still called as ./.
.
Then I decided to try HaplotypeCaller in native mode and here is the same line in the final VCF:
1 949831 rs116002608 C T 1248.20 PASS AC=3;AF=0.010;AN=190;BaseQRankSum=0.022;ClippingRankSum=0.193;DB;DP=4985;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0101;MLEAC=3;MLEAF=0.016;MQ=60.00;MQRankSum=-1.802;QD=10.32;ReadPosRankSum=-1.925;SOR=0.604 GT:AD:DP:GQ:PL 0/0:41,0:41:99:0,123,1381 0/0:44,0:44:99:0,132,1443 0/0:56,0:56:99:0,167,1885 0/0:45,0:45:99:0,135,1512 0/0:44,1:45:99:0,102,1469 0/0:57,0:57:99:0,170,1889 0/0:71,0:71:99:0,212,2306 0/0:44,0:44:99:0,132,1407 0/0:52,0:52:99:0,154,1623 0/0:46,0:46:99:0,138,1534 0/0:42,0:42:99:0,125,1313 0/0:52,0:52:99:0,156,1770 0/0:62,0:62:99:0,184,2012 0/0:53,0:53:99:0,158,1801 0/0:36,0:36:99:0,108,1188 0/0:51,0:51:99:0,151,1549 0/0:46,0:46:99:0,137,1479 0/0:66,0:66:99:0,197,2070 0/0:45,0:45:99:0,133,1376 0/0:56,0:56:99:0,167,1893 0/0:52,0:52:99:0,155,1644 0/0:53,0:53:99:0,158,1717 0/0:42,0:42:99:0,126,1381 0/0:46,0:46:99:0,138,1619 0/0:55,0:55:99:0,164,1756 0/0:62,0:62:99:0,185,1947 0/1:24,20:44:99:556,0,751 0/0:50,0:50:99:0,148,1571 0/1:22,24:46:99:606,0,617 0/1:23,8:31:99:160,0,690 0/0:56,0:56:99:0,167,1833 0/0:54,0:54:99:0,162,1774 0/0:88,0:88:99:0,262,2873 0/0:59,0:59:99:0,175,1880 0/0:53,0:53:99:0,159,1740 0/0:52,0:52:99:0,156,1702 0/0:53,0:53:99:0,159,1745 0/0:85,0:85:99:0,255,2834 0/0:35,0:35:99:0,105,1155 0/0:52,0:52:99:0,156,1666 0/0:53,0:53:99:0,159,1749 0/0:44,0:44:99:0,132,1535 0/0:49,0:49:99:0,146,1475 0/0:69,1:70:99:0,199,2268 0/0:63,0:63:99:0,194,2207 0/0:48,0:48:99:0,143,1562 0/0:48,0:48:99:0,144,1545 0/0:51,0:51:99:0,153,1726 0/0:50,0:50:99:0,149,1645 0/0:50,0:50:99:0,150,1732 0/0:58,0:58:99:0,173,1933 0/0:45,0:45:99:0,135,1519 0/0:63,0:63:99:0,188,2029 0/0:59,0:59:99:0,176,1910 0/0:64,0:64:99:0,191,2076 0/0:65,0:65:99:0,194,2092 0/0:48,0:48:99:0,144,1649 0/0:56,0:56:99:0,168,1886 0/0:54,0:54:99:0,161,1723 0/0:48,0:48:99:0,142,1551 0/0:51,0:51:99:0,153,1711 0/0:45,0:45:99:0,134,1506 0/0:41,0:41:99:0,123,1451 0/0:54,0:54:99:0,162,1783 0/0:67,0:67:99:0,200,2121 0/0:54,0:54:99:0,161,1734 0/0:56,0:56:99:0,167,1723 0/0:37,0:37:99:0,110,1183 0/0:37,0:37:99:0,111,1295 0/0:45,0:45:99:0,134,1474 0/0:48,0:48:99:0,143,1572 0/0:61,0:61:99:0,183,2112 0/0:66,0:66:99:0,198,2158 0/0:63,0:63:99:0,189,2207 0/0:42,0:42:99:0,126,1447 0/0:37,0:37:99:0,111,1262 0/0:50,0:50:99:0,149,1624 0/0:47,0:47:99:0,141,1585 0/0:53,0:53:99:0,159,1649 0/0:54,0:54:99:0,162,1698 0/0:58,0:58:99:0,173,1658 0/0:28,0:28:83:0,83,945 0/0:56,0:56:99:0,167,1817 0/0:49,0:49:99:0,146,1544 0/0:36,0:36:99:0,108,1186 0/0:45,0:45:99:0,135,1478 0/0:52,0:52:99:0,156,1727 0/0:66,0:66:99:0,198,2150 0/0:57,0:57:99:0,170,1894 0/0:56,0:56:99:0,168,1928 0/0:52,0:52:99:0,156,1648 0/0:83,0:83:99:0,249,2560 0/0:53,0:53:99:0,157,1654 0/0:65,0:65:99:0,194,2027 0/0:53,0:53:99:0,159,1837
In this case, it calls the genotype 0/0, with GQ = 99 for most of the samples. Is there any reason that HaplotypeCaller in native mode would call the correct genotype but not in -ERC GVCF or BP_RESOLUTION mode? I would like to use the GVCF mode since we often get our sample data in batches, but this problem with -ERC modes is a limitation for us.
I attached a screenshot from IGV showing the reads for the first sample. There are several non-duplicate reads flanking this nucleotide.
Thanks,
Andrew
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
Hi @andrewo,
The
-ERCIS
parameter is used in modeling the confidence we have in the homozygous reference genotype when generating GVCFs. It is not utilized when you run HC in multi-sample mode, since in that mode we do not calculate reference confidence. It is only available as an argument in that mode due to an oversight -- we forgot to mark it as incompatible with normal mode. So that explains why you're seeing differences between the two.Regarding appropriate values, to be frank the default value of this parameter was derived largely empirically. The values you are giving it are unusually large and therefore are causing a heavy imbalance in the calculation of the reference confidence, excessively penalizing the hom-ref genotype case. Note that this parameter is not meant to be an upper bound to the size of indels you can detect -- you will be able to detect indels of that size even without adjusting that parameter. So I would strongly recommend that you allow the program to use the default setting.
Answers
@andrewo
Hi Andrew,
Can you please post the bamout file for the sample. Please also post a larger region surrounding the site. Perhaps include ~1000 bases before and after the site. Can you also post some variant records for the positions before and after the GQ 0 calls. Are there lots of GQ 0 calls in the region? Can you post some calls made with high GQ in the region?
Thanks,
Sheila
Hi @Shelia,
Here is a wider region of the gVCF file (also attached a 2kb region from IGV as you mentioned):
Here is what the gVCF looks like for the same region if I use GATK 3.7 and -forceActive
As you can see, most of the HET calls have a high GQ, and most of the 0/0 calls have low GQ. There are a few with 0/0 that have good GQ as well, e.g.,
These all have a few reads that support the alternate allele, but not enough to call it heterozygous. I don't see any examples where AD shows no reads supporting the alternate allele with good GQ though (also tested with the full gVCF file and it is true genome-wide). All bands where there are no reads supporting the alternate allele have GQ = 0.
I keep getting a
530 Sorry, max 20 users -- try again later
Error when I try to connect to the broad FTP server to post the bamout file from HaplotypeCaller, but you can download it from this dropbox link for now (let me know if you have any issues with the link):https://www.dropbox.com/s/bmy02jec15n2c4z/aoler_bamout.zip?dl=0
I also included this NIAID-1113.recal.g.vcf.txt file from the command above. Let me know if you need anything else.
Thanks,
Andrew
@andrewo
Hi Andrew,
Sorry for the delay. I am getting a little lost in the long post, so I think it is best if you submit a bug report so I can have a look locally. Instructions are here.
Thanks,
Sheila
Hi Sheila,
I posted the bug report on the FTP server. andrewo_bug_report_for_Sheila.zip
Thanks,
Andrew
@andrewo
Hi Andrew,
Thank you. I will get to this soon.
-Sheila
@andrewo
Hi Andrew,
I just had a look. I did not find any difference between the two modes. I ran HaplotypeCaller on the individual BAM files and with both BAM files in normal mode. Then, I ran HaplotypeCaller in GVCF mode on both BAM files, then GenotypeGVCFs. I get the exact same results from both modes.
I used the latest version and ran on chromosome 1 only.
-Sheila
@Sheila
Hi Sheila,
Thanks for looking at this. I'm not exactly sure why you see no difference between the two modes. I just double-checked to see if I was missing something. I used GATK 3.7 for my retest and I still see genotype
0/0
for individual NIAID-1113 at these positions in HaplotypeCaller normal mode with both BAM files and./.
in GVCF mode followed by GenotypeGVCFs using the commands I put in the bug report.It seems maybe you used the default settings rather than the parameters in the command of the bug report, is that right? With this in mind, I did some more testing and determined that the option that is causing the
./.
genotype in GVCF mode for me is-ERCIS 100
, which was also part of the command in the bug report. Did you run it with this option? Adding this parameter alone (otherwise using defaults) will reproduce the issue. When I run with-ERCIS 10
, or omit the option it gives genotype0/0
for the variants above in GVCF mode. In Haplotypecaller normal (multi-sample) mode, I get genotype0/0
whether I use -ERCIS 10 or -ERCIS 100. Why would the genotype of a single nucleotide variant be affected by increasing the indel size to eliminate? I thought that setting would only affect genotype calls for indels. And why would it always be0/0
in HaplotypeCaller multi-sample mode for both -ERCIS settings (10 and 100) but not GVCF mode? My reason for specifying -ERCIS 100 is to enable support for insertions with respect to the reference that are larger than 10bp, since it seems entirely possible to occur biologically and I also see many pathogenic variants in ClinVar that are at least 20bp long. Let me know if I'm misunderstanding the purpose of this parameter. I'm also wondering if there could be something about the reads in the dataset that makes it more sensitive to the -ERCIS parameter, since I don't see this problem with data from other sequencing vendors. Any ideas you have would be welcome.Thanks!
Andrew
As a follow-up, I tested different values for -ERCIS (--indelSizeToEliminateInRefModel) to see at what point the genotype switches from
0/0
to./.
in GVCF mode and I noticed that values up to 55 give me0/0
genotype; values from 60-70 give me0/0
at one position and./.
at the other position; values 75+ give me./.
at both position. The reads in this dataset are 76bp, whereas reads we are normally used to looking at are ~125 bp. Maybe this is the reason I see this effect in this dataset but haven't noticed it before. Can you explain how the -ERCIS parameter should be used, and how it might be sensitive to read length? Should I be using --activeRegionMaxSize instead?Thanks!
Andrew
Issue · Github
by Sheila
@andrewo
Hi Andrew,
Got it. I will have a look again and get back to you.
-Sheila
Hi @andrewo,
The
-ERCIS
parameter is used in modeling the confidence we have in the homozygous reference genotype when generating GVCFs. It is not utilized when you run HC in multi-sample mode, since in that mode we do not calculate reference confidence. It is only available as an argument in that mode due to an oversight -- we forgot to mark it as incompatible with normal mode. So that explains why you're seeing differences between the two.Regarding appropriate values, to be frank the default value of this parameter was derived largely empirically. The values you are giving it are unusually large and therefore are causing a heavy imbalance in the calculation of the reference confidence, excessively penalizing the hom-ref genotype case. Note that this parameter is not meant to be an upper bound to the size of indels you can detect -- you will be able to detect indels of that size even without adjusting that parameter. So I would strongly recommend that you allow the program to use the default setting.
@Geraldine_VdAuwera
Okay, I will use the default. Thanks for the explanation!