Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineGVCFs - All alleles are missing

astrandastrand New YorkMember
edited January 2015 in Ask the GATK team


I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing.

For example, at position 16050036, this is what comes up in the final gvcf file:
22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0

But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site:
22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0

The command I'm using to generate these files is:

java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L ${numchr}
where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?


Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You mean that in the combined GVCF, the genotypes are missing, right? The alleles seem to be there.

    Which version are you using?

  • astrandastrand New YorkMember

    Yes, that's what I meant. I am using the latest version (3.3).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We've seen a few cases recently where GenotypeGVCFs assigns no-calls (./.) instead of a proper genotype when coverage is extremely low. I think that's what's happening here. It may take some time before we can develop a fix for this as it's fairly low priority (those sites aren't really usable anyway). If all your sites are like this you should really check your data QC. But if it's just the first sites in your files it's not necessarily unexpected -- the coverage tends to be very low at the beginning of contigs (in WGS) or capture targets (in WEx).

  • astrandastrand New YorkMember

    Hi Geraldine,

    What do you mean by "data QC"? The example I used in my post is a test run that I did for 3 individuals on 1 chromosome. I was able to generate a perfectly normal final vcf file using GenotypeGVCFs. However, because I have a lot more files for which I have to run this analysis (and for which, ideally, I should use CombineGVCFs), I decided to test CombineGVCFs on those 3 individuals. That's when I saw the missing genotypes. They are missing from every single site that gets called. Like I said, using GenotypeGVCFs produces perfectly acceptable output, but not CombineGVCFs, so not sure what is going on there.

    Thanks anyways for the information,

  • astrandastrand New YorkMember

    OK, thanks Sheila. Where does the information regarding the genotypes get stored for GenotypeGVCFs to use? I'll run the Combined files through GenotypeGVCFs and let you know if I experience any problems.


  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Alva,

    The information is in the other fields present in the CombineGVCFs file. GenotypeGVCFs uses the AD and PL fields for all the samples to determine a proper final genotype.


  • Kelly135Kelly135 KoreaMember

    Hi Sheila, I am experiencing the same issue and this post helped me. I understood that genotypes are missing after CombineGVCFs, and then they appear after GenotypeGVCF.

    But the final format I want to get is "combined gvcf" not "combined vcf". In this case, how should I create this file?

    I also tried to find the way of converting vcf to gvcf, but there was no such information as most people are interested in vcf, not gvcf.

    FYI, I am using the latest version (4.0.1) and my question is not related to a bug.

  • SheilaSheila Broad InstituteMember, Broadie admin


    There is no way to get a final combined GVCF, as GVCFs are intermediate files not to be used in final analysis.

    Why do you need the final combined GVCF?


Sign In or Register to comment.