Missing chr in gVCF


I have generated a gVCF file using GATK-3.7 and it is completed without any error. However, i notice only few chromosome names in the first column of the gVCF file:

 java -Xmx10G -GenomeAnalysisTK.jar -R species.fa -T HaplotypeCaller -I Input.bam -stand_call_conf 30 -ERC GVCF --min_base_quality_score 20 --variant_index_parameter 128000 --variant_index_type LINEAR --genotyping_mode DISCOVERY -o GATK.g.vcf.gz

  less GATK.g.vcf.gz | grep -v "##" | cut -f1| sort | uniq

Does it mean that the run is incomplete? or ami i missing some property of gVCF format.


  • meharmehar Member

    The command produces .tbi file as well which sort of indicates that the run is completed successfully. However, not all chr's are present in the first column.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @mehar

    1) Would you please use 'samtools view' on the bam file to see if it has all the chromosomes, and if thats where the issue is?
    2) If the bam file does have all the chromosomes, then please send me a subset of the bam file with one one of the chromosomes 'not' in the output gVCF. That will help me to find the reason for this discrepancy. In order to send me the sample bam please follow the steps in this link https://software.broadinstitute.org/gatk/guide/article?id=1894

    Bhanu Gandham

  • meharmehar Member

    Samtools view -H shows all the chromosomes. Then i randomly chosen chr22 which is not listed with the less command.

    `samtools view  AM374.recalibrated.bam chr22:1-61439934 | head`

    This shows the alignment for chr22 in the bam. Also, i have run GenotypeGVCF on the above gVCF file and it produces variants in all the chromosomes. It is puzzling that the above less command does not list all the chromosomes but the VCF from GenotypeGVCF has all the chromosomes. I also tried tabix on the gVCF instead of "less" command on the same file and it shows chr22.

     $ tabix -p vcf GATK.g.vcf.gz chr22| head
    chr22   1   .   N   <NON_REF>   .   .   END=927 GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
    chr22   928 .   A   <NON_REF>   .   .   END=1021    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,29
    chr22   1022    .   C   <NON_REF>   .   .   END=1029    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,66
    chr22   1030    .   A   <NON_REF>   .   .   END=1036    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,97
    chr22   1037    .   A   <NON_REF>   .   .   END=1055    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,128
    chr22   1056    .   C   <NON_REF>   .   .   END=1059    GT:DP:GQ:MIN_DP:PL  0/0:5:15:5:0,15,152
    chr22   1060    .   A   <NON_REF>   .   .   END=1066    GT:DP:GQ:MIN_DP:PL  0/0:6:18:6:0,18,197

    I suspect it could be a maximum limit or something like that in reading the gVCF file using less command where it lists the chr that were read and excludes the remaining. Could this be the issue or should i send the subset of sample bam?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi mehar,

    It's possible that its a less command limitation as 'less' is ideally used for navigating through a file. To see if it is indeed a limitation of the less command please try:
    cat GATK.g.vcf.gz | grep -v "##" | cut -f1| sort | uniq
    If this shows all the chromosomes then problem solved!

    Let me know if this helps.


  • meharmehar Member

    yes, it turned out to be a funny post!!! zcat showed all the chr. It was just a moment of panic...Thanks for your time.

Sign In or Register to comment.