Discrepancy in GVCF file size between bam files of a similar size


I am performing WGS analysis according to the GATK best practice guidelines (aligning to hg38). I am calling variants on my samples using HaplotypeCaller in GVCF mode to generate GVCF files (by chromosome so I can improve the speed by running the jobs in parallel). For each sample I then join all the GVCF files together using CatVariants.

For the majority of my samples (which are all 30X coverage) I have BAM files which are around 175G in size and GVCF files which are around 55G. However, two of my samples have 60X coverage with BAM files of 221G and 198G. What is confusing me is the observation that the GVCF files from these two 60X samples are much smaller than all the rest (7G and 6G).

I have checked the format of both the large and small GVCF files and they are consistent with the GVCF file format specification. Interestingly, both have a similar number of variant records so the difference in file size must relate to the number of non-variant records.

Is there any simple explanation for the discrepancy in GVCF file sizes between the 30X and 60X coverage samples?

Kind regards,

Best Answer


Sign In or Register to comment.