Compressed output file size from GatherVcfs is different when input files are compressed vs not

I have produced gVCF files for different intervals from the same sample (using the -L flag) and now want to merge those smaller VCFs together, to create one single large gVCF per sample, using GatherVcfs.

The total size of my gzipped input gVCF files is ~10 GB. When I feed those into GatherVcfs, the total size of the output file -- once manually gzipped -- is ~10 GB. That's roughly the same, and thus as expected.

Because GatherVcfs cannot validate position non-overlap of files and create an index when input files are compressed, I did a separate run in which I gunzipped the input files first. The total size of the output file from GatherVcfs, after I manually gzipped it, is only ~6 GB.

Can anyone explain where 4 GB of data are disappearing to? All of the input files are from a separate interval, so GatherVcfs shouldn't be removing any data -- there shouldn't be duplicates between intervals. I would prefer to gunzip my input files first, to create an index while running GatherVcfs, but I'm concerned that something is going horribly wrong if I'm losing so much data.

Any suggestions would be much appreciated -- thank you!


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 7

    HI @bramblepuss

    That is weird. Try to do a diff between the 10GB GatherVcfs output and the 6GB GatherVcfs output. Also view the regions that are different between these gcvfs on IGV too. After that please post those Igv screenshots and the records here so we can visualize whats wrong.

