Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Compressed output file size from GatherVcfs is different when input files are compressed vs not

I have produced gVCF files for different intervals from the same sample (using the -L flag) and now want to merge those smaller VCFs together, to create one single large gVCF per sample, using GatherVcfs.

The total size of my gzipped input gVCF files is ~10 GB. When I feed those into GatherVcfs, the total size of the output file -- once manually gzipped -- is ~10 GB. That's roughly the same, and thus as expected.

Because GatherVcfs cannot validate position non-overlap of files and create an index when input files are compressed, I did a separate run in which I gunzipped the input files first. The total size of the output file from GatherVcfs, after I manually gzipped it, is only ~6 GB.

Can anyone explain where 4 GB of data are disappearing to? All of the input files are from a separate interval, so GatherVcfs shouldn't be removing any data -- there shouldn't be duplicates between intervals. I would prefer to gunzip my input files first, to create an index while running GatherVcfs, but I'm concerned that something is going horribly wrong if I'm losing so much data.

Any suggestions would be much appreciated -- thank you!
Tagged:

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 7

    HI @bramblepuss

    That is weird. Try to do a diff between the 10GB GatherVcfs output and the 6GB GatherVcfs output. Also view the regions that are different between these gcvfs on IGV too. After that please post those Igv screenshots and the records here so we can visualize whats wrong.

Sign In or Register to comment.