Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Correct GATK4 tools to use for combining scattered gVCFS and VCFs from multiple calls

I am running GATK on non-human data and am trying to follow the best practices as much as possible. I've now hit two separate roadblocks, both addressing similar issues:

1) Combining scattered gVCFS from the same sample

I have been parallelizing variant calling, running HaplotypeCaller per chromosome and producing separate gVCF files for each chromosome. I now need to concatenate those chromosomes into a single gVCF file per individual, before using CombineGVCFs to combine those from multiple samples and pass the output to GenotypeGVCFs.

Previously, I would have used CatVariants, as described in a previous thread (which I can't link to as a new user). It does not appear that CatVariants exists in GATK4, but there is GatherVcfs (Picard). Is this the appropriate tool to use? The documentation does not suggest it can be used for gVCF files, only VCF files, but it seems to run -- albeit slowly.

2) Combining hard-filtered VCF files into a single VCF to use as known sites in BQSR

My dataset comprises numerous individuals from five distinct species within the same genus. I ran HaplotypeCaller, then used CombineGVCFs to make six GVCF files: five that contain only the individuals from each distinct species, plus a sixth containing all the individuals. The logic behind the sixth was to capture low-frequency genus-wide alleles. After GenotypeGVCF and hard filtering, I now have six VCF files, and I'd like to combine these into a single VCF to use as known sites in BQSR.

I had planned to use CombineVariants with the UNIQUIFY (merge) option, but this was not ported to GATK4. I cannot use MergeVCF (Picard) because not all the files contain the same samples: the genus file contains them all, but the five species-specific files contain only the animals from those species. Can anyone suggest the best approach to do this?

I also wonder if it's necessary to start with six VCFs. I had presumed that one call across the whole genus would have been enough, but I am following methods from a previously published paper that took the six-VCF approach.

Any insight would be much appreciated. Thank you!

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @bramblepuss

    1) Correct GatherVcfs is the tool to use.
    2) CombineVariants with the UNIQUIFY (merge) option works for your case, you should use that from GATK3.

  • bramblepussbramblepuss Member
    edited May 7
    @bhanuGandham

    Thanks so much for your help.

    I gunzipped the interval gVCF files from one individual and used GatherVcfs to merge them. I then manually gzipped the output gVCF. The total size of the compressed input files was ~10 GB, but the total size of the compressed output file was only around ~6 GB.

    If I don't gunzip the input files and manually gzip the output file, and instead input the .gz and output a .gz, the tool produces an output file of roughly the same size as the input files (~10 GB).

    The advantage to gunzipping first was to validate position non-overlap of files and to create an index. Neither of those options are supported when gathering compressed files. But I can't explain where ~4 GB of data are disappearing to. Can you offer any insight here?
    Post edited by bramblepuss on
Sign In or Register to comment.