CombineGVCFs assigns incorrect reference allele in GATK3.7 and GATK4

maddiemaddie Los Angeles, CAMember

I'm analyzing 284 exomes using GATK 3.7-0-gcfedb67. My workflow is to run HaplotypeCaller on each individual exome split into 32 bed files. After HaplotypeCaller I combine the 32 split g.vcf files using CombineGVCFs to create a single g.vcf per individual before I genotype all individuals together using GenotypeGVCFs. In my last bed file, I’ve run into an issue where a site that is missing from the post-HaplotypeCaller g.vcf is incorporated into the combined g.vcf file with the wrong reference allele. For other individuals, this site is present in the post-HaplotypeCaller g.vcf, and for those individuals the site is assigned the correct reference allele after CombineGVCFs. This discordance in reference alleles leads to problems downstream with GenotypeGVCFs because it sees multiple reference alleles and throws an error. I used the same genome reference for all steps of the pipeline, and I haven’t been able to find a discussion on the GATK forum that solves this issue. I’m using GATK 3.7 for this but tried to run CombineGVCFs in GATK 4 and the output file had the same issue. Any help on this would be appreciated. Thanks!

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @maddie
    Hi,

    CombineGVCFs is used for hierarchically combining GVCFs before feeding them into GenotypeGVCFs. This is supposed to reduce the compute in GenotypeGVCFs. For merging the GVCFs from each sample into one, you should use GatherVcfs.

    You can also try running GenotypeGVCFs on each of the sample GVCFs (per interval), then merging the final VCFs into one with GatherVCFs.

    -Sheila

Sign In or Register to comment.