Merging Genome VCF's for the Same individual

Hi this is pretty much a feature request for something I think would be useful, I mentioned it briefly at the Brussels workshop and it seems like it might be possible.

In a couple of projects I'm involved in we have done low coverage (2-10x whole genome) exploratory sequencing for a large number of individuals (similar to 1K genomes, around 1,200 individuals between the two projects) and have recently processed these individuals using the new N+1 pipeline, generating gVCFs.

Now going forward we are adding additional sequence for a decent number of these individuals (from the same PCR free library) to improve genome coverage and the accuracy of genotypes (target 30x) in individuals and Trios of interest. We thus want to combine the new sequence (20x) with the older sequence (6-10x) to get as much coverage as possible.
To do this I understand that currently I would need to rerun the GATK HaplotypeCaller on both the old and new BAMs at once, generating a new gVCF then track down the individuals in our previous combined gVCFs and remove them so I can Genotype the old gVCF minus the low coverage samples + the new gVCFs.
Following that process I have to reprocess the old data multiple times and subset old combined gVCF files if new data comes in which is rather painful and computationally wasteful.

Ideally it would instead be possible to run GATK HaplotypeCaller just on the new sequence generating a second new gVCF that only has data for the new 20x coverage, then combine it somehow with the old gVCF merging the data from both the old and new gVCFs and resulting in a single final VCF record for this sample which has utilised the data from both the old and new gVCFs. I guess this could either be run as a separate tool to merge/combine old and new gVCFs or be done automatically by the GenotypeGVCFs tool.

This would also be useful from a work flow point of view, as we have limited computational resources and storage it's preferable that we process data as soon as it comes off the sequencer through to the gVCF stage to save space and allow us to archive the BAM files while keeping the gVCFs for when we run GenotypeGVCF on all the current data. At the moment I have to keep the BAMs for an individual in working space until I'm sure I've got all the sequence for that individual (and as mentioned above that can change in the future) then generate the gVCFs. Being able to flow sequence data through the cluster to gVCF stage as soon as it becomes available and then later merge the gVCFs when additional lanes are completed would make things a lot simpler from a resource management and pipeline design.

If this is possible it would be greatly appreciated if it could be implemented.

Best Answer


  • dkolbedkolbe IowaMember

    I'm in a similar position to the original poster - we want to collect GVCFs for each of our samples as we go for an ongoing record of observed variants. We regularly re-sequence the same sample for a variety of reasons, and we don't want a repeat of a sample (possibly different libraries, though) to count as two individuals when outputting variants and calculating frequency. Ideally, we'd be able to combine the information from multiple GVCFs with matching sample tags. Given it's been two years - any change in the answer here? Best recommendations for handling this scenario? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Still feasible (though not trivial), but still not on our roadmap, I'm afraid. It's not something our immediate collaborators need, and you're only the second person in two years to express interest about this... We'd be happy to take a patch if someone wants to figure it out, though.

Sign In or Register to comment.