We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Most efficient way to combine multiple GVCF files

sheilaztsheilazt ValenciaMember
edited January 2017 in Ask the GATK team

Dear GATK team,
In order to calculate variant frequencies, all samples from different runs must be in the same GVCF. For each run, I'am currently running the CombineGVCF tool and then I use the GenotypeGVCF tool to combine different runs and calculate variant frequencies. When a new run arrives, to obtain just one unique file for all runs, I execute again the GenotypeGVCF tool adding to the list of GVCF files the one that belongs to the new run. I wonder if there is a more efficient way to reduce computational cost for joining GVCF files produced by CombineGVCF into one so then I can use this unique file as input for the GenotypeGVCF along with new GVCFs that come from new runs.


Now I work this way:

gatk --analysis_type CombineGVCFs --variant sample_1.g.vcf --variant sample_2.g.vcf -R GRCh38.fasta -o run1.g.vcf

gatk --analysis_type CombineGVCFs --variant sample_3.g.vcf --variant sample_4.g.vcf -R GRCh38.fasta -o run2.g.vcf

Combine Run_1 and Run_2
gatk --analysis_type GenotypeGVCFs --variant run1.g.vcf --variant run2.g.vcf -R GRCh38.fasta -o run1_run2.g.vcf --includeNonVariantSites

I wonder if there is a better way to do the following:

gatk --analysis_type CombineGVCFs --variant sample_5.g.vcf --variant sample_6.g.vcf -R GRCh38.fasta -o run3.g.vcf

Combine the GVCF from Run_1+Run_2 and the new Run_3
gatk --analysis_type GenotypeGVCFs --variant run1_run2.g.vcf --variant run3.g.vcf -R GRCh38.fasta -o run1_run2_run3.g.vcf --includeNonVariantSites

I've tried different ways with no success.

Thanks very much in advance.




  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Actually, the most efficient way is to use a special kind of database called TileDB that was developed by our Intel collaborators. Whenever you generate a GVCF, you add its contents to this database, then whenever you want you can just run GenotypeGVCFs on the database to joint-call all samples. Unfortunately this is rather difficult to set up so we are not yet able to provide it to the general public. We hope to make this solution widely available when we release GATK4 later this year.

    In the meantime, the way you are doing (batch-combining samples) is reasonable if you're working with many samples. If you have few samples, be aware that it is not necessary to combine all samples since GenotypeGVCFs can take multiple inputs. We do the combining when there are many samples to keep the number of concurrently open files low (below 200).
  • sheilaztsheilazt ValenciaMember

    Dear Geraldine,
    I work for one of the main public hospitals in my country and the number of samples is increasing very fast. If there is a way to gain early access to the TileDB we would be happy to collaborate in its development/testing.

    Thanks very much for your quick reply.



    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @sheilazt,

    The code to set this up is already publicly available under the name GenomicDB here. You can have a look there and ask if they could help you set it up.

    Good luck!

  • Is GenomicDB the same as "ImportGenomicsDB" in GATK4?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @pearlpearl,

    GenomicsDB refers both to the database that the GenomicsDBImport tool generates and to the functionality it embodies.

Sign In or Register to comment.