Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Most efficient way to combine multiple GVCF files

sheilaztsheilazt ValenciaMember
edited January 2017 in Ask the GATK team

Dear GATK team,
In order to calculate variant frequencies, all samples from different runs must be in the same GVCF. For each run, I'am currently running the CombineGVCF tool and then I use the GenotypeGVCF tool to combine different runs and calculate variant frequencies. When a new run arrives, to obtain just one unique file for all runs, I execute again the GenotypeGVCF tool adding to the list of GVCF files the one that belongs to the new run. I wonder if there is a more efficient way to reduce computational cost for joining GVCF files produced by CombineGVCF into one so then I can use this unique file as input for the GenotypeGVCF along with new GVCFs that come from new runs.

Example:

Now I work this way:

Run_1
gatk --analysis_type CombineGVCFs --variant sample_1.g.vcf --variant sample_2.g.vcf -R GRCh38.fasta -o run1.g.vcf

Run_2
gatk --analysis_type CombineGVCFs --variant sample_3.g.vcf --variant sample_4.g.vcf -R GRCh38.fasta -o run2.g.vcf

Combine Run_1 and Run_2
gatk --analysis_type GenotypeGVCFs --variant run1.g.vcf --variant run2.g.vcf -R GRCh38.fasta -o run1_run2.g.vcf --includeNonVariantSites

I wonder if there is a better way to do the following:

Run_3
gatk --analysis_type CombineGVCFs --variant sample_5.g.vcf --variant sample_6.g.vcf -R GRCh38.fasta -o run3.g.vcf

Combine the GVCF from Run_1+Run_2 and the new Run_3
gatk --analysis_type GenotypeGVCFs --variant run1_run2.g.vcf --variant run3.g.vcf -R GRCh38.fasta -o run1_run2_run3.g.vcf --includeNonVariantSites

I've tried different ways with no success.

Thanks very much in advance.

Regards,

Sheila

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Actually, the most efficient way is to use a special kind of database called TileDB that was developed by our Intel collaborators. Whenever you generate a GVCF, you add its contents to this database, then whenever you want you can just run GenotypeGVCFs on the database to joint-call all samples. Unfortunately this is rather difficult to set up so we are not yet able to provide it to the general public. We hope to make this solution widely available when we release GATK4 later this year.

    In the meantime, the way you are doing (batch-combining samples) is reasonable if you're working with many samples. If you have few samples, be aware that it is not necessary to combine all samples since GenotypeGVCFs can take multiple inputs. We do the combining when there are many samples to keep the number of concurrently open files low (below 200).
  • sheilaztsheilazt ValenciaMember

    Dear Geraldine,
    I work for one of the main public hospitals in my country and the number of samples is increasing very fast. If there is a way to gain early access to the TileDB we would be happy to collaborate in its development/testing.

    Thanks very much for your quick reply.

    Regards,

    Sheila

    Issue · Github
    by Sheila

    Issue Number
    1690
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @sheilazt,

    The code to set this up is already publicly available under the name GenomicDB here. You can have a look there and ask if they could help you set it up.

    Good luck!

  • Is GenomicDB the same as "ImportGenomicsDB" in GATK4?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @pearlpearl,

    GenomicsDB refers both to the database that the GenomicsDBImport tool generates and to the functionality it embodies.

Sign In or Register to comment.