Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited May 18 in Tutorials

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case only two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport, which has a few limitations (for example it can only run on diploid data at the moment). We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).


UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImportcommand would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20

That generates a directory called my_database containing the combined GVCF data for chromosome 20. The contents of the directory are not really human-readable; see further down for tips to deal with that.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf 

And that's all there is to it.


Important limitations:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.

  3. GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using CatVariants) or scatter the following steps by chromosome as well.

**If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way. **


Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf
Post edited by Sheila on

Comments

  • RosmaninhoRosmaninho Member

    If we want to run GenomicsDBImport in all chromosomes how do we specify it?
    Is it possible to specificy samples using the following command?
    -V $path/*.g.vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Rosmaninho
    Hi,

    Have a look at Geraldine's last answer in this thread.

    Is it possible to specificy samples using the following command?

    -V $path/*.g.vcf
    That is the way to do it :smile:

    -Sheila

  • EliseGEliseG Member

    Hi,

    So if I understand correctly, the best way is to use GenomicsDBImport when it's possible (as a BestPractice).

    But is-it possible to use -V ${sep = "- V" GVCFs} like in the following tutorial: https://software.broadinstitute.org/wdl/documentation/article?id=7614 ?

    Elise

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @EliseG
    Hi Elise,

    No, GenotypeGVCFs in GATK4 only accepts 1 input GVCF. Have a look at this article.

    -Sheila

  • EliseGEliseG Member

    Thanks Sheila!

    Elise

  • biojiangkebiojiangke Member ✭✭

    Hi,

    I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I'm getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?

    The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.

    Thanks!

    Ke

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @biojiangke
    Hi Ke,

    I'm getting results up to 5bp from this specified position.

    I am not sure what you mean. Can you post some example records?

    Thanks,
    Sheila

  • biojiangkebiojiangke Member ✭✭

    For example, I supply a position to the GenotypGVCF operation in this format: 2:42889589 with the -L option. I would expect genotypes from this location, but instead, I got variant information at 2:42889592, which is 3bp away from the expected location.

  • biojiangkebiojiangke Member ✭✭

    Sometimes I can see this is an InDel issue, but sometimes it is a SNP, but the position is still off.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @biojiangke
    Hi Ke,

    I see. Can you post the exact command you ran? Are you only using one position in the interval?

    Thanks,
    Sheila

  • biojiangkebiojiangke Member ✭✭

    sites.list (GTAK format but only 1bp)

    2:121169112
    2:35031950
    2:42889592
    2:4365346

    Combine VCF step:

    gatk CombineGVCFs -R ref.fa -L sites.list -O Combined.vcf -V gVCF.list

    Joint call step:

    gatk GenotypeGVCFs -O GTed.vcf -R ref.fa -V Combined.vcf

    The positions in GTed.vcf are:

    2 4365345
    2 35031949
    2 42889589
    2 121169110

    After inspecting the results, I seem to know the reason. As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited June 29

    @biojiangke
    Hi,

    As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

    Yes, that is correct. Thanks for posting.

    -Sheila

  • QuinnCQuinnC Member

    I am currently using GATK 4.0.5.1 and looking to analyse SNPs from a target capture dataset done on a non-model organism without linkage or chromosome information. There are 2900 targeted regions/contigs each across 45 individuals. I have run HaplotypeCaller using -ERC mode and now have 45 g.vcf files (about 400-700mb each) that I will like to combine for GenotypeGVCFs.
    Does the GenomicsDBImport for this version or newer versions support multiple intervals? I will prefer not to have 2900 separate output files after this step.
    I have tried combineGVCFs across the 45 g.vcf files without using the -L option but always ran into "GC threads out of memory error" even with -Xmx32G.
    Previously we used GATK 3.7 without any problem as GenotypeGVCFs accepted multiple g.vcf files as input.
    I'd appreciate your suggestion on how I can move forward with my analyses.
    Thanks in advance,
    Quinn

  • QuinnCQuinnC Member
    edited August 3

    Thank you, Mauro! Will upgrade to 4.0.6.0 and give it a try!

Sign In or Register to comment.