(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited January 23 in Tutorials

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).


UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -newQual \
    -O test_output.vcf 

And that's all there is to it.


Important limitations and Common “Gotchas”:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At least one interval must be provided when using GenomicsDBImport.

  3. Input GVCFs cannot contain multiple entries for a single genomic position

  4. GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.

  5. The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]

  6. GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.

  7. If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]

  8. HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.

  9. There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.


Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

Post edited by bhanuGandham on

Comments

  • RosmaninhoRosmaninho Member

    If we want to run GenomicsDBImport in all chromosomes how do we specify it?
    Is it possible to specificy samples using the following command?
    -V $path/*.g.vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Rosmaninho
    Hi,

    Have a look at Geraldine's last answer in this thread.

    Is it possible to specificy samples using the following command?

    -V $path/*.g.vcf
    That is the way to do it :smile:

    -Sheila

  • EliseGEliseG Member

    Hi,

    So if I understand correctly, the best way is to use GenomicsDBImport when it's possible (as a BestPractice).

    But is-it possible to use -V ${sep = "- V" GVCFs} like in the following tutorial: https://software.broadinstitute.org/wdl/documentation/article?id=7614 ?

    Elise

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @EliseG
    Hi Elise,

    No, GenotypeGVCFs in GATK4 only accepts 1 input GVCF. Have a look at this article.

    -Sheila

  • EliseGEliseG Member

    Thanks Sheila!

    Elise

  • biojiangkebiojiangke Member ✭✭

    Hi,

    I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I'm getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?

    The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.

    Thanks!

    Ke

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @biojiangke
    Hi Ke,

    I'm getting results up to 5bp from this specified position.

    I am not sure what you mean. Can you post some example records?

    Thanks,
    Sheila

  • biojiangkebiojiangke Member ✭✭

    For example, I supply a position to the GenotypGVCF operation in this format: 2:42889589 with the -L option. I would expect genotypes from this location, but instead, I got variant information at 2:42889592, which is 3bp away from the expected location.

  • biojiangkebiojiangke Member ✭✭

    Sometimes I can see this is an InDel issue, but sometimes it is a SNP, but the position is still off.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @biojiangke
    Hi Ke,

    I see. Can you post the exact command you ran? Are you only using one position in the interval?

    Thanks,
    Sheila

  • biojiangkebiojiangke Member ✭✭

    sites.list (GTAK format but only 1bp)

    2:121169112
    2:35031950
    2:42889592
    2:4365346

    Combine VCF step:

    gatk CombineGVCFs -R ref.fa -L sites.list -O Combined.vcf -V gVCF.list

    Joint call step:

    gatk GenotypeGVCFs -O GTed.vcf -R ref.fa -V Combined.vcf

    The positions in GTed.vcf are:

    2 4365345
    2 35031949
    2 42889589
    2 121169110

    After inspecting the results, I seem to know the reason. As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited June 2018

    @biojiangke
    Hi,

    As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

    Yes, that is correct. Thanks for posting.

    -Sheila

  • QuinnCQuinnC Member

    I am currently using GATK 4.0.5.1 and looking to analyse SNPs from a target capture dataset done on a non-model organism without linkage or chromosome information. There are 2900 targeted regions/contigs each across 45 individuals. I have run HaplotypeCaller using -ERC mode and now have 45 g.vcf files (about 400-700mb each) that I will like to combine for GenotypeGVCFs.
    Does the GenomicsDBImport for this version or newer versions support multiple intervals? I will prefer not to have 2900 separate output files after this step.
    I have tried combineGVCFs across the 45 g.vcf files without using the -L option but always ran into "GC threads out of memory error" even with -Xmx32G.
    Previously we used GATK 3.7 without any problem as GenotypeGVCFs accepted multiple g.vcf files as input.
    I'd appreciate your suggestion on how I can move forward with my analyses.
    Thanks in advance,
    Quinn

  • QuinnCQuinnC Member
    edited August 2018

    Thank you, Mauro! Will upgrade to 4.0.6.0 and give it a try!

  • MehulSMehulS Member

    Do I need to install a separate GenomicsDB tool for this command to work ? The GitHub link seems to suggest so.

  • hdeteringhdetering Vigo, SpainMember

    Using GenotypeGVCFs from GATK 4.0.10.0, the option -newQual gave me an error:

    A USER ERROR has occurred: n is not a recognized option
    

    Trying --newQual was unsuccessful, too:

    A USER ERROR has occurred: newQual is not a recognized option
    

    Am I right in assuming that this was an upward-compatibility option in GATK3?

  • leshwillleshwill HoustonMember

    Hello, I have 520 WES samples. I'm trying to run GenomicsDBImport per chromosome.
    I used the --intervals chr1 as per the example but got "Query interval "chr1" is not valid for this input".
    Is it possible to run GenomincsDBImport per chromosome?

    My plan is to create joint called VCFs for each chromosome then combine the chromosome VCFs to get a final VCF file. I've been running CombineGVCFs for about 12days and so far the progress meter indicates it's on chromosome 5. I need help. How can I make this process go faster?

  • danangcrysnantodanangcrysnanto EdinburghMember
    Hi

    I parallelize genomicsdbimport tools to 2000 interval across genome with sample size ~400 and run about 100 jobs simultaneously. I got complain from my cluster admin that my jobs overloaded the Lustre filesystems by having lots I/O (I already set --batch-size 50). Alternatively, I could use the SSD attached to the computing nodes but I need to move the output files whenever the jobs finished as it will be immediately deleted.

    Is it now possible to move the genomic database output after creation? Do you recommend to use CombineGVCFs instead (and 400 samples are realistics to be put in a single big gvcf files)?

    Thank you
  • lauren_whitelauren_white Member
    Hello,
    I am trying to combine 350 exome-capture GVCFs using GenomicDBImport. I have started with chr21 (-L chr21) as a test run, but the program seems to be stuck. The last output showed the headings for the progress meter, but it has not moved on in nearly an hour.

    I have only recently switched to gatk4. When using gatk3 GenotypeGVCFs (without first having to combine the gvcfs) on chr21 alone, the process usually took less than an hour. I have tried switching back to this version, but the single gvcfs now seem to contain header information that gatk3 can not deal with (Unable to parse header with error: For input string: "R").

    I appreciate any advice or help you can provide.
    Thankyou
    -Lauren
  • EvanKristiansenEvanKristiansen Member
    edited March 13
    I'm curious if this is still the most common thread for GenomicsDBImport? I've yet to see anyone answer the question of whether it's possible to run GenomicsDBImport with all contigs. I'm running gatk/4.0.7.0 on an SCC.

    Basically, I have 20 individuals aligned and called with a not so great reference genome; so I have ~300 scaffolds. Is it possible to run GenomicsDBImport without having to manually input a list of 300 scaffolds?

    Thanks!
    Evan
  • jmedaromjmedarom Member
    Hi,

    I have a question about this Geraldine's comment :

    "Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don't combine workspaces; that is not possible. However there are a few tweaks you can apply."

    In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3...), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.

    Thanks,

    Jenn
  • vaa2011vaa2011 Member
    Hi,

    I would like to combine a cohort of samples from WGS. For this I followed the indications in this post (on a small subset of my cohort) and was trying to run GenomicsDBImport on all of the chromosomes at the same time. As the command forces you to specify at least one interval, I listed all the chromosomes and I get this error:

    A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX" is not valid for this input

    Has anyone been able to successfully run this command for the whole genome? I have no issues running CombineGVCFs except for it being slow.
Sign In or Register to comment.