Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited January 23 in Tutorials

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).


UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -newQual \
    -O test_output.vcf 

And that's all there is to it.


Important limitations and Common “Gotchas”:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At least one interval must be provided when using GenomicsDBImport.

  3. Input GVCFs cannot contain multiple entries for a single genomic position

  4. GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.

  5. The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]

  6. GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.

  7. If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]

  8. HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.

  9. There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.


Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

Post edited by bhanuGandham on

Comments

  • RosmaninhoRosmaninho Member

    If we want to run GenomicsDBImport in all chromosomes how do we specify it?
    Is it possible to specificy samples using the following command?
    -V $path/*.g.vcf

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Rosmaninho
    Hi,

    Have a look at Geraldine's last answer in this thread.

    Is it possible to specificy samples using the following command?

    -V $path/*.g.vcf
    That is the way to do it :smile:

    -Sheila

  • EliseGEliseG Member

    Hi,

    So if I understand correctly, the best way is to use GenomicsDBImport when it's possible (as a BestPractice).

    But is-it possible to use -V ${sep = "- V" GVCFs} like in the following tutorial: https://software.broadinstitute.org/wdl/documentation/article?id=7614 ?

    Elise

  • SheilaSheila Broad InstituteMember, Broadie admin

    @EliseG
    Hi Elise,

    No, GenotypeGVCFs in GATK4 only accepts 1 input GVCF. Have a look at this article.

    -Sheila

  • EliseGEliseG Member

    Thanks Sheila!

    Elise

  • biojiangkebiojiangke Member ✭✭

    Hi,

    I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I'm getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?

    The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.

    Thanks!

    Ke

  • SheilaSheila Broad InstituteMember, Broadie admin

    @biojiangke
    Hi Ke,

    I'm getting results up to 5bp from this specified position.

    I am not sure what you mean. Can you post some example records?

    Thanks,
    Sheila

  • biojiangkebiojiangke Member ✭✭

    For example, I supply a position to the GenotypGVCF operation in this format: 2:42889589 with the -L option. I would expect genotypes from this location, but instead, I got variant information at 2:42889592, which is 3bp away from the expected location.

  • biojiangkebiojiangke Member ✭✭

    Sometimes I can see this is an InDel issue, but sometimes it is a SNP, but the position is still off.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @biojiangke
    Hi Ke,

    I see. Can you post the exact command you ran? Are you only using one position in the interval?

    Thanks,
    Sheila

  • biojiangkebiojiangke Member ✭✭

    sites.list (GTAK format but only 1bp)

    2:121169112
    2:35031950
    2:42889592
    2:4365346

    Combine VCF step:

    gatk CombineGVCFs -R ref.fa -L sites.list -O Combined.vcf -V gVCF.list

    Joint call step:

    gatk GenotypeGVCFs -O GTed.vcf -R ref.fa -V Combined.vcf

    The positions in GTed.vcf are:

    2 4365345
    2 35031949
    2 42889589
    2 121169110

    After inspecting the results, I seem to know the reason. As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

  • SheilaSheila Broad InstituteMember, Broadie admin
    edited June 2018

    @biojiangke
    Hi,

    As long as there is an additional alternative allele, even from a few reads, supporting InDels that overlap with the queried location, the sites involving these alleles would be written into output. I think this makes sense, as any sites, even far away from the queried site, may have alleles affecting the status at the queried site.

    Yes, that is correct. Thanks for posting.

    -Sheila

  • QuinnCQuinnC Member

    I am currently using GATK 4.0.5.1 and looking to analyse SNPs from a target capture dataset done on a non-model organism without linkage or chromosome information. There are 2900 targeted regions/contigs each across 45 individuals. I have run HaplotypeCaller using -ERC mode and now have 45 g.vcf files (about 400-700mb each) that I will like to combine for GenotypeGVCFs.
    Does the GenomicsDBImport for this version or newer versions support multiple intervals? I will prefer not to have 2900 separate output files after this step.
    I have tried combineGVCFs across the 45 g.vcf files without using the -L option but always ran into "GC threads out of memory error" even with -Xmx32G.
    Previously we used GATK 3.7 without any problem as GenotypeGVCFs accepted multiple g.vcf files as input.
    I'd appreciate your suggestion on how I can move forward with my analyses.
    Thanks in advance,
    Quinn

  • QuinnCQuinnC Member
    edited August 2018

    Thank you, Mauro! Will upgrade to 4.0.6.0 and give it a try!

  • MehulSMehulS Member

    Do I need to install a separate GenomicsDB tool for this command to work ? The GitHub link seems to suggest so.

  • hdeteringhdetering Vigo, SpainMember

    Using GenotypeGVCFs from GATK 4.0.10.0, the option -newQual gave me an error:

    A USER ERROR has occurred: n is not a recognized option
    

    Trying --newQual was unsuccessful, too:

    A USER ERROR has occurred: newQual is not a recognized option
    

    Am I right in assuming that this was an upward-compatibility option in GATK3?

  • leshwillleshwill HoustonMember

    Hello, I have 520 WES samples. I'm trying to run GenomicsDBImport per chromosome.
    I used the --intervals chr1 as per the example but got "Query interval "chr1" is not valid for this input".
    Is it possible to run GenomincsDBImport per chromosome?

    My plan is to create joint called VCFs for each chromosome then combine the chromosome VCFs to get a final VCF file. I've been running CombineGVCFs for about 12days and so far the progress meter indicates it's on chromosome 5. I need help. How can I make this process go faster?

  • danangcrysnantodanangcrysnanto EdinburghMember
    Hi

    I parallelize genomicsdbimport tools to 2000 interval across genome with sample size ~400 and run about 100 jobs simultaneously. I got complain from my cluster admin that my jobs overloaded the Lustre filesystems by having lots I/O (I already set --batch-size 50). Alternatively, I could use the SSD attached to the computing nodes but I need to move the output files whenever the jobs finished as it will be immediately deleted.

    Is it now possible to move the genomic database output after creation? Do you recommend to use CombineGVCFs instead (and 400 samples are realistics to be put in a single big gvcf files)?

    Thank you
  • lauren_whitelauren_white Member
    Hello,
    I am trying to combine 350 exome-capture GVCFs using GenomicDBImport. I have started with chr21 (-L chr21) as a test run, but the program seems to be stuck. The last output showed the headings for the progress meter, but it has not moved on in nearly an hour.

    I have only recently switched to gatk4. When using gatk3 GenotypeGVCFs (without first having to combine the gvcfs) on chr21 alone, the process usually took less than an hour. I have tried switching back to this version, but the single gvcfs now seem to contain header information that gatk3 can not deal with (Unable to parse header with error: For input string: "R").

    I appreciate any advice or help you can provide.
    Thankyou
    -Lauren
  • EvanKristiansenEvanKristiansen Member
    edited March 13
    I'm curious if this is still the most common thread for GenomicsDBImport? I've yet to see anyone answer the question of whether it's possible to run GenomicsDBImport with all contigs. I'm running gatk/4.0.7.0 on an SCC.

    Basically, I have 20 individuals aligned and called with a not so great reference genome; so I have ~300 scaffolds. Is it possible to run GenomicsDBImport without having to manually input a list of 300 scaffolds?

    Thanks!
    Evan
  • jmedaromjmedarom Member
    Hi,

    I have a question about this Geraldine's comment :

    "Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don't combine workspaces; that is not possible. However there are a few tweaks you can apply."

    In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3...), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.

    Thanks,

    Jenn
  • vaa2011vaa2011 Member
    Hi,

    I would like to combine a cohort of samples from WGS. For this I followed the indications in this post (on a small subset of my cohort) and was trying to run GenomicsDBImport on all of the chromosomes at the same time. As the command forces you to specify at least one interval, I listed all the chromosomes and I get this error:

    A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX" is not valid for this input

    Has anyone been able to successfully run this command for the whole genome? I have no issues running CombineGVCFs except for it being slow.
  • YifangYifang Member
    edited March 22
    Hello,
    I hit the same wall recently.
    The format for options -L and --intervals are still unclear to me. I tried to include all the chromosomes at once, but not successful. The interval seems taking entries separated by comma as I did not get error without -L option:
    --intervals N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19
    as "ERROR -L option is needed ......"
    I then added -L option but got error with "-L options unrecognized N2 etc..." (sorry, did not have screenshot record!)
    ................
    The two options seem mutual exclusive to each other as I could not get them run at the same time even I only provide a single chromosome. (-L N1 and --interval N1). Bang bang bang!!!

    It seems only one interval is allowed with the -L option.

    Spent hours testing this and finally got one successful case. Here is my script:

    gDB=/path/to/genomeDBImport #Must exist, and will be created by the program
    gatk4_jar=/storage/yifang/download-software/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar

    for chr in N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19; do
    java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
    -jar ${gatk4_jar} GenomicsDBImport \
    --genomicsdb-workspace-path ${gDB}_${chr} \
    --batch-size 16 \
    -L ${chr} \
    --sample-name-map /storage/yifang/20190225/data1/cohort_samples.map \
    --tmp-dir=/storage/ppl/yifang/20190225/data1/tmp \
    --reader-threads 8 &
    done

    I put here for a reference. Hope it can become a good starter for expert to give better example with deeper insight.
    Gook luck to all!

    Yifang
  • danat_yermakovichdanat_yermakovich Member
    edited April 9

    @Yifang
    @vaa2011
    hi guys
    I'm a little bit annoyed of this examples of commands on the top cause nor one of them are not working.
    I tried my old 4.0.6.0 and the latest 4.1.1.0 gatk

    The first one fall down on this line:
    gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \

    --intervals chr20,chr21

    after spending some time, i thought up syntax that allowed - you need put -L flag to each chr
    like
    ./gatk-4.1.1.0/gatk --java-options "-Xmx8g -Xms6g" GenomicsDBImport \
    --genomicsdb-workspace-path $Path/genomic_db \
    --sample-name-map cohort.sample_map \
    --reader-threads 2 \

    -L chr20 -L chr21 -L chr22 -L chrX -L chr2

    weird not fixed shi__t

    @hdetering
    and the second one:

    gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \

    -newQual \

    -O test_output.vcf

    so

    $ gatk-4.1.1.0/gatk GenotypeGVCFs
    after this command you have a help message, and there is no flag --newQual
    it really was in gatk3 GenomeAnalysisTK.jar
    but u can find the same by sense:

    --use-new-qual-calculator,-new-qual:Boolean
    Use the new AF model instead of the so-called exact model Default value: true.
    Possible values: {true, false}

    as you can see, the default value is true - u don't need to clearly indicates this option

    i'm really disappointed - it's not the best practise to mix up options from old and new versions
    p.s. sorry for not excellent English, but i hope it was understandable

  • AlvaAlva SwitzerlandMember ✭✭
    edited July 23

    Hello @Geraldine_VdAuwera @Sheila ,
    I have a question regarding using GenomicsDBImport over CombineGVCFs, I have been using CombineGVCFs for joining 320 gvf files using following memory options,

        • --java-options '-Xmx350G -XX:+UseParallelGC -XX:ParallelGCThreads=40'  [ERROR]
        • --java-options -Xms454m -Xmx3181m -XX:+UseSerialGC [Error]
        • --java-options '-Xmx650G -XX:+UseParallelGC -XX:ParallelGCThreads=70'  [ERROR]
    --java-options '-Xmx900G -XX:+UseParallelGC -XX:ParallelGCThreads=80' [ERROR]
    

    In all above cases I got the following error,
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x00002ad5d784802e, pid=31158, tid=0x00002ad5d6327280
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_172-b11) (build 1.8.0_172-b11)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.172-b11 mixed mode linux-amd64 )
    # Problematic frame:
    # V [libjvm.so+0x92c02e] SR_handler(int, siginfo*, ucontext*)+0x3e
    #
    # Core dump written. Default location: /cluster/work/grlab/projects/tmp_alva/test_wxs_aml/scripts/core or core.31158
    #
    # An error report file with more information is saved as:
    # /cluster/work/grlab/projects/tmp_alva/test_wxs_aml/scripts/hs_err_pid31158.log
    #
    # If you would like to submit a bug report, please visit:
    # http://bugreport.java.com/bugreport/crash.jsp

    Therefore, I thought of using GenomicsDBImport tool over the Combinegvcfs,. However, in my case I have 1824 contigs from within humnan and viral sequences for eachgvcf files. I would like to whether the intervals parameters can be used efficentevly in my case?

  • wbsimeywbsimey California Academy of SciencesMember ✭✭
    edited July 31

    Hello,
    We are using the latest GATK4 conda install on Ubuntu 16 with 1TB RAM.
    We are trying to use the GenomicsDBImport tool to combine dozens of gVCFs from WGS individuals. Our reference genome has thousands of scaffolds (40,232), 25 of which we have assigned as primary chromosomes.
    We have tried many different interval options with no luck. The example provided above with commas (chr20,chr21) does not work. We tried multiple -L commands for each scaffold, this also does not work.
    We have tried to use a list, but I think we must be formatting the list incorrectly (I tried to find an example .list, but could not. Sorry in advance if there is one).
    What are we doing wrong?
    I am not posting the error messages, because they are the same as others have posted here.

    The following are a couple examples of what we have tried:
    Example 1:
    gatk --java-options "-Xmx200G" GenomicsDBImport \
    -V TP30046_sorted_dedup_rg.gVCF \
    -V TP29957_sorted_dedup_rg.gVCF \
    ...
    -V TP29969_sorted_dedup_rg.gVCF \
    -V AHP1168_sorted_dedup_rg.gVCF \
    --genomicsdb-workspace-path ../Genomics_DB-all \
    --intervals scaffold_1,scaffold_2,scaffold_3,...,scaffold_25

    Example 2:
    gatk --java-options "-Xmx200G" GenomicsDBImport \
    -V TP30046_sorted_dedup_rg.gVCF \
    -V TP29957_sorted_dedup_rg.gVCF \
    ...
    -V TP29969_sorted_dedup_rg.gVCF \
    -V AHP1168_sorted_dedup_rg.gVCF \
    --genomicsdb-workspace-path ../Genomics_DB-all \
    -L scaffold_1
    -L scaffold_2
    -L scaffold_3
    ...
    -L scaffold_25

  • jmedaromjmedarom Member
    Hi @Sheila ,

    I have been waiting for this answer so long. Could someone from gatk team give an answer to this post please?.

    Thank you for your help.


    > @jmedarom said:
    > Hi,
    >
    > I have a question about this Geraldine's comment :
    >
    > "Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don't combine workspaces; that is not possible. However there are a few tweaks you can apply."
    >
    > In my case, I ran GenomicsDBImport per interval (per each chromosome as -L 1 -L 2 -L 3...), but I did not do the same with GenotypeVCFs and thus did not combine the per-interval VCFs. is this correct? or is it necessary running both tools per interval?.
    >
    > Thanks,
    >
    > Jenn
  • lidialidia Member
    Hi @wbsimey ,

    I am having the same problem that you have and I can´t seem to find an answer in the documentation, nor in the forums. Were you able to find a solution?

    If so, I would very much appreciate your help.

    Lidia
  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    Hi @lidia, we were never able to get GenomicsDBImport to accept more than one interval. But our workflow includes 43 individuals (2.3 Gbp each) and 29 chromosomes/scaffolds/intervals, which, by our estimation would require about 2-3 months for HaplotypeCaller to finish. So, I broke up HaplotypeCaller analyses into individual chromosomes and ran many instances simultaneously (I have access to large multi threaded Xeon servers). My plan was to then use CombineGVCFs to merge the chromosomes of each individual into a single gVCF file, then use GenomicsDBImport to create a single database for all samples and chromosomes. But, as the GATK documentation points out, CombineGVCFs is very slow; it required more than 48 hours for one of our smallest alignments. So, I am planning to use GenomicsDBImport to create a unique database for each chromosome (29 databases).

  • fengtaofengtao Member

    @lidia @wbsimey

    I use gatk4.0.8, and it works with multiple --intervals arguments:

        gatk --java-options "-Xmx200G" GenomicsDBImport \
        --sample-name-map cohort.sample_map \
        --genomicsdb-workspace-path  output_DB\
        --intervals Chr1 \
        --intervals Chr2 \
        --intervals Chr3 \
        --intervals Chr4 \
        --intervals Chr5
    
  • wbsimeywbsimey California Academy of SciencesMember ✭✭

    Thanks @fengtao ! This did indeed work for me on 4.1.4 conda install.

Sign In or Register to comment.