Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Using GenomicsDBImport to consolidate GVCFs for input to GenotypeGVCFs in GATK4

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited April 16 in GATK 4 Beta

This article is out of date and has been replaced by https://software.broadinstitute.org/gatk/documentation/article?id=11813


In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case there is only one valid way to do it: with GenomicsDBImport.

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output a VCF. Note that GenomicsDBImport does not take two or more same sample GVCFs. You will need to create one GVCF per-sample before running the tool.

Here are example commands to use it:

gatk-launch GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsDBWorkspace my_database \
    --intervals 20

That generates a directory called my_database containing the combined gvcf data.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk-launch GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf 

And that's all there is to it.

There are three caveats:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.

  3. At the moment GenomicsDB only supports diploid data. The developers of GenomicsDB are working on implementing support for non-diploid data.

If you can't use GenomicsDB for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.


Addendum: extracting data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk-launch SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Caveat: cannot move database after creation

Currently the GenomicsDB internal code uses the absolute path of the location of the database as part of the data encoding. As a consequence, you cannot move the database to a different location before running GenotypeGVCFs on it. If you do, it will no longer work. This is obviously not desirable, and the development team is looking at options to remediate this.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • chapmanbchapmanb Boston, MAMember

    Geraldine;
    Thanks much for this tutorial. I'm working on implementing joint calling in bcbio with GATK4 beta3 and running into issues getting data in and out of GenomicsDB. I appear to lose genotypes when importing and as a result GenotypeGVCFs outputs are empty. I put together a self contained reproducible test case here:

    https://s3.amazonaws.com/chapmanb/testcases/gatk4_genotypegvcfs.tar.gz

    which consists of running an import, and then a GenotypeGVCFs and SelectVariants to test export:

    gatk-launch GenomicsDBImport --genomicsDBWorkspace test_genomicsdb -L chrM:1-1000 --variant Test1.vcf.gz --variant Test2.vcf.gz
    gatk-launch GenotypeGVCFs --variant gendb://test_genomicsdb -R hg19.fa --output joint.vcf.gz -L chrM:1-1000
    gatk-launch SelectVariants -R hg19.fa -V gendb://test_genomicsdb -O extract.vcf.gz
    

    The input VCFs look have calls within the chrM region I'm testing:

    Test1.vcf.gz:chrM       150     .       T       C,<NON_REF>     49406.77        .       DP=1084;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQ0=0;RAW_MQ=3902400.00    GT:AD:DP:GQ:MCL:MMQ:PGT:PID:PL:SB       1/1:0,1081,0:1081:99:0,0,0:0,60,0:0|1:150_T_C:49435,3362,0,49435,3362,49435:0,0,483,598
    

    but are empty in the GenotypeGVCFs output. Running SelectVariants shows the sample but with empty genotype calls:

    extract.vcf.gz:chrM     150     .       T       C,<NON_REF>     .       .       DP=2168;ExcessHet=3.01;MQ0=0;RAW_MQ=7804800.00        GT:AD:DP:GQ:MCL ./.:0,1081,0:1081:99:0.00,0.00  ./.:0,1081,0:1081:99:0.00,0.00
    

    I'm stuck on what I'm doing wrong. Could anyone offer any pointers on ways to fix/tweak this to get correct genotyping? Thanks much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Brad, when you say the outputs are empty in GenotypeGVCFs, do you mean the files are empty, or some positions are missing?

  • chapmanbchapmanb Boston, MAMember

    Geraldine;
    Thanks for getting back with me so quickly. GenotypeGVCFs produces an VCF file with just the header and no variant positions called. This is a small test dataset we use a lot so expect a number of calls like at the chrM 150 position I show above. Thanks again for looking at this.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, someone else reported something like this recently. We'll look into it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Issue created here in case you want to track progress

  • chapmanbchapmanb Boston, MAMember

    Brilliant, thanks Geraldine. I subscribed and happy to help there with any other information needed.

  • jjmmiijjmmii Member
    edited August 2017

    Hi Geraldine,

    May you please show me how to combine per-chromosome
    I am unable to pass the intervals file I used in the previous steps of the Best Practices workflow, namely the exome intervals, with the error "More than one interval specified. The tool takes only one". But when I don't pass any intervals, the error becomes "Argument intervals was missing: Argument 'intervals' is required." May you please show me the correct way to specify the exome intervals? Thank you very much.

    EDIT: I just read that I can only specify one interval at a time. If I specify per-chromosome intervals, then how do I merge them together to form a whole-genome VCF file? Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited August 2017

    @jjmmii
    Hi,

    Sorry for the delay. Let me check with the team and get back to you.

    -Sheila

    EDIT: I think you can use MergeVcfs.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jjmmii
    Hi,

    I just heard back from the developers. If you want to merge disjoint shards after a split operation you should use GatherVCFs; if you want to to merge VCFs that can overlap each other you should use MergeVariants.

    -Sheila

  • bbimberbbimber HomeMember

    Hello - this is tangential to this thread, but I have a question on GenomicsDB in general. If I understand this, in the example you're using GenomicsDB to aggregate files to run computation that in turn produces a VCF output. I can see the utility of that. In your example, the GenomicsDB workspace is essentially just a transient tool to support the generation of the actual VCF (though I suppose technically it could stick around).

    GenomicsDB seems like it would also make sense as a long term data store for production genotype calls. I see that GenomicsDB's import will work for any VCF and there are hooks available to issue queries. From the outside, it appear this functionality might be a little less developed than your joint genotyping use case? Are there any public projects or examples you know about using GenomicsDB for data storage? Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bbimber
    Hi,

    Keep in mind, the tools in GATK4 are still in beta.

    That said, right now, I am not aware of any public projects that use it. I think that is one of the goals of the tool, however. I think the team would like it to be a persistent place for vcfs to live and to make quick callsets.

    I hope that helps. We should have more documentation and information in the next few months :smile:

    -Sheila

  • Hi Geraldine,

    I found the same problem with missing variant positions in GATK4.b6 and followed the bug report you linked to. The issue was marked as closed a month ago but the issue was not fixed in GATK4.b6. Is it known when will the fix be available?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited November 2017

    @berkel
    Hi,

    Are you referring to the empty VCF when using GenomicsDBImport? That should be fixed in beta 6. Can you post the exact commands you ran?

    Thanks,
    Sheila

  • eykim909eykim909 chicagoMember

    Hi,
    GenomicsDBImport in GATK4, I used -L my_exome_bed_file and -ip 100 for interval, and got the error "More than one interval specified. The tool takes only one". Without -ip 100 option, I still get the same error. The -L with my_exome_bed has been used for all steps. I have ~1000 g.vcf files, so I would like to use GenomicsDBImport than CombineGVCFs per 1000 single g.vcf files. How could I proceed this?
    Thanks!

  • jfarrelljfarrell Member ✭✭

    Is there a way to specify a file containing a list of the gVCFs to merge? Repeating -V for each vcf makes for a long command-line. I would just like to pass it a file with a list of gVCFs.

    Previously with CombineGVCF, the -V parameter would take a filename (-V mygvcf.list). GenomicsDBImport does not use this convention for the -V arguement.

  • pallevillesenpallevillesen DenmarkMember

    @jfarrell said:
    Is there a way to specify a file containing a list of the gVCFs to merge? Repeating -V for each vcf makes for a long command-line. I would just like to pass it a file with a list of gVCFs.

    Previously with CombineGVCF, the -V parameter would take a filename (-V mygvcf.list). GenomicsDBImport does not use this convention for the -V arguement.

    --sample-name-map FILE.list

    must be sample TAB vcf_file_name

  • jfarrelljfarrell Member ✭✭

    Thanks! Just what I was looking for.

  • alphahmedalphahmed JAPANMember

    Hi,

    Congratulations for the release of this great version of GATK.

    It is still not clear to me on how to call all GVCFs without specifying a specific interval. I need to run GenotypeGVCFs which requires the GenomicsDBImport output, but when I enter the intervals for the whole genome sequencing (WGS_intervals), I get a message that only one interval is accepted.

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @alphahmed
    Hi,

    Yes, GenomicsDBImport does not accept more than one interval right now. I think there are plans to change that, but I need to check with the developers on the timeline. Keep an eye on this thread for updates.

    -Sheila

  • alphahmedalphahmed JAPANMember
    edited January 23

    Thank you Sheila...
    For the time being... Can you suggest any other way of Genotyping multiple WGS gVCFs using GATK4?

    Post edited by alphahmed on

    Issue · Github
    by Sheila

    Issue Number
    2876
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @eykim909
    Hi,

    Right now, GenomicsDBImport only accepts 1 interval at a time. There are plans to change that, and you can keep track of the issue here.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @alphahmed
    Hi,

    Let me confirm with the team and get back to you.

    -Sheila

  • @Sheila said:
    @jjmmii
    Hi,

    I just heard back from the developers. If you want to merge disjoint shards after a split operation you should use GatherVCFs; if you want to to merge VCFs that can overlap each other you should use MergeVariants.

    -Sheila

    Hi,sheila
    I am a bit confused. I process the data for one sample according to Germline-SNPs-Indels-GATK4-hg38 (https://portal.firecloud.org/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-hg38/monitor ); In the WDL, the HaplotypeCaller step is to process sampletest.gh38.bam with 50 intervals files generated by the ScatterIntervalsByNs tool, and then get a sampletest.hg38.g.vcf.gz using MergeVcfs.
    But then (also JointGenotyping step), I can not do a GenomicsDBImport on sampletest.hg38.g.vcf.gz with 50 intervals files generated by the ScatterIntervalsByNs tool, just as @jjmmii does.Can you tell me what to do next?
    BTW, in the Germline-SNPs-Indels-GATK4-hg38, JointGenotyping step use a diff sample from HaplotypeCallerGvcf_GATK4, and the --intervals is diff too.
    Thanks in advance. ;) ;)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wxhyih
    Hi,

    I am a little confused about what you are asking. Are you asking why you cannot run GenomicsDBImport on more than 1 interval?

    Thanks,
    Sheila

  • @Sheila said:
    @wxhyih
    Hi,

    I am a little confused about what you are asking. Are you asking why you cannot run GenomicsDBImport on more than 1 interval?

    Thanks,
    Sheila

    Hi
    I am sorry for the unclear description.

    In JointGenotyping step(https://console.cloud.google.com/storage/browser/fc-347d972a-6a14-4199-8061-56f3940ee0da/23a01750-76de-4df2-9a66-ef6c881945ff/JointGenotyping/13864639-6cb9-4e9b-ad00-e4abb870ae8e/?pli=1), the command GenomicsDBImport step for one sample is like this :
    _/gatk/gatk-launch --javaOptions "-Xmx4g -Xms4g" \
    GenomicsDBImport \
    --genomicsDBWorkspace genomicsdb \
    --batchSize 50 \
    **-L chr19:1-58617616 **
    --sampleNameMap inputs.list \
    --readerThreads 5 \
    -ip 500_

    but, in HaplotypeCaller step (https://accounts.google.com/AccountChooser?continue=https://console.cloud.google.com/storage/browser/fc-347d972a-6a14-4199-8061-56f3940ee0da/237359f1-0a97-4325-9b20-230908313f9c/HaplotypeCallerGvcf_GATK4/817d8cb8-368c-425d-93f7-4141406e322d/), the command HaplotypeCaller step for one sample is like this :

    _cd /cromwell_root
    /gatk/gatk-launch --javaOptions -Xms8000m \
    HaplotypeCaller \
    -R /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
    -I /cromwell_root/gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam \
    -O NA12878_24RG_small.hg38.g.vcf.gz \
    -L /cromwell_root/genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/temp_0019_of_50/scattered.interval_list \
    -ip 100 \
    -contamination 0 \
    --max_alternate_alleles 3 \
    -ERC GVCF _

    As you can see, the values of -L parameters are set differently.In HaplotypeCaller, the scattered.interval_list have more than 1 interval and it can not used by GenomicsDBImport for now.So, I would like to ask you how to do the GenomicsDBImport step after I finished the HaplotypeCaller step with the scattered.interval_list.That is, how to combine the haplotypecaller-gvcf-gatk4 and joint-discovery-gatk4 steps (like this: https://portal.firecloud.org/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-hg38/monitor) together?

    Thanks && Happy Chinese New Year :)
    wxhyih.

  • bwraybwray Member
    edited February 16

    Hey GATK team,

    I'm running GenomicsDBImport in GATK 4.0.1.0 and I'm experiencing the same error presented at the top of the forums here, namely that all of the resulting vcf file contains a header and nothing else. I've looked in my post-haplotypecaller vcf files and confirmed that there are indeed variants being called within the region I'm specifying. Here is the command I'm running:
    gatk GenomicsDBImport -V sample1.g.vcf.gz -V sample2.g.vcf.gz --genomicsdb-workspace-path $gatkProjectDir/genomicsdb_workspace/interval1 --L chr1:1735000-1832000

    I have tried running this with just a single sample and end up with the same result.

    I should also note that I then tried running CombineGVCFs within GATK 4.0.1.0 by using the following command:
    gatk --java-options "-Xmx80G" CombineGVCFs -R bundle/hg19.fa --variant sample1.g.vcf.gz --variant sample2.g.vcf.gz -O samples1_2_combined.g.vcf.gz

    and I end up with the following error:

    htsjdk.samtools.util.RuntimeIOException: sample1.g.vcf.gz has invalid uncompressedLength: -1290556380

    My VCF files are huge, like 60gb after uncompressing. I don't get this error from GenomicsDBImport, for what that's worth.

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wxhyih
    Hi wxhyih,

    Ah, I see. Thanks for the clarification. So, HaplotypeCaller can run on multiple intervals which is what is specified in that -L. GenomicsDBImport can only take one interval at a time, so the -L has a small interval (probably the first in the interval list provided to HaplotypeCaller). Have a look at this thread. The team recommends running on ~1 interval per sample in your cohort so GenomicsDBImport runs faster.

    I hope this helps.

    You may also be interested in the gatk-workflows github repository.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bwray
    Hi,

    That is odd. Can you try with the very latest version and let us know if that helps? I may need you to submit a bug report.

    -Sheila

  • bwraybwray Member
    edited February 22

    @Sheila

    Thanks for responding. I'll give the very latest GATK version a try.

    I ran ValidateVariants on my original huge post-HaplotypeCaller VCF files, and I had one VCF file fail (with the same "invalid uncompressedLength" error), but the other one was successfully validated. I'm not sure what the point of running a single VCF file through GenomicsDBImport would be, but I tried it with the one valid VCF file and again generated a set of VCF files with only headers.

    For now, what I've done is rerun HaplotypeCaller, this time passing my list of target intervals (which I hadn't passed the first time I ran HapltypeCaller) to create much smaller GVCF files to pass downstream. I tried running GenomicsDBImport / GenotypeGVCFs on these new smaller GVCF files but ended up with the same set of header-only vcf files.

    I then tried GATK 4.0.1.0's CombineGVCFs on the smaller post-HaplotypeCaller VCF files, and managed to get a combined VCF file populated with variants that was then successfully processed by GenotypeGVCF.

    I'm now wading through VQSR docs trying to compose my next commands. I'll try the very latest GATK once I finish processing these samples.

    Thanks again!
    edit - ValidateVariants, not ValidateVCF

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bwray
    Hi,

    To confirm, CombineGVCFs worked on both of your sample GVCFs, but GenomicsDBImport did not?

    If you can confirm this GenomicsDBImport issue still happens in the latest version, I will need you to submit a bug report.

    In the meantime, I am happy CombineGVCFs worked for you and you are able to move on :smile:

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @alphahmed See the workflow script we use for joint calling in https://github.com/gatk-workflows/gatk4-germline-snps-indels

    Alternatively, you can use CombineGVCFs as the combine step before GenotypeGVCFs, which follows the same paradigm as what we did in GATK3.

  • sam0persam0per Member

    Hi,
    After variant calling using HaplotypeCaller in GVCF mode, I am consolidating the .g.vcf.gz outputs with the tool GenomicsDBImport. I have a list of samples in the file "samples.txt" and a list of contigs to pass individually. This is the command:

    while read i; do java -Xmx34g -jar /proj/data9/samuel/modules/gatk-4.0.2.0/gatk-package-4.0.2.0-local.jar GenomicsDBImport --sample-name-map samples.txt --genomicsdb-workspace-path gVCF_database -L $i; done<contigs

    Howerver, a user error occurs because I am not supposed to save each output for a single contig interval under the same workspace.

    A USER ERROR has occurred: The workspace you're trying to create already exists. Writing into an existing workspace can cause data corruption. Please choose an output path that doesn't already exist.

    As far as I understand, each interval requires a separate workspace and I was wondering whether I could merge the workspaces of each contig interval before joint genotyping. For example, I would like to combine gVCF_database_contig0 with gVCF_database_contig4 (I have thousands of contigs).

    I assume that I could iterate joint genotyping over the thousands of workspaces but I am still quite unsure how to merge the final output.

    Can somebody suggest me at which step it is more convenient to combine the multiple workspaces for each contig interval? before or after joint genotyping?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @sam0per, in our pipelines we run through GenomicsDBImport and GenotypeGVCFs per interval, then we combine the per-interval VCFs after that. We don't combine workspaces; that is not possible. However there are a few tweaks you can apply.

    1. We don't have as many contigs as you do but we run on WGS interval lists with a lot of intervals, and to keep things reasonable we merge intervals to produce a 2.5:1 ratio of intervals to samples, which you can find implemented here: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl (which unfortunately has some additional complexity due to the inclusion of filtering steps + some cloud-specific optimizations).

    2. Because we're running on cloud, each separate job is run on a different machine, so for us it doesn't matter that the workspace name is the same in all cases. In your case however I would recommend making the workspace name/path a variable that includes eg an interval identifier, to avoid collisions.

    Note that I'm going to close this thread because the article is out of date and has been replaced by https://gatkforums.broadinstitute.org/gatk/discussion/11813/how-to-consolidate-gvcfs-for-joint-calling-with-genotypegvcfs; if you still have questions please feel free to post them on that article or just create a new discussion thread.

This discussion has been closed.