Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenomicsDBImport--do multiple samples need individual databases?

I am trying to do variant calling on a reference transcriptome that I've produced, but I have some questions about functionality of GenomicsDBImport, and downstream in SelectVariants. I know that you can only look at one genomic interval per go, but do they all need individual databases?

I've begun running this command from bash, with $contigs as the list of contigs to go through, $path gives the absolute path, and files.txt representing all my samples.

for i in $contigs do gatk GenomicsDBImport \ $(cat files.txt) \ --genomicsdb-workspace-path $path/my_database \ --intervals $i done

I get this error after running it A USER ERROR has occurred: The workspace you're trying to create already exists. ( /gatk/my_data/my_database ) Writing into an existing workspace can cause data corruption. Please choose an output path that doesn't already exist. Does this mean that I need to create an individual result for each contig? And how does this influence the downstream SelectVariants command?

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @adam_s,

    Please be sure to use GATK v4.0.6.0+, as many bugs were fixed for GenomicsDBImport in v4.0.6.0. You can peruse the release notes here. For example, you can specify multiple intervals per invocation instead of being limited to one.

    Have you seen the tutorial on using GenomicsDBImport? It is Tutorial#10061. This tutorial lists three caveats, one of which is as follows:

    You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

    Also, the tool documentation gives example commands that you may find helpful when handling multiple samples in an import. In this tool doc we see:

    The --genomicsdb-workspace-path must point to a non-existent or empty directory.

    Given the above, I believe you need to import all the samples for a genomic interval together simultaneously, e.g. with:

    gatk GenomicsDBImport \
    -V gvcfs/mother.g.vcf \
    -V gvcfs/father.g.vcf \
    -V gvcfs/son.g.vcf \ --genomicsdb-workspace-path sandbox/trio \ --intervals 20:10,000,000-10,200,000
    

    To add additional samples to a database then requires re-importing all the GVCFs together.

    If you feel that a feature of the tool should be on-the-fly modification of existing records to accommodate additional sample imports to an existing database, then do let us know. I can put in a feature request on your behalf.

  • adam_sadam_s Member

    Hi @shlee,

    Thanks for the answers. I am running this on a transcriptome, so there is a pretty serious time/computational component to go through thousands of contigs. As I understood it from the tutorial, this function only allows me to produce results for one contig at a time:

    At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.

    Does your comment mean that currently using --intervals with a comma delimited list of contigs will produce a combined list for every contig listed in one go? I am trying to look at SNPs for every contig in a de novo transcriptome, which seems to be computationally intractable unless I can do every contig at once, which I have a comment about here. The final caveat is that I will need to be examine SNPs on a per-contig basis after the fact, if that wasn't clear.

    Thanks,
    Adam

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @adam_s,

    My understanding is that the new features 4.0.6.0+ allow you to input multiple intervals, even across contigs, to the same input statement. The intervals do not need to be contiguous. https://github.com/broadinstitute/gatk/pull/4645 details this particular new feature.

    You do not comma delimit. Here's a command I just tested to showcase:

    gatk4090 GenomicsDBImport -V papa.g.vcf -V paalt.g.vcf --genomicsdb-workspace-path db3 --intervals chr19:1-100 --intervals chr19:401-500 --intervals chr19_KI270866v1_alt
    

    You can define an interval at the contig-level, e.g. -L chr1. Also, perhaps it will interest you to know the -L/--intervals parameter can take a list of intervals, where each line defines an interval, e.g. one contig per line.

    If the memory/time issue you comment on in your other thread is still the case when you update to the latest version v4.0.9.0, please do let us know. Our developers will be keen to solve this.

    I hope this is helpful.

  • adam_sadam_s Member

    Hi @shlee,

    I just posted in the other thread, but still the same import issue (> 10 minutes import time). Unfortunately, since I'm working with a de novo transcriptome (we don't have a genome) I can't look across multiple intervals the way it is currently written as they are not contiguous. I attempted this last night but it did not work.

    It isn't terribly helpful in this case, but just to demonstrate this, here is my error:

    00:45:38.506 INFO  GenomicsDBImport - Importing batch 1 with 45 samples
    00:55:51.510 INFO  GenomicsDBImport - Starting batch input file preload
    00:55:58.416 INFO  GenomicsDBImport - Finished batch preload
    00:55:58.416 INFO  GenomicsDBImport - Importing batch 1 with 45 samples
    00:55:59.047 INFO  GenomicsDBImport - Starting batch input file preload
    00:55:59.047 INFO  GenomicsDBImport - Shutting down engine
    00:55:59.059 INFO  GenomicsDBImport - Starting batch input file preload
    [October 4, 2018 12:55:59 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 12.24 minutes.
    Runtime.totalMemory()=15154020352
    00:55:59.070 INFO  GenomicsDBImport - Starting batch input file preload
    00:55:59.071 INFO  GenomicsDBImport - Starting batch input file preload
    java.util.concurrent.CompletionException: org.broadinstitute.hellbender.exceptions.GATKException: Cannot call query with different interval, expected:Transcript_1:1-895 queried with: Transcript_4:1-1612
            at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
            at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
            at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
    Caused by: org.broadinstitute.hellbender.exceptions.GATKException: Cannot call query with different interval, expected:Transcript_1:1-895 queried with: Transcript_4:1-1612
            at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport$InitializedQueryWrapper.query(GenomicsDBImport.java:769)
            at com.intel.genomicsdb.importer.GenomicsDBImporter.<init>(GenomicsDBImporter.java:165)
            at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:604)
            at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
            ... 3 more
    00:55:59.072 INFO  GenomicsDBImport - Starting batch input file preload
    
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for reporting back @adam_s. @bhanuGandham is going to start owning GenomicsDB questions, so she will follow up. We will bring up these questions to the developers. If you can think of anything else that would help us in our investigation as to why the import takes so long, please let us know these details. Thanks.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @adam_s

    Sorry for the delay but I am looking into this.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @adam_s

    We have reported this issue as a bug and our developers are working on it. Thank you for bringing this to our notice. You can keep a track of its progress here.

    Regards
    Bhanu

  • yingchen69yingchen69 nanjingMember

    I had the same issue. Apparently this is a multi-thread bug. I changed --reader-threads from 5 as suggested to 1, and the issue is gone. Now it's slow but at least it works.

    Best,

    Ying

Sign In or Register to comment.