Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How many gvcfs can I combine using genomicDBimport (GATK4)?

Hello,
I am about to start combining low x gvcfs for many individuals (>4K). In the previous pipeline I had the individuals split into cohorts of ~200 and then hierarchically combined within and then across cohorts using combineGVCFs.
I have tried the genomicDBimport on a couple of samples, and I liked it. However, I can't seem to find any information on how many samples I can provide in the sample map. Would I be able to combine them all in one run?
If not, what would be the alternative? Is it possible to do hierarchical merging like with combineGVCFs?
As the sample map requires the sample names next to the paths leading to the gvcfs, I can't figure out if it is possible to merge genomicDBs for multiple cohorts?

Best Answers

Answers

  • Great, thank you! I was planning on splitting the job per chromosomes, hopefully this will be smaller enough.

  • jilskajilska Member
    edited March 2018

    Thank you very much @shlee!
    I am trying to run it now, and while I had to play around with memory settings a bit, it seems to be working.... but it's taking a long time, even though I started from the smallest chromosome! I will try the SplitIntervals now!

  • @shlee, I'm afraid I have more questions.. I am struggling to set the memory requirements to successfully run the job. The first subset of data I am trying to merge contains 300 individuals sequenced at 30x, their gvcfs take ~900G. When I run the genomicDBimport with 300G set in java options, I get the
    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

    Any idea why this could be?

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jilska,

    Looks like you ran out of memory due to GC = garbage collection. See this StackOverflow thread for an explanation.

  • This is the error message I am getting but it doesn't make much sense, given that I got it also on smaller datasets with large amounts of memory (larger than some of the specifications listed on this forum for genomicDBimport). Our IT people, after observing the job, seem to think this is a java-related bug as the process itself doesn't use anywhere near the memory specified. We have installed a new version of java and I will be re-running the analysis to see if this solved the issues.

    I'm not sure if I should be creating a new thread for this, but I do have a general comment about genomicDBimport. The project I am involved with is in partnership with an industrial partner, who sequences a number of animals every few weeks. In the pipeline using GATK 3.6, the newly sequenced animals were combined using combinegVCF and multiple gVCFs were then fed into genotypeGVCFs.

    Unless I am missing something, the current set up in GATK 4.0 is not ideal for routine sequencing. First, I need to combine all animals every time a new batch of data is added (rather than adding a batch to existing database). Second, if I decide to use combineGVCFs in GATK 4.0, I have to run it twice, first to combine the new cohort, then to combine the new cohort with older animals so that I have 1 file to feed to genotypegVCF.

    As such I am now reverting back to GATK 3.6. It would be very nice if genomicDBimport allowed addition of new data to existing database, and/or genotypegVCF allowed multiple gVCFs.

    Issue · Github
    by Sheila

    Issue Number
    4667
    State
    closed
    Last Updated
    Closed By
    chandrans
  • In addition to the above, as there is a bug with combining metaVCFs as shown here, I can't run it at all now.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin
    edited April 2018

    @jilska
    Hi,

    Sorry you are having so many issues. It is a new tool and some kinks are still being worked out. I will make a note for the team to allow for adding in GVCFs to a previously existing database. Have a look at this article.

    Also, this thread may help.

    -Sheila
    EDIT: https://github.com/broadinstitute/gatk/issues/4467 and https://github.com/broadinstitute/gatk/issues/2641 may help as well.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @jilska
    Hi again,

    The best thing to do in this case is wait until the CombineGVCFs bug is fixed. The developer is on it, and the fix should be available soon.

    -Sheila

Sign In or Register to comment.