We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Parameters for running GenomicsDB import

I have a system with about 8GB RAM. I've run HaplotypeCaller (-ERC GVCF) on specific genes of my interest using a .list file and have 109 **.g.vcf.gzs **of about 5-10 GB each. What would be the most optimal way to run GenomicsDBImport on these samples for Joint Calling ? Will I need to further subset these files into specific intervals or set a batch size ?

GATK version - 4.0.11, Java version-1.8

Optimal = Avoid errors, Maximise input samples, minimise computational load and minimise time in that order.

Best Answer


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited December 2018

    Hi @MehulS

    We apologize we were unable to get to your question and our team is on a holiday until Jan 2nd 2019. We will come back and get to your question asap.
    Merry Christmas and Happy New Year!

  • AdelaideRAdelaideR Member admin

    @mehulS, I am still working on getting an answer to your question.

  • AdelaideRAdelaideR Member admin

    Hello @MehulS - I have heard back from the development team.

    This is the advice:

    1.) add the argument --merge-input-intervals to the command.
    2.) a batch size of 50 seems to be the right amount for batching files with this command because it keeps the memory below 8GB.
    3.) give GATK less memory (maybe 7GB) because GDB takes up memory outside of Java

  • MehulSMehulS Member

    Thank you for your responses. I noticed that specifying an intervals list or a list file is compulsory in the command. But I have specified this when I ran haplotypecaller and my GVCFs are already interval-specific.

    Also, the GATK blog here says> "However the --intervals argument value must be a single interval, not a list,"
    so I'm a bit confused if I'll be able to specify my.list file which contains multiple intervals (assuming I need to)

  • AdelaideRAdelaideR Member admin

    HI @MehulS The complete paragraph from that section of the blog is:

    Note that the GVCFs can also be passed in as a list or map instead of being enumerated in the command. However the --intervals argument value must be a single interval, not a list, because this functionality was designed from the start to be used from within a script that scatters execution over multiple intervals. We'd like to enable running on one more intervals in one go, but we might not get to that for awhile, so for now you need to run on each interval separately.

    So, the answer seems to be you might need to run each interval separately for now.

    @gauthier Has this changed?

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Sorry, but that document is out of date. I'll try to get it updated this week. The functionality to run GenomicsDBImport over multiple intervals was added over the summer with the most stable versions in and later (https://github.com/broadinstitute/gatk/releases). You do still have to specify at least some interval, though.

  • MehulSMehulS Member

    So, is it OK to use my interval-specific GVCF files that I generated from haplotypecaller along with specifying the .list file that I used to create them ? (Same .list file that I used in haplotype caller).

    Also, pardon me if I'm diverting but I'm also curious to know why GDBimport requires intervals necessarily in the first place.

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Yes, you can use the same .list file you used to call the GVCFs to import them, but it could be faster if you merge the intervals. That can be plan B if your tasks are taking too long.

    I'm not 100% sure why GDB needs intervals, but it doesn't require a reference, so I'm guessing it can't traverse its data structure without a contig to query.

  • MehulSMehulS Member
    edited January 2019

    I should have asked this earlier, but what does it mean to merge intervals anyway ? Make the discontinuous intervals into a single continuous interval ?

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Exactly. For each contig, make a single new interval by taking the beginning of the first interval and the end of the last interval. Given that there's no data in between since you called using the gene list the extra territory won't cost you anything and it will save you a lot of time instantiating the data structure for each additional interval.

  • MehulSMehulS Member

    Thanks. Why would I need a tool or script for that though ? I could just manually create a single interval from start to end (per chromosome, say) covering all my desired genes from start to end.

    If I have, say, intervals : chr1: 1-10
    chr1: 40-50
    chr5: 70-90
    chr5: 100-1000

    I could manually create a new .list of chr1: 1-50 chr5: 70-1000.

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Depends on how many ways you end up scattering. ;-) I had a job that I scattered 50 ways and I didn't want to make 50 interval lists by hand.

  • MehulSMehulS Member

    That seems fine. I'm not familiar with the scattering process anyway since I run everything locally via the command line rather than using WDL/Cromwell. So I won't be using scattering in all likelihood. Or would you deem that a necessary component of running this process ?

  • MehulSMehulS Member

    Also, any chance this > "You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples." has changed ?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @MehulS

    This is an addition we plan on working on in the future. Here is the git issue associated with it: https://github.com/broadinstitute/gatk/issues/4773
    You can get updates on the issue by following it.

  • olavurolavur Member

    I just want to note that the --merge-input-intervals option is not described anywhere. I expected to find it in the GenomicdsDBImport documentation, but even after googling only found properly discussed here.

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Sorry about that @olavur -- that's a new option that went in right after we updated the GenomicsDB article. It is in the tool docs, but maybe @shlee can put in a quick update for the article?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Absolutely @gauthier. I've proposed updates in a google doc I've shared with you for your review.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Actually, @gauthier, there appear to be duplicate sources of information for GenomicsDB. These look pretty similar to me:

    1. Article#11091 linked by @MehulS above: https://gatkforums.broadinstitute.org/gatk/discussion/11091/genomicsdb posted December 2017 by GATK_Team (Geraldine is the originator). This falls in the Dictionary section of the forum.
    2. Tutorial#11813 by @Geraldine_VdAuwera: https://gatkforums.broadinstitute.org/gatk/discussion/11813/how-to-consolidate-gvcfs-for-joint-calling-with-genotypegvcfs#latest posted April 2018 and last edited January 2019. This falls in the Tutorials section of the forum.

    I made draft updates to the former. I think there is some need to deduplicate these two sources.

Sign In or Register to comment.