Parameters for running GenomicsDB import

I have a system with about 8GB RAM. I've run HaplotypeCaller (-ERC GVCF) on specific genes of my interest using a .list file and have 109 **.g.vcf.gzs **of about 5-10 GB each. What would be the most optimal way to run GenomicsDBImport on these samples for Joint Calling ? Will I need to further subset these files into specific intervals or set a batch size ?

GATK version - 4.0.11, Java version-1.8

Optimal = Avoid errors, Maximise input samples, minimise computational load and minimise time in that order.

Answers

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    edited December 2018

    Hi @MehulS

    We apologize we were unable to get to your question and our team is on a holiday until Jan 2nd 2019. We will come back and get to your question asap.
    Merry Christmas and Happy New Year!

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @mehulS, I am still working on getting an answer to your question.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @MehulS - I have heard back from the development team.

    This is the advice:

    1.) add the argument --merge-input-intervals to the command.
    2.) a batch size of 50 seems to be the right amount for batching files with this command because it keeps the memory below 8GB.
    3.) give GATK less memory (maybe 7GB) because GDB takes up memory outside of Java

  • gauthiergauthier Member, Broadie, Moderator, Dev admin
    edited January 7

    Hi @MehulS,

    This should be doable with the amount of memory you have. We use a batch size of 50 in production, which typically keeps memory in a very reasonable range, especially for the number of samples you have. I added a new argument recently to help with importing intervals (--merge-input-intervals). I'd suggest you run the command as follows:

    gatk --java-options "-Xmx6g -Xms6g"
    GenomicsDBImport \
    --genomicsdb-workspace-path my_database \
    --batch-size 50 \
    -L genesOfInterest.list \
    --merge-input-intervals \
    --sample-name-map cohort.sample_map \
    --tmp-dir=/path/to/large/tmp \
    --reader-threads 1

    I'm suggesting you give GATK less than the amount of memory you have available because GenomicsDB needs extra memory outside of Java and we have to set some aside.

    How many megabases is your gene list? For our ~35Mb exome if we scatter 10 ways then the jobs finish in around an hour for a cohort of 60 samples. (I'd suggest doing the scatter with Cromwell -- we can't help you debug any shell script solutions, for example.) . It's not exactly linear, but close enough that you can extrapolate from there.

    Unfortunately that arg hasn't made it into a release yet, but in theory you can get the nightly from dockerhub: https://hub.docker.com/r/broadinstitute/gatk-nightly/tags That repo looks broken, though. While I follow up on that, you can still run the import, it will just run a lot slower without --merge-input-intervals. The other option is to merge the intervals yourself. For each import, instead of your gene list pass in an interval list such that there is one interval per contig that contains all of your intervals. I don't think bedtools can do this, but it's a quick bash or python script.

  • MehulSMehulS Member

    Thank you for your responses. I noticed that specifying an intervals list or a list file is compulsory in the command. But I have specified this when I ran haplotypecaller and my GVCFs are already interval-specific.

    Also, the GATK blog here says> "However the --intervals argument value must be a single interval, not a list,"
    so I'm a bit confused if I'll be able to specify my.list file which contains multiple intervals (assuming I need to)

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    HI @MehulS The complete paragraph from that section of the blog is:

    Note that the GVCFs can also be passed in as a list or map instead of being enumerated in the command. However the --intervals argument value must be a single interval, not a list, because this functionality was designed from the start to be used from within a script that scatters execution over multiple intervals. We'd like to enable running on one more intervals in one go, but we might not get to that for awhile, so for now you need to run on each interval separately.

    So, the answer seems to be you might need to run each interval separately for now.

    @gauthier Has this changed?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Sorry, but that document is out of date. I'll try to get it updated this week. The functionality to run GenomicsDBImport over multiple intervals was added over the summer with the most stable versions in 4.0.8.0 and later (https://github.com/broadinstitute/gatk/releases). You do still have to specify at least some interval, though.

  • MehulSMehulS Member

    So, is it OK to use my interval-specific GVCF files that I generated from haplotypecaller along with specifying the .list file that I used to create them ? (Same .list file that I used in haplotype caller).

    Also, pardon me if I'm diverting but I'm also curious to know why GDBimport requires intervals necessarily in the first place.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Yes, you can use the same .list file you used to call the GVCFs to import them, but it could be faster if you merge the intervals. That can be plan B if your tasks are taking too long.

    I'm not 100% sure why GDB needs intervals, but it doesn't require a reference, so I'm guessing it can't traverse its data structure without a contig to query.

  • MehulSMehulS Member
    edited January 10

    I should have asked this earlier, but what does it mean to merge intervals anyway ? Make the discontinuous intervals into a single continuous interval ?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Exactly. For each contig, make a single new interval by taking the beginning of the first interval and the end of the last interval. Given that there's no data in between since you called using the gene list the extra territory won't cost you anything and it will save you a lot of time instantiating the data structure for each additional interval.

  • MehulSMehulS Member

    Thanks. Why would I need a tool or script for that though ? I could just manually create a single interval from start to end (per chromosome, say) covering all my desired genes from start to end.

    If I have, say, intervals : chr1: 1-10
    chr1: 40-50
    chr5: 70-90
    chr5: 100-1000

    I could manually create a new .list of chr1: 1-50 chr5: 70-1000.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Depends on how many ways you end up scattering. ;-) I had a job that I scattered 50 ways and I didn't want to make 50 interval lists by hand.

  • MehulSMehulS Member

    That seems fine. I'm not familiar with the scattering process anyway since I run everything locally via the command line rather than using WDL/Cromwell. So I won't be using scattering in all likelihood. Or would you deem that a necessary component of running this process ?

  • MehulSMehulS Member

    Also, any chance this > "You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples." has changed ?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @MehulS

    This is an addition we plan on working on in the future. Here is the git issue associated with it: https://github.com/broadinstitute/gatk/issues/4773
    You can get updates on the issue by following it.

Sign In or Register to comment.