Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

GenomicsDBImport "RunConfigException" Error when scale is high

KousikKousik GermanyMember

Hi,
I am having a problem in successfully running GenomicsDBImport with ~10K samples. Most of the times, I get this below mentioned error, while in a very few cases (genomic intervals) I had success.

terminate called after throwing an instance of 'RunConfigException'
  what():  RunConfigException : ifs.is_open()

Could you help me to understand the error message please ?

Note: I have a decent ulimit set (~65K) , so higher number of file descriptors can open at the same time.

The code I run: GATK version: 4.0.0.0

java -d64 -Xmx64g -Xms64g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar gatk.4.0.0.0.jar GenomicsDBImport \
    --genomicsdb-workspace-path /some/path/genomicsdb-workspace \
    --interval-padding 500 \
    --batch-size 50 \
    --intervals chr1:257667-297968 \
    --reader-threads 5 \
    --TMP_DIR /some/path/tmp \
   --variant /some/path/sample1.vcf.gz
   --variant /some/path/sample2.vcf.gz
   .......
   --variant /some/path/sample10000.vcf.gz

Thank you,

Kousik

Best Answer

Answers

  • KousikKousik GermanyMember

    I am running on a computer cluster (LSF)

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Kousik

    I noticed you are using gatk4.0.0.0. This version is a few months old and this error could be due to various bugs that have been fixed now. Would you please instal the latest version and try again. That should resolve the issue. If not please reach out to us and we will try and reproduce this error on our end.

    Regards
    Bhanu

  • KousikKousik GermanyMember

    Hi Bhanu,
    Thank you very much for your mail. As all the variants were called using gatk-4.0.0.0, we wanted to keep the same version for rest of the pipeline. However, as you recommended, I will try the new version and update you about the outcome.
    One more think I wanted to ask that is there any recommended interval (in --intervals) list that I should use, or any arbitrary interval would work. At the moment we have >550 genome-wide intervals with different number of variants (range from 500 bp to 14 million bp). As you can guess large intervals take long time on the cluster. Thank you.

    Best regards,
    Kousik

  • KousikKousik GermanyMember

    Hi @bhanuGandham
    Just wanted to update you that using latest GATK version (v4.0.10.1), I am no longer getting that weird error. The Jobs look stable and running without any problem.

    However, I need one more help regarding the genomics intervals. As you probably already know that "Broad suggested intervals" are quite varied in lengths (500bp to >10 million bp), and if I use the same intervals to run GATK GenomicsDBImport + GenotypeGVCFs on a lustre space in a computer cluster, it takes significant amount of time.
    I was wondering whether I can chunk the intervals even further and make the maximum chunks length around 500kb bp (I make sure there is no SNP/Indels across my samples (N=10K) falling in the ± 50bp chunking boundary). Could you tell me if chunking intervals would be wrong technically ?
    In short, if I use different intervals for HaplotypeCaller and GenomicsDBImport+GenotypeGVCFs what would go wrong ? Note I am only interested in short SNP and INDELs.

    I would really appreciate, if you could help me in this matter. Thank you very much.

    Best regards,
    Kousik

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Kousik

    You provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. So if you use different regions in HaplotypeCaller and GenomicsDBImport then you are doing the analysis on different regions which doesn't make sense. But if you want to break up the same intervals from HaplotypeCaller into subsets for intervals in GenomicsDBImport, that should be fine.
    By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule in the Tool Docs).

    I hope this helps.

    Regards
    Bhanu

  • KousikKousik GermanyMember
    edited October 2018

    Hi @bhanuGandham
    Thank you very much for your reply. Yes, I want to break the same intervals used in HaplotypeCaller into smaller subsets for (GenomicsDBImport + GenotypeGVCFs) to reduce the runtime.

    I am using only intervals (i.e., --intervals chr1:257667-297968) and NOT lists of intervals.

    For example -
    If chr1:13004385-16799163 is used in HaplotypeCaller, I would like to break it down into three intervals - i.e.,
    chr1:13004385-14799163,
    chr1:14799164-15799163, and
    chr1:15799164-16799163)

    and then run GenomicsDBImport into 3 separate jobs using each aforementioned interval.
    Will this be a problem ?
    Or I should use lists of intervals ? Will that make the process much faster internally (as this is still be a single job) ?

    Best regards,
    Kousik

    Post edited by Kousik on
  • KousikKousik GermanyMember
    edited October 2018
  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    HI @Kousik

    Yes that is fine.
    GenomicsDBImport will use intervals and intervals list in the same way. If you have more than one interval then yes using interval list makes more sense.

    Regards
    Bhanu

  • KousikKousik GermanyMember

    Thank you very much @bhanuGandham
    I think I would opt for breaking the intervals into subsets and use those intervals into separate jobs instead of using interval lists.
    The reason for it because I have tested a few intervals and observed separate jobs would be much faster than using interval lists, and also easier to control on cluster.

    Basically, I was only worrying what if GenomicsDBImport is not smart enough to appropriately recalculate annotations for gVCF reference bands that cross an interval boundary (we had this problem with CombineGVCFs in GATK 3) , although I break the intervals at such a point that there are no variants present nearby (±50bp) across all samples.

    Thanks a ton for your help. Much appreciated.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    @Kousik

    Yes you want to make sure there are no variants in the boundaries. --interval-padding would be a good option to use to provide padding.

    Regards
    Bhanu

  • KousikKousik GermanyMember
    edited November 2018

    Yes, I am already using --interval-padding .
    Thanks you very much.

Sign In or Register to comment.