GATK4 joint Genotyping for an exome pipeline: CombineGVCFs or GenomicsDBImport ?

TintestTintest FranceMember
edited October 2018 in Ask the GATK team

Hello,

I want to use 386 exomes as a normalization group for joint genotyping for an exome diagnostic pipeline. Usually it was done with a “giant combined gvcf” splitted per chromosome but I wanted to give GenomicsDBImport a try.

So I did and I’m quite disappointed. I think I’m might doing something wrong or maybe GenomicsDBImport is not yet suited yet for my purpose. So I have some questions.

The building of a GenomicsDBImport is longer than a traditional CombineGVCFs per chromosome. It wouldn’t be a problem if I could build it “forever” and then give the database plus the patient samples .gvcfs to process to GenotypeGVCFs or add new samples to the database. Do you plan adding this feature?

Because you can’t add a new simple in an already built GenomicsDB, I should rebuilt it with the new samples at every single pipeline execution. So I don’t see why use this GenomicsDB or perhaps should I use the Intel library? It seems to add an unwanted supplementary level of complexity which I don’t know if it is worth it or not.

Am I missing something?

Thank you.

Answers

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Tintest

    I am sorry you are having trouble with GenomicsDBImport.

    1) GenomicsDBImport comes with a few limitations, such as, you can't add to an existing database. We are currently not working on adding this feature on our end but are looking forward to contributions from a 3rd party on this issue.
    2) It is surprising that GenomicsDBImport is slower than CombineGVCFs per chromosome. We released a new functionality with GenomicsDBImport, which is to use multiple intervals. This release in available in the latest GATK version. This might be helpful for your purpose to ramp up the speed.
    Example: gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf.gz \
    -V data/gvcfs/father.g.vcf.gz \
    -V data/gvcfs/son.g.vcf.gz \
    --genomicsdb-workspace-path my_database \
    --tmp-dir=/path/to/large/tmp \
    -L 18 \
    -L 19 \
    -L 20

    Please let me know if this helps.
    For more information please follow this link to the tool docs: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.9.0/org_broadinstitute_hellbender_tools_genomicsdb_GenomicsDBImport.php#--intervals

    Regards
    Bhanu

  • TintestTintest FranceMember

    Hello @bhanuGandham ,

    Thank you for your prompt answer. I'm using the official GATK 4.0.9.0 docker container "converted" into a singularity container for my tests with GenomicsDBImport and CombineGVCFs. Maybe that is the problem. I know I had some problems in the past because the Singularity "homemade" container was not officially supported by the GATK team. But today, everything seems tos work fine.

    Anyway, following our previous discussion, for example, it took 2 hours to do a combinegvcfs for the chr11 and more than 24h for GenomicsDBImport and then get killed by my batch scheduler (it's not suppose to be that long).

    Here is my command for GenomicsDBImport for every chromosome. You will notice some weird ${arguments}, they are from nextflow, my workflow :

    gatk GenomicsDBImport --java-options "-Djava.io.tmpdir=${params.resultDir}/tmp/ -Xmx128g" --sample-name-map ${normalization_list} --genomicsdb-workspace-path ${params.resultDir}/vcf/${params.sampleID}_${params.genomVers}_chr${chr} --intervals ${chr} --reader-threads ${task.cpus}
    

    But wait there is more, for some chromosomes (the smallest ones), it manage to build the database under 24 hours. But the GenotypeGVCFs get killed after 24 hours by my batch scheduler or dies before because of "Java heap space" or "Java Garbage Collector full". So memory problem ... I could give more than 128G but, common, that's already overkill.

    And of course the GenotypeGVCFs works fine with a combinegvcf per chromosome. Here is my command for GenotypeGVCFs for every chromosome :

    gatk GenotypeGVCFs --java-options "-Djava.io.tmpdir=${params.resultDir}/tmp/ -Xmx128g" -R ${params.genomeRef} -V gendb://${params.resultDir}/vcf/${params.sampleID}_${params.genomVers}_chr${chr} -O cohort_${params.sampleID}_${chr}_genotyped.vcf.gz -G AS_StandardAnnotation -G StandardAnnotation
    

    So, maybe do you have some parameters to help me to "tune" this ? Otherwise I think I will stick to the old CombineGVCFs even if it's not the best looking solution.

    Thank you.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    HI @Tintest

    Would you please post the entire error log here so I can ask the developers to look into it.

    Thank you.

    Regards
    Bhanu

  • TintestTintest FranceMember

    Sorry. I don't have the logs anymore. Anyway if I have to rebuild the GenomicsDB for each pipeline execution, I'll stick with the good old CombineGVCFs.

    Thank you.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Tintest

    In the future if you could please post the exact commands and error logs, that would be very helpful. User input helps us make gatk better and we appreciate your feedback.
    Please get back to us if you have any other questions.

    Regards
    Bhanu

Sign In or Register to comment.