How to combine the interval after GenomicsDBImport

vangvang Member ✭✭

I am trying to build a GenomicsDB database using GenomicsDBImport on 6000 samples. Running on all chromosomes in one go seems to tage many weeks of processing time. If I use small intervals, how do I then combine the different GenomicsDB foldes afterwards? Or is there a different strategy?

I use GATK 4.1.1 and this is my command with all chromosomes that takes forever

gatk --java-options "-Xmx4g -Xms4g" \
GenomicsDBImport \
--genomicsdb-workspace-path genomicsdb \
--batch-size 50 \
-L all_chromosomes.bed \
--sample-name-map sample_map \
--tmp-dir=./tmp \
--reader-threads 5

Answers

  • vangvang Member ✭✭

    I am rerunning now with 64GB memory and v4.1.2 - It looks like the same speed.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 26

    Hi @vang

    If you want to run joint genotyping on it then you do not need to combine the different GenomicsDB foldes afterwards. You can run GenotypeGVCFs on it as shown in the usage example in these docs: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_GenotypeGVCFs.php
    https://software.broadinstitute.org/gatk/documentation/article?id=11813

  • micknudsenmicknudsen DenmarkMember ✭✭

    Hi @bhanuGandham

    I also find the new GenomicsDB approach rather confusing. I would like to replicate the "solving the N+1 problem" solution from GATK3.

    Suppose that I have N samples. I would then:

    1) Run HaplotypeCaller on the N samples in GVCF mode.
    2) Run CombineGVCFs on all GVCF files to get single GVCF file with N samples.

    This GVCF file would then be used to better genotype new samples:

    3) When a new sample (N+1) arrives, run HaplotypeCaller on this sample and then GenotypeGVCFs on the output GVCF together with the N others.
    4) Recalibrate variants and run SelectVariants to get only the recalibrated variants for sample N+1.

    In GATK4, the role of CombineGVCFs has been taken over by GenomicsDBImport. However, GenotypeGVCFs now only accepts a single -V input, so one cannot use both the DB (N samples) and the new (N+1) sample as input. Also, it is not possible to add GVCFs to an existing GenomicsDB.

    Does this mean that if one wants to replicate the GATK3 solution, one must first run CombineGVCFs on the (huge) N-sample GVCF file and the new (N+1) GVCF file and use that as input to GenotypeGVCFs? Alternatively, one must create a new GenomicsDB each time? Both solutions will require several days of computation, so I am pretty sure that I am missing something obvious here.

    Thanks,
    Michael

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @micknudsen

    Does this mean that if one wants to replicate the GATK3 solution, one must first run CombineGVCFs on the (huge) N-sample GVCF file and the new (N+1) GVCF file and use that as input to GenotypeGVCFs? Alternatively, one must create a new GenomicsDB each time?

    That is correct. I know this is a huge bother and we are trying to improve this process, but this is best we have for now.

  • vangvang Member ✭✭

    Thanks for your reply.
    Is that what you do at The Broad in production? Build a new GenomicsDB for each new sample? And how many intervals do you split the genome into then?

    Thanks.

Sign In or Register to comment.