Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenomicsDBImport run slowly with multiple samples

Since I took the GATK4 training early this year, I switched my resequencing analysis from GATK 3.6 to GATK 4.0. Everything worked fine except the GVCF consolidate step. GenomicsDBImport takes much longer time than the traditional CombineGVCFs. On a pilot run of three samples, it took 40 hours while CombineGVCFs only took one hour.
Now I am expanding to 140 samples. I gave GenomicsDBImport a test run on 1 Mb on one chromosome, it took 5 hours, while CombineGVCFs took 10 minutes. I realized after a few runs that the running time is linear to my sample size and interval size. Therefore, I tried out a few parameters as suggested, including,
1. simply run gatk --java-options "-Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport --genomicsdb-workspace-path my_db --intervals chr2:1-10000000 -R ref.fa -V gvcf.list.
2. use --batch-size 10, which took 5% longer than the first solution
3. use --batch-size 10 --reader-threads 5, no difference from solution 2
4. use --L intervals.list, which split 10 Mb into 10 intervals but spent the same time as solution 2
It turns out that GenomicsDBImport without any other parameters runs the fastest, although it's still too long. I've used CombineGVCFs to generate a consolidate GVCF file for genotypeGVCFs, although I noticed that GenomicsDBImport and CombineGVCFs resulted different variants. I would like to compare the variants side by side.

So my questions are,
1. What is the criteria to split one chromosome into a few dozens of intervals without causing potential problems? For example, if a gene is split into two intervals, does it affect the following steps?
2. What is the following procedure after GenomicsDBImport is done with consecutive intervals because GenotypeGVCFs only runs on individual database? I’m testing it now so I will have an answer this week.
After all, I didn’t expect that GenomicsDBImport runs so slowly because I read CombineGVCFs is slower in the GATK4 forum. I would appreciate if its slowness is fixed. Otherwise, I have to split chromosomes into small intervals.

Best Answer

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You may skip consolidate step in GenomicsDBImport unless you have gazillions of batches to import. I am working with 1500~ samples in 50 sample batches and 30 batches do not really disturb the system during genotyping. You may skip consolidate until you have more than 100 or maybe 200 batches in a single import.

  • TonyWeiTonyWei Member

    @SkyWarrior said:
    You may skip consolidate step in GenomicsDBImport unless you have gazillions of batches to import. I am working with 1500~ samples in 50 sample batches and 30 batches do not really disturb the system during genotyping. You may skip consolidate until you have more than 100 or maybe 200 batches in a single import.

    Thank you for the quick reply. Do you mean that I can run GenotypeGVCFs with multiple GVCF files unless I have a 5000-sample cohort, right?

  • TonyWeiTonyWei Member

    I agree with you that I don't need --batch-size for my 140 samples. The real problem is how to run GenomicsDBImport within a reasonable time, say a few days. It took 5 hours for 1 Mb, which means at least 6 weeks for a 200-Mb chromosome.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sorry but I was confused I guess. I was wondering if you were using the consolidate option in GenomicsDBImport stage. That I think is unnecessary unless you have too many samples. GenomicsDBImport is still needed and you can import your gvcfs in smaller batches to make the process memory efficient and faster. My import of 1500 WES samples takes about 16 hours total and I parallelize my imports using a perl script. Then I process them in parallel using my perl script with GenotypeGVCFs

    CombineGVCFs is old deprecated and usually cumbersome in the later stages.

    I can share my scripts here if anyone is interested (Indeed I did this before but I cannot find the topic.)

  • TonyWeiTonyWei Member

    @SkyWarrior
    Hi, how did you parallelize gvcf imports into a DB using a perl script? Are you saying that I can split 140 samples into 14 ten-sample batches and GenomicsDBImport into the same DB in order to parallelize this step? Something like,
    gatk GenomicsDBImport --genomicsdb-workspace-path my_db -V s1.g.vcf.gz -V s2.g.vcf.gz ...
    gatk GenomicsDBImport --genomicsdb-workspace-path my_db -V s11.g.vcf.gz -V s12.g.vcf.gz ...
    gatk GenomicsDBImport --genomicsdb-workspace-path my_db -V s21.g.vcf.gz -V s22.g.vcf.gz ...
    ...

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Below is the script for import. I run this in 4 seperate instances which can still be automated but I am lazy. I may restructure this for a better automation.

    #!/usr/bin/perl
    use warnings;
    use strict;
    my $gatkcommand = "gatk --java-options '-Xmx24g -Djava.io.tmpdir=./tmp' GenomicsDBImport -R \$HG19FULL --sample-name-map sn.txt --reader-threads 2 --batch-size 50 ";
    my $contigfile = 'contigs.list';
    open my $contig, $contigfile or die "Could not open $contigfile: $!";
    while( my $line = <$contig>)  {
        chomp($line);
            my $gatkfinalcommand = $gatkcommand."--genomicsdb-workspace-path ".$line." -L ".$line;
            system($gatkfinalcommand);
    }
    

    Below is the script for genotyping

    #!/usr/bin/perl
    use warnings;
    use strict;
    my $gatkcommand = "gatk --java-options '-Xmx24g -Djava.io.tmpdir=./tmp' GenotypeGVCFs -R \$HG19FULL -new-qual";
    my $contigfile = 'contigs.list';
    open my $contig, $contigfile or die "Could not open $contigfile: $!";
    while( my $line = <$contig>)  {
        chomp($line);
            my $gatkfinalcommand = $gatkcommand." --variant gendb://".$line." -O All_".$line.".vcf.gz";
            system($gatkfinalcommand);
    }
    

    You need to create 2 files.

    1- sn.txt which is a tab seperated file that contains samplename and samplepaths
    2- contigs.list which is a file to create your own intervals. I am importing per chromosome so my file looks like
    chr1
    chr2
    etc....

    Since I am running these in 4 seperate instances my contig.list files contain seperate contigs what are sorted for their length to distribute even workload onto 4 seperate instances.

  • TonyWeiTonyWei Member

    @SkyWarrior
    Thank you for sharing the codes with me. I did the same as you, which is to run GenomicsDBImport with each chromosome.
    However, each chromosome will takes 42 days (based on 5 hours/1 Mb). So my situation is that GenomicsDBImport takes too long even with individual chromosomes. :(

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Are you working with whole genome or whole exome samples.

  • TonyWeiTonyWei Member

    Whole genome resequencing. Genome size 3 Gb; nine chromosomes.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You may need to divide your chromosomes into more managable fragments for this purpose. For instance you can modify the contigs.list file and run the whole perl scripts again. BTW no one can claim that this process will be finished in a couple hours. You are working with whole genome sequencing and you have 140 samples. Comparing to a WES sample you are working with 4200 WES samples (If you compare that to human WES. )

  • TonyWeiTonyWei Member

    Couldn't agree more. I'm testing -L smaller.intervals right now. ~20 Mb takes one week. I can take that.
    However, I need to verify the results before moving on. I don't know how much differences it causes. I compared CombineGVCFs with GenomicsDBImport, the variants called from the resulting gvcfs showed 40% differences, which is too much.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @TonyWei
    Hi,

    Have a look at this thread as well.

    -Sheila

  • TonyWeiTonyWei Member

    @Sheila

    I have read this thread before I posted. Here I quote,
    "The existing tool does scale to tens of thousands of samples successfully, but you need to use smaller intervals (not a whole chromosome at once), and use the --batchSize argument to control how many samples are processed at once when there are many samples."

    I'm spliting one chromosome into ~20-Mb intervals and the running time is reasonable. My real question is: what is the right way to split one chromosome into intervals without causing potential problems? I assume that choose the masked regions for spliting should be fine. But does it make any difference to split one chromosome into 10 intervals compared with 1000 intervals? I'm running some tests in small scale. Hope you can give me some advices.
    Best,
    Tony

  • TonyWeiTonyWei Member

    Thanks, Sheila. This is what I need.
    Best,
    Tony

  • AlexandraAlexandra ItalyMember
    Dear all,
    I'm runnig GenomicDBImport for 30 samples. It takes soo much time and after 3 days job killed for walltime exceeded limit. I want to ask you If there is a way to let it become faster.
    I thought to do a cycle of different jobs for each chromosome ... is it possibile? And..in that case, dbimport should create different folder for each chromosome? And and that point GenotypeGVCF is able to open all the different chromosomes folder or should I merge them?
    Thanks!!!
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 21
  • TonyWeiTonyWei Member
    In my case, CombineGVCFs runs faster than GenomicDBImport no matter what sample size or chromosome size.
    I'm using CombineGVCFs in my pipeline now. Although I still don't know why GenomicDBImport is slow in my hand but it's recommended in the Best Practice.

    > @Alexandra said:
    > Dear all,
    > I'm runnig GenomicDBImport for 30 samples. It takes soo much time and after 3 days job killed for walltime exceeded limit. I want to ask you If there is a way to let it become faster.
    > I thought to do a cycle of different jobs for each chromosome ... is it possibile? And..in that case, dbimport should create different folder for each chromosome? And and that point GenotypeGVCF is able to open all the different chromosomes folder or should I merge them?
    > Thanks!!!
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    CombineGVCFs is not an optimal solution when it comes to thousands of samples. Believe me combining more than 1500 WES samples is not a trivial task for CombineGVCFs.

  • AlexandraAlexandra ItalyMember
    My next step will be to analyze 200 samples. I read , as @SkyWarrior wrote, that it's not a trivial task for CombineGVCFs.
  • evetcevetc Member
    At what sample size is it best to run GenomicsBDImport over CombineGCVFS? I will have a sample size of ~525 samples.
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    I would say go with GenomicsDBImport.

  • evetcevetc Member

    Ok thank you @SkyWarrior I will do.

    Also thank you for your perl scripts. I have never used perl before. Can I confirm that I am understanding it correctly?
    Your GenomicsBDImport script creates seperate databases per contig listed in your 'contigs.list'? Then you genotype unsing GenotypeGCVFs across all of those databases to produce one final .vcf flie?

    I am currently running this on a test sample of 96 indivduals:

        time ~/bin/gatk/gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \
           --genomicsdb-workspace-path ../4.GVCFs/my_database \
           --batch-size 48 \
           --reader-threads 10 \
           --intervals ~/Pararge_aegeria/PA_genome_CW/index/Pararge_aegeria_v2.softmasked.fa.bed \
           --sample-name-map ../files/cohort_sample_map
    

    It uses the whole genome as a bed file - its a non model organism so has loads of scaffolds. It has been running from around 17:00 yesterday non-stop and is still importing the first batch of 48. This seems slow to me?

    I am not sure I understand what the --reader-threads option does? I thought it increased the number of threads it used but it looks to be still running on one?

    Sorry for all my questions and thank you for your help!

  • evetcevetc Member
    edited July 5

    as a follow-up I have tried running your GneomicsBDImport script and recieve this error:
    A USER ERROR has occurred: Failure working with the tmp directory /pub37/eve/Pararge_aegeria/1.trimmed/Index_1-501_701/practice/perl/./tmp. Try changing the tmp dir with with --tmp-dir on the command line. Exact error was should exist and have read/write access
    What does this --java-options '-Xmx24g -Djava.io.tmpdir=./tmp' do?

    Thanks

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    --reader-threads option will sure have ins and outs depending on the resources utilized on the compute environment. I am working on a local server so I limit my --reader-threads to 2 in order to be on the safe side. Also my script works fine for a known organism (human) but I cannot guarantee a perfect running cycle for non-model organisms with lots of contigs but can be tweaked to get that to work.

    My perl script is a fairly straightforward one to generate one db per contig. Once genotyped you receive one vcf per contig and from there you need to collect all of those vcfs into a single file.

    tmp dir behavior seems to have changed in recent versions of gatk4 so it is better to stick with the recommended switch to change that rather than forcing that through java options.

  • evetcevetc Member

    @SkyWarrior - amazing, thank you so much for your help.
    I am currently running your Perl script and will see what happens and how long it takes - fingers crossed!

Sign In or Register to comment.