Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

GenomicsDBImport run slowly with multiple samples

Since I took the GATK4 training early this year, I switched my resequencing analysis from GATK 3.6 to GATK 4.0. Everything worked fine except the GVCF consolidate step. GenomicsDBImport takes much longer time than the traditional CombineGVCFs. On a pilot run of three samples, it took 40 hours while CombineGVCFs only took one hour.
Now I am expanding to 140 samples. I gave GenomicsDBImport a test run on 1 Mb on one chromosome, it took 5 hours, while CombineGVCFs took 10 minutes. I realized after a few runs that the running time is linear to my sample size and interval size. Therefore, I tried out a few parameters as suggested, including,
1. simply run gatk --java-options "-Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport --genomicsdb-workspace-path my_db --intervals chr2:1-10000000 -R ref.fa -V gvcf.list.
2. use --batch-size 10, which took 5% longer than the first solution
3. use --batch-size 10 --reader-threads 5, no difference from solution 2
4. use --L intervals.list, which split 10 Mb into 10 intervals but spent the same time as solution 2
It turns out that GenomicsDBImport without any other parameters runs the fastest, although it's still too long. I've used CombineGVCFs to generate a consolidate GVCF file for genotypeGVCFs, although I noticed that GenomicsDBImport and CombineGVCFs resulted different variants. I would like to compare the variants side by side.

So my questions are,
1. What is the criteria to split one chromosome into a few dozens of intervals without causing potential problems? For example, if a gene is split into two intervals, does it affect the following steps?
2. What is the following procedure after GenomicsDBImport is done with consecutive intervals because GenotypeGVCFs only runs on individual database? I’m testing it now so I will have an answer this week.
After all, I didn’t expect that GenomicsDBImport runs so slowly because I read CombineGVCFs is slower in the GATK4 forum. I would appreciate if its slowness is fixed. Otherwise, I have to split chromosomes into small intervals.

Best Answer

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You may skip consolidate step in GenomicsDBImport unless you have gazillions of batches to import. I am working with 1500~ samples in 50 sample batches and 30 batches do not really disturb the system during genotyping. You may skip consolidate until you have more than 100 or maybe 200 batches in a single import.

  • TonyWeiTonyWei Member

    @SkyWarrior said:
    You may skip consolidate step in GenomicsDBImport unless you have gazillions of batches to import. I am working with 1500~ samples in 50 sample batches and 30 batches do not really disturb the system during genotyping. You may skip consolidate until you have more than 100 or maybe 200 batches in a single import.

    Thank you for the quick reply. Do you mean that I can run GenotypeGVCFs with multiple GVCF files unless I have a 5000-sample cohort, right?

  • TonyWeiTonyWei Member

    I agree with you that I don't need --batch-size for my 140 samples. The real problem is how to run GenomicsDBImport within a reasonable time, say a few days. It took 5 hours for 1 Mb, which means at least 6 weeks for a 200-Mb chromosome.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sorry but I was confused I guess. I was wondering if you were using the consolidate option in GenomicsDBImport stage. That I think is unnecessary unless you have too many samples. GenomicsDBImport is still needed and you can import your gvcfs in smaller batches to make the process memory efficient and faster. My import of 1500 WES samples takes about 16 hours total and I parallelize my imports using a perl script. Then I process them in parallel using my perl script with GenotypeGVCFs

    CombineGVCFs is old deprecated and usually cumbersome in the later stages.

    I can share my scripts here if anyone is interested (Indeed I did this before but I cannot find the topic.)

  • TonyWeiTonyWei Member

    @SkyWarrior
    Hi, how did you parallelize gvcf imports into a DB using a perl script? Are you saying that I can split 140 samples into 14 ten-sample batches and GenomicsDBImport into the same DB in order to parallelize this step? Something like,
    gatk GenomicsDBImport --genomicsdb-workspace-path my_db -V s1.g.vcf.gz -V s2.g.vcf.gz ...
    gatk GenomicsDBImport --genomicsdb-workspace-path my_db -V s11.g.vcf.gz -V s12.g.vcf.gz ...
    gatk GenomicsDBImport --genomicsdb-workspace-path my_db -V s21.g.vcf.gz -V s22.g.vcf.gz ...
    ...

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Below is the script for import. I run this in 4 seperate instances which can still be automated but I am lazy. I may restructure this for a better automation.

    #!/usr/bin/perl
    use warnings;
    use strict;
    my $gatkcommand = "gatk --java-options '-Xmx24g -Djava.io.tmpdir=./tmp' GenomicsDBImport -R \$HG19FULL --sample-name-map sn.txt --reader-threads 2 --batch-size 50 ";
    my $contigfile = 'contigs.list';
    open my $contig, $contigfile or die "Could not open $contigfile: $!";
    while( my $line = <$contig>)  {
        chomp($line);
            my $gatkfinalcommand = $gatkcommand."--genomicsdb-workspace-path ".$line." -L ".$line;
            system($gatkfinalcommand);
    }
    

    Below is the script for genotyping

    #!/usr/bin/perl
    use warnings;
    use strict;
    my $gatkcommand = "gatk --java-options '-Xmx24g -Djava.io.tmpdir=./tmp' GenotypeGVCFs -R \$HG19FULL -new-qual";
    my $contigfile = 'contigs.list';
    open my $contig, $contigfile or die "Could not open $contigfile: $!";
    while( my $line = <$contig>)  {
        chomp($line);
            my $gatkfinalcommand = $gatkcommand." --variant gendb://".$line." -O All_".$line.".vcf.gz";
            system($gatkfinalcommand);
    }
    

    You need to create 2 files.

    1- sn.txt which is a tab seperated file that contains samplename and samplepaths
    2- contigs.list which is a file to create your own intervals. I am importing per chromosome so my file looks like
    chr1
    chr2
    etc....

    Since I am running these in 4 seperate instances my contig.list files contain seperate contigs what are sorted for their length to distribute even workload onto 4 seperate instances.

  • TonyWeiTonyWei Member

    @SkyWarrior
    Thank you for sharing the codes with me. I did the same as you, which is to run GenomicsDBImport with each chromosome.
    However, each chromosome will takes 42 days (based on 5 hours/1 Mb). So my situation is that GenomicsDBImport takes too long even with individual chromosomes. :(

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Are you working with whole genome or whole exome samples.

  • TonyWeiTonyWei Member

    Whole genome resequencing. Genome size 3 Gb; nine chromosomes.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You may need to divide your chromosomes into more managable fragments for this purpose. For instance you can modify the contigs.list file and run the whole perl scripts again. BTW no one can claim that this process will be finished in a couple hours. You are working with whole genome sequencing and you have 140 samples. Comparing to a WES sample you are working with 4200 WES samples (If you compare that to human WES. )

  • TonyWeiTonyWei Member

    Couldn't agree more. I'm testing -L smaller.intervals right now. ~20 Mb takes one week. I can take that.
    However, I need to verify the results before moving on. I don't know how much differences it causes. I compared CombineGVCFs with GenomicsDBImport, the variants called from the resulting gvcfs showed 40% differences, which is too much.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @TonyWei
    Hi,

    Have a look at this thread as well.

    -Sheila

  • TonyWeiTonyWei Member

    @Sheila

    I have read this thread before I posted. Here I quote,
    "The existing tool does scale to tens of thousands of samples successfully, but you need to use smaller intervals (not a whole chromosome at once), and use the --batchSize argument to control how many samples are processed at once when there are many samples."

    I'm spliting one chromosome into ~20-Mb intervals and the running time is reasonable. My real question is: what is the right way to split one chromosome into intervals without causing potential problems? I assume that choose the masked regions for spliting should be fine. But does it make any difference to split one chromosome into 10 intervals compared with 1000 intervals? I'm running some tests in small scale. Hope you can give me some advices.
    Best,
    Tony

  • TonyWeiTonyWei Member

    Thanks, Sheila. This is what I need.
    Best,
    Tony

Sign In or Register to comment.