Very long import time heavily compounded across a transcriptome in GenomicsDBImport

Hi all,

I'm running GATK4 on a transcriptome, but I'm running into an issue with run time due to the time importing sample vcfs. It isn't a huge dataset (N = 45), but importing these vcfs takes ~10.5 minutes. Doing this across all the individual contigs in a transcriptome quickly becomes functionally impossible due to the wall clock time (back of the envelop calculation = 1 year without parallelization). If it would be possible to start the engine then iterate through a whole series of intervals/contigs without repeatedly importing the samples that would be a huge improvement. It is possible that this is already built in, but I haven't been able to get this to work even with the -L flag, I think because there is no overlap/sensible way to combine contigs.

I'm running this on a linux cluster in a Docker environment. I've already run HaplotypeCaller to produce genomic vcfs. Here is what I am running, with the time issues:

for i in $contigs     # $contigs is a variable with all of the contigs in the transcriptome
gatk GenomicsDBImport \
    $(cat files.txt) \ # files.txt is a file with the format "-V /path_to_file/sample1 \ -V /path_to_file/sample2 \ etc
    --genomicsdb-workspace-path $path/my_database/$i \
    --intervals $i \
    --reader-threads 5

Thanks in advance,


Sign In or Register to comment.