Very long import time heavily compounded across a transcriptome in GenomicsDBImport
I'm running GATK4 on a transcriptome, but I'm running into an issue with run time due to the time importing sample vcfs. It isn't a huge dataset (N = 45), but importing these vcfs takes ~10.5 minutes. Doing this across all the individual contigs in a transcriptome quickly becomes functionally impossible due to the wall clock time (back of the envelop calculation = 1 year without parallelization). If it would be possible to start the engine then iterate through a whole series of intervals/contigs without repeatedly importing the samples that would be a huge improvement. It is possible that this is already built in, but I haven't been able to get this to work even with the -L flag, I think because there is no overlap/sensible way to combine contigs.
I'm running this on a linux cluster in a Docker environment. I've already run HaplotypeCaller to produce genomic vcfs. Here is what I am running, with the time issues:
for i in $contigs # $contigs is a variable with all of the contigs in the transcriptome do gatk GenomicsDBImport \ $(cat files.txt) \ # files.txt is a file with the format "-V /path_to_file/sample1 \ -V /path_to_file/sample2 \ etc --genomicsdb-workspace-path $path/my_database/$i \ --intervals $i \ --reader-threads 5 done
Thanks in advance,