GATK4 or 3.8 for a new pipeline ?
I am currently designing a pipeline for exome sequencing to detect SNPs, CNVs and structural variants for clinical diagnosis of orphan diseases.
I already have an "old" pipeline with GATK 3.8, but I wanted to modernize it and especially with workflow manager like Nextflow or Snakemake.
My pipeline will be used, first, in research and then in the very short term in production. So, I was wondering about between which version of GATK should I choose and if GATK4 which version of the tools, the "classic ones" (without Spark) or the "Spark" version.
I started to develop my new pipeline with the classic tools of the GATK4 suite (without spark), but running with only one thread, the pipeline is very slow.
Here are the tools used, MarkDuplicates, BaseRecalibrator, ApplyBQSR, HapplotypeCaller, Combine GVCF, GenomicDBimport, GenotypeGVCF, FilterVCF, leftnormalization ... Basically the your best practice recommendation.
I know you were advising on the forum to do not use the spark tools in production, but most of these messages are from September 2017, when, if I remember correctly, GATK4 was also in beta and not just the spark tools. There also i the tools some big red text boxes ...
So what should I do? Do I have to go back to 3.8 while waiting for the official release? Are there tools that I can keep in a classic version, not slowing down the process too much ? Have you identified some tools that do not differ greatly from their spark counterpart? Do I have to do a plume of tools in version 3.8 and 18.104.22.168?
My opinion for now is to use the Spark tools with the local execution and then benchmark my new pipeline vs the old one with reference datasets. But I would prefer to do not have to develop my pipeline twice, so I'm really waiting for your suggestions.