GATK4 or 3.8 for a new pipeline ?

TintestTintest FranceMember

Hello,

I am currently designing a pipeline for exome sequencing to detect SNPs, CNVs and structural variants for clinical diagnosis of orphan diseases.

I already have an "old" pipeline with GATK 3.8, but I wanted to modernize it and especially with workflow manager like Nextflow or Snakemake.

My pipeline will be used, first, in research and then in the very short term in production. So, I was wondering about between which version of GATK should I choose and if GATK4 which version of the tools, the "classic ones" (without Spark) or the "Spark" version.
I started to develop my new pipeline with the classic tools of the GATK4 suite (without spark), but running with only one thread, the pipeline is very slow.

Here are the tools used, MarkDuplicates, BaseRecalibrator, ApplyBQSR, HapplotypeCaller, Combine GVCF, GenomicDBimport, GenotypeGVCF, FilterVCF, leftnormalization ... Basically the your best practice recommendation.

I know you were advising on the forum to do not use the spark tools in production, but most of these messages are from September 2017, when, if I remember correctly, GATK4 was also in beta and not just the spark tools. There also i the tools some big red text boxes ... :smile:

So what should I do? Do I have to go back to 3.8 while waiting for the official release? Are there tools that I can keep in a classic version, not slowing down the process too much ? Have you identified some tools that do not differ greatly from their spark counterpart? Do I have to do a plume of tools in version 3.8 and 4.0.2.0?

My opinion for now is to use the Spark tools with the local execution and then benchmark my new pipeline vs the old one with reference datasets. But I would prefer to do not have to develop my pipeline twice, so I'm really waiting for your suggestions.

Best regards.

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Tintest
    Hi,

    I am currently designing a pipeline for exome sequencing to detect SNPs, CNVs and structural variants for clinical diagnosis of orphan diseases.

    GATK3 does not have many of the tools for detection of CNVs and SVs, so for that reason I would stick with GATK4. It is true the Spark tools are still a work in progress, but they are actively being worked on. For now, you can design your pipeline with the non-Spark tools, but when they are more fully supported, you can upgrade to those.

    Also, if you are interested in running the Spark tools locally, this article may help. We hope to have more documentation on Spark tools soon.

    -Sheila

Sign In or Register to comment.