VCF from Germline SNPs + INDELs Best Practice Workflow

MilanDMilanD BelgradeMember

Dear Firecloud team,

I just wanted to check with you couple of things regarding the Germline SNPs + INDELs production WDL/workflow. If I am planning to run this workflow with single samples only with no joint-calling in mind after this initial variant calling step would these assumptions be correct:

  • Would outputting VCF from HC instead of gVCF break anything, that is would that be a considerable deviation from the best practices?
  • VQSR should be skipped when working with a single sample?
  • Last one, just as a sanity check this is the WDL which is considered to be the production one? I was a bit confused because of the other one in the github page.

Thanks a lot!

Milan

Answers

  • SChaluvadiSChaluvadi admin Member, Broadie, Moderator admin

    @MilanD
    If you are using single samples and do not have a need for the gVCF file, outputting a regular VCF is fine. Here is a document that explains in greater details how HaplotypeCaller works for both gvcf and vcf outputs. It also has some examples for commands for your use case.

    Since VQSR works better when run on calls from multiple samples, simply because having more data yields more accurate models, I think that you can skip this step if you would like.

    Yes, I would use the one that has the gatk4.0 label listed in the name of the WDL!

  • john156john156 Member
    edited March 26
    Hey I have a follow up question - since some of these WDLs were updated, the production WDL should now be the "PairedEndSingleSampleWf-fc-hg38.wdl" one, correct?

    Also - is it ok to use docker images with GATK version 4.1.0.0 for all steps in the WDL, instead of the combination of GATK 4.0.4.0 and Genomes In The Cloud images? With proper changes of course (regarding some executables for example, and similar
  • AdelaideRAdelaideR admin Unconfirmed, Member, Broadie, Moderator admin

    Hi @john156

    It seems that there are still steps that point at the docker for genomes on the cloud. I went to look at the current wdl file on github


    String? gatk_docker_override String gatk_docker = select_first([gatk_docker_override, "broadinstitute/gatk:4.1.0.0"]) String? gatk_path_override String gatk_path = select_first([gatk_path_override, "/gatk/gatk"]) String? gotc_docker_override String gotc_docker = select_first([gotc_docker_override, "broadinstitute/genomes-in-the-cloud:2.3.1-1512499786"]) String? gotc_path_override String gotc_path = select_first([gotc_path_override, "/usr/gitc/"]) String? python_docker_override String python_docker = select_first([python_docker_override, "python:2.7"])

    So, it looks like it is still pointing at a gotc docker.

    May I ask where you are looking at your WDL? Sometimes different locations have different settings. For example, if it is on AWS versus GCP versus on a local machine.

  • john156john156 Member
    Ahh, thank you, sorry, I might have been a bit vague.

    Basically, I'm looking at the two WDL files:
    1. Prod* germline short variant joint genotyping
    2. Generic germline short variant per-sample calling

    In the Prod pipeline, GATK 4.0.11.0 was used, and thats what I was referring to in my question - if I only want to use the GATK 4.1.0.0 image (for some particular reason), could I replace the docker image with the 4.1.0.0 anywhere in the WDL file where it's present, hoping that the pipeline will work just as well?

    Also, are there any significant differences between the Generic version and the Prod version?
    In the generic version, GATK 4.1.0.0 is used almost everywhere (except for one tool), so if there aren't any major differences, I suppose both pipeline versions would work with GATK 4.1.0.0?

    I hope I'm not too confusing :smile:
    In short, I want to run the germline best practice pipeline with GATK 4.1.0.0 only (if possible), so this is where my dilemmas come from :smile:
  • AdelaideRAdelaideR admin Unconfirmed, Member, Broadie, Moderator admin

    @john156 I believe that 4.1.0.0 is the alpha version [the more stable version]. Production may be using the new version before pushing it out as a new release.

    I believe the new release will have some changes to increase computational efficiency using GenomicsDBImport. I have not seen the release notes so I cannot be more specific.

    So, I would hazard a guess that your strategy is a good one because you will be referring to the most recent version of the stable release.

  • john156john156 Member
    Hm, just to clear it up:
    The Prod* version uses GATK 4.0.11.0 + GenomesInTheCloud docker images.
    The Generic version uses GATK 4.1.0.0 (i.e. the newest release) in almost all steps except one (CRAM to BAM, where the GITC docker image is used).

    Until this moment, I was looking at the Prod* version of the pipeline, wondering if I can exchange the 4.0.11.0 and the GITC docker images with the 4.1.0.0 docker image (the newest one).

    Now I'm looking at the Generic version, and wondering if it's ok to use that pipeline as my main reference point, since it uses 4.1.0.0.
    Does the Generic version differ much from the Prod* version?
    I'm still now allowed to post links, so I can't link both WDLs, but I hope I'm clear enough!

    * By the way, this is what the GATK page says about the Prod version - * Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.
  • AdelaideRAdelaideR admin Unconfirmed, Member, Broadie, Moderator admin

    @john156 Of course, the minute I post that, the new release notes come out. Here is the link

  • john156john156 Member
    Ahh great :)
    So, since the Generic WDL uses 4.1.0.0 now
    Do I still keep up with that one, or do I edit the WDL with the GATK 4.1.1.0 image (which I can push on a private docker repo)?
    Any major things I should worry about?
    Or is it ok to call the WDL with GATK 4.1.1.0 (or even with 4.1.0.0 tools) 'BROAD Best Practice', even though the Prod pipeline uses 4.0.11.0 for example?
    Thanks for the answers by the way!
  • AdelaideRAdelaideR admin Unconfirmed, Member, Broadie, Moderator admin
    edited March 30

    @john156 Because this is not a major release the basic commands remain the same. However, a few parameters may have been added. A good idea with a new release is to walk through the commands and read the help instructions from the command line.
    Or, run a small data set that will send errors in a short amount of time .

    Post edited by AdelaideR on
Sign In or Register to comment.