VQSR on SNP and Indel

My impression is that old recommendation from the GATK team is to do VQSR on snp.vcf and indel.vcf separately and in parallel. But the current pipeline and example is to do them sequentially on all.variants.vcf.

I wonder if the new approach is to keep those variants that are neither SNP nor Indel?

Best Answers

  • blueskypyblueskypy ✭✭
    Accepted Answer

    Thanks so much, @Geraldine_VdAuwera ! Your answer is right on point! Really appreciated!

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @blueskypy

    The current recommendation is still to do VQSR on snp.vcf and indel.vcf separately.

  • blueskypyblueskypy Member ✭✭

    hi, @bhanuGandham, thanks for the reply! Why is that the current pipeline and this example do them sequentially on all.variants.vcf. ?

    For example, here is from the current pipeline:

    task ApplyRecalibration {
     ...
    
      command {
        set -e
    
        /usr/gitc/gatk --java-options "-Xmx5g -Xms5g" \
          ApplyVQSR \
          -O tmp.indel.recalibrated.vcf \
          -V ${input_vcf} \
         ...
          -mode INDEL
    
        /usr/gitc/gatk --java-options "-Xmx5g -Xms5g" \
          ApplyVQSR \
          -O ${recalibrated_vcf_filename} \
          -V tmp.indel.recalibrated.vcf \
          ...
          -mode SNP
      }
      runtime {
        ...
      }
      output {
        File recalibrated_vcf = "${recalibrated_vcf_filename}"
        File recalibrated_vcf_index = "${recalibrated_vcf_filename}.tbi"
      }
    }
    
    
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 12

    @blueskypy

    Sorry I wasn't very clear. In the recommended workflow, basically you run VQSR on the vcf file in SNP mode first, which recalibrates SNPs but doesn't touch indels. Then you run in indel mode on the output of the first step, which will do the same for indels, not touching SNPs. This will produce the final output with all SNPs and indels appropriately recalibrated in a single file. When you recalibrate for one type, the variants of the other type are emitted to the output file without modification.

  • blueskypyblueskypy Member ✭✭

    hi, @bhanuGandham, thanks for the reply!
    I understand what the sequential order does. But still my question is why doing it sequentially on all.variants.vcf, instead of in parallel on snp.vcf and indel.vcf separately? Is the sequential processing to keep those variants that are neither SNP nor Indel?

  • blueskypyblueskypy Member ✭✭
    edited March 12

    does GenotypeGVCFs produce other types of variants besides SNP and Indel? if it does, I'll lose those variants when I use SelectVariants to get snp.vcf and indel.vcf to run VQSR in parallel. So I just wonder if the sequential processing is to avoid that loss?

  • blueskypyblueskypy Member ✭✭
    Accepted Answer

    Thanks so much, @Geraldine_VdAuwera ! Your answer is right on point! Really appreciated!

Sign In or Register to comment.