Workflow for fungal WGS

simono101simono101 London, UKMember

Hi all,
I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

  • 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
  • 02- BAM: MarkDuplicates
  • 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
  • 04- BAM: Calling variants using HaplotypeCaller
  • 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
  • 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
  • 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
  • 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Comments

  • simono101simono101 London, UKMember

    BTW I forgot to mention that I am applying these tools to each sample individually. The samples cannot be assumed to come from the same population and I want to make some inferences about the relatedness of the different samples. It is more important to me to be highly specific about true SNPs rather than very sensitive and find all SNPs (for now, if that makes sense).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This looks mostly good, but it's incomplete. Once you have the individual GVCFs, you need to run them all together through GenotypeGVCFs. You can use the -allSites argument to output hom-ref sites in the final VCF.

    One question: why are you filtering out heterozygous calls? If your organism is haploid, you can set -ploidy 1 so the tools will handle them appropriately and produce haploid calls.

  • simono101simono101 London, UKMember

    Hi @Geraldine_VdAuwera‌ thanks for your response. The fungus I am working on displays aneuploidy. Eventually I will include het positions too, but for now I am more interested in looking at homozygous variant positions to study population phylogeny.

Sign In or Register to comment.