"GVCFs produced by HaplotypeCaller" vs "VCFs produced by UnifiedGenotyper (EMIT_ALL_SITES option)"

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data.
Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime.
Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!!
Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Nancy,

    If you care at all about indels, you should absolutely use HaplotypeCaller with the gvcf-based workflow.

    As you suppose correctly, UnifiedGenotyper cannot output gVCFs, and all-sites VCFs are not an acceptable substitute for gVCFs. The only way to approximate the joint calling method with UG is to give it all samples in the cohort together, which does not scale very well for large cohorts. Also, UG is very bad at making indel calls, like all other locus-based callers.

    HC has the best sensitivity over both SNPs and indels so it is better, even though the runtime is longer. It's up to you to decide if you care more about runtime or about sensitivity (especially for indels).

  • nancySEEnancySEE malaysiaMember

    Hi Geraldine,

    Thanks for your recommendation. SNP is my priority, whereas indels good to be additional information to my research.

    Comparing the the sensitivity on SNP detection for both UG and HC algorithms, from what i've obtained from the results, UG seems to be done better job (more sensitive) than HC, however i believe it might be cause by a lot of false positives SNPs from UG.

    Just to reconfirm, so up to day there's no feature from GATK site that can combine ALL_SITE vcf files from all samples into a jointed vcf file?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nancySEE‌

    Hi,

    Yes, you are correct that there is no tool to combine ALL_SITES vcf files into a joint vcf. As Geraldine stated above, the only way to do joint calling with Unified Genotyper is to input all samples together.

    -Sheila

  • nancySEEnancySEE malaysiaMember

    Hi Sheila, thanks you.

  • nancySEEnancySEE malaysiaMember

    Hi GATK team, when i combine gvcf produced from haplotypeCaller into joint gvcf file using either combineGVCFs or GenotypeGVCFs, it doesn't provide the feature (--minimumN: Combine variants and output site only if the variant is present in at least N input files), just like what combineVariants has it. Does GATK include this feature elsewhere (other tools perhaps)? =)

  • tommycarstensentommycarstensen United KingdomMember

    After GenotypeGVCFs try SelectVariants and use a select expression. Otherwise try bcftools or awk.

Sign In or Register to comment.