Multisample vs Single sample(Paired sample MUT &WT separate BAM files)

sv91707sv91707 TampereMember
edited March 2017 in Ask the GATK team

Hello,
I'm currently working with zebrafish mutants, and compare phenotypically wild-type and mutant siblings for mutations. I have 3 different mutants, like 3 pairs of different Mutant and Wild type siblings. e.g Mut A, WT A and Mut B, WT B and so on.

I called variants in Single sample mode, i.e. WT separate and Mutant Separate (GVCF haplotypecaller) for all the pairs (separate GVCF for each BAM file).

Then I combine all the samples to call Genotypes, to use all the coverage from other samples to get correct Genotype. So in the end I have a 6 sample VCF file.

Then just to check, I called Variants directly from Haplotypecaller with using just pairs, i.e Mut (A) and WT (A) together to call variants, only for one pair.

But, After comparing at some positions in same sample, say A, I'm seeing difference in genotypes, as compared in 6 sample VCF and my 2 sample VCF (of same mutant+sibling vcf). Even the DP is different in some cases.

I want to ask, which one is the better method ? I though multisample variant calling would be better.

thanks

Best Answer

Answers

  • sv91707sv91707 TampereMember
    edited March 2017

    Here are two examples, first (2samp_HC) is the 2 samples (Coverage ~ 16X total combined) with variant called directly with Haplotypecaller with default settings.
    Second (all_GVCF), is the 2 samples extracted with GATK SelectSample after calling variants first singularly in GVCF mode and combined (total 6 samples ~ > 30X total combined)
    The variant number 700 and 686 for example have different genotype.
    I'm working in finding linked region to the mutation, and I get considerable differences in the homozygosity mapped over the genome, which I think is related to changed genotype when Variant calling method is changed.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sv91707
    Hi,

    Which version of GATK are you using? Can you try using the latest version if you are not? If you use the latest version, most of the calls in the GVCF workflow that are different should be ./. (no-calls). Note that the differences occur in questionable sites. For example, position 700 has only 5 reads. That is not enough evidence to make a confident call in the GVCF workflow (the GQ is 0). What happens if you use all 6 samples in the GVCF workflow? When you add more samples in the GVCF workflow that have the variant, the more likely the questionable samples will actually get called as variant. Have a look at this thread.

    -Sheila

    P.S. This thread may also be useful.

  • sv91707sv91707 TampereMember

    Thank you for the answer.
    I'm at the moment using GATK 3.5-0-g36282e4
    Yes, its true number 700 has only 5 reads and GQ is 0 so its possible that it has 2 genotypes.

    Using all 6 samples in the GVCF workflow ? Mean, should I make gvcf files using not single sample but all 6 samples ?
    I have not tried that.

    What I did, as I have mentioned, I called single sample GVCF and combined all 6 samples to make on VCF file with 6 samples, and then extracted two Samples from that 6 sample VCF.
    Anyways, May be the difference in the genotype is for very low quality calls. Also, the coverage I'm working is not high, usally 8X per sample.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sv91707
    Hi,

    I see. I thought you had run GenotypeGVCFs on only 2 GVCFs. Indeed, the issue with the GVCF workflow is when you have low quality calls. In your case, it may just be best to use HaplotypeCaller in normal joint-calling mode. However, when you continue to add samples, and the runtime increases, I would encourage you to test out the GVCF workflow again. I suspect the GVCF workflow will be able to save those low quality calls when you have more samples that have good evidence for the variant at the site.

    -Sheila

Sign In or Register to comment.