unified genotyper vs. haplotype caller w/ pedigree study, major discrepancies


I've been exploring de novo mutation identification in the context of a pedigree of trios. I've run the UnfiedGenotyper (UG) given all the bam files for ~25 sets of trios and it appears to identify a set of de novo mutations. When I run the HaplotypeCaller (HC) pipeline, first generating gVCF files for each individual, and then using the merged gVCF files along with the pedigree for genotype refinement and de novo mutation calling, it also finds a number of de novo mutations annotated as hi-confidence de novo mutations. When I compare the UG de novo mutations to the high confidence HC list, there's very little overlap. Many of the UG hi-confidence de novo variants are called by HC, but listed as low-confidence de novo variants, and from looking at a few examples, it would appear that the HC calls have assigned lower genotype confidence levels for the parental (non-mutated, reference) genotypes. Could it be that because the gVCF files aren't storing position-specific information for the reference (non-mutated) positions in the genome, the pedigree-type de novo mutation calling is not as accurate as it could be? Should I be generating gVCFs that include position-specific information?

Many thanks for any insights. If it would help, I could post some examples.

Best Answers


  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Yes, I think some example records will help.

    Can you also tell us the exact commands you ran for UnifiedGenotyper and HaplotypeCaller to get the VCFs you are inputting to the GenotypeRefinement workflow?


  • bhimmbhimm CambridgeMember

    Many thanks for your response.

    Before delving further into the UG vs. HC comparison, I've focused more specifically on tracking the variant calls through the HC pipeline.

    It seems that the CalculateGenotypePosteriors is taking a rather large toll on the initial genotype qualities for my trios. Here's the command I ran for computing the genotype posteriors using a trio.ped file:

    java -jar GenomeAnalysisTK-2014.3-17-g0583018/GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -ped trio.ped -V gatk-HC-VQSR-annotated.final.vcf -o gatk-HC-VQSR-annotated.final.postCGP.vcf -R /human_g1k_v37.fasta --supporting 1000G_phase3_v4_20130502.sites.vcf

    For one of the trios in the input file: gatk-HC-VQSR-annotated.final.vcf

    A 0/0:21,0:21:60:0,60,900 # parent A
    B 0/0:27,0:27:75:0,75,1055 # parent B
    C 0/1:13,8:21:99:243,0,411 # affected child

    and the given entry in the trio.ped file:
    (python27)-bash-4.2$ grep Family_20 trio.ped

    Family_20 20A.bwa_mem 0 0 0 1
    Family_20 20B.bwa_mem 0 0 0 1
    Family_20 20C.bwa_mem 20B.bwa_mem 20A.bwa_mem 0 2

    Post the CalculateGenotypePosteriors step, we end up with the following:

    A 0/0:21,0:21:0:0,60,900:0,0,840:2 # genotype quality changed from 60 to zero
    B 0/0:27,0:27:18:0,75,1055:0,18,998:2 # genotype quality changed from 70 to 18
    C 0/1:13,8:21:99:243,0,411:186,0,471:2 # genotype quality remains at 99

    Is there anything obvious that might have caused this, such as a problem with my trio.ped file? In the trio.ped, I just listed the column names for the corresponding entries.

    many thanks for your assistance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @bhimm,

    It's hard to say with certainty what might be happening here, but here is a possible explanation. Clearly the two parental genotypes have mediocre confidence, while the child genotype confidence is really great. And of course the trio configuration is in a state of mendelian violation. My guess would be that the population frequency resource suggests that the variant is fairly common and therefore it is more likely that one of the parental genotypes is wrong rather than the child having a de novo mutation. You can check the site call in the 1000G file and see if that supports my supposition. You can also check the raw data at this site and see if the parental calls make sense, or if there seems to be some strand bias or other source of error.

  • bhimmbhimm CambridgeMember

    Thanks, Geraldine. There are so many examples of these kinds of problems that I suspect something else might be going on. I'll continue to explore it with the very latest GATK software and see how it goes.

    Also - is it possible for the forum system to send emails once there's been a response in a thread? If it 's already supposed to be doing this, I haven't been receiving any email responses and have needed to revisit the forum from time to time to see if there's been any activity.

    Thx again!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @bhimm Yes, there's a doc that describes how to set up forum notifications: https://www.broadinstitute.org/gatk/guide/article?id=27

  • bhimmbhimm CambridgeMember

    Thanks! I'd recommend setting up the defaults to auto-email and have folks turn it off as needed.

    Another question for this thread: is there a test data set I could run through the pedigree caller to ensure that I've set it up correctly, and so I can be more confident about the results I'm getting from my own data set?

  • bhimmbhimm CambridgeMember
    edited February 2016

    Thanks! I'll look into this. One last question here, hopefully - are there other test data sets with expected outputs one might use for validating GATK installation and functionality for general variant calling (separate from pedigree analysis), again, to ensure proper setup?

Sign In or Register to comment.