Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, and its ability to call indels is far superior. We recommend using HaplotypeCaller in all cases, with only a few exceptions:

  • If you want to analyze more than 100 samples at a time (for performance reasons)
  • If you are working with non-diploid organisms (UG can handle different levels of ploidy while HC cannot)
  • If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)

In those cases, we recommend using UnifiedGenotyper instead of HaplotypeCaller.

Geraldine Van der Auwera, PhD

Comments

  • DavidRiesDavidRies Posts: 11Member

    Dear Geraldine, I have to different setups:

    1. I have a pool of individuals of a F2 generation. These individuals all decend from two diploid parents. So I still end up with a diploid pool and would use HC, right?
    2. I have a pool of non related diploid individuals, so I have to use UG?

    Best regards,

    David

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

    Hi David,

    If you are doing true pooled calling, you need to use UG and set the ploidy to (real ploidy per individual) x (number of individuals).

    Geraldine Van der Auwera, PhD

  • jeramiahsmithjeramiahsmith University of KentuckyPosts: 3Member

    Is HaplotypeCaller appropriate for making genotype calls from haploid progeny (individuals, not pools) from a diploid parent?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

    HaplotypeCaller can only genotype diploid individuals. For haploids, please use UnifiedGenotyper.

    Geraldine Van der Auwera, PhD

  • jeramiahsmithjeramiahsmith University of KentuckyPosts: 3Member

    hmm, not the answer I expected. Is there some kind of intrinsic assumption about levels of herterozygosity? which program would one use to genotype isogenic dipliod lines?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

    HaplotypeCaller assumes that the organism is diploid and therefore looks for allele frequencies around 0, 0.5, 1. With UnifiedGenotyper you can set ploidy to whatever you want and the model will be adapted accordingly. Neither makes any assumptions about levels of heterozygosity if by that you mean the frequency at which you find het sites in your organism.

    Keep in mind that we work exclusively on human genomes, so we may not be aware of the particularities of other types of organisms. If you need information that is tailored to those peculiarities, you need to clarify in more detail what you are working with and what are the concerns/constraints.

    Geraldine Van der Auwera, PhD

  • jeramiahsmithjeramiahsmith University of KentuckyPosts: 3Member
  • crojocrojo CaliforniaPosts: 8Member
    edited April 23

    Hi Geraldine,

    My understanding is that for larger cohorts (e.g. over 100 samples) HC is now also recommended. However, the FAQs page still recommends UG in this case so just wanted to mention that. Looks like the website is undergoing a lot of updates so this particular update to the recommendations might already be in the works but just wanted to mention it in case it is helpful.

    Post edited by crojo on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

    @crojo, thanks for pointing this out! Indeed, we're in the middle of updating everything so a few pages are still lagging, but as a matter of fact I had forgotten about that FAQ article. I'll update this asap to avoid confusion.

    Geraldine Van der Auwera, PhD

  • mprasadmprasad FrancePosts: 13Member

    Hi Geraldine,

    I recently analysed some targeted gene-sequencing data using HC with Best Practices 3.x using the gvcf option. The aim was to identify rare mutations. Although it worked well for all variants, there was one that it wasn't able to detect. A sequencing platform analysed the same fastq files and called variants using UG, pileup, mpileup, and HC with an earlier version of Best Practices and they found an indel with UG and pileup that HC missed. I verified the indel by Sanger sequencing, so it is real and indeed the one causing the disease. Thought it may be useful for the team to see why HC missed it. The region where the indel lies is highly repetitive and polymoprhic, so that may explain it, but I did find the same indel by visual inspection, and it is present in ~30% of the reads. I'd like to share the BAM and VCFs with you if you are interested in taking a look. Can I upload this onto the ftp? Are there any other files/info you will need? I can give you my command lines but not those of the sequencing platform.

    Thanks, Megana

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

    Hi Megana,

    This is indeed interesting -- could you please post your command lines? Ideally we'd want the center's commands too to find out what they did differently. But we can start by looking at yours and see if we can identify any potential reasons why the indel would be missed. A screenshot would be good too, please. Depending on what we find on first examination we'll decide whether we need to see the actual data. Thanks for volunteering this information!

    Geraldine Van der Auwera, PhD

  • mprasadmprasad FrancePosts: 13Member

    Hi Geraldine,

    Sure, here are my command lines to generate the raw calls.

    java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -I $ARGV[$n] -L $L --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o ${dir}/${filename}.gvcf

    java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta @variant -o ${dir}/${project_name}.rawvariants.vcf

    I have also attached a screenshot of the BAM file I generated with the deletion clearly visible and the VCF generated by the sequencing centre with the variant.

    Let me know if you need anything else.

    Thanks, Megana !

    Screen Shot 2014-10-02 at 13.54.35.png
    1078 x 692 - 56K
    Screen Shot 2014-10-02 at 13.55.38.png
    998 x 42 - 27K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,398Administrator, GATK Developer admin

    Thanks -- this does look like something HC should call. One more thing to try -- can you please run HC on this region with -ERC BP_RESOLUTION and the -bamout argument to output the reassembled reads? This will tell us what HC is seeing once it has performed reassembly. You can compare the output bam to the original and see if reads have moved. It can also show if HC didn't actually process parts of the region of interest, which happens in rare cases. Please show the IGV screenshot and the relevant lines from the GVCF (in full).

    Incidentally (not causing the problem but FYI), you don't need to pass confidence threshold parameters to HC when running in ERC mode, because they are set to 0 internally. You can pass them to GenotypeGVCFs, however.

    Geraldine Van der Auwera, PhD

  • mprasadmprasad FrancePosts: 13Member

    Hi Geraldine,

    Thanks for the pointer re: the confidence threshold paramaters. So I ran HC in BP_RESOLUTION mode and have attached a screenshot of the resulting bam (top) in comparison with the old bam (bottom). Indeed, there weren't any reads assembled in the region of interest, which explains the lack of variant calling! Why is that? I have also attached a screenshot of the resulting gvcf with the deleted base in yellow.

    Thanks, Megana

    Screenshot from 2014-10-03 17:07:13.png
    1920 x 1080 - 150K
    Screenshot from 2014-10-03 17:16:35.png
    1920 x 1080 - 439K
    Screenshot from 2014-10-03 17:07:13.png
    1920 x 1080 - 150K
  • SheilaSheila Broad InstitutePosts: 520Member, GATK Developer, Broadie, Moderator admin

    @mprasad

    Hi Megana,

    Looking at the original bam files, it seems like there are a lot of indels and SNPs around your site of interest. This is usually a sign of bad mapping.

    Because the reads are not present in the bamout file, the region is not being called as an active region. Have you checked the mapping qualities of the reads in the region, as well as the base qualities around the site? Active regions may not be triggered because of low quality scores.

    Please have a look at these documents to learn more about how Haplotype Caller works: https://www.broadinstitute.org/gatk/guide/article?id=4147

    https://www.broadinstitute.org/gatk/guide/article?id=4146

    This article may also be of interest to you: https://www.broadinstitute.org/gatk/guide/article?id=1235

    I hope these help!

    -Sheila

  • SheilaSheila Broad InstitutePosts: 520Member, GATK Developer, Broadie, Moderator admin

    @‌mprasad

    Hi again,

    Can you upload a snippet of the region so we can replicate the issue? Instruction on how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks, Sheila

  • mprasadmprasad FrancePosts: 13Member

    Hi Sheila,

    Sorry for the slow response. Thanks for the links. Indeed, the reads in the region have low MQ scores and BQ scores because it is highly repetitive and also polymorphic. I was just curious why UG called it but not HC, but that may have to also do with the specific command line used by the sequencing centre. I have uploaded the snippet file of the bam onto the ftp (prasad.snippet.bam) and the corresponding index file. Please do let me know if you need something else.

    Thanks, Megana

  • SheilaSheila Broad InstitutePosts: 520Member, GATK Developer, Broadie, Moderator admin

    @mprasad

    Hi Megana,

    I just submitted a bug report. I am not sure when this will be fixed, but thank you for sharing this! We are looking for test cases that are verified variants that are not picked up by Haplotype Caller.

    I will keep you updated on the status.

    -Sheila

  • mprasadmprasad FrancePosts: 13Member

    @Geraldine_VdAuwera

    I took your suggestion and ran Haplotype caller without the confidence threshold parameters but then tried to use them with the GenotypeGVCF tool. But it is throwing up an error. Here is my command line and the error

    java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -stand_emit_conf 10 -stand_call_conf 30 --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-M.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-J.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-H.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-F.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-CH.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-CA.aligned.sorted.dedup.realign.recal.gvcf' -o '/media/newdrive/REANALYSIS/Plaquettes/plaquettes.rawvariants.vcf'

    ERROR ------------------------------------------------------------------------------------------

    ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Argument with name 'stand_emit_conf' isn't defined.

    I checked to make sure that the argument was available in GenotypeGVCFs. Am I missing something obvious here?

    Thanks, Megana

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    @mprasad‌ - It's just a version issue. -stand_emit_conf was added to GGVCFs in version 3.2

Sign In or Register to comment.