Questions about using UG vs. HC (out of date)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?.

Comments

  • DavidRiesDavidRies Member ✭✭

    Dear Geraldine,
    I have to different setups:

    1. I have a pool of individuals of a F2 generation. These individuals all decend from two diploid parents. So I still end up with a diploid pool and would use HC, right?
    2. I have a pool of non related diploid individuals, so I have to use UG?

    Best regards,

    David

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi David,

    If you are doing true pooled calling, you need to use UG and set the ploidy to (real ploidy per individual) x (number of individuals).

  • jeramiahsmithjeramiahsmith University of KentuckyMember

    Is HaplotypeCaller appropriate for making genotype calls from haploid progeny (individuals, not pools) from a diploid parent?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    HaplotypeCaller can only genotype diploid individuals. For haploids, please use UnifiedGenotyper.

  • jeramiahsmithjeramiahsmith University of KentuckyMember

    hmm, not the answer I expected. Is there some kind of intrinsic assumption about levels of herterozygosity? which program would one use to genotype isogenic dipliod lines?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    HaplotypeCaller assumes that the organism is diploid and therefore looks for allele frequencies around 0, 0.5, 1. With UnifiedGenotyper you can set ploidy to whatever you want and the model will be adapted accordingly. Neither makes any assumptions about levels of heterozygosity if by that you mean the frequency at which you find het sites in your organism.

    Keep in mind that we work exclusively on human genomes, so we may not be aware of the particularities of other types of organisms. If you need information that is tailored to those peculiarities, you need to clarify in more detail what you are working with and what are the concerns/constraints.

  • jeramiahsmithjeramiahsmith University of KentuckyMember

    OK, that makes sense!

  • crojocrojo CaliforniaMember
    edited April 2014

    Hi Geraldine,

    My understanding is that for larger cohorts (e.g. over 100 samples) HC is now also recommended. However, the FAQs page still recommends UG in this case so just wanted to mention that. Looks like the website is undergoing a lot of updates so this particular update to the recommendations might already be in the works but just wanted to mention it in case it is helpful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @crojo, thanks for pointing this out! Indeed, we're in the middle of updating everything so a few pages are still lagging, but as a matter of fact I had forgotten about that FAQ article. I'll update this asap to avoid confusion.

  • mprasadmprasad FranceMember

    Hi Geraldine,

    I recently analysed some targeted gene-sequencing data using HC with Best Practices 3.x using the gvcf option. The aim was to identify rare mutations. Although it worked well for all variants, there was one that it wasn't able to detect. A sequencing platform analysed the same fastq files and called variants using UG, pileup, mpileup, and HC with an earlier version of Best Practices and they found an indel with UG and pileup that HC missed. I verified the indel by Sanger sequencing, so it is real and indeed the one causing the disease. Thought it may be useful for the team to see why HC missed it. The region where the indel lies is highly repetitive and polymoprhic, so that may explain it, but I did find the same indel by visual inspection, and it is present in ~30% of the reads. I'd like to share the BAM and VCFs with you if you are interested in taking a look. Can I upload this onto the ftp? Are there any other files/info you will need? I can give you my command lines but not those of the sequencing platform.

    Thanks,
    Megana

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Megana,

    This is indeed interesting -- could you please post your command lines? Ideally we'd want the center's commands too to find out what they did differently. But we can start by looking at yours and see if we can identify any potential reasons why the indel would be missed. A screenshot would be good too, please. Depending on what we find on first examination we'll decide whether we need to see the actual data. Thanks for volunteering this information!

  • mprasadmprasad FranceMember

    Hi Geraldine,

    Sure, here are my command lines to generate the raw calls.

    java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -I $ARGV[$n] -L $L --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o ${dir}/${filename}.gvcf

    java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta @variant -o ${dir}/${project_name}.rawvariants.vcf

    I have also attached a screenshot of the BAM file I generated with the deletion clearly visible and the VCF generated by the sequencing centre with the variant.

    Let me know if you need anything else.

    Thanks,
    Megana !

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks -- this does look like something HC should call. One more thing to try -- can you please run HC on this region with -ERC BP_RESOLUTION and the -bamout argument to output the reassembled reads? This will tell us what HC is seeing once it has performed reassembly. You can compare the output bam to the original and see if reads have moved. It can also show if HC didn't actually process parts of the region of interest, which happens in rare cases. Please show the IGV screenshot and the relevant lines from the GVCF (in full).

    Incidentally (not causing the problem but FYI), you don't need to pass confidence threshold parameters to HC when running in ERC mode, because they are set to 0 internally. You can pass them to GenotypeGVCFs, however.

  • mprasadmprasad FranceMember

    Hi Geraldine,

    Thanks for the pointer re: the confidence threshold paramaters.
    So I ran HC in BP_RESOLUTION mode and have attached a screenshot of the resulting bam (top) in comparison with the old bam (bottom). Indeed, there weren't any reads assembled in the region of interest, which explains the lack of variant calling! Why is that? I have also attached a screenshot of the resulting gvcf with the deleted base in yellow.

    Thanks,
    Megana

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mprasad‌

    Hi Megana,

    Looking at the original bam files, it seems like there are a lot of indels and SNPs around your site of interest. This is usually a sign of bad mapping.

    Because the reads are not present in the bamout file, the region is not being called as an active region. Have you checked the mapping qualities of the reads in the region, as well as the base qualities around the site? Active regions may not be triggered because of low quality scores.

    Please have a look at these documents to learn more about how Haplotype Caller works: https://www.broadinstitute.org/gatk/guide/article?id=4147

    https://www.broadinstitute.org/gatk/guide/article?id=4146

    This article may also be of interest to you: https://www.broadinstitute.org/gatk/guide/article?id=1235

    I hope these help!

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @‌mprasad

    Hi again,

    Can you upload a snippet of the region so we can replicate the issue? Instruction on how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • mprasadmprasad FranceMember

    Hi Sheila,

    Sorry for the slow response. Thanks for the links. Indeed, the reads in the region have low MQ scores and BQ scores because it is highly repetitive and also polymorphic. I was just curious why UG called it but not HC, but that may have to also do with the specific command line used by the sequencing centre.
    I have uploaded the snippet file of the bam onto the ftp (prasad.snippet.bam) and the corresponding index file. Please do let me know if you need something else.

    Thanks,
    Megana

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mprasad‌

    Hi Megana,

    I just submitted a bug report. I am not sure when this will be fixed, but thank you for sharing this! We are looking for test cases that are verified variants that are not picked up by Haplotype Caller.

    I will keep you updated on the status.

    -Sheila

  • mprasadmprasad FranceMember

    @Geraldine_VdAuwera

    I took your suggestion and ran Haplotype caller without the confidence threshold parameters but then tried to use them with the GenotypeGVCF tool. But it is throwing up an error. Here is my command line and the error

    java -jar /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/GATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/GRCh37/human_g1k_v37.fasta -stand_emit_conf 10 -stand_call_conf 30 --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-M.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-J.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-H.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-F.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-CH.aligned.sorted.dedup.realign.recal.gvcf' --variant '/media/newdrive/REANALYSIS/Plaquettes/Plaq-CA.aligned.sorted.dedup.realign.recal.gvcf' -o '/media/newdrive/REANALYSIS/Plaquettes/plaquettes.rawvariants.vcf'

    ERROR ------------------------------------------------------------------------------------------

    ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Argument with name 'stand_emit_conf' isn't defined.

    I checked to make sure that the argument was available in GenotypeGVCFs. Am I missing something obvious here?

    Thanks,
    Megana

  • pdexheimerpdexheimer Member ✭✭✭✭

    @mprasad‌ - It's just a version issue. -stand_emit_conf was added to GGVCFs in version 3.2

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mprasad‌

    Hi Megana,

    I was just updated on your bug. Unfortunately, it is very difficult to call this type of variant, as it may better be represented as a structural variant. Can you give me some more information about how this variant is sanger validated? If you can provide us with the exact sequence in the region, that would be great. Having the Sanger sequence will help us determine whether the events in the region are within the scope of what we can call, or whether there's a structural variant here that require different tools (like GenomeSTRiP).

    Thanks,
    Sheila

This discussion has been closed.