Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Why is haplotype caller ignoring some reads and calling 1/1 variants instead of 0/1?

VerónicaVerónica BarcelonaMember

Hello,

I compared the bam output from BWA-MEM and the bam output from HaplotypeCaller and I could see that for a specific active region GATK ignored some reads. Because of that, some variants were not called and others were called as homozygous instead of heterozygous. Could you please help me to understand why during the realignment step GATK does not consider the reads? I have checked mapping quality and alignment score and everything is ok. I am using GATK v4.

Thanks in advance.

Best

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Véronica
    Hi,

    Can you please post the VCF records for those sites?

    Thanks,
    Sheila

  • VerónicaVerónica BarcelonaMember

    Hello,

    Here they are, only two variants in this region (I am also sending the variant before and the one after):

    chr2_B 773399 . T C 1745.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.477;ClippingRankSum=0.000;DP=103;ExcessHet=3.0103;FS=4. 64;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=16.95;ReadPosRankSum=0.306;SOR=0.374 GT:AD:DP:FT:GQ:PL 0/1:54,49:103:heterozygous:99:1774,0,1984
    chr2_B 774393 . T C 2343.77 PASS AC=2;AF=1.00;AN=2;DP=54;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.50;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2372,162,0
    chr2_B 774408 . G T 2508.77 PASS AC=2;AF=1.00;AN=2;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=24.69;SOR=0.960 GT:AD:DP:GQ:PL 1/1:0,57:57:99:2537,174,0

    chr2_B 775128 . T A 331.77 SnpCluster AC=1;AF=0.500;AN=2;BaseQRankSum=0.682;ClippingRankSum=0.000;DP=74;ExcessHet=3.0103;FS=6.241;MLEAC=1;MLEAF=0.500;MQ=57.65;MQRankSum=1.016;QD=4.48;ReadPosRankSum=-3.130;SOR=0.719 GT:AD:DP:FT:GQ:PL 0/1:60,14:74:heterozygous:99:360,0,2611

    The command I ran for variant calling (GATK v4.0.2.1):
    gatk HaplotypeCaller --bam-output B_SC5314.GATK.bam --standard-min-confidence-threshold-for-calling 30.0 --sample-ploidy 2 --genotyping-mode DISCOVERY --output B_SC5314.vcf --input B_SC5314.mkd.bam --reference ref.fasta

    Thanks for your help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Verónica
    Hi,

    I think the VCF records for those two sites are correct. It looks like all the reads support the variant allele.

    Can you check the base qualities at the other two sites? Can you also post some zoomed out IGV screenshots? It would be good to see ~300 bases before and after the sites of interest.

    Thanks,
    Sheila

  • VerónicaVerónica BarcelonaMember

    @Sheila
    Hi,

    Thanks for your answer.

    However, what I do not understand is in the VCF, as in the BAM output from HC, DP = 54 (or 57 depending on the variant), but in my original BAM I have roughly the double of reads mapping at these positions.

    The reads that were excluded are exactly the ones supporting the reference in this positions and presenting SNPs in other ones. So, if all the reads I had initially were considered, I would have a 0/1 genotype and not 1/1 (plus other SNPs that the excluded reads contain). I would understand if the reads were excluded because of quality reasons, but all the bases have quality above 30 and the mapping quality varies between 30 and 60.

    Could you please tell me other factors that can have contributed to the "exclusion" of half of the reads?

    Thank you very much!!
    Verónica

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Verónica
    Hi Verónica,

    I suspect these reference reads are being mapped elsewhere after the reassembly step. Unfortunately, we used to have a way to view those reads in the bamout in GATK3, but it has not yet been ported to GATK4. Sorry for the hassle, but in this case, it may be worth running GATK3 on just that region and checking if that is indeed the case. You will need to add --emitDroppedReads to your command.

    -Sheila

  • VerónicaVerónica BarcelonaMember

    @Sheila
    Hello Sheila,

    I ran GATK3 in this region as you suggested. Indeed, the reads I am missing have the tag "Failed realignment".

    Thanks for your help!!

    Verónica

  • mpmachadompmachado LisboaMember

    Hi everyone,
    I’m facing a similar problem, but in my case HaplotypeCaller disregarded all reads, and did not call anything. When I asked to emit all dropped reads with -edr no reads were recovered. And I obtained the same result without the interval list (just to ensure it was not a problem of intervals list).
    I run it with GATK v4.1.2.0 (using broadinstitute/gatk:4.1.2.0 Docker image) and then with v3.8 (using broadinstitute/gatk3:3.8-1 Docker image).

    Commands:
    GATK v4.1.2.0
    gatk HaplotypeCaller --input /mnt/align/$(basename $bam_file) --output /mnt/align/PAT4-789871A.manually_01.vcf.gz --reference /mnt/ref/Human.fa --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --annotation TandemRepeat --contamination-fraction-to-filter 0 --sample-ploidy 2 --activity-profile-out /mnt/align/activity_profile_igv.manually_01.tab --assembly-region-out /mnt/align/assembly_region_igv.manually_01.tab --bam-output /mnt/align/bam_out_igv.manually_01.bam --disable-optimizations true --intervals /mnt/target/TrusightCancer_Manifest_UMO_24_genes_50-100_Julho2018.bed --interval-padding 100
    GATK v3.8
    java -jar /usr/GenomeAnalysisTK.jar -T HaplotypeCaller -I /mnt/align/$(basename $bam_file) -o /mnt/align/PAT4-789871A.manually_v3_8_01.vcf.gz -R /mnt/ref/Human.fa -G StandardAnnotation -G StandardHCAnnotation -A TandemRepeatAnnotator -contamination 0 -ploidy 2 -bamout /mnt/align/bam_out_igv.manually_v3_8_01.bam -disableOptimizations -L /mnt/target/TrusightCancer_Manifest_UMO_24_genes_50-100_Julho2018.bed –edr

    Image:
    Upper alignment panel: original BAM file
    Lower alignment panel: GATK v3.8 -bamout file

    And this happed for other regions as well.
    I hope you can help me understanding why is HaplotypeCaller ignoring all the reads.
    Thank you in advance.
    Best regards,
    Miguel

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited September 9

    Hi @mpmachado

    First off, we do not support GATK v3 anymore. Would you please share the same image for the GATK v4 instead please?

    If you want to get a better understanding of bamout you may try running through one of tutorials here which walks you through the IGV steps of exploring a bamout file. Note: this uses GATK3 but this shouldn't stop you from running the IGV steps and getting a good understanding of the concept.

  • mpmachadompmachado LisboaMember

    Hi @bhanuGandham,
    GATK v4 produced exactly the same result of GATK v3 (the previous image): no reads! I just run GATK v3 to see the problematic reads with --emitDroppedReads as suggested above by @Sheila.
    However, after following your advice and read the tutorial, I run everyting again but now with --force-active and got some reads. Only reference reads were present and I couldn't find reads with the alternative allele, even taged as "Failed realignment" (in the case of GATK v3).
    I still don't get why is HaplotypeCaller only considering reference reads. What do you suggest me to do to try to desintangle this situation?
    Thanks once more.
    Best regards,
    Miguel

    Commands:
    GATK v4.1.2.0
    gatk HaplotypeCaller --input /mnt/align/$(basename $bam_file) --output /mnt/align/PAT4-789871A.manually_09.vcf.gz --reference /mnt/ref/Human.fa --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --annotation TandemRepeat --contamination-fraction-to-filter 0 --sample-ploidy 2 --activity-profile-out /mnt/align/activity_profile_igv.manually_09.tab --assembly-region-out /mnt/align/assembly_region_igv.manually_09.tab --bam-output /mnt/align/bam_out_igv.manually_09.bam --disable-optimizations true --intervals /mnt/target/TrusightCancer_Manifest_UMO_24_genes_50-100_Julho2018.bed --interval-padding 100 --force-active true
    GATK v3.8
    java -jar /usr/GenomeAnalysisTK.jar -T HaplotypeCaller -I /mnt/align/$(basename $bam_file) -o /mnt/align/PAT4-789871A.manually_v3_8_03.vcf.gz -R /mnt/ref/Human.fa -bamout /mnt/align/bam_out_igv.manually_v3_8_03.bam -disableOptimizations -L /mnt/target/TrusightCancer_Manifest_UMO_24_genes_50-100_Julho2018.bed -forceActive -ip 100 -edr

    Image:
    Upper alignment panel: original BAM file
    Middle alignment panel: GATK v4.1.2.0 --bam-output file with --force-active
    Lower alignment panel: GATK v3.8 -bamout file with -forceActive

  • mpmachadompmachado LisboaMember
    edited September 11

    Hi again,
    I went back to the same data and colour it by read group. I noticed that there are two Artificial Haplotypes overlapping the SNV region. Do you think this might be the reason for HaplotypeCaller being ignoring the SNV and the reads supporting it?
    Thanks again,
    Miguel

    Image:
    Coloured by read group
    * Upper alignment panel: original BAM file
    * Middle alignment panel: GATK v4.1.2.0 --bam-output file with --force-active; ArtificalHaplotypeRG in green
    * Lower alignment panel: GATK v3.8 -bamout file with -forceActive; ArtificalHaplotypeRG in pink

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @mpmachado

    The reason you do not see any variant in this region is because after local assembly Haplotypecaller did not find any evidence of a variant and you see that in the bamout and vcf. Take a look at these two docs to understand how Haplotypecaller does local assembly: https://software.broadinstitute.org/gatk/documentation/article?id=11068
    https://software.broadinstitute.org/gatk/documentation/article?id=11076

  • mpmachadompmachado LisboaMember
    edited September 20

    Hi @bhanuGandham,
    Is it possible that HaplotypeCaller did not find any evidence for a variant because it is excluding the reads with non-reference bases? Although it targeted the region as an active region, which indicates it took in consideration the missing reads.
    If I understood right the documentations you pointed out, the new k-mer from the reads with non-reference bases should have been included in the assembler graph and the node containing the variant should not have been removed because there were around 40 reads supporting the variant (with a balanced strand representation). However, at the end, there aren't any reads containing the non-reference base, and, of course, no variant is called.
    This phenomenon happened several times in other positions.
    If you think the data would help us understanding what is going on, I can share it with you.
    Thanks,
    Miguel

Sign In or Register to comment.