Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Haplotype caller calling heterozygote with bad quality when it is Homozygote

Hi all,
Second question of the day, I am sorry.

I have been facing an issue that I saw many times on the forum but each time it seemed that you fixed it with new versions. Problem is I am using the very last version of GATK = 3.4 (downloaded yesterday).

I am using HaplotypeCaller single sample mode with all parameters by default as indicated in the "DNASeq" example in the HaplotypeCaller documentation page.

I have a problem with one variant rs206076 on BRCA2 chromosome 13 position 32915005. I have Sanger validation for this variant in all my samples and Unified Genotyper was calling it correctly.

Using HaplotypeCaller, there are some variants where the HOM call is made correctly and with very high quality (22 000) and others where HC calls the position as heterozygote with very low quality when it is obvious from IGV that the position is homozygote.

I put in the IGV screenshot on individual that fails (up) and the other that is ok (down).

I do have the same problem with another variant close to this one (not shown) whereas a couple of variants in between are called correctly in all samples.

Any idea about this issue ?
thanks a lot
Manon

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi Manon,

    Can you please post the bamout screenshot of the region? I assuming those screenshots you posted are of the original bam file.

    Also, please post the VCF record for one position of interest.

    Thanks,
    Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    @tommycarstensen I did read your thread but aren't you having an issue with low coverage data ? Here we have pretty high coverage data (5000x, not necessarily need that much I agree).
    However I did test with --minPruning 1. The variant shows up but as mutliallelic. See IGV plots in my other answer. Still having issues.

  • manon_sourdeixmanon_sourdeix FranceMember

    @Sheila

    Here are all the IGV screenshots for
    1) the given position = chr13:32915005
    first line is the original bam file
    second is bamout without minPruning
    third is bamout with minPruning set to 1
    2) And the same for another variant within the same individual that is found = chr13:32913055.

    It seems that quite a lot of reads completely disappear or are misaligned I do not know.

    Vcf lines for the minPruning set to 1
    Variant not found or low quality
    chr13 32915005 rs206076 G A,C 704.77 . AC=1,1;AF=0.500,0.500;AN=2;DB;DP=436;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=1.62;SOR=5.610 GT:AD:DP:GQ:PL 1/2:0,2,31:33:83:733,651,645,83,0,296
    Variant found correctly
    chr13 32913055 rs206075 A G 77259.77 . AC=2;AF=1.00;AN=2;DB;DP=2034;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.21;SOR=2.365 GT:AD:DP:GQ:PL 1/1:0,2034:2034:99:77288,6110,0

    Please tell me if there is anything else you need to help me.
    Thanks a lot.

    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi again guys,
    I add here the IGV plot and vcf line when adding --minDanglingBranchLength 1 as suggested by Geraldine in @tommycarstensen post.

    The variant is found, monoallelic but called heterozygote. On IGV, it's quite clerly homozygote though (last line).

    Vcf line
    chr13 32915005 rs206076 G C 503.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.516;ClippingRankSum=0.494;DB;DP=440;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.710;QD=1.14;ReadPosRankSum=0.927;SOR=1.559 GT:AD:DP:GQ:PL 0/1:15,30:45:99:532,0,268

    Any thoughts ?
    Thanks again.
    Manon

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Um... "obvious from IGV" is rarely an informative phrase. Based on the ADs in that last call it looks like most reads are getting thrown out -- probably due to low qualities; HC has some more stringent filters than other tools. Try turning on the IGV option that fades mismatches depending on base quality. Also, are these duplicate-marked, and do you have duplicates showing in IGV? When I hear 5000x, I think PCR dupes...

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey. This is amplicon-based analysis so they are by design all PCR duplicates. Sorry for not having mentioned.

    Problem is that I know this SNP is homozygote. This was confirmed by Sanger, UG calls it HOM with very high quality as well. Which is why I am looking for where the issue with my HC call is.
    Also because most of the other calls behave very well, especially the indels !

    The shading was already on "between 0 and 50". I put it up to 80 and the Cs start to shade. Is there any option I can "play" on to call these variants. Since we are in a diagnosis context, we cannot miss any variant. On the other side, getting a bit more false positives is not an issue.

    Any suggestion that I could try running ?

    thanks a lot Sheila and Géraldine.

    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi all,
    I forgot to output the line of this same variant called HOM by UG and where the quality is really good.

    chr13 32915005 rs206076 G C 22359.77 KEEP AC=2;AF=1.00;AN=2;DB;DP=558;Dels=0.00;FS=0.000;HaplotypeScore=34.1418;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.35;SOR=12.642;ANN=C|synonymous_variant|LOW|BRCA2|BRCA2|transcript|NM_000059.3|Coding|11/27|c.6513G>C|p.Val2171Val|6740/11386|6513/10257|2171/3418||;SNP;HOM;VARTYPE=SNP GT:AD:DP:GQ:PL 1/1:0,555:558:99:22388,1664,0

    Any idea why I get such a low quality or even no call with HC ?

    Thanks
    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi,
    Just as a matter of transparency linked to my last post about weird strand bias, for these tests I was using nct = 1 and not nct 4. Therefore I am still completely blocked in my workflow regarding the non-calling of such variants.
    Please tell me if you need any other info / example.

    Thanks a lot
    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi again.
    I am so sorry that I post so many comments but I am trying to find the reason with HC acts like that and want to provide you with as much detail as possible.

    Looking at the position in the original file, the C (alternate) base has a quality of around 38, which is equivalent to most of the other bases, including at other positions on the reads.
    When looking at the vcf file line (HC call, minPruning 1, minDangling 1), AD is 15,30 when DP is 440, feeling like many reads have not been considered, all carrying the alternate allele. When looking in the bamout for this call, the C (alt) allele, has a base quality around 33, the G allele as well.

    I cannot figure out why so many reads carring a good quality C allele are not considered in the call.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, hmm. The problem with amplification data is that it gives you a false sense of security regarding depth of coverage... There's a lot of reads but you have no good way to distinguish what portion is actually representative of the underlying truth. That's really not an ideal data type for making diagnostic decisions, afaik. But if that's what you have to work with...

    So, my guess is that either the reads are getting filtered out up front due to low quality (which you can test for by running some QC tools on the bamout) or they are being discounted as uninformative later in the process, which is a bit more complicated to test for. If it's the former, there are some parameters you can play with to lower HC's requirements, iirc. If it's the latter there's not much to be done except flag uncertain calls as needing manual review. You can have HC emit reference GQ (RGQ) to assist in flagging these calls.

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi Geraldine, thank you lot for this complete answer. At that specific position it is not ultra deep (see UG call with dt NONE has DP=558. This is still testing data (not yet diagnosis as you can guess) and this is the reason why there is so much coverage in some regions.

    I will have a look at the QC of the bamout file to see if there are many reads with low quality. However if that was the case, wouldn't UG detect that mutation with low quality as well ?

    In case this info may help, I did not perform base recalibration as this is such a low coverage (in the sense horizontal coverage !).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    UG applies fewer filters and is more easily convinced than HC. It's also less "aware" of haplotype ambiguity (when there is ambiguity about which haplotype a read supports, ie a read is uninformative as to which haplotype explains the data better).

    BQSR is dependent on number of reads total, not how spread they are across the genome, so you may still be able to use it -- but it may or may not make a difference here.

    (I think we would call this experiment type "narrow coverage" rather than low coverage, to avoid that ambiguity.)

  • manon_sourdeixmanon_sourdeix FranceMember

    I am still looking at the reads in this region since I do not get why so many are not considered, but I will test BQSR as well, as a try.

    Thanks !

  • manon_sourdeixmanon_sourdeix FranceMember

    All reads supporting this mutation have a mapping quality of 60. All bases at this position have a base quality around 30 from what I can see.
    As previously mentionned, this is happening in more than 1 individual (although for some others, the mutation at the exact same position is called by HC). For all individuals, this has been confirmed by Sanger, and call from the same fastQ with a non-opensource software as well.
    Also, for the other 20ish other mutations, HC is performing perfectly well, which makes me think it is not the amplicon design the culprit.

    So this makes me a bit worried and wondering whether I can really trust HC output, at least for SNVs..

    I will do base recalibration right now and come back to you.

  • manon_sourdeixmanon_sourdeix FranceMember

    Hello @Geraldine_VdAuwera

    Here we go I tested runnning BQSR on my original bam file (that stopped GATK worflow righ tbefore the BQSR process) and recalled the variants using HC. My rs206076 variant does appear, homozygous, with very high quality.

    It's a bit too early to close that thread and say it is resolve. I will run the recaliibration step on all my samples tonight or tomorrow, and see whether it solves the issue entirely and come back to you about that.

    I kind of based myself on this, https://www.biostars.org/p/1742/ citing your recommendations. I probably shouldn't have. Is there any covariates that I should / should not include regarding the fact that I am analysing amplicon based sequencing data ?

    Thanks again for your help, this is really appreciated.

    Manon

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, interesting. Let us know how it goes. We don't really have experience with amplicon sequencing data so I can't say much about tweaking BQSR covariates. But I can say the caveat cited on Biostars mostly applies to very small experiments where people sequence only a handful of genes. Since you mentioned having upward of 22K variants in your original post, I'm assuming you gene panel is significantly larger than that, so you may be in the clear. Ultimately there's nothing quite like running some tests and running a comparative analysis of the results :)

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi Geraldine,
    I do not have all results yet. 22K was the quality of the call, not the number of variants. Right now I have only 2 genes. We are moving slowly to a larger panel of around 30 genes though.

    From the results that have run, it's getting better but still missing quite a few variants. So the BQSR does not solve it all.

    Could you please tell me which parameter I should be looking at to try and lower the stringency of HC ?

    The other very last possibility will be to stick to call with both HC and UG and at the end try and group the 2 calls into 1. I don't really like that, especially when you really recommend to use HC now.
    One of the 2 variants that are making me so much trouble is actually homozygote in ALL individuals. And HC fails to identify a few of them. Do you think trying to call them all (42 of them) together would improve that ?

    Thanks again for the support !

    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi Geraldine,
    Ok so BQSR did solve some of my samples. For this rs that I talked about, I had 7 individuals with the mutation missing. With BQSR I resolve for 4 of them, but get 2 more, for a total of 5 individuals with the mutation missing.

    By adding minPruning 1, 3 of these 5 get called with very good quality (QD around 30 !).

    Leaving 2 missed ones. I ran HC and noted that for the 3 that get called, HC says "XX reads were filtered out during the traversal out of approximately 135268 total reads (0.20%)" --> total of around 100000 reads.
    But for the 2 that are missing the mutation, HC says "out of approximately 53713" reads.

    I did samtools flagstat on all 5 recalibrated bams and they all have well mapped, well paired 100 000 reads, 50 000 read1 and 50 000 read2. So for my 2 individuals with the missing mutations, it seems that HC is looking at only half the reads, probably all read1 or all read2.

    I think we are getting closer with this issue but can't think of a reason why HC would do that. Any suggestion/comment ?
    Thanks a lot
    Manon

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you post the section of the output log that shows the breakdown of the filtering?

  • manon_sourdeixmanon_sourdeix FranceMember

    Sure. Sorry about the time for answering, it's just that I am in France so our days do not overlap much !

    Here is the section, not many reads are filtered out it seems. But it feels like only half are looked at from the beginning.

    Command line
    java -Djava.io.tmpdir=/tmp/tmp_pipeline_bioinfo/ -Xmx20G -jar /home/lom/bin/softwares/gatk/current/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/lom/bin/references/ucsc.hg19.fasta -I 4249.recal.bam --dbsnp /home/lom/bin/databases/dbsnp_138.hg19.vcf -L intervals.bed -stand_call_conf 30 -stand_emit_conf 10 -minPruning 1 -bamout 4249_bamout.bam -o 4249_out.vcf

    Section of output log
    INFO 08:50:21,009 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
    INFO 08:50:21,270 IntervalUtils - Processing 14185 bp from intervals
    INFO 08:50:21,352 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO 08:50:21,380 GenomeAnalysisEngine - Done preparing for traversal
    INFO 08:50:21,381 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 08:50:21,381 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 08:50:21,382 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
    INFO 08:50:21,382 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
    INFO 08:50:21,492 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO 08:50:21,493 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO 08:50:21,514 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
    Using SSE4.1 accelerated implementation of PairHMM
    INFO 08:50:24,465 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
    INFO 08:50:24,465 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
    WARN 08:50:33,088 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper
    WARN 08:50:33,089 InbreedingCoeff - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.
    INFO 08:50:51,385 ProgressMeter - chr13:32906351 7531.0 30.0 s 66.4 m 15.2% 3.3 m 2.8 m
    INFO 08:51:21,387 ProgressMeter - chr13:32910328 10914.0 60.0 s 91.6 m 23.8% 4.2 m 3.2 m
    WARN 08:51:21,859 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr13:32907535 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    INFO 08:51:51,388 ProgressMeter - chr13:32971015 160205.0 90.0 s 9.4 m 93.5% 96.0 s 6.0 s
    INFO 08:51:56,650 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.037732179000000005
    INFO 08:51:56,651 PairHMM - Total compute time in PairHMM computeLikelihoods() : 76.34947205
    INFO 08:51:56,651 HaplotypeCaller - Ran local assembly on 8 active regions
    INFO 08:51:57,020 ProgressMeter - done 187877.0 95.0 s 8.5 m 100.0% 95.0 s 0.0 s
    INFO 08:51:57,021 ProgressMeter - Total runtime 95.64 secs, 1.59 min, 0.03 hours
    INFO 08:51:57,021 MicroScheduler - 151 reads were filtered out during the traversal out of approximately 57728 total reads (0.26%)
    INFO 08:51:57,022 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 08:51:57,022 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter
    INFO 08:51:57,023 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 08:51:57,023 MicroScheduler - -> 146 reads (0.25% of total) failing HCMappingQualityFilter
    INFO 08:51:57,024 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 08:51:57,024 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 08:51:57,025 MicroScheduler - -> 5 reads (0.01% of total) failing NotPrimaryAlignmentFilter
    INFO 08:51:57,025 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
    INFO 08:51:58,444 GATKRunReport - Uploaded run statistics report to AWS S3

    Flagstat recal bam
    119821 + 0 in total (QC-passed reads + QC-failed reads)
    87 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    118709 + 0 mapped (99.07%:-nan%)
    119734 + 0 paired in sequencing
    59867 + 0 read1
    59867 + 0 read2
    118226 + 0 properly paired (98.74%:-nan%)
    118316 + 0 with itself and mate mapped
    306 + 0 singletons (0.26%:-nan%)
    78 + 0 with mate mapped to a different chr
    42 + 0 with mate mapped to a different chr (mapQ>=5)

    Flagstat bamout
    8841 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    8841 + 0 mapped (100.00%:-nan%)
    8256 + 0 paired in sequencing
    4206 + 0 read1
    4050 + 0 read2
    8236 + 0 properly paired (99.76%:-nan%)
    8236 + 0 with itself and mate mapped
    20 + 0 singletons (0.24%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey
    I may have been misleading you sorry.

    Let me try to summarize a little bit.

    Individual 5967 and 4249 have been sequenced for 2 genes using amplicon sequencing. 5967 has the variant called if minPruning is set to 1 (not otherwise). 4249 does not even with minPruning.

    In the original bams, 5967 has 100 000 reads for each of the 2 chromosomes, idem in the recalibrated bam.
    In the original bams, 4249 has 50 000 reads for each of the 2 chromosomes, idem in the recalibrated bam.

    In the bamout, 5967 has 13173 reads on the chromosome of interest (bed file)
    In the bamout, 4249 has 8841 reads on the chromosome of interest (bed file)

    Basically from the individuals I have tested, 2 for which the variants is missing have a total of 6900 and 8841 reads in the bamout for that chromosome, and for 3 of them that do show the variant, they have 9274, 10653 and 13173 reads in the bamout. Seems like the difference is not THAT big.

    For information, this variant is a very commun homozygote variant (CEU population) so it should be present in the homozygous state in all individuals. I really have to resolve this error somehow but can't figure out what I should try next..

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey
    One last info for today

    I have gone a bit deeper in the bamout files of individuals 4249 (missing the variant) and 5967 (having the variant).

    (samtools view bamout chr13:32915005)
    Individual 4249 has 1162 reads covering the position, all with MQ = 60. Flags : 625 with '147', 46 with '16', 2 with '73' and 489 with '99'.
    Individual 5967 has 4045 reads covering the position, all with MQ = 60. Flags : 1865 with '147', 192 with '16', 10 with '73' and 1978 with '99'.

    Anything striking you from these statistics ?

    Thanks again.

  • manon_sourdeixmanon_sourdeix FranceMember
    • One last info so that I hope you have everything you need to try and diagnose my problem.

    When I try EMIT_ALL_SITES for my individual ID 4249 I get at my desired position
    chr13 32915005 . G . 605.23 . AN=2;DP=337;MQ=60.00 GT:AD:DP 0/0:337:337
    (should be homozygote CC, G being the ref allele).

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey GATK team !
    Any chance I could send you some examples of 1 that show the variant, the other that does not.
    I have tried so many different things and still cannot figure out why this variant is not called and how I should try and change the parameters for this true variation to be detected.

    Thanks a lot
    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi
    I know this discussion starts to be long, but I already tried minPruning and minDanglingBranchLength set to 1 :) The minPruning part did recover some but not all.

    I will definitely try the kmer sizes. (I also do not forget that I owe you a test for the other thread ! will do asap)

    I will get back to you for this test.

  • manon_sourdeixmanon_sourdeix FranceMember

    Hi @Sheila !

    This kmerSize set to 10 and 19 completely solved the problem ! Thank you SO much !
    Any idea why this is ?

    Also, one important question is : do you think this should be added in complement with minPruning 1 (even without minPrunig set to 1, it does call the confirmed variant) ? + do you think this is needed only for Amplicon sequencing ? (I am currently also analyzing a capture panel of 30ish genes) ?

    Thanks a lot !
    Manon

  • SheilaSheila Broad InstituteMember, Broadie admin

    @manon_sourdeix
    Hi Manon,

    I am going to have a ticket to have the kmerSize in amplicon sequencing checked out. For now, it seems like only amplicon sequencing needs the kmer size to be changed, but I will have to get back to you. I am happy the kmer size solved the problem for you!

    -Sheila

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey !
    There was one other variant not behaving correctly in the same run. I will re-launch calling for all individuals and tell you if that completely solved the problem (+ close that thread if that's the case).

    So you would advise not to put minPruning to 1 and kmerSizes for targeted capture sequencing ?

    Thanks again !
    Manon

  • manon_sourdeixmanon_sourdeix FranceMember

    Hey !
    I would like to confirm that for my amplicon sequencing run, adding the kMer parameters did resolve all my problems.

    I have a run a capture of a 30ish genes panel without the kMer parameters and all went fine although it's only 10 individuals and I have Sanger sequencing data for only 2 genes in the panel.

    Thanks a lot for your help.
    Manon

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @manon_sourdeix What was you exact command line with regard to the kmersize flag?

  • manon_sourdeixmanon_sourdeix FranceMember

    @tommycarstensen
    Apparently by default 2 Kmers are examined which are 10 and 25.
    In the command line you need to add "-kmerSize 12 -kmerSize 19"

    Is this what you wanted to know or do you need anything else ?

Sign In or Register to comment.