Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

curious zero quality reference calls around indels

Hello:

I have been looking the HC output of my "haploid" Drosophila genomes. Mostly great. But there is one pattern that I neither understand or can even rationalize in the context of the many discussions of the complexity the gatk BP pipeline. Here is a typical call (including a bamout):

| DPGP3b_ZI99_GATKBP @ rooted3 (chuck) | => gatk \ | => -T HaplotypeCaller \ | => -R /home/chuck/shrd/reference_genomes/D_mel_RELEASE6_Sue/dmel_r6.fasta \ | => -I DPGP3b_ZI99.sorted.dm.rg.recal.bam \ | => --genotyping_mode DISCOVERY \ | => --emitRefConfidence BP_RESOLUTION \ | => -hets 0.006 \ | => --indel_heterozygosity 0.0006 \ | => --minReadsPerAlignmentStart 5 \ | => --output_mode EMIT_ALL_CONFIDENT_SITES \ | => --sample_ploidy 1 \ | => -kmerSize 10 \ | => -kmerSize 25 \ | => --doNotRunPhysicalPhasing \ | => --dontTrimActiveRegions \ | => --variant_index_type LINEAR \ | => --variant_index_parameter 128000 \ | => -o DPGP3b_ZI99.2L12-13M.raw.snps.indels.g.vcf \ | => -L 2L:12000000-13000000 \ | => -bamout DPGP3b_ZI99.2L12-13M.sorted.dm.rg.recal.bamout.bam &

And here is a small example

2L 12020924 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:99:0,135 2L 12020925 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:99:0,135 2L 12020926 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:89:0,90 2L 12020927 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:89:0,90 2L 12020928 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:89:0,90 2L 12020929 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:89:0,90 2L 12020930 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:89:0,90 2L 12020931 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:0:0,0 2L 12020932 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:13,0:13:0:0,0 2L 12020933 . A AT,<NON_REF> 219.78 . DP=13;MLEAC=1,0;MLEAF=1.00,0.00;MQ=60.00 GT:AD:DP:GQ:PL:SB 1:0,13,0:13:99:252,0,252:0,0,8,5 2L 12020934 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:3,11:14:0:0,0 2L 12020935 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:14,0:14:99:0,391 2L 12020936 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:14,0:14:99:0,389 2L 12020937 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0:17,0:17:99:0,479

A number of other examples are in the attached file along with the std output to the terminal from HC.

Many of these cases of GQ == 0 seem arbitrary and surprising, but I know there is a lot going on under the hood.

Is this to be expected? If not, what can I do to improve the high quality calling.

Let me know if you would like to have the input bam files.

Cheers,
Chuck

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    Can you please confirm that the base qualities and mapping qualities in the bam file are good in the region? Also, please post the original bam file and bamout file for the region you posted above.
    Are you using the latest version of GATK?

    Thanks,
    Sheila

  • chlangleychlangley UCDMember

    Sorry for posting the typo riddled text. It should have read: "I have been looking at the HC output from my "haploid" Drosophila genomes. Mostly great. But there is one pattern that I neither understand nor can I even rationalize it in the context of the many discussions of the complexity the gatk BP pipeline."
    ....
    "A number of other examples are in the attached file along with the std output to the terminal from HC. Many of these cases of GQ == 0 seem arbitrary and surprising, but I know there is a lot going on under the hood. Is this to be expected? If not, what can I do to improve the high quality calling?"

    Oh! And here is the attachment.

    Cheers,
    Chuck

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley

    Hi Chuck,

    Can you post IGV screenshots of the original bam file and bamout file in the region? https://www.broadinstitute.org/gatk/guide/article?id=5484

    Thanks,
    Sheila

  • chlangleychlangley UCDMember

    Hello Shiela:

    Thanks for looking at my issue.

    I am using
    | => gatk --version
    3.4-46-gbc02625

    I checked 20 reads covering several of the 'curious' GQ==0 calls. All seem to be 'Mapping quality = 60 .

    I placed a file, fromChuck.zip in ftp.broadinstitute.org/ . It contains the relevant bam and bamout files:
    DPGP3b_ZI99.2L12-13M.sorted.dm.rg.recal.bam
    DPGP3b_ZI99.2L12-13M.sorted.dm.rg.recal.bamout1.bam

    Thanks for any help.

    Cheers,
    Chuck

  • chlangleychlangley UCDMember

    @Sheila said:
    @chlangley

    Hi Chuck,

    Can you post IGV screenshots of the original bam file and bamout file in the region? https://www.broadinstitute.org/gatk/guide/article?id=5484

    Thanks,
    Sheila

    Hello:

    I posted two screenshots on the ftp site.

    2L12020933.pdf
    2L12022181.pdf
    in screenshots_CHL.zip

    Cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey @chlangley, when you post screenshots, can you please just attach them as images in your comments? Having to retrieve them from a zipped file on the ftp takes us more time, we'd prefer to be able to just glance at them in the forum.

  • chlangleychlangley UCDMember

    @Geraldine_VdAuwera said:
    Hey @chlangley, when you post screenshots, can you please just attach them as images in your comments? Having to retrieve them from a zipped file on the ftp takes us more time, we'd prefer to be able to just glance at them in the forum.

    Sure.
    They are: 2L12020933.pdf 2L12022181.pdf

    Issue · Github
    by Sheila

    Issue Number
    373
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Huh, that is weird. You're saying this occurs around many variant calls?

  • chlangleychlangley UCDMember

    Hello:

    Such zero quality reference calls near high quality indels seems to be surprisingly common, certainly greater than 10% of indels.

    We have been looking at these in many haploid Drosophila genomes for months, thinking we must be doing something wrong.

    If you can suggest a promising analysis that could shed light on the source of these I shall try to doing it promptly.

    I imagine that the original recalibrated Illumina quality is not making it into the local de nova assembled sequence (and back to the bp). Perhaps the original bwa mapping quality is entering into calculation of the base quality after local assembly.

    Thanks,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I don't think that's the case. Although the original BWA misalignments will admittedly cause mismatches that will be counted as sequencing errors by BQSR, there wouldn't be all that many of them overall compared to the number of bases in your dataset, and there wouldn't be a strong systematic signal that would cause bases at those positions to be so systematically downgraded. You can check the base quals in the pileups at those positions in the bamout but I don't expect you'll see anything striking.

    What version are you using?

  • chlangleychlangley UCDMember

    I am using
    | => gatk --version
    3.4-46-gbc02625

  • chlangleychlangley UCDMember

    Hello:

    @Geraldine_VdAuwera said:
    You can check the base quals in the pileups at those positions in the bamout but I don't expect you'll see anything striking.

    I checked a couple the zero quality bp all have a number of reads in the bamout pileup, all of which have Phred base qualities between 22 and 29. None are near zero.

    What version are you using?

    I am using
    | => gatk --version
    3.4-46-gbc02625

    Thanks,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ok, thanks. Can you please run a test on a subset of these sites with the latest release and default parameters (i.e. not including unnecessary arguments like indel heterozygosity) and let us know if you still observe this effect?

  • chlangleychlangley UCDMember

    Per your suggestion I call HC in what I could understand is the default mode in this situation.

    gatk \ -T HaplotypeCaller \ -R /home/chuck/shrd/reference_genomes/D_mel_RELEASE6_Sue/dmel_r6.fasta \ -I DPGP3b_ZI99.sorted.dm.rg.recal.bam \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --output_mode EMIT_ALL_CONFIDENT_SITES \ --sample_ploidy 1 \ -o DPGP3b_ZI99.2L12-13M.raw.snps.indels.g.dfltParam.vcf \ -L 2L:12000000-13000000 \ -bamout DPGP3b_ZI99.2L12-13M.sorted.dm.rg.recal.dfltParam.bamout.bam &

    The vcf output is almost identical to that of the previous runs.
    The zero-quality reference calls are all there at the same sites.

    Are the parameters sufficiently close to default?

    Cheers,
    Chuck

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    Can you submit a bug report so we can have a look? Instructions are here.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @chlangley By default parameters I meant, don't specify any parameter that is not required, except for the ploidy. So basically your command would become:

    gatk \
        -T HaplotypeCaller \
        -R /home/chuck/shrd/reference_genomes/D_mel_RELEASE6_Sue/dmel_r6.fasta \
        -I DPGP3b_ZI99.sorted.dm.rg.recal.bam \
        --sample_ploidy 1 \
        -o DPGP3b_ZI99.2L12-13M.raw.snps.indels.g.dfltParam.vcf \
        -L 2L:12000000-13000000 \
        -bamout DPGP3b_ZI99.2L12-13M.sorted.dm.rg.recal.dfltParam.bamout.bam
    
  • chlangleychlangley UCDMember

    Hello Sheila and Geraldine:

    I just upload two zipped files: chl_Gdefault_call_output.zip AND chl_default_call_output.zip .

    Each contains the suggestion files and info:
    1. Gdefault_2Lnarrow1_call_output.txt AND 2Lnarrow1_call_output.txt contain the commandline followed by the full output.
    2. the input snippet bam files
    3. the input snippet bai files
    4. the output vcf files
    4. the output vcf.idx files
    6. the output bamout.bam files
    7. the output bamout.bai files

    The chl_Gdefault_2Lnarrow1 uses the call proposed by Geraldine above.
    While chl_default_call_output uses my proposed "default" call (above) which does output the problematic zero quality flanking reference calls.

    Cheers,
    Chuck

  • chlangleychlangley UCDMember

    Hello:

    One further check - the bug report:

    We installed the gatk 3.5 update.

    I ran the "2Lnarrow1_call" command line (as submitted to the ftp site in "2Lnarrow1_call_output.txt".

    The output vcf looks like the one from 3.46 - still has the zero-quality reference call around indels.

    Cheers,
    Chuck

    Issue · Github
    by Sheila

    Issue Number
    426
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    I have received your bug report and will look into it soon.

    Thanks,
    Sheila

  • chlangleychlangley UCDMember

    Hello Sheila:

    The Github and Issue Number links are dead.

    Is there a passive way for me to follow the path of this bug report?

    Cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Chuck, unfortunately we can't give you access to the private github repo where the issues live. Once the bug report has been verified by our team and entered as a ticket in the developers' queue, it will be visible in the Bug Tracker. Usually this transition happens much faster but the end of year / winter break period is a bad time because we're short-handed and pressed to wrap up higher priority tickets. It may take another week for us to get caught up and finally process your report. Our apologies for the inconvenience.

  • chlangleychlangley UCDMember

    Hello Geraldine and Sheila:

    Nothing in Big Tracker yet. Can you offer any update?

    Cheers,
    Chuck

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    Sorry for the delay. Can you please upload your reference you used? As soon as you do, I will process the bug report.

    -Sheila

  • chlangleychlangley UCDMember

    Hello:

    I just uploaded a folder:

    | => cd chl/ total 41128 -rw-r--r-- 1 chuck staff 18K Nov 30 2014 dmel_r6.fasta.fai.gz -rw-r--r-- 1 chuck staff 40M Nov 29 2014 dmel_r6.fasta.gz

    onto the GSA ftp server.

    Thanks,
    Chuck

    Issue · Github
    by Sheila

    Issue Number
    1271
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck

    Sorry for the delay in processing this. It indeed is a curious case! There were a few odd cases that I found, but for now, I stuck to your original request to determine why the reference calls have GQ 0 when the mapping and base qualities are fine.

    I will update the issue to be in the Public Bug Tracker when I get the greenlight from Geraldine. Also, I will make a push to have someone look at this asap at our next meeting.

    -Sheila

  • chlangleychlangley UCDMember

    Hello Sheila and Geraldine:

    Just checking on my "bug".
    To avoid becoming a pest, maybe you can tell me the etiquette and reasonable expectations.
    You two have the bar pretty high on the support side.

    Thanks,
    Chuck

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    Unfortunately, I don't have a timeline for you. The developers are swamped with other high priority items, but I will make a push for this to get fixed. As soon as I know something, I will post here.

    Thanks for being patient.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can see the issue tracker queue here: https://www.broadinstitute.org/gatk/guide/issue-tracker

    Right now it doesn't show progress but we're planning to add some indicators of whether an issue has been assigned, is being worked on etc.

    It's hard to give any estimates on expectations for turnaround because it's highly dependent on developer bandwidth, which these days are focused on some high-impact projects that will affect a large proportion of users -- to the unavoidable detriment of isolated "weird calls" cases like yours, I'm afraid. I realize it's been a couple of months -- having the holidays in the middle hasn't helped.

    That said we reviewed your issue today; we still don't have a clear idea of what might be happening beyond "it's a messy region" so we'll need a developer to take a closer look at your data. We're hoping to get someone assigned at the next weekly dev review meeting on Thursday.

  • chlangleychlangley UCDMember

    Thanks for the info.
    Cheers,
    Chuck

  • chlangleychlangley UCDMember

    Hello Geraldine:

    Did my bug get assigned to a developer today?

    Thanks,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @chlangley and @Redmar_van_den_Berg

    Someone is actively looking into this, yes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @chlangley, the developer assigned to this found that there is most probably a large insertion (big enough to be considered a structural variant) in the problem region. It's beyond HC's ability to call so we have to classify this as a known limitation for now. The devs are going to think of ways to put in some logic to detect this sort of event, but unfortunately at this time there's no way to improve those calls.

    I'd like to write up a short case study article for the docs to document how to diagnose this sort of problem. Do you mind if we use IGV screenshots that were generated during the debugging process in the document? We can redact file names from the screenshot if it helps.

  • chlangleychlangley UCDMember

    Hello Geraldine:

    I am disappointed to hear that nothing can be done to improve these calls. We had a lot riding on the quality of these 'assemblies'.

    Yes, of course, you may use our data and derivatives to document the issue and how to detect it.

    No need to redact names as far as I am concerned.

    BTW: I am somewhat skeptical of the "probably large insertion" explanation and hope to test that and other possibilities with simulated reference sequences and variously generated read data from the dmel ref genome sequence. The frequency of tracks of these zero quality reference calls around most small indels seems inconsistent with the familiar evidence about drosophila genomics (large insertions being rare).

    BTW2: Here is a related algorithmic question about HC that I can not find documented or discussed. Most assemblers rely on voting to 'vote out' rare molecular chimeras (say between two random regions) in the short insert PE data. How does HC deal with such read pairs? And could these be a source of the "large insertions"? As it turns out we have examined these in our sequencing protocol and have a handle on them. Is there by any chance a parameter that effects the way such voting/filtering works?

    Cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks Chuck. Are these data derived from some new genome assemblies? What we're seeing in your case would actually be more consistent with assembly problems than with a single large insertion (which is what we found in Redmar's case and are treating as a similar problem because from the point of view of HC it's hitting the same logic flaw).

    HC currently doesn't do anything with read pair information at all, except for some logic that prevents overlapping reads from the same pair from counting as double evidence (since they're non-independent observations). Obviously that's kind of a pity since there is clearly some interesting and useful information contained there, and in future I expect we'll add some logic to access that info. In a separate development effort we're working on CNV and SV calling, and once we have a good handle on that we'll hopefully be able to apply some of the lessons we'll have learned working with SVs. The ideal caller would of course take all classes of variation into account at the same time, and I think that is the future -- but at the moment that future is still very much a work in progress!

  • chlangleychlangley UCDMember

    The raw data are 150bp PE Illumina reads from a library prepared from the whole-genome-amplified DNA of a single haploid Drosophila melanogaster embryo. Our comparisons of this protocol to libraries prepared from unamplified DNA showed a very low background level of chimeric molecules (as predicted from experiments in the literature) and an increase in the variance of coverage (both as predicted from experiments in the literature). Based on simply filtered bwa alignments we determined a sufficient depth to give high overall coverage and routine 'vote out' the rare chimeric reads and read pairs.

    What confuses me is the phenomenon of zero quality reference calls occurs near (within 15 bp) of almost all small inserts and deletions. There is no obvious association with read depth (thus interaction with the larger variance in depth). And such zero quality reference calls are not seen around SNPs.

    We are planning are few explorations to test these speculations.

    Cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ok, let us know what you find.

  • chlangleychlangley UCDMember

    Hello Geraldine:

    In an attempt to both understand better the source of these mysterious ‘zero quality reference calls flanking small indel’ (here after ZQRCs) and hopefully stumble onto a workaround, I have been doing a few experiments that I trust will eliminate potential sources of this ‘known limitation’ and hopefully focus attention on a narrower set options.

    The connection between Redmar’s issue and these annoying ZQRCs is not so obvious to me. His issue may indeed be the result of misassembly. But I am skeptical that the hundreds of thousands of ZQRCs through out every gatk-HC resequencing assembly (>200 independent genomes) are attributable to assembly errors in the sixth generation, gold standard Drosophila genome (dmel6) or caused by ubiquitous undocumented euchromatic structural variants in the natural populations from which the genomes were sampled. Significant efforts have been made to discovering and documenting such things; they are reported to be rare relative to ZQRCs.

    I did two experiments to better understand the source of ZQRCs. The first addresses the potential role of unknown genomic polymorphism. Using sequencing data from libraries made the reference-genome-source stock, ycnbwsp using the our standard haploid embryo protocol, I ran several parameterizations of the BP gatk pipeline (including HC 3.5) with a dwgsim-mutated reference genome. So under these conditions there should be no genomic variation save the SNPs and indels introduced by dwgsim (and the 200 or so known mutations in our stock derived from that used to create dmel6 a few decade back). Alas the familiar ZQRCs flanking the mostly single-bp indels are everywhere in the euchromatic sequence. Thus I conclude these ZQRCs are not due to unknown (structural or otherwise) variation in the target genome or misassembly of dmel6.

    Next I decided to do a little experiment that I imagine must have been done in the development and testing of HC; but maybe not? The idea is to simply map simulated reads back to a reference genome with known SNPs and indels. Note the dwgsim generated reads (40X of 2x146bp) from dmel6 had by design low error rates (10^-3), and neither chimeras nor PCR duplicates. Thus we have a idealized synthetic haploid data that must fit the fundamental assumptions and most primitive goals of the gatk-HC pipeline. Essentially every SNP and indel in the euchromatic genome is discovered and called with high confidence. But alas flanking each indel are ZQRCs.

    Two examples: simple, isolated situations such as this

    2L 12051992 . T . . . GT:AD:DP:GQ:PL 0:29,0:29:99:0,180 2L 12051993 . C . . . GT:AD:DP:GQ:PL 0:29,0:29:99:0,180 2L 12051994 . C . . . GT:AD:DP:GQ:PL 0:30,0:30:99:0,180 2L 12051995 . C . . . GT:AD:DP:GQ:PL 0:31,0:31:99:0,180 2L 12051996 . T . . . GT:AD:DP:GQ:PL 0:31,0:31:99:0,180 2L 12051997 . G . . . GT:AD:DP:GQ:PL 0:31,0:31:89:0,90 2L 12051998 . C . . . GT:AD:DP:GQ:PL 0:32,0:32:44:0,45 2L 12051999 . G . . . GT:AD:DP:GQ:PL 0:32,0:32:0:0,0 2L 12052000 . T . . . GT:AD:DP:GQ:PL 0:32,0:32:0:0,0 2L 12052001 . G . . . GT:AD:DP:GQ:PL 0:34,0:34:0:0,0 2L 12052002 . A . . . GT:AD:DP:GQ:PL 0:34,0:34:0:0,0 2L 12052003 . C . . . GT:AD:DP:GQ:PL 0:34,0:34:0:0,0 2L 12052004 . A AT, 1176.78 . DP=34;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=122400.00 GT:AD:DP:GQ:PL:SB 1:0,34,0:34:99:1209,0,1209:0,0,21,13 2L 12052005 . T . . . GT:AD:DP:GQ:PL 0:1,33:34:0:0,0 2L 12052006 . T . . . GT:AD:DP:GQ:PL 0:34,0:34:99:0,1172 2L 12052007 . G . . . GT:AD:DP:GQ:PL 0:34,1:35:99:0,1144 2L 12052008 . G . . . GT:AD:DP:GQ:PL 0:35,0:35:99:0,1213 2L 12052009 . G . . . GT:AD:DP:GQ:PL 0:35,0:35:99:0,1216 2L 12052010 . C . . . GT:AD:DP:GQ:PL 0:35,0:35:99:0,1213 2L 12052011 . C . . . GT:AD:DP:GQ:PL 0:36,0:36:99:0,1244 2L 12052012 . C . . . GT:AD:DP:GQ:PL 0:36,0:36:99:0,1263

    and more complex regions such as this

    2L 12050903 . C . . . GT:AD:DP:GQ:PL 0:48,0:48:99:0,540 2L 12050904 . C . . . GT:AD:DP:GQ:PL 0:48,0:48:99:0,540 2L 12050905 . T . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,540 2L 12050906 . T . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,450 2L 12050907 . A G, 1481.81 . DP=43;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=154800.00 GT:AD:DP:GQ:PL:SB 1:0,43,0:43:99:1504,0,1504:0,0,20,23 2L 12050908 . C . . . GT:AD:DP:GQ:PL 0:45,0:45:99:0,315 2L 12050909 . A . . . GT:AD:DP:GQ:PL 0:45,0:45:99:0,315 2L 12050910 . C . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,225 2L 12050911 . T . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,180 2L 12050912 . G . . . GT:AD:DP:GQ:PL 0:47,0:47:99:0,135 2L 12050913 . G . . . GT:AD:DP:GQ:PL 0:48,0:48:99:0,135 2L 12050914 . T . . . GT:AD:DP:GQ:PL 0:48,0:48:0:0,0 2L 12050915 . G . . . GT:AD:DP:GQ:PL 0:48,0:48:0:0,0 2L 12050916 . C . . . GT:AD:DP:GQ:PL 0:48,0:48:0:0,0 2L 12050917 . T . . . GT:AD:DP:GQ:PL 0:45,0:45:0:0,0 2L 12050918 . G . . . GT:AD:DP:GQ:PL 0:44,0:44:0:0,0 2L 12050919 . C . . . GT:AD:DP:GQ:PL 0:44,0:44:0:0,0 2L 12050920 . T . . . GT:AD:DP:GQ:PL 0:43,0:43:0:0,0 2L 12050921 . G . . . GT:AD:DP:GQ:PL 0:43,0:43:0:0,0 2L 12050922 . C . . . GT:AD:DP:GQ:PL 0:43,0:43:0:0,0 2L 12050923 . T TC, 1530.78 . DP=40;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=144000.00 GT:AD:DP:GQ:PL:SB 1:0,40,0:40:99:1563,0,1563:0,0,22,18 2L 12050924 . C . . . GT:AD:DP:GQ:PL 0:2,42:44:0:0,0 2L 12050925 . T . . . GT:AD:DP:GQ:PL 0:44,1:45:99:0,450 2L 12050926 . C . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,315 2L 12050927 . G . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,315 2L 12050928 . C . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,270 2L 12050929 . C . . . GT:AD:DP:GQ:PL 0:46,0:46:99:0,270 2L 12050930 . T . . . GT:AD:DP:GQ:PL 0:47,0:47:99:0,180 2L 12050931 . C . . . GT:AD:DP:GQ:PL 0:47,0:47:0:0,0 2L 12050932 . C . . . GT:AD:DP:GQ:PL 0:48,0:48:0:0,0 2L 12050933 . T . . . GT:AD:DP:GQ:PL 0:47,1:48:0:0,0 2L 12050934 . G . . . GT:AD:DP:GQ:PL 0:48,0:48:0:0,0 2L 12050935 . G . . . GT:AD:DP:GQ:PL 0:48,1:49:0:0,0 2L 12050936 . A . . . GT:AD:DP:GQ:PL 0:46,4:50:0:0,0 2L 12050937 . A C, 2116.81 . DP=46;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=165600.00 GT:AD:DP:GQ:PL:SB 1:0,46,0:46:99:2139,0,2139:0,0,24,22 2L 12050938 . G . . . GT:AD:DP:GQ:PL 0:45,3:48:0:0,0 2L 12050939 . A . . . GT:AD:DP:GQ:PL 0:46,4:50:0:0,0 2L 12050940 . A ATG, 2139.78 . DP=48;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=172800.00 GT:AD:DP:GQ:PL:SB 1:0,48,0:48:99:2172,0,2172:0,0,25,23 2L 12050941 . A . . . GT:AD:DP:GQ:PL 0:5,45:50:0:0,0 2L 12050942 . G . . . GT:AD:DP:GQ:PL 0:49,0:49:99:0,1125 2L 12050943 . G . . . GT:AD:DP:GQ:PL 0:48,1:49:99:0,1125 2L 12050944 . G . . . GT:AD:DP:GQ:PL 0:49,0:49:99:0,990 2L 12050945 . A . . . GT:AD:DP:GQ:PL 0:48,0:48:99:0,990 2L 12050946 . C . . . GT:AD:DP:GQ:PL 0:49,0:49:99:0,855 2L 12050947 . G . . . GT:AD:DP:GQ:PL 0:50,0:50:99:0,855

    Note the well called SNP (2L 12050937) in the middle of a ZQRC.

    One could increase the read depth in this situation to say 100X and reduce the error rate even further, but it seems likely the ZQRCs would persist. Thus I conclude that the source of ZQRCs is likely a limitation of the HC implementation (perhaps the design).

    I hope this little effort will encourage the gatk developers to consider chasing down the the source of ZQRCs and either squash the offending bug or implement the needed extension or kindly guide me back to the proper use of gatk.

    As you can tell, I am enthusiastic about using gatk-HC to study theoretical predictions about the patterns of genomic variation among Drosophila genomes. Except for these annoying ZQRCs that prospect seems achievable and near. So please take this little offering as constructive and supportive. I look forward to hearing more.

    Cheers,
    Chuck

    Issue · Github
    by Sheila

    Issue Number
    614
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    I really appreciate this insight! I will see what I can do to get this prioritized.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Chuck, thanks for this very thorough look at this ZQRC phenomenon. From our end what we see is that this happens in regions where there are a lot of soft-clips that get unclipped by HaplotypeCaller in the course of the reassembly and calling. These are typically generated in the presence of SVs or assembly issues, but it's quite possible that something in the nature of your data or alignment tools leads to even more soft-clips than we expect to see around indels, without necessarily being due to SVs or assembly. The soft-clips are typically resolved by HaplotypeCaller when the real indels are discovered, and it does look like the indel calls you're getting are reasonable -- is that right?

    That leaves us with the problem of dealing with those flanking regions. You are absolutely right that it stems from a limitation of the tool as it works currently. The underlying cause of the problem you observed is known; as I touched on briefly in the issue resolution notes (in the tracker) we realized that when we run the reference confidence evaluation, there is some inconsistency in how the regions with soft-clips are handled (main problem is that instead of looking at the alignment post-reassembly like it does for the variant call, HC evaluates support for the ref call sites in part as they looked before reassembly happened), which leads HC to emit a ref confidence value that is not fully meaningful at those sites. The more soft-clipped reads there are at a site, the more likely you are to end up with these RGQ=0 values.

    This is obviously sub-optimal and is something we're planning to address in the future, within the scope of what is shaping up to be GATK4. HaplotypeCaller is a very complex tool that has evolved a lot since its original design, and hindsight is 20/20 -- so work is underway right now to re-implement HaplotypeCaller in a more deliberately engineered form that will not suffer from this problem. I realize that when you're trying to get work done right now it's cold comfort to hear that a future version will be better -- but we have to prioritize our efforts, and in this case, rather than patching up the existing version (considering the current version's variant calls are reliable, and only a subset of reference calls suffer from this flaw) we'd rather invest in rebuilding to produce a more robust tool entirely.

  • chlangleychlangley UCDMember

    Hello:

    "The soft-clips are typically resolved by HaplotypeCaller when the real indels are discovered, and it does look like the indel calls you're getting are reasonable -- is that right?"

    Yes. With purely simulated data all the variants are being called correctly. The problem seems to be that the soft-clipped reads generated by actual indel are leading HC to declared positions covered by those reads as GQ=0. The handling of the soft-clipped reads HC should take the called indel into account, since when the reads are aligned to the discovered haplotype there are not unaligned bp(s).

    Where can I find the best description of the presently implemented step that yield these ZQRCs?

    Cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately there is no formal documentation on that step. We can point you to the code if you're interested, but it's not easy to ready.

  • chlangleychlangley UCDMember

    Hello:

    Well I shall wander off a bit in search "warm comfort" in an ad hoc hack-filter.

    But as a last comment, we also noted a consistent pattern of 5' bias in the ZQRCs.

    If one looks above at my post on 23.February the is apparent in the two example regions.

    Again without a clear model of the source of ZQRCs I am surprised by this behavior since it does not seem to depend on strand bias in the covering reads.

    Cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Is there maybe a bias in which strand reads have soft clips? If many of the zqrcs are near repeat regions/str/homopolymers, and considering the indel will be left aligned relative to the repeats, you could end up with more soft clips on one side than the other, with a fairly systematic signature.

  • chlangleychlangley UCDMember

    hello Geraldine:
    Yes, the source of the polarization in the ZQRCs flanking small indels is likely and interaction with the 'left alignment' convention of the local assembler but I doubt it depends much on homopolymers or simple sequences since such polarized ZQRCs are pervasive in the gatk-HC output based on totally simulated fake-mute genome (randomly placed indels of random sequence) and randomly simulated pe reads. We may never know what process yields these ZQRCs flanking small indels if the gatk v4 has a completely reengineered and upgraded HC - no need and no time. In the meantime backer to adhoc filters.
    cheers,
    Chuck

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    As long as they go away, who cares, right? ;)

    Good luck with the filters, let us know how it goes.

  • chlangleychlangley UCDMember

    Hello:

    I just wanted to check back in to ask if it can be confirmed that gatk v4.0 will address the 'limitation' in HC identified in this thread?

    Cheers,
    Chuck

  • chlangleychlangley UCDMember

    Hello Geraldine:

    We are getting ready to start a new project and are wondering if this bug was ever actually squashed in V4.
    Could affirm that it was squashed (which version) or indicate that it is still on the to_do bug list.
    Thanks,
    Chuck

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @chlangley
    Hi Chuck,

    No, it looks like this is still an issue. Not sure if you ever got a blog post draft to Geraldine? Sorry Chuck, but it really is an issue of limited resources.

    -Sheila

Sign In or Register to comment.