GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

Significant difference in VQSR or VariantAnnotation between v1.6 and v2.2-10

kyasunokyasuno Posts: 14Member
edited November 2012 in Ask the GATK team

Hi,

I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10).
In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0.
When I looked at a sample to compare variant sites using VariantEval, it showed that

Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66

The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.

The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...)
The previous BAM processing and variant calls were made using v1.6.
For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.

As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.

Thanks in advance.
Katsuhito

comp_v1.6_v2.2_FS.png
1500 x 529 - 242K
Post edited by kyasuno on

Best Answers

Answers

  • kyasunokyasuno Posts: 14Member

    Let me try it. Thanks.

  • kyasunokyasuno Posts: 14Member

    Hi Geraldine,

    I've just realized that I also have a variant call set that were called and filtered by using GATK v2.1-9 even though this set included other samples as well. I'm attaching a figure that compares 3 versions of FS vs QD plots. I think it suggests that ReduceReads is fine but FS calculations are different between v2.1-9 and v2.2-10.

    Thanks

    comp_v1.6_v2.2_v2.1.png
    1500 x 1110 - 374K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    OK, that makes sense -- we recently made some changes to how the FS is calculated; seems to be a likely culprit. We'll take a closer look and get back to you once we have a clear idea of the problem.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi Katsuhito, could you please submit a detailed bug report as explained here? Thanks!

    Geraldine Van der Auwera, PhD

  • kyasunokyasuno Posts: 14Member

    Hi Geraldine,

    For this issue, it's ambiguous which region I should select because I didn't see any error messages and FS values differ across many (if not all) regions as can be seen in the figure attached previously. Do you think that some portion of VCF files for the sites shared between callsets v1.6 and v2.1-10 would be enough to see the difference or do you really need snippets of BAMs? Let me know.

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Ah, fair point. We're definitely going to need BAM snippets, preferably containing regions where you find some different sites and some shared sites, if you can identify a region of reasonable length that contains all that.

    Geraldine Van der Auwera, PhD

  • kyasunokyasuno Posts: 14Member

    Hi Geraldine,

    I've uploaded the requested BAM snippets, named bam_for_FS_check.tar.gz.
    I confirmed that v2.1-9 and v2.2-10 UGs produce very different FS values.
    Hope it helps to solve the issue.
    Thanks

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Member, GP Member admin

    It looks like you've uploaded an invalid bam file.

    samtools view bam_for_FS_check.bam
       [bam_header_read] invalid BAM binary header (this is not a BAM file).
       [main_samview] fail to read the header from "bam_for_FS_check.bam".
    

    Can you please upload a valid bam file? Also, if you are not using a standard reference from our bundle, I'll need you to upload the reference files (fasta, fai, and dict) too.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • kyasunokyasuno Posts: 14Member

    Let me check it again. I could run variant calling for the bam file with the same name.
    Thanks (PS. I'm using humang_g1k_v37_decoy.fasta in the bundle.)

  • kyasunokyasuno Posts: 14Member

    Sorry, no, I uploaded a wrong one.

  • kyasunokyasuno Posts: 14Member

    I've uploaded the file (with the same name, overwritten). Thanks

  • kyasunokyasuno Posts: 14Member

    Ok, in that case, another questions come to my mind: can the FisherStrand annotation from v2.1-9 be reliable? Does this issue affect only to v2.2 series? Thanks

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Member, GP Member admin

    Yes. You should be able to use the VariantAnnotator (with -A FisherStrand) to re-annotate that value using the older version.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • kyasunokyasuno Posts: 14Member

    That sounds good. Thank you very much!

Sign In or Register to comment.