The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

# Significant difference in VQSR or VariantAnnotation between v1.6 and v2.2-10

Posts: 14Member
edited November 2012

Hi,

I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10). In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0. When I looked at a sample to compare variant sites using VariantEval, it showed that

Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66

The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.

The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...) The previous BAM processing and variant calls were made using v1.6. For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.

As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.

Thanks in advance. Katsuhito

Post edited by kyasuno on
Tagged:

## Answers

• Posts: 14Member

Let me try it. Thanks.

• Posts: 14Member

Hi Geraldine,

I've just realized that I also have a variant call set that were called and filtered by using GATK v2.1-9 even though this set included other samples as well. I'm attaching a figure that compares 3 versions of FS vs QD plots. I think it suggests that ReduceReads is fine but FS calculations are different between v2.1-9 and v2.2-10.

Thanks

• Posts: 6,672Administrator, GATK Developer admin

OK, that makes sense -- we recently made some changes to how the FS is calculated; seems to be a likely culprit. We'll take a closer look and get back to you once we have a clear idea of the problem.

Geraldine Van der Auwera, PhD

• Posts: 6,672Administrator, GATK Developer admin

Hi Katsuhito, could you please submit a detailed bug report as explained here? Thanks!

Geraldine Van der Auwera, PhD

• Posts: 14Member

Hi Geraldine,

For this issue, it's ambiguous which region I should select because I didn't see any error messages and FS values differ across many (if not all) regions as can be seen in the figure attached previously. Do you think that some portion of VCF files for the sites shared between callsets v1.6 and v2.1-10 would be enough to see the difference or do you really need snippets of BAMs? Let me know.

Thanks

• Posts: 6,672Administrator, GATK Developer admin

Ah, fair point. We're definitely going to need BAM snippets, preferably containing regions where you find some different sites and some shared sites, if you can identify a region of reasonable length that contains all that.

Geraldine Van der Auwera, PhD

• Posts: 14Member

Hi Geraldine,

I've uploaded the requested BAM snippets, named bam_for_FS_check.tar.gz. I confirmed that v2.1-9 and v2.2-10 UGs produce very different FS values. Hope it helps to solve the issue. Thanks

• Posts: 684GATK Developer mod

It looks like you've uploaded an invalid bam file.

samtools view bam_for_FS_check.bam
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "bam_for_FS_check.bam".


Can you please upload a valid bam file? Also, if you are not using a standard reference from our bundle, I'll need you to upload the reference files (fasta, fai, and dict) too.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 14Member

Let me check it again. I could run variant calling for the bam file with the same name. Thanks (PS. I'm using humang_g1k_v37_decoy.fasta in the bundle.)

• Posts: 14Member

Sorry, no, I uploaded a wrong one.

• Posts: 14Member

I've uploaded the file (with the same name, overwritten). Thanks

• Posts: 14Member

Ok, in that case, another questions come to my mind: can the FisherStrand annotation from v2.1-9 be reliable? Does this issue affect only to v2.2 series? Thanks

• Posts: 684GATK Developer mod

Yes. You should be able to use the VariantAnnotator (with -A FisherStrand) to re-annotate that value using the older version.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 14Member

That sounds good. Thank you very much!

Sign In or Register to comment.