The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Significant difference in VQSR or VariantAnnotation between v1.6 and v2.2-10

Posts: 14
edited November 2012

Hi,

I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10).
In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0.
When I looked at a sample to compare variant sites using VariantEval, it showed that

Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66

The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.

The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...)
The previous BAM processing and variant calls were made using v1.6.
For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.

As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.

Katsuhito

Tagged:

• Posts: 14

Let me try it. Thanks.

• Posts: 14

Hi Geraldine,

I've just realized that I also have a variant call set that were called and filtered by using GATK v2.1-9 even though this set included other samples as well. I'm attaching a figure that compares 3 versions of FS vs QD plots. I think it suggests that ReduceReads is fine but FS calculations are different between v2.1-9 and v2.2-10.

Thanks

OK, that makes sense -- we recently made some changes to how the FS is calculated; seems to be a likely culprit. We'll take a closer look and get back to you once we have a clear idea of the problem.

Geraldine Van der Auwera, PhD

Hi Katsuhito, could you please submit a detailed bug report as explained here? Thanks!

Geraldine Van der Auwera, PhD

• Posts: 14

Hi Geraldine,

For this issue, it's ambiguous which region I should select because I didn't see any error messages and FS values differ across many (if not all) regions as can be seen in the figure attached previously. Do you think that some portion of VCF files for the sites shared between callsets v1.6 and v2.1-10 would be enough to see the difference or do you really need snippets of BAMs? Let me know.

Thanks

Ah, fair point. We're definitely going to need BAM snippets, preferably containing regions where you find some different sites and some shared sites, if you can identify a region of reasonable length that contains all that.

Geraldine Van der Auwera, PhD

• Posts: 14

Hi Geraldine,

I've uploaded the requested BAM snippets, named bam_for_FS_check.tar.gz.
I confirmed that v2.1-9 and v2.2-10 UGs produce very different FS values.
Hope it helps to solve the issue.
Thanks

It looks like you've uploaded an invalid bam file.

samtools view bam_for_FS_check.bam
`

Can you please upload a valid bam file? Also, if you are not using a standard reference from our bundle, I'll need you to upload the reference files (fasta, fai, and dict) too.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 14

Let me check it again. I could run variant calling for the bam file with the same name.
Thanks (PS. I'm using humang_g1k_v37_decoy.fasta in the bundle.)

• Posts: 14

Sorry, no, I uploaded a wrong one.

• Posts: 14

I've uploaded the file (with the same name, overwritten). Thanks

• Posts: 14

Ok, in that case, another questions come to my mind: can the FisherStrand annotation from v2.1-9 be reliable? Does this issue affect only to v2.2 series? Thanks