The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at

VariantRecalibration, numBadVariants, and size of the data set

pdexheimerpdexheimer Member, Dev Posts: 543 ✭✭✭✭

I'm somewhat struggling with the new negative training model in 2.7. Specifically, this paragraph in the FAQ causes me trouble:

Finally, please be advised that while the default recommendation for --numBadVariants is 1000, this value is geared for smaller datasets. This is the number of the worst scoring variants to use when building the model of bad variants. If you have a dataset that's on the large side, you may need to increase this value considerably, especially for SNPs.

And so I keep thinking about how to scale it with my dataset, and I keep wanting to just make it a percentage of the total variants - which is of course the behavior that was removed! In the Version History for 2.7, you say

Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number

Can you comment a little further about how it scales? I'm assuming it's non-linear, and my intuition would be that smaller sets have proportionally more bad variants. Is that what you've seen? Do you have any other observations that could help guide selection of that parameter?

Best Answer


  • pdexheimerpdexheimer Member, Dev Posts: 543 ✭✭✭✭

    Ahh, that would be cool. I'm just now moving my production pipelines to 2.7, maybe I'll hold off for a bit longer. I should be able to scare up a dataset or two I could test with, please do let me know when it appears in the nightlies. Thanks!

Sign In or Register to comment.