We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Mutect2 4.1.4.0 stats file with a negative number

ElenaGrassiElenaGrassi Member
edited November 2019 in Ask the GATK team

Hello, I've just adapted my pipeline to the new filtering strategies, while looking at the files I noticed that for a WGS run I obtained a stats file with a negative number:
[[email protected] biodiversa]>cat mutect/CRC1307LMO.vcf.gz.stats
statistic value
callable -1.538687311E9

Looking around about the meaning of the number I found https://gatkforums.broadinstitute.org/gatk/discussion/24496/regenerating-mutect2-stats-file, so I'm wondering if I should be worried by having a negative number of callable sites :/
What's more puzzling is that FilterMutectCalls after ran without any error.

Before running mutect I used the usual best practices pipeline, then:

gatk Mutect2 -tumor CRC1307LMO -R /archive/home/egrassi/bit/task/annotations/dataset/gnomad/GRCh38.d1.vd1.fa -I align/realigned_CRC1307LMO.bam -O mutect/CRC1307LMO.vcf.gz --germline-resource /archive/home/egrassi/bit/task/annotations/dataset/gnomad/af-only-gnomad.hg38.vcf.gz --f1r2-tar-gz mutect/CRC1307LMO_f1r2.tar.gz --independent-mates 2> mutect/CRC1307LMO.vcf.gz.log

gatk CalculateContamination -I mutect/CRC1307LMO.pileup.table -O mutect/CRC1307LMO.contamination.table --tumor-segmentation mutect/CRC1307LMO.tum.seg 2> mutect/CRC1307LMO.contamination.table.log

gatk LearnReadOrientationModel -I mutect/CRC1307LMO_f1r2.tar.gz -O mutect/CRC1307LMO_read-orientation-model.tar.gz 2> mutect/CRC1307LMO_read-orientation-model.tar.gz.log

gatk FilterMutectCalls -V mutect/CRC1307LMO.vcf.gz -O mutect/CRC1307LMO.filtered.vcf.gz -R /archive/home/egrassi/bit/task/annotations/dataset/gnomad/GRCh38.d1.vd1.fa --stats mutect/CRC1307LMO.vcf.gz.stats --contamination-table mutect/CRC1307LMO.contamination.table --tumor-segmentation=mutect/CRC1307LMO.tum.seg --filtering-stats mutect/CRC1307LMO_filtering_stats.tsv --ob-priors mutect/CRC1307LMO_read-orientation-model.tar.gz 2> mutect/CRC1307LMO_filtering_stats.tsv.log
Tagged:

Issue · Github
by bhanuGandham

Issue Number
6302
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
davidbenjamin

Best Answers

Answers

  • I can confirm that the same happens with a matched normal. I will soon compare calls made with a previous version of gatk (4.1.0.0) to check what changes at that level (the calls comparing 4.1.4.0 without a matched with 4.1.0.0 with a matched were veeery different). In january I will receive a large number of WGS samples and would like to know if I need to re-roll all my pipelines to previous version (and which one) to be on the safe side or not...

  • Ok, number of pass variants with the matched normal:
    4.1.0.0: 31151
    4.1.4.0: 25221
    common: 24906

    The comparison between 4.1.4.0 without matched normal and 4.1.0.0 is scary though:
    4.1.4.0: 43339
    common: 2197
    I was expecting without the matched normal to find more variants (all the germline that the gnomad resource won't be able to filter) but not less variants, does anyone have any idea about why this happens?
    Should I avoid 4.1.4.0 considering the negative number?

  • Hello Bhanu, thank you very much for your answer.

    The negative number is probably due to integer overflow and is potentially a bug. I have created a issue ticket for it here: https://github.com/broadinstitute/gatk/issues/6302. You can track its progress there. Our Mutect2 developer @davidben is trying to get you a jar file with this fix as soon as possible.

    I'll follow up in my answer to his post, thanks!

    Extra info: 4.1.4.0 is more conservative in its calls, however, you can get the old behaviour back by setting the argument --defaultAF to 0. Please note, this change would result in increased false positives.
    Since v4.1.1.0, Mutect2 can detect germline variants even if those variants are not in gnomad, based on the variant's AF. If a matched normal is not provided Mutect2 will look for AF 0.5 to detect a variant as germline even if it isn't in the germline resource.

    Thank you for the insight! The second one is of particular interest to me since we have an experimental design that will end up in many somatic variants at AF ~0.5.
    I was initially wondering if for this reason it would be better for me to use HaplotypeCaller but then while studying the best practices I noticed that the CNNScoreVariants would make it not useful for us so I decided to use Mutect2 (both with matched normal and not), but I did not see this new filter! Does it filter out also AF==1?
    Is it possible to disable it? Otherwise I won't be able to use Mutect2 without a matched normal (I was preferring it to the matched version again for our experimental design), I fear.
    i am looking at Mutect2 options but am not spotting the right one.

  • @davidben said:
    @ElenaGrassi As @bhanuGandham said, we believe the negative value is due to integer overflow. (Since the Java spec does not prescribe a fixed number of bits in an integer, this error only occurs on some machines. I'm guessing that your overflow occurred at 2^31 = 2147483648.) If you want results quickly, you can edit the mutect stats by hand and add 2^32 to the negative value to get 2756279985, which is a reasonable number of callable sites in a WGS sample.

    Yep, I'll try this right away and get back here with the results.
    I'm using GATK docker images on a Centos with Intel® Xeon® Processor E5-2680 v3, but I had the idea that the Java MAX int is not platform dependent (and have stopped using Java some time ago so was not sure about it going smooth over overflows or not)...anyhow you are right:

    System.out.println("max: "+ Integer.MAX_VALUE);

    max: 2147483647

    A pull request for the fix is here: https://github.com/broadinstitute/gatk/pull/6303

    Additionally, you may download a patched GATK jar here: gs://broad-dsde-methods-davidben/gatk-builds/gatk-4.1.4.0-with-overflow-patch.jar

    Please let us know if this does not resolve the issue.

    Thank you very much for the quick fix!
    I'll plug it in a new docker image in the right place and re-run everything and will get back with the result, I haven't the pipeline ready to split the genome (due to memory-n cores constraints in the HPC system I am using I am not sure that this will be beneficial), therefore depending on the queue times this will take most of the weekend.

  • Ok first quick result writing 2147483648 manually in the stats file as @davidben suggests:
    22411 in common between the 155236 (!!!) found with Mutect 4.1.4.0 without a matched normal and the 31151 with the matched normal and Mutect 4.1.0.0, much more reasonable overlap, even if it makes me think about the real need for a matched normal :)

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ElenaGrassi It is inevitable to get a lot of false positive germline events without a matched normal. The reason for this is that even if one aggressively filters every germline variant that occurs even once in the 150,000 or so exomes in gnomAD, an average person will still exhibit tens of thousands of private germline mutations not seen in gnomAD. Since gnomAD only has about 16,000 genomes, the situation is worse for WGS.

    By default FilterMutectCalls tries to reduce this overwhelming amount of false positives at the expense of sensitivity by trying to model germline events via a diploid het-like allele fraction (ie 1/2 in the absence of CNV). It does not automatically filter these, but rather looks for somatic variation at other allele fractions. If it detects enough somatic variants at lower allele fraction it figures that the tumor purity is low enough that the higher allele fractions are germline. Or at least, more likely to be germline that the low-AF variants.

    This is the right behavior if you want to balance sensitivity and precision. If, however, you want a call set with maximum sensitivity, you can turn this off with -default-af 0 in Mutect2. This tells the tool that any variant not in gnomAD can't possibly be a germline event.

  • @davidben I see, thanks. Basically I'll be obtaining calls (and then obtain differences of sets of calls done on different samples for the same "clonal" tumor, which is why I don't care much about having a correct germline filter) on samples where I expect somatic variants (out of CNV) to have AF 0.5, will using -default-af 0avoid the filter that @bhanuGandham was writing about, too?
    I'm also performing some preliminary experiments where I mix normal samples reads to get some "fake" somatic variants at AF 0.5, could you please comment on your recent preprint (congrats! was waiting for it) sentence: "for the in vitro mixtures weturn off the germline filter and the panel of normals because the true “somatic” mutations are really common germline variants".
    How do you turn off the germline filter? Keeping both "PASS" and "germline" calls after filtering? Using the new experimental argument "--genotype-germline-sites"?
    Or maybe using a fake "germline resource" without any variants and -default-af 0?
    My initial plan was to skip the --germline-resource but keep the PON, to remove technical artifacts, am I wrong?

    (I enqueued the patched jar jor, hopefully can report you about the fix for the overflow issue early next week).

  • Ok, I can confirm that the jar that with the bugfix fixes the issue :)

    [[email protected] biodiversa]>cat mutect/CRC1307LMO.vcf.gz.stats
    statistic       value
    callable        2.756279985E9
    

    Total calls 155236, common with paired old version: 22268

    @davidben could you tell me how to turn off filters and why the PON should be off?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ElenaGrassi The fake somatic variants were obtained from mixing 5, 10, and 20 samples with known germline variants in vitro, and then sequencing. We only use this validation to measure sensitivity, although there's no fundamental reason it can't be used for specificity as well.

    We turn off the germline filter by omitting a --germline-resource and by setting --default-af 0. We omit the panel of normals as well because until a few months ago CreateSomaticPanelOfNormals emitted germline variants along with technical artifacts. Once we test the new, germline-excluding version more we will be able to keep the panel of normals in this validation.

    By the way, we recently developed a new somatic truth data set derived from real somatic variants: https://www.biorxiv.org/content/10.1101/825042v1. The data are publicly available.

  • Ok, thank you very much!

Sign In or Register to comment.