The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

Bug UnifiedGenotyper for INDELs - last line from the indels vcf file is ignored

ClaraClara UKMember
edited October 2013 in Ask the GATK team

I think there is a bug somewhere, I am running GATK - UnifiedGenotyper with the following parameters:
java GenomeAnalysisTK.jar -T UnifiedGenotyper -glm INDEL -R ../path/to/hs37d5.fa -I /chr17-pooled.list --alleles /path/17:81000001-81195210.indels.vcf.gz -L 17:81000001-81195210 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -o /path/17:81000001-81195210.aindels.vcf.gz.part.vcf.gz

I am getting the following error at the next step, when I try to apply the recalibration:

java -jar /nfs/users/nfs_m/mercury/src/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -T ApplyRecalibration -U LENIENT_VCF_PROCESSING --ts_filter_level 90.00 --mode INDEL -R /path/to/hs37d5.fa --input /path/to/somevcf.gz -recalFile /path/to/vqsr.sites.indels.vcf.gz -tranchesFile /path/to/vqsr.sites.indels.tranches -o /path/to/17.vqsr.vcf.gz.part.vcf.gz

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-2-g6bda569):
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR MESSAGE: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 17:81195149-81195151 Q51.40 of type=INDEL alleles=[GTT*, GTTT] attr={AC=1, AN=14, DP=27, DP4=[6, 12, 0, 3], HWE=1.000000e+00, INDEL=true, IS=[3, 1.000000], MDV=3, MQ=28, PV4=[0.53, 1, 1, 0.0027], QBD=2.447886e+00, VDB=1.224300e-05} GT=[[EGAN00001096474 ./. GQ 0 DP 0 PL 0,0,0 {DV=0, SP=0}],[EGAN00001096477 ./.

Though the actual problem seems to be a step before this, when running the unified genotyper (as stated above) - and I actually managed to reproduce it:

The very last lines in the (--allele) file /path/17:81000001-81195210.indels.vcf.gz are:

17 81195128 . AGGG AGGGG 58.6 . AN=12;AC=2;INDEL;IS=3,1;DP=57;VDB=0.188545;PV4=0.025,1,1,0.037;DP4=3,12,3,0;MQ=25;QBD=3.25359;MDV=3;HWE=0.0909091 17 81195149 . GTT GTTT 51.4 . AN=14;AC=1;INDEL;IS=3,1;DP=27;VDB=1.2243e-05;PV4=0.53,1,1,0.0027;DP4=6,12,0,3;MQ=28;QBD=2.44789;MDV=3;HWE=1

The very last lines of the output file (/path/17:81000001-81195210.aindels.vcf.gz.part.vcf.gz) are:

17 81195128 . AGGG AGGGG 53.96 . AC=2;AF=0.091;AN=22;BaseQRankSum=0.532;DP=96;FS=24.358;InbreedingCoeff=0.5200;MLEAC=1;MLEAF=0.045;MQ=18.64;MQ0=0;MQRankSum=2.599;QD=17.99;RPA=3,4;RU=G;ReadPosRankSum=-1.910;STR GT:AD:DP:GQ:PL ./. ./. ./. 0/0:3,0:3:9:0,9,102 ./. ./. ./. 0/0:3,0:3:9:0,9,105 ./. ./. ./. ./. ./. ./. 0/0:3,0:3:9:0,9,101 ./. ./. .............

Therefore the last location within this region on chromosome 17 is ignored, hence my error.
This happened to me only on this chromosome, only in this region. I am running the UnifiedGenotyper for the whole genome in the same way, so I assume this might be a bug in GATK somewhere, or is there something that I am missing/doing wrong?

Thank you in advance!

Post edited by Clara on


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Clara, can you please post the entire output (from when the GATK job started)?

  • ClaraClara UKMember

    Ok, I edited my question, and posted the entire output. So the error appears at the ApplyRecalibration step, but the cause of it - in my opinion - is while running the UnifiedGenotyper. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Alright, let me first take a moment to discuss how you originally posted your question. I understand what you were trying to do, but that's just not how this works. You can't just say that you ran one command line then post the output of something else. Fortunately I could tell that that was probably what you were doing because the error mentions a recal file. But in other cases it may not be so obvious, and I might spend time puzzling over the question and trying to figure out what went wrong with the program. That would be a huge waste of my time, and meanwhile many other people need help. So please don't do that again. If there is a multi-step process and you think an earlier step caused a problem in a later step, you should write out the whole thing; all command lines involved and a summary of results. Feel free to speculate about what might be wrong, but don't just present truncated information. It is not helpful. Your edited version: much better, although you're still skipping the BaseRecalibrator step.

    Now, could you please post, in order, the actual command lines that you ran at each step (UG, BaseRecalibrator, ApplyRecalibration)? I need to see exactly how the inputs and outputs match up.

    The reason is that while I believe you're right that the last variant in your alleles file is not getting output by the UG, I don't think that is actually the cause of your error you're experiencing in ApplyRecalibration. I think you may have mixed up some inputs and outputs. Part of the reason is that the variant recalibration tools should never even see the missing site anywhere, so would have no reason to complain about it, unless your inputs aren't matching up. The other part is that you have -recalFile /path/to/vqsr.sites.indels.vcf.gz in your cmd line which doesn't make sense; a recal file should not be a VCF, it's a text file containing tables, and typically has .recal or .grp as extension, if you've followed our conventions properly. So I suspect that there is an error in your command lines.

  • ClaraClara UKMember

    Hi Geraldine,
    First of all, I apologise for confusing you and posting truncated information. It was not my intention to waste your time and I am really sorry if I did so.
    So there are 2 different things going on here and it is my mistake that I mixed them up. The reason for my post has to do only with the UnifiedGenotyper, and not with the recalibration step - I don't seem to have the permission to edit the post any more, otherwise I would have deleted the recalibration part from it and rewrote from scratch my question.

    So let's just ignore the recalibration step entirely for now and focus only on the input and output of the UnifiedGenotyper please. I managed to isolate the problem and I have created a simple test case - it seems that the issue has to do with this specific indel that doesn't get called and not with the fact that it was the last one in the file. According to my understanding, when running GATK UnifiedGenotyper in the given allele mode for all sites, there should be called variants for all the sites given in the alleles file. This doesn't seem to happen for this particular indel, which is present in the alleles.vcf file but doesn't appear in the output file. I have attached an archive containing the bam file with the sequence, the alleles.vcf and the result.vcf file, obtained after running the following command:

    java -jar /path/GenomeAnalysisTK.jar -T UnifiedGenotyper -glm INDEL -R /path/1000Genomes_hs37d5/hs37d5.fa -I /path/pooled.list --alleles /path/alleles.vcf -L 17:81000001-81195210 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -o /path/result.vcf


    • pooled.list is a file containing only one line - the path to the bam file used for testing.

    • the alleles.vcf file containes the problematic site -- 81195149 :

    #CHROM POS ID REF ALT QUAL FILTER INFO 17 81195149 . GTT GTTT 51.4 . AN=14;AC=1;INDEL;IS=3,1;DP=27;VDB=1.2243e-05;PV4=0.53,1,1,0.0027;DP4=6,12,0,3;MQ=28;QBD=2.44789;MDV=3;HWE=1 17 81195152 . A T 10 . AN=123;AC=1
    and another site which I added for checking that the variant calling actually happens -- 81195152.

    • the result.vcf file contains the output of the calling:

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT EGA.... 17 81195152 . A T . . DP=58;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MQ=16.61;MQ0=37 GT ./. ./. ...

    and you can see it here that the 81195149 location is missing from this result.vcf.

    I hope this time I explained my problem better, and once again, I apologise!
    Thanks a lot for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, what happens if you delete the SNP at 17:81195152 from your alleles file, and run again on the same snippet? I wonder if it's an issue of one of them masking out the other.

  • ClaraClara UKMember

    I tried that as well, the output vcf doesn't contain any site.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I think this may be a quirk of how the UnifiedGenotyper processes indels in GGA/emit all sites mode. The HaplotypeCaller should handle this better -- have you tried running HC on this?

Sign In or Register to comment.