The current GATK version is 3.4-46

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Bug UnifiedGenotyper for INDELs - last line from the indels vcf file is ignored

UKPosts: 4Member
edited October 2013

I think there is a bug somewhere, I am running GATK - UnifiedGenotyper with the following parameters:

java GenomeAnalysisTK.jar
-T UnifiedGenotyper
-glm INDEL
-R ../path/to/hs37d5.fa
-I /chr17-pooled.list
--alleles /path/17:81000001-81195210.indels.vcf.gz
-L 17:81000001-81195210
-U LENIENT_VCF_PROCESSING
-baq CALCULATE_AS_NECESSARY
-gt_mode GENOTYPE_GIVEN_ALLELES
-out_mode EMIT_ALL_SITES
--standard_min_confidence_threshold_for_calling 4.0
--standard_min_confidence_threshold_for_emitting 4.0
-l INFO
-A QualByDepth
-A HaplotypeScore
-A MappingQualityRankSumTest
-A FisherStrand
-A InbreedingCoeff
-A Coverage
-o /path/17:81000001-81195210.aindels.vcf.gz.part.vcf.gz

I am getting the following error at the next step, when I try to apply the recalibration:

java -jar /nfs/users/nfs_m/mercury/src/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar
-T ApplyRecalibration
-U LENIENT_VCF_PROCESSING
--ts_filter_level 90.00
--mode INDEL
-R /path/to/hs37d5.fa
--input /path/to/somevcf.gz
-recalFile /path/to/vqsr.sites.indels.vcf.gz
-tranchesFile /path/to/vqsr.sites.indels.tranches
-o /path/to/17.vqsr.vcf.gz.part.vcf.gz
##### ERROR MESSAGE: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 17:81195149-81195151 Q51.40 of type=INDEL alleles=[GTT*, GTTT] attr={AC=1, AN=14, DP=27, DP4=[6, 12, 0, 3], HWE=1.000000e+00, INDEL=true, IS=[3, 1.000000], MDV=3, MQ=28, PV4=[0.53, 1, 1, 0.0027], QBD=2.447886e+00, VDB=1.224300e-05} GT=[[EGAN00001096474 ./. GQ 0 DP 0 PL 0,0,0 {DV=0, SP=0}],[EGAN00001096477 ./.

Though the actual problem seems to be a step before this, when running the unified genotyper (as stated above) - and I actually managed to reproduce it:

The very last lines in the (--allele) file /path/17:81000001-81195210.indels.vcf.gz are:

17      81195128        .       AGGG    AGGGG   58.6    .       AN=12;AC=2;INDEL;IS=3,1;DP=57;VDB=0.188545;PV4=0.025,1,1,0.037;DP4=3,12,3,0;MQ=25;QBD=3.25359;MDV=3;HWE=0.0909091
17      81195149        .       GTT     GTTT    51.4    .       AN=14;AC=1;INDEL;IS=3,1;DP=27;VDB=1.2243e-05;PV4=0.53,1,1,0.0027;DP4=6,12,0,3;MQ=28;QBD=2.44789;MDV=3;HWE=1

The very last lines of the output file (/path/17:81000001-81195210.aindels.vcf.gz.part.vcf.gz) are:

17 81195128 . AGGG AGGGG 53.96 . AC=2;AF=0.091;AN=22;BaseQRankSum=0.532;DP=96;FS=24.358;InbreedingCoeff=0.5200;MLEAC=1;MLEAF=0.045;MQ=18.64;MQ0=0;MQRankSum=2.599;QD=17.99;RPA=3,4;RU=G;ReadPosRankSum=-1.910;STR GT:AD:DP:GQ:PL ./. ./. ./. 0/0:3,0:3:9:0,9,102 ./. ./. ./. 0/0:3,0:3:9:0,9,105 ./. ./. ./. ./. ./. ./. 0/0:3,0:3:9:0,9,101 ./. ./. .............

Therefore the last location within this region on chromosome 17 is ignored, hence my error.
This happened to me only on this chromosome, only in this region. I am running the UnifiedGenotyper for the whole genome in the same way, so I assume this might be a bug in GATK somewhere, or is there something that I am missing/doing wrong?

Post edited by Clara on
Tagged:

Hi Clara, can you please post the entire output (from when the GATK job started)?

Geraldine Van der Auwera, PhD

• UKPosts: 4Member

Ok, I edited my question, and posted the entire output. So the error appears at the ApplyRecalibration step, but the cause of it - in my opinion - is while running the UnifiedGenotyper. Thanks!

Alright, let me first take a moment to discuss how you originally posted your question. I understand what you were trying to do, but that's just not how this works. You can't just say that you ran one command line then post the output of something else. Fortunately I could tell that that was probably what you were doing because the error mentions a recal file. But in other cases it may not be so obvious, and I might spend time puzzling over the question and trying to figure out what went wrong with the program. That would be a huge waste of my time, and meanwhile many other people need help. So please don't do that again. If there is a multi-step process and you think an earlier step caused a problem in a later step, you should write out the whole thing; all command lines involved and a summary of results. Feel free to speculate about what might be wrong, but don't just present truncated information. It is not helpful. Your edited version: much better, although you're still skipping the BaseRecalibrator step.

Now, could you please post, in order, the actual command lines that you ran at each step (UG, BaseRecalibrator, ApplyRecalibration)? I need to see exactly how the inputs and outputs match up.

The reason is that while I believe you're right that the last variant in your alleles file is not getting output by the UG, I don't think that is actually the cause of your error you're experiencing in ApplyRecalibration. I think you may have mixed up some inputs and outputs. Part of the reason is that the variant recalibration tools should never even see the missing site anywhere, so would have no reason to complain about it, unless your inputs aren't matching up. The other part is that you have -recalFile /path/to/vqsr.sites.indels.vcf.gz in your cmd line which doesn't make sense; a recal file should not be a VCF, it's a text file containing tables, and typically has .recal or .grp as extension, if you've followed our conventions properly. So I suspect that there is an error in your command lines.

Geraldine Van der Auwera, PhD

• UKPosts: 4Member

Hi Geraldine,
First of all, I apologise for confusing you and posting truncated information. It was not my intention to waste your time and I am really sorry if I did so.
So there are 2 different things going on here and it is my mistake that I mixed them up. The reason for my post has to do only with the UnifiedGenotyper, and not with the recalibration step - I don't seem to have the permission to edit the post any more, otherwise I would have deleted the recalibration part from it and rewrote from scratch my question.

So let's just ignore the recalibration step entirely for now and focus only on the input and output of the UnifiedGenotyper please. I managed to isolate the problem and I have created a simple test case - it seems that the issue has to do with this specific indel that doesn't get called and not with the fact that it was the last one in the file. According to my understanding, when running GATK UnifiedGenotyper in the given allele mode for all sites, there should be called variants for all the sites given in the alleles file. This doesn't seem to happen for this particular indel, which is present in the alleles.vcf file but doesn't appear in the output file. I have attached an archive containing the bam file with the sequence, the alleles.vcf and the result.vcf file, obtained after running the following command:

java -jar /path/GenomeAnalysisTK.jar -T UnifiedGenotyper
-glm INDEL
-R /path/1000Genomes_hs37d5/hs37d5.fa
-I /path/pooled.list
--alleles /path/alleles.vcf
-L 17:81000001-81195210
-U LENIENT_VCF_PROCESSING
-baq CALCULATE_AS_NECESSARY
-gt_mode GENOTYPE_GIVEN_ALLELES
-out_mode EMIT_ALL_SITES
--standard_min_confidence_threshold_for_calling 4.0
--standard_min_confidence_threshold_for_emitting 4.0
-l INFO
-A QualByDepth
-A HaplotypeScore
-A MappingQualityRankSumTest
-A FisherStrand
-A InbreedingCoeff -A Coverage -o /path/result.vcf

where:

• pooled.list is a file containing only one line - the path to the bam file used for testing.

• the alleles.vcf file containes the problematic site -- 81195149 :

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
17      81195149        .       GTT     GTTT    51.4    .       AN=14;AC=1;INDEL;IS=3,1;DP=27;VDB=1.2243e-05;PV4=0.53,1,1,0.0027;DP4=6,12,0,3;MQ=28;QBD=2.44789;MDV=3;HWE=1
17      81195152        .       A       T       10      .       AN=123;AC=1

and another site which I added for checking that the variant calling actually happens -- 81195152.
• the result.vcf file contains the output of the calling:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  EGA....
17      81195152        .       A       T       .       .       DP=58;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MQ=16.61;MQ0=37  GT      ./.     ./.  ...

and you can see it here that the 81195149 location is missing from this result.vcf.

I hope this time I explained my problem better, and once again, I apologise!
Thanks a lot for your help!

Hmm, what happens if you delete the SNP at 17:81195152 from your alleles file, and run again on the same snippet? I wonder if it's an issue of one of them masking out the other.

Geraldine Van der Auwera, PhD

• UKPosts: 4Member

I tried that as well, the output vcf doesn't contain any site.