The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# BaseRecalibrator Error

Posts: 9Member
edited October 2012

Dear GATK-Team

When running BaseRecalibrator with own selected SNPs, i got following stack trace error:

##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-6-g6a46042):

Do i have to filter out all LowQuality-SNPs in the known.vcf file before feeding it into the BaseRecalibrator or is this a software error? Thank you! Michel

• Posts: 122GATK Developer mod

Unfortunately I wasn't able to run the GATK on your sam file (I think it is missing its header?), but I was able to fix another problem in the BaseRecalibrator related to your reads. Hopefully this will fix your issue. Version 2.1-13 should appear on the website later today.

Thanks for all your help with this,

You should definitely not be feeding low-quality SNPs as knowns to the BaseRecalibrator, that will lead to potentially bad recalibration results.

However that may not be the cause of your problem here -- this looks like a bug that has been fixed. Can you please upgrade to the latest version (2.1-11) and try again?

Geraldine Van der Auwera, PhD

• Posts: 9Member

Thank you for the fast answer, I changed to the latest version, but the error unfortunately remained the same.

##### ERROR

Do you have any other suggestions?

OK, we've looked into it -- it's a new bug and we think we have a solution. Can you send us an excerpt from your file that contains the offending region so we can test our fix?

Geraldine Van der Auwera, PhD

• Posts: 9Member
edited October 2012

Hi, i am not sure if i got the right region, because i dont know exactly where GATK had the error..

INFO  17:24:05,103 TraversalEngine - Peaxi.v0.1.1.Scf2073:7571
2.04e+07    2.0 h        6.0 m     23.3%         8.8 h     6.7 h
INFO  17:24:40,247 TraversalEngine - Peaxi.v0.1.1.Scf2104:33179
2.05e+07    2.0 h        6.0 m     23.3%         8.8 h     6.7 h
INFO  17:25:11,322 TraversalEngine - Peaxi.v0.1.1.Scf2113:75472
2.05e+07    2.1 h        6.0 m     23.4%         8.8 h     6.7 h
INFO  17:25:30,527 GATKRunReport - Uploaded run statistics report to AWS
S3


Output ended at Scf2113 but i assume the error to be in the following scaffs. I created a sam file of the scaffolds 2113 to 2123 using samtools view. Hope thats what you are looking for. Thank you for testing!

Post edited by Geraldine_VdAuwera on
• Posts: 9Member

Sorry, couldnt upload the file via this homepage, so here is the dropbox-link to the file: https://www.dropbox.com/s/ctuenzjnbcc6sga/realigned-bwa-error-region.sam

Geraldine Van der Auwera, PhD

• Posts: 9Member
• Posts: 122GATK Developer mod

Hi Michel,

We believe this is fixed in the latest version of the GATK available on the website. Thank you for providing the files to help us track this down.

Cheers,

• Posts: 9Member

Thank you for your effort! I rerun the files with the newest version. The error persists in the one file i sent you (same error, same place):

##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-12-ga99c19d):

but 3 other alignment-files worked well. So i am starting to suggest its a error due to the file, not the program. Unfortunately, i need the error-ed file to create a reference-SNP-set. Do you have other suggestions for the cause of the error?

Cheers

• Posts: 122GATK Developer mod

That's actually a different error. We'll take a look at it.

Thanks,

• Posts: 9Member

BaseRecalibration on the files worked perfectly with version 2.1.13. That is some really good software support you do here! Kudos to you

• Posts: 122GATK Developer mod

Thanks! I'm glad we were able to help. Cheers,

• Posts: 9Member

Hi,

I encounter a similar question as what mentioned in this thread. But I cann't find a solution in this thread. When use BaseRecalibrator based on a vcf file, produced from my own bam file, it showed a "stack trace" error (full error message shown blow). Since my study species is non-model species, so I do not have known SNP site data, and thus have to repeatedly do UnifiedGenotyper-BaseRecaibartor-PrintReads from the original bam file; it was fine when I did the same BaseRecalibrator at the first round but failed at the second round. The GATK version is 2.3-0. I encountered the same problem in two different bam files.

##### ERROR ------------------------------------------------------------------------------------------

Is there a solution avialable now?

Thanks

• Posts: 683GATK Developer mod

Hi ymw,

Thanks for reporting this. The problem seems to be that you either have: 1) a mixture of well-encoded and mis-encoded reads in your file, or 2) base qualities that are extremely poorly calibrated and that span too large a range. I will add a patch (that will be available in version 2.4) that exits more gracefully with a better error message, but it's not going to help you unfortunately. You need to go back and fix this at the source because there's just something wrong with your data. Good luck and sorry to be the bearer of bad news.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 9Member

Hi Eric,

I figure out the problem, and maybe other users will be interested to know. The problem is that I mixed two versions of GATK for the analyses of this data set. I used GATK 2.1 to do local alignment and GATK 2.3 (when it's available) to do base quality recalibaration. When I re-do the anaylses all with GATK 2.3, the problem is solved. Best,

Chih-Ming

edited January 2013

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 19Member

Hi there,

I'm getting a similar error using the base recalibrator on 1000 Genomes SOLiD data, but in my case I'm processing all of it with the same version of GATK 2.3. My command is:

java -Xmx15g -jar ~/software/GenomeAnalysisTKLite-2.3-4-gb8f1308/GenomeAnalysisTKLite.jar -T BaseRecalibrator -l INFO -R human_g1k_v37.fasta --knownSites 00-All-build135.vcf -I NA12814.mapped.SOLID.bfast.CEU.realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate --out NA12814.recalibration.grp -solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --disable_indel_quals -nct 8 --filter_mismatching_base_and_quals


On the other hand I'm getting a different index out of range, not sure if that gives you any info:

java.lang.ArrayIndexOutOfBoundsException: -6


I have observed in the 1000 genomes supplementary information that GATK was only employed to detect variants on Illumina data. Is it just a coincidence or did you have any issues with g1k SOLiD data?

Hi Pablo, could you please upload a BAM snippet for us to test?

Geraldine Van der Auwera, PhD

• Posts: 19Member

OK, I took a snippet for chromosome 22 and I could reproduce the same error (well the index out of bounds now was 11...). Everything is uploaded in ftp.broadinstitute.org/priesgo.NA12842.22.zip

By the way I was to able to process this very same data with the old GATK 1.6.5.

Thanks Geraldine! Pablo.

• Posts: 683GATK Developer mod

@priesgo: I'm not sure where you got this BAM file but it is completely invalid and malformed. In the future, please run Picard's ValidateSAMFile first on your bams before sending us bug reports.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

As you say running Picard's ValidateSAMFile gives out an error:

Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once.  -1: SRR097794.39113769


This is because the aligner may report more than one possible mapping position for each read. But, is this incorrect? This data comes from the 1000 Genomes Project and to discard that my manipulations could add any error I just reproduced the error again with the raw data. Is there any other malformation?

java -Xmx15g -jar ~/software/GenomeAnalysisTKLite-2.3-4-gb8f1308/GenomeAnalysisTKLite.jar -T BaseRecalibrator -l INFO -R human_g1k_v37.fasta --knownSites 00-All-build135.vcf -I NA12814.mapped.SOLID.bfast.CEU.22.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate --out NA12814.recalibration.grp -solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --disable_indel_quals -nct 8 --filter_mismatching_base_and_quals --fix_misencoded_quality_scores


This is the output:

java.lang.ArrayIndexOutOfBoundsException: -8


The data comes from the 1000 Genomes Project repository and I just selected the chromosome 22 to deal with a smaller file.

• Posts: 683GATK Developer mod

You should go back to the 1000 Genomes Project then and make sure you are pulling down the correct file, because all of the base qualities were mis-encoded in the file you uploaded to us. The minimum value is ASCII33 but you had values that were lower than that. At this point, the problem is not with the GATK so there's really nothing else we can do to help here. Good luck!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

Yes, sorry for the malformed file, I saw the mis-encoded base call qualities but I did not want to confuse the main point, now I see it is related. In fact the base call qualities were mis-encoded by the GATK's option "--fix_misencoded_quality_scores" wrongly called by me. And this may be causing the error shown above.

But why I called it? Because when running without this parameter I got the following message:

##### ERROR MESSAGE: SAM/BAM file
SAMFileReader{/home/priesgo/data/sequences/1000G_releases/20110521/NA12814/exome_alignment/NA12814.22.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error


And I found this as a possible solution by @ymv in this entry, but it does not seem to apply for my case

So, let's see this base call qualities encoded string:

""3'"IUC34;U\FMI5I\]]_L<FYZY_^_\@


The "" corresponds to 63 in Phred scale and translating some of these characters gives us:

1 1 18 6 1 40 52 ... 60 60 62 63 43 ... 62 63 59 31


We can see that the values are correctly distributed in the range from 1 up to 63. Let me ask, is there a way to compress this base call quality range in GATK?

Sorry for the long post and thanks again. By the way this might be better in another post...

Pablo.

• Posts: 683GATK Developer mod

No, but you can have the GATK process the file with suspicious quals with the --allow_potentially_misencoded_quality_scores argument.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 19Member

Thanks! It worked. What a mess with the qualities...