Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to fix "mis-encoded base qualities"? Is there a recommended program?

GATK BaseRecalibrator and UnifiedGenotyper gave the same error messages as below. Basically, gatk doesn't take my bam file because of "mis-encoded base qualities". What can I do with the base qualities, which is assigned by sequencing machine? Does anybody know how to fix it or is there any recommended package?

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01): 
    ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ##### ERROR Please do not post this error to the GATK forum
    ##### ERROR
    ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
    ##### ERROR ------------------------------------------------------------------------------------------

Command:
... -T BaseRecalibrator -R hg18.fa -knownSites dbsnp_137.hg18.srt.pl.vcf -I bwa.fxmt.srt.dup.realign.bam -o bwa.fxmt.srt.dup.realign.brecal.grp -nct 8 -fixMisencodedQuals

Screen Output:
INFO 06:34:58,952 HelpFormatter - --------------------------------------------------------------------------------
INFO 06:34:58,955 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO 06:34:58,955 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 06:34:58,955 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 06:34:58,961 HelpFormatter - Program Args: -T BaseRecalibrator -R hg18.fa -knownSites dbsnp_137.hg18.srt.pl.vcf -I bwa.fxmt.srt.dup.realign.bam -o bwa.fxmt.srt.dup.realign.brecal.grp -nct 8 -fixMisencodedQuals
INFO 06:34:58,961 HelpFormatter - Date/Time: 2013/03/13 06:34:58
INFO 06:34:58,961 HelpFormatter - --------------------------------------------------------------------------------
INFO 06:34:58,961 HelpFormatter - --------------------------------------------------------------------------------
INFO 06:34:58,975 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_137.hg18.srt.pl.vcf to be VCF
INFO 06:34:59,031 GenomeAnalysisEngine - Strictness is SILENT
INFO 06:34:59,189 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 06:34:59,197 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 06:34:59,217 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO 06:34:59,229 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
INFO 06:34:59,458 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine
INFO 06:34:59,513 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 06:34:59,519 GenomeAnalysisEngine - Done creating shard strategy
INFO 06:34:59,519 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 06:34:59,519 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 06:34:59,771 BaseRecalibrator - The covariates being used here:
INFO 06:34:59,771 BaseRecalibrator - ReadGroupCovariate
INFO 06:34:59,771 BaseRecalibrator - QualityScoreCovariate
INFO 06:34:59,772 BaseRecalibrator - ContextCovariate
INFO 06:34:59,772 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 06:34:59,772 BaseRecalibrator - CycleCovariate
INFO 06:34:59,774 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 06:34:59,775 ReadShardBalancer$1 - Done loading BAM index data for next contig
WARN 06:35:00,571 RestStorageService - Error Response: PUT '/GATK_Run_Reports/mDvQnd49La42Ayf3DFfHS8Ose8FujpOH.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 990, Content-MD5: GFEY2WWQuTxIc70mtCHarQ==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 185118d96590b93c4873bd26b421daad, Date: Wed, 13 Mar 2013 10:35:00 GMT, Authorization: AWS AKIAIMHBU7X642TCHQ2A:HxKd88gSkYOfLDMyi6c7xzvDuOE=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-279.el6.x86_64; amd64; en; JVM 1.6.0_24), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: F8042D1C5589AAC2, x-amz-id-2: 17re9WKZk/mRbhpsgkYzTVTqTLH7WkqoC4tmu4hid5pNVNazL4IdcO3tNoMJt+gO, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 13 Mar 2013 16:07:00 GMT, Connection: close, Server: AmazonS3]
WARN 06:35:00,639 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 19920 seconds. Retrying connection.
INFO 06:35:00,738 GATKRunReport - Uploaded run statistics report to AWS S3

Command:
... -T UnifiedGenotyper -glm BOTH -R hg18.fa -I bwa.fxmt.srt.dup.realign.brecal.bam -o bwa.fxmt.srt.dup.realign.brecal.dbsnp137.vcf -D dbsnp_137.hg18.srt.pl.vcf -L Exome_ROI_hg18.bed -mbq 30 -l INFO -nt 8 -fixMisencodedQuals

Screen Output:
INFO 06:54:53,611 ArgumentTypeDescriptor - Dynamically determined type of Exome_ROI_hg18.bed to be BED
INFO 06:54:53,682 HelpFormatter - --------------------------------------------------------------------------------
INFO 06:54:53,683 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO 06:54:53,683 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 06:54:53,683 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 06:54:53,688 HelpFormatter - Program Args: -T UnifiedGenotyper -glm BOTH -R hg18.fa -I bwa.fxmt.srt.dup.realign.brecal.bam -o bwa.fxmt.srt.dup.realign.brecal.dbsnp137.vcf -D dbsnp_137.hg18.srt.pl.vcf -L Exome_ROI_hg18.bed -mbq 30 -l INFO -nt 8 -fixMisencodedQuals
INFO 06:54:53,689 HelpFormatter - Date/Time: 2013/03/13 06:54:53
INFO 06:54:53,689 HelpFormatter - --------------------------------------------------------------------------------
INFO 06:54:53,689 HelpFormatter - --------------------------------------------------------------------------------
INFO 06:54:53,732 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_137.hg18.srt.pl.vcf to be VCF
INFO 06:54:53,789 GenomeAnalysisEngine - Strictness is SILENT
INFO 06:54:53,957 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 06:54:53,965 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 06:54:53,985 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO 06:54:53,998 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
INFO 06:54:55,443 IntervalUtils - Processing 37696441 bp from intervals
INFO 06:54:55,471 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine
INFO 06:54:55,533 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 06:54:56,300 GenomeAnalysisEngine - Done creating shard strategy
INFO 06:54:56,301 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 06:54:56,301 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 06:54:56,561 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 06:54:56,565 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
...
INFO 06:54:56,970 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 06:54:56,975 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 06:54:56,977 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
INFO 06:54:56,978 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 06:54:56,982 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
WARN 06:54:57,516 RestStorageService - Error Response: PUT '/GATK_Run_Reports/JJeSoA0KUKzXaGwh2VLiJR094sU0dZJ3.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1040, Content-MD5: Chn1nsvvxgbiQUn/EZnW6Q==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 0a19f59ecbefc606e24149ff1199d6e9, Date: Wed, 13 Mar 2013 10:54:57 GMT, Authorization: AWS AKIAIMHBU7X642TCHQ2A:VugqHWfaPbShOYNCK0UHqCr92kc=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-279.el6.x86_64; amd64; en; JVM 1.6.0_24), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: F06D38C29086232A, x-amz-id-2: 9fKv7j3pL5t/447MYU+752XKU86UDgYsIwtLfTxpKR8tyVM8ez4FzfwMeXjK8QQ4, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 13 Mar 2013 16:26:58 GMT, Connection: close, Server: AmazonS3]
WARN 06:54:57,572 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 19920 seconds. Retrying connection.
INFO 06:54:57,695 GATKRunReport - Uploaded run statistics report to AWS S3

If I don't use "-fixMisencodedQuals", it gives the following results

Command:
... -T UnifiedGenotyper -glm BOTH -R hg18.fa -I bwa.fxmt.srt.dup.realign.brecal.bam -o bwa.fxmt.srt.dup.realign.brecal.dbsnp137.vcf -D dbsnp_137.hg18.srt.pl.vcf -L Exome_ROI_hg18.bed -mbq 30 -l INFO -nt 8

Error Message:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'
ERROR ------------------------------------------------------------------------------------------

Screen Output:
INFO 07:01:58,097 ArgumentTypeDescriptor - Dynamically determined type of Exome_ROI_hg18.bed to be BED
INFO 07:01:58,162 HelpFormatter - --------------------------------------------------------------------------------
INFO 07:01:58,162 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO 07:01:58,162 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 07:01:58,162 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 07:01:58,168 HelpFormatter - Program Args: -T UnifiedGenotyper -glm BOTH -R hg18.fa -I bwa.fxmt.srt.dup.realign.brecal.bam -o bwa.fxmt.srt.dup.realign.brecal.dbsnp137.vcf -D dbsnp_137.hg18.srt.pl.vcf -L Exome_ROI_hg18.bed -mbq 30 -l INFO -nt 8
INFO 07:01:58,168 HelpFormatter - Date/Time: 2013/03/13 07:01:58
INFO 07:01:58,168 HelpFormatter - --------------------------------------------------------------------------------
INFO 07:01:58,168 HelpFormatter - --------------------------------------------------------------------------------
INFO 07:01:58,211 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_137.hg18.srt.pl.vcf to be VCF
INFO 07:01:58,268 GenomeAnalysisEngine - Strictness is SILENT
INFO 07:01:58,430 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 07:01:58,438 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 07:01:58,458 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO 07:01:58,471 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
INFO 07:01:59,869 IntervalUtils - Processing 37696441 bp from intervals
INFO 07:01:59,896 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine
INFO 07:01:59,958 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 07:02:00,713 GenomeAnalysisEngine - Done creating shard strategy
INFO 07:02:00,713 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 07:02:00,713 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 07:02:00,973 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 07:02:00,984 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
...
INFO 07:02:01,011 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
INFO 07:02:01,013 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 07:02:01,023 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
...
INFO 07:02:01,050 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 07:02:01,054 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 07:02:01,316 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
...
INFO 07:02:03,335 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
INFO 07:02:04,716 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 07:02:04,720 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO 07:02:04,776 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg18.srt.pl.vcf
WARN 07:02:05,328 RestStorageService - Error Response: PUT '/GATK_Run_Reports/d2TWrFuY9SaFFlQl0BD9kkOUUZ9GwXaw.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 908, Content-MD5: lE3e5sLHp0XzDrDQZMolDQ==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 944ddee6c2c7a745f30eb0d064ca250d, Date: Wed, 13 Mar 2013 11:02:04 GMT, Authorization: AWS AKIAIMHBU7X642TCHQ2A:M7sDkJlHmahlg1wpOg+EQpaa/X0=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-279.el6.x86_64; amd64; en; JVM 1.6.0_24), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 1010373FFF5ABCBA, x-amz-id-2: QuiZK4x7f/z0TxVpi3NPt7yDwLsyhxNabXwG+k4F37gEogYqa2AndiOnIED8Efeg, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 13 Mar 2013 16:34:05 GMT, Connection: close, Server: AmazonS3]
WARN 07:02:05,397 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 19920 seconds. Retrying connection.
INFO 07:02:05,486 GATKRunReport - Uploaded run statistics report to AWS S3

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually, the error you get when you're not using the -fixMisencodedQuals flag has nothing to do with quals, it has to do with your reference.

    As for that error -- are you running on Windows, by any chance?

  • Which reference are you referring to? Genome ref: hg18.fa, dbSNP: dbsnp_137.hg18.srt.pl.vcf or BED file Exome_ROI_hg18.bed?
    I am using CentOS release 6.3 (Final). Thank you for your information.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, when we talk about the reference, in GATK-speak, it always means the reference genome fasta file.

    We've had issues in the past with gracefully handling non-IUPAC bases in the reference (GATK pretty much refuses to look at anything that's not A, T, G or C, but references usually do include some other characters here and there) but that's supposed to be fixed now. It's also been recently suggested that the specific '10' error may be related to how Windows filesystems handles newlines -- but if you're on CentOS that shouldn't apply to your case AFAIK.

    I will look into this and get back to you shortly.

  • Thank you so much, I will wait for your response to move on. Otherwise, I cannot run any of the walkers in GATK.

  • pdexheimerpdexheimer Member ✭✭✭✭

    Unless hg18.fa was created in Windows, in which case it would still carry the LF characters.

    @flyingflyers, how did you create the bam file that's triggering the misencoded quals problem? From the name, I can guess that BWA and Picard's FixMateInformation were involved, anything else? Which BWA module did you use?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Fair point again, @pdexheimer.

    @flyingflyers, you can try downloading our version of the reference from our resource bundle and run again.

  • I build the BWA index of hg18.fa
    bwa index hg18.fa
    Then align the 75bp pair-end reads using BWA
    bwa aln 1.fq > 1.sai
    bwa sampe -f bwa.sam hg18.fa 1.sai 2.sai 1.fq 2.fq -r '@RG\tID:...'

    The hg18.fa are stored by chromosome in Windows platform in our lab. I downloaded them to CentOS and cat them into hg18.fa. Perhaps I have to substitute all The End of Line (EOL) character (0x0D0A, \r\n) into The Line Feed (LF) character (0x0A, \n). But why BWA doesn't report error?

    Otherwise, I will try hg18 on resource bundle. Thank you both for your answers.

  • pdexheimerpdexheimer Member ✭✭✭✭

    Yes, pulling them off the Windows server is likely your problem. The safest approach is probably to download from the resource bundle, you could also strip the line feeds yourself (I like perl for this, there's also a dos2unix tool on many distros). BWA might actually be supported on Windows (and so handle those line endings), I'm not sure. It could also just be more permissive than GATK.

    Regarding the "misencoded quals" - you describe a very standard alignment pipeline that should encode everything properly. So my suggestion would be to leave off that argument and fix your reference, and all should be fine

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I agree on all @pdexheimer's points.

    It could also just be more permissive than GATK.

    GATK is famously picky about what it will accept as input :)

    Seriously though, your lab should consider replacing that Windows server.

  • bd5fh2bd5fh2 Member

    Hi,
    I ran into the same error as flyingflyers: "##### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool", which led me to this conversation. Here is my cmd line:" java64 -Xmx24g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R refFile -I inputFile --dbsnp dbsnpFileFromGatkBundle -fixMisencodedQuals -stand_call_conf 50.0 -stand_emit_conf 10.0 -o outputFile"

    flyingflyers actually did use the -fixMisencodedQuals option. That's why I am a bit confused by Geraldine's comment "Actually, the error you get when you're not using the -fixMisencodedQuals flag has nothing to do with quals, it has to do with your reference.". Is it possible to have another explanation to flyingflyers' error?

    In my case, my reference files have all been used before for UnifiedGenotyper. The error came up after I used -fixMisencodedQuals in an attempt to fix the initial error of misencoding with qual > 63. I think what happens now is that a base qual is < 31, which triggered the "mixture of reads". I wonder why, if it's indeed because of a mixture of reads due to the encoding, it's not a problem for UG? Actually the bam file has been recalibrated using BQSR.

    Thanks in advance.

  • Hi bd5fh2,
    I haven't try HaplotypeCaller. The solution for my situation is to run dos2unix on the reference genome under linux platform. I tried the windows version, which doen't work.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bd5fh2, what happened to flyingflyers was that there were two separate errors. One was due to a reference formatting issue, while the other was the quality encoding issue.

    The quality encoding check was added very recently to GATK, so if you used an older version to run UG on those files, it is normal that you didn't get the error.

    Do you know if several sequenced samples were combined in the bam that is having the error? If they were encoded differently that would explain the error.

  • bd5fh2bd5fh2 Member

    Thank you both for the speedy replies! Geraldine, yes, I'm using as input a bam merged from two individual bams, which have reads from the same sequencer and mapped/processed in a similar way. Would it be possible that same type of sequencers render different quality encodings? In other words, is the quality encoding of the reads subject to user settings? If the answer is yes, I will need to investigate further on the history of the reads. If the answer is no, I'm still lost on this issue. Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The same sequencer should be using the same settings, but the best way to find out would be to look at the range of qualities in each of the original bams.

  • smk_84smk_84 Member
    edited June 2013

    Hi I am getting the same problem but I am not running anything on windows server. The bam files I have were generated by an older version of GATK and are base recalibrated and whenever I try to run them they throw this error. I wanted to run depth of covergae on them but I have deleted other files on the server due to space constraints and I am only left with the base recalibrated files which I can't use because of this and other errors.

    java -jar /u1/tools/public/GenomeAnalysisTK_2.4/GenomeAnalysisTK.jar -R ../Soybean_ref_genome.fasta -T DepthOfCoverage -o depth_of_coverage_SRS079352_w05.txt -I wild.list --fix_misencoded_quality_scores -fixMisencodedQuals
    INFO 11:04:42,556 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:04:42,558 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-3-g2a7af43, Compiled 2013/02/27 12:18:19
    INFO 11:04:42,559 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 11:04:42,559 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 11:04:42,562 HelpFormatter - Program Args: -R ../Soybean_ref_genome.fasta -T DepthOfCoverage -o depth_of_coverage_SRS079352_w05.txt -I wild.list --fix_misencoded_quality_scores -fixMisencodedQuals
    INFO 11:04:42,563 HelpFormatter - Date/Time: 2013/06/05 11:04:42
    INFO 11:04:42,563 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:04:42,563 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:04:42,619 GenomeAnalysisEngine - Strictness is SILENT
    INFO 11:04:42,858 GenomeAnalysisEngine - Downsampling Settings: No downsampling
    INFO 11:04:42,863 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 11:04:42,983 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
    INFO 11:04:43,084 GenomeAnalysisEngine - Creating shard strategy for 4 BAM files
    INFO 11:04:44,069 GenomeAnalysisEngine - Done creating shard strategy
    INFO 11:04:44,069 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 11:04:44,069 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    INFO 11:04:44,829 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.4-3-g2a7af43):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @smk_84,

    I see that you are feeding in the bams via a list file. You should check the encodings of individual files to see if the problem is that some of your bams have different encodings (which is easy to fix) or if there are actually mixed encodings within individual files (which is more tricky to deal with).

  • smk_84smk_84 Member

    It seems to throw the same error with every file that I have in the list I have tried to run them individually.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that's not good news. Just to be clear, you did run without the -fixMisencodedQuals argument first, right? And you're using it because there was an alert about misencoded quals? If so, did that happen for all the files if you ran them individually?

  • smk_84smk_84 Member
    edited June 2013

    Yes that's true. I haven't tried with every file but it seems to be happeining with the ones that I have tried

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Then I'm afraid there's not much I can do for you. There is an option to process files without fixing the quals, but that's dangerous and can lead to bad results. Ideally you should try to find out where the issue comes from originally.

  • smk_84smk_84 Member

    What could be the possible reasons. Earlier I had run GATK using a previous version of GATK prior to gatk 3 now and when I ran ir at that time it did not throw any such errors. I followed the best practices in the documentation and in the videos but still I get this error. I don't have the time to run the analysis all over again. Is there any way I can try

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Earlier versions of GATK did not have the check for quality encodings in place, so the problem would have gone through unnoticed. The original cause of the problem is often because the sequencing vendor did not clearly communicate what encoding scale was used for the data, and files with different encoding scales were merged together without being made compatible first. Other reasons can include processing problems, though if you used our best practices that should not be your case. One way to find out for sure is to look at the original files that the sequencing vendor provided (generally FastQ).

    It's up to you to decide whether you want to proceed with your analysis of the data as it is (this is possible with the option I mentioned), or whether it is worthwhile to reprocess everything from the beginning.

    The main consequence of ignoring the error and proceeding with misencoded data is that the callers (UnifiedGenotyper or HaplotypeCaller) will be working with flawed information about how good (or bad) the base calls are. So the call confidence will be biased -- confidence scores will be higher for the part of the data that is encoded with the higher value scale, and lower for the part encoded with the lower value scale.

    We strongly encourage people to do the safe thing and start over with clean data. It may take time, but you will be more confident that you can trust the data. Otherwise, any analysis work you do downstream with the potentially flawed data will be tainted with uncertainty. But it's your choice.

  • golharamgolharam Member ✭✭✭

    I'm getting this error as well. I'm using the 1kG reference, all on CentOS (no Windows). Here's the error I'm getting. I'm also using the '--fix_misencoded_quality_scores' parameter, but that doesn't seem to help. How do I correct for this?

    INFO 00:19:36,058 HelpFormatter - --------------------------------------------------------------------------------
    INFO 00:19:36,080 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36
    INFO 00:19:36,080 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 00:19:36,080 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 00:19:36,085 HelpFormatter - Program Args: -T RealignerTargetCreator -rf BadCigar -nt 4 -R /data/results/projects/lifescope/pcgc/reference/b37/human_g1k_v37.fasta --fix_misencoded_quality_scores -I 1-00489-02.
    dedup.bam -o 1-00489-02.dedup.indelrealigner.list

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.4-9-g532efad):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some oth

    er tool

    ERROR ------------------------------------------------------------------------------------------

    Encountered error running indel realigner target creator

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @golharam, what do you get if you run without the --fix_misencoded_quality_scores flag?

  • golharamgolharam Member ✭✭✭

    Odd. It seems to run fine. There were no errrors. I am using that parameter in my pipeline because it came up previously with another sample.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @golharam, it's best not to use that parameter by default unless you are really sure that it will be applicable to all of your samples. Otherwise it will cause problems like it did here...

  • sxg501sxg501 Case Western Reserve UniversityMember

    @Geraldine_VdAuwera said:
    Then I'm afraid there's not much I can do for you. There is an option to process files without fixing the quals, but that's dangerous and can lead to bad results. Ideally you should try to find out where the issue comes from originally.

    the -fixmisencodedquals option does not always work properly. and it may be fixing some quals but not others -- which is why running the GATK pipeline with this option in place creates a new error of having "some properly encoded base qualities" and having a mix.

    the cleanest way to approach this is to not rely on GATK to fix these misencoded quals. i use fastqc to determine the encoding quality and then correct the older illumina encoded files using seqtk:

    seqtk seq -Q64 -V $2_1.fastq > $2_sanger_1.fastq

    then i run GATK without adding any fixing options and it works well!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @sxg501
    Hi,

    Thank you for reporting your workaround! I hope it will be useful to other users.

    -Sheila

  • priyatamapriyatama CAMember

    Hi,
    I received the below error when running RealignerTargetCreator as

    ERROR MESSAGE: SAM/BAM/CRAM file [email protected]6e5c19f appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66.

    I checked my sequence file through FASTQC and found illumina 1.5 encoding. Then I added -fixMisencodedQuals option in my command.
    I have added this option in RealignerTargetCreator, IndelRealigner, BaseRecalibrator ,AnalyzeCovariates and PrintReads.

    Now I am getting the following errors at BaseRecalibrator step.

    ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool.

    Please suggest me what should I do ?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @priyatama
    Hi,

    Have a look at this thread and this thread for help.

    -Sheila

Sign In or Register to comment.