Reg. Unified Genotyper error: BAQ tag error: the BAQ value is larger than the base quality

sswaminathansswaminathan USAPosts: 16Member
edited September 2013 in Ask the GATK team

Hello

We are working with canine whole genome and exome sequence data that has been aligned to canFam3.1. When we run the Unified Genotyper tool with the following option:

-T UnifiedGenotyper -baq CALCULATE_AS_NECESSARY -glm BOTH -nt 16 -R /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.fa -S SILENT -D /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.dbSNP.ens72.vcf -I AF23.jir.rc.bam -l INFO -o AF23.vcf -metrics AF23.gakt.metrics

we get the error message:

ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/sswaminathan/canine_genomes_canfam3.1/alignments/AF23/AF23.jir.rc.bam} is malformed: BAQ tag error: the BAQ value is larger than the base quality

We also tried the following options:

-T UnifiedGenotyper -baq CALCULATE_AS_NECESSARY -fixMisencodedQuals -glm BOTH -nt 16 -R /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.fa -S SILENT -D /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.dbSNP.ens72.vcf -I AF23.jir.rc.bam -l INFO -o AF23.vcf -metrics AF23.gakt.metrics

and

-T UnifiedGenotyper -baq RECALCULATE -fixMisencodedQuals -glm BOTH -nt 16 -R /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.fa -S SILENT -D /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.dbSNP.ens72.vcf -I AF23.jir.rc.bam -l INFO -o AF23.vcf -metrics AF23.gakt.metrics

but we receive the error message:

ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool

We are running The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29.

Could you please tell us what is the cause of this error and how we can rectify it?

Thank you Yours sincerely Shanker Swaminathan

Post edited by Geraldine_VdAuwera on

Answers

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Shanker,

    It looks like there might be a bad interaction between the BAQ calculation and your needing to fix the misencoded base qualities on the fly. I'd recommend either going back and reprocessing your data to fix the problematic quality encoding at its source (recommended) or trying to create intermediate bam files with fixed encoding using Print Reads and then pass those on to the Genotyper.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • sswaminathansswaminathan USAPosts: 16Member

    Hi Eric

    Thank you for the quick reply. We performed the following steps prior to the Unified Genotyper:

    -T IndelRealigner --maxReadsInMemory 750000 -S SILENT -R /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.fa -I /scratch/sswaminathan/canine_genomes_c
    anfam3.1/alignments/AF23/AF23.prmdup.indel.rc.bam -targetIntervals /scratch/sswaminathan/canine_genomes_canfam3.1/jointindelrealign/Joint_Indel_Realign_Step1.NoKnowns.indel.intervals -compress 0 -o AF23.jir.bam
    -T BaseRecalibrator -l INFO -R /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.fa -I AF23.jir.bam -knownSites /scratch/sswaminathan/canine_genomes_ca
    nfam3.1/canFam3.1/canFam3.1.dbSNP.ens72.vcf -o AF23.recal_data.grp
    -T PrintReads -l INFO -R /scratch/sswaminathan/canine_genomes_canfam3.1/canFam3.1/canFam3.1.fa -I AF23.jir.bam -BQSR AF23.recal_data.grp -o AF23.jir.rc.bam
    net.sf.picard.sam.BuildBamIndex INPUT=AF23.jir.rc.bam TMP_DIR=[/scratch/tmp] VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=6000000    VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CR
    EATE_INDEX=false CREATE_MD5_FILE=false

    Could you please tell us where the problem might be and what you would suggest we do?

    Thanks Shanker

  • ebanksebanks Posts: 683GATK Developer mod

    Why were you running the Unified Genotyper with -fixMisencodedQuals?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • sswaminathansswaminathan USAPosts: 16Member

    We first ran the Unified Genotyper without the -fixMisencodedQuals option. But we received the error message ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/sswaminathan/canine_genomes_canfam3.1/alignments/AF23/AF23.jir.rc.bam} is malformed: BAQ tag error: the BAQ value is larger than the base quality. Then we decided to try the -fixMisencodedQuals option to see if that might work, but it did not.

    Thanks Shanker

  • ebanksebanks Posts: 683GATK Developer mod

    Can you please run your BAM file through Picard's ValidateSAMFile to confirm that it is well-formed? If it is okay, then we will ask you to upload a small snippet of your data so that we can test locally.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • sswaminathansswaminathan USAPosts: 16Member

    We ran the BAM file through Picard ValidateSAMfile and here is the output:

    [Tue Sep 03 14:50:01 MST 2013] net.sf.picard.sam.ValidateSamFile INPUT=AF23.jir.rc.bam OUTPUT=AF23.jir.rc.bam.Validate MAX_RECORDS_IN_RAM=5000000    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Tue Sep 03 14:50:01 MST 2013] Executing as sswaminathan@pnap-pe5-s25 on Linux 2.6.32-220.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15; Picard version: 1.95(1496)
    INFO    2013-09-03 14:51:55 SamFileValidator    Validated Read    10,000,000 records.  Elapsed time: 00:01:53s.  Time for last 10,000,000:  113s.  Last read position: 1:121,576,379
    INFO    2013-09-03 14:53:50 SamFileValidator    Validated Read    20,000,000 records.  Elapsed time: 00:03:48s.  Time for last 10,000,000:  115s.  Last read position: 3:34,020,578
    INFO    2013-09-03 14:55:36 SamFileValidator    Validated Read    30,000,000 records.  Elapsed time: 00:05:34s.  Time for last 10,000,000:  106s.  Last read position: 4:62,693,535
    INFO    2013-09-03 14:57:33 SamFileValidator    Validated Read    40,000,000 records.  Elapsed time: 00:07:31s.  Time for last 10,000,000:  117s.  Last read position: 6:20,732,731
    INFO    2013-09-03 14:59:23 SamFileValidator    Validated Read    50,000,000 records.  Elapsed time: 00:09:21s.  Time for last 10,000,000:  110s.  Last read position: 7:61,096,350
    INFO    2013-09-03 15:01:14 SamFileValidator    Validated Read    60,000,000 records.  Elapsed time: 00:11:12s.  Time for last 10,000,000:  110s.  Last read position: 9:29,549,155
    INFO    2013-09-03 15:03:06 SamFileValidator    Validated Read    70,000,000 records.  Elapsed time: 00:13:04s.  Time for last 10,000,000:  112s.  Last read position: 11:31,149,037
    INFO    2013-09-03 15:04:56 SamFileValidator    Validated Read    80,000,000 records.  Elapsed time: 00:14:54s.  Time for last 10,000,000:  110s.  Last read position: 12:64,937,846
    INFO    2013-09-03 15:06:51 SamFileValidator    Validated Read    90,000,000 records.  Elapsed time: 00:16:49s.  Time for last 10,000,000:  114s.  Last read position: 14:37,696,674
    INFO    2013-09-03 15:08:44 SamFileValidator    Validated Read   100,000,000 records.  Elapsed time: 00:18:42s.  Time for last 10,000,000:  112s.  Last read position: 16:29,702,030
    INFO    2013-09-03 15:10:32 SamFileValidator    Validated Read   110,000,000 records.  Elapsed time: 00:20:30s.  Time for last 10,000,000:  108s.  Last read position: 18:22,343,372
    INFO    2013-09-03 15:12:21 SamFileValidator    Validated Read   120,000,000 records.  Elapsed time: 00:22:19s.  Time for last 10,000,000:  108s.  Last read position: 20:29,040,445
    INFO    2013-09-03 15:14:05 SamFileValidator    Validated Read   130,000,000 records.  Elapsed time: 00:24:03s.  Time for last 10,000,000:  104s.  Last read position: 22:37,780,829
    INFO    2013-09-03 15:15:55 SamFileValidator    Validated Read   140,000,000 records.  Elapsed time: 00:25:53s.  Time for last 10,000,000:  109s.  Last read position: 24:41,010,308
    INFO    2013-09-03 15:17:43 SamFileValidator    Validated Read   150,000,000 records.  Elapsed time: 00:27:41s.  Time for last 10,000,000:  107s.  Last read position: 27:33,125,399
    INFO    2013-09-03 15:19:32 SamFileValidator    Validated Read   160,000,000 records.  Elapsed time: 00:29:30s.  Time for last 10,000,000:  109s.  Last read position: 30:20,744,627
    INFO    2013-09-03 15:21:16 SamFileValidator    Validated Read   170,000,000 records.  Elapsed time: 00:31:14s.  Time for last 10,000,000:  104s.  Last read position: 33:10,542,438
    INFO    2013-09-03 15:22:58 SamFileValidator    Validated Read   180,000,000 records.  Elapsed time: 00:32:56s.  Time for last 10,000,000:  102s.  Last read position: 36:23,417,850
    [Tue Sep 03 15:23:54 MST 2013] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 33.89 minutes.
    Runtime.totalMemory()=728104960
    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp

    On another note, we ran the same pipeline for 12 samples including this one and all of them had the same error. Please let us know how you would like us to proceed.

    Thanks Shanker

  • sswaminathansswaminathan USAPosts: 16Member

    I am sorry for the previous message, but the Picard output file had 2113 lines of ERROR. On viewing the file, we identified three classes of errors:

    1. MAPQ should be 0 for unmapped read.
     2. Mate negative strand flag does not match read negative strand flag of mate
     3. Mate unmapped flag does not match read unmapped flag of mate
    Please let us know your thoughts and suggestions. Thanks Shanker
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi Shanker,

    We'll need to take a closer look at this to confirm where the problem lies. Could you please upload test files to our FTP server? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • sswaminathansswaminathan USAPosts: 16Member

    We are working to get the files ready and will let you know as soon as we upload them. Could you please clarify if you would like the canine reference files? The reference files are canFam3.1 downloaded from top level FASTA file from Ensembl and indexed with BWA6. We are planning to send a test BAM that contains the ERROR reads along with a subset of good reads without the ERROR from chromosome 1 (~3000000 reads). Would this be a good set for you to test?

    Thanks Shanker

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Yes, we will need your reference, please. Those test files sound good.

    Geraldine Van der Auwera, PhD

  • sswaminathansswaminathan USAPosts: 16Member

    Hi Geraldine We have uploaded the test set (AF23_bad_dog_to_broad.tar.gz) on ftp.broadinstitute.org using the instructions provided under Uploading. Please let us know if you have any questions. Thank you very much for your help. Shanker

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Thanks Shanker, I'll have a look at these first thing tomorrow and let you know what I find.

    Geraldine Van der Auwera, PhD

  • sswaminathansswaminathan USAPosts: 16Member

    Hello I would like to follow up regarding the "Reg. Unified Genotyper error: BAQ tag error: the BAQ value is larger than the base quality" question. Could you please tell us if you were able to identify the problem and advise us on the steps we need to take? Thank you very much for looking into this. Shanker

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,462Administrator, GATK Developer admin

    Hi Shanker, sorry for the delay in following up on this, it's been a busier time than I expected.

    I've isolated the problem to a specific position (1:203344), although I think it occurs at a few others in your bam. At that position, the issue is related to a single problem read:

    HWI-ST388-W7D:5:1208:16203:138818#ACAGTG    163 1   203289  29  4S86M   =   203671  473 TGTAGTTATTCGGTATGTTGATCAATTTAACACCCATATAAAATAAATTTATTTAAAAAACAGATCGAACAACCAAATCTATACACTACC  -8/*4=:>9><4-7;51?&24+8<4,5&:>=?75':<>+=>>@?0;)<4>?<>>?6<@9"*74:3=%=:=>@<(=8;?############  BD:Z:EEFHIIGEFFEGGGGEHEFFHGGHGEF?EEFFEIDIIEFEE??FEE?FF?EFF?EE????FFIDHHHHFGGIGJJI@GGGGGHHIIEEEE MD:Z:16G6C6A3A21A5C6T4A1T2T3G2  RG:Z:AF23_aln_09-04-12  XG:i:0  BI:Z:FFGHFGEEEDCFGFGEFFEEGFEGFFD@EEDEDFDHGEEEE@@FEE@FD@EED@FFAAAAEFHFGFGHGEFGEHJHBHGHHHHIJJFFFF AM:i:29 NM:i:11 SM:i:29 XM:i:11 XO:i:0  BQ:Z:@@@@OWWYRZWNIRUNMZCLPFSWOGRCVZXZSPDTWYEXZZ[ZKWDWPZZWZZZPX[VBEPPSNX@XVXY[WEXTWZ>>>>>>>>>>>> MQ:i:29 OQ:Z:+A2,ACFIEHGG33C89H+22*1:1*:*?FHJ@6):BG*?EEHG<D*B<FHBFFH8=F@''.7=3?(;BEDCB(;=BC############ XT:A:M
    

    At this point I'm passing this on to the devs to figure out why it's causing the BAQ calculation to blow up and how to fix it or work around it.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Shanker, sorry for the delay on this. I've taken a closer look and there's something strange with your bam file. It seems that your reads have the BAQ tag already included, but that shouldn't be possible given the commands you posted above that you ran prior to the Unified Genotyper. When I take a look at the header of your BAM file I see that you've run both TableRecalibration (from GATK v1.6) and PrintReads (from GATK v2.7), so it looks like you may have run base quality score recalibration twice - and maybe somewhere along the way the BAQ tag was included.

    The easiest thing to do would be to run the UnifiedGenotyper with "-baq OFF" or "-baq RECALCULATE" since the current BAQ values are all messed up. But I am worried about the integrity of your files given that they appear to have gone through multiple (and inconsistent) processing steps. My recommendation would be to start reprocessing from scratch if at all possible.

    I hope this helps!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • sswaminathansswaminathan USAPosts: 16Member

    Hi Eric, Thank you very much for looking into this! We checked and yes, the files were run through TableRecalibration in an older version of GATK which may be causing the problem. We agree that it might be better to start from scratch to get fresh files and will start working on it. This is of great help, thanks a lot! Shanker

  • caddymobcaddymob Posts: 10Member

    Hi Eric,

    To follow up a little on this one - the files were run through GATKv1 previously by another researcher, and some new samples were added at later times, so that is why they appear inconsistent. This error cropped up as we were simply hoping to 'freshen-up' the bams by adding the new samples and starting from a large joint indel-realignment with GATK2.7 (they were previously indel realigned individually GATK1.6), BSQR and have up-to-date GATK data following current best-practices.

    This is an issue that must be commonplace - perhaps you can point us to a best-practices for this type of operation? Specifically, what's the best way to update old bams to current best practices? As in this example, new samples come in >1 year since 1st analysis, we want to run this and do a fresh joint indel realignment onwards.

    Could GATK2 be made aware of GATK1 processing and revert to original qualities? We can use picard to do this but lose alignments. We can go all the way back to FASTQs but that just seems like a waste of cpu hours. We'd also lose PG tags too that capture all the work done on said bam. Sure we can we take from the 1st header, then reheader final bam.... just clunky and time consuming.

    We welcome your suggestions - thanks for your time and attention on this everyone!

  • ebanksebanks Posts: 683GATK Developer mod

    Yes, use can easily revert to the original qualities. The --useOriginalQualities argument tells the GATK to use the BAM's original quals when available (and they are in your file); you'll want to use that when running BaseRecalibrator and PrintReads on your files.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.