Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

UnifiedGenotyper -rf BadCigar

rwnessrwness Member
edited April 2013 in Ask the GATK team

Hello,
I am using GATK v2.4-9 and when I use the UnifiedGenotyper I get the following error:

ERROR MESSAGE: SAM/BAM file SAMFileReader{my_bam} is malformed: read starts with deletion. Cigar: 3D3M4I93M. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar

This problem seems to be occur very commonly throughout the reads (either starts or ends with deletion) and I have never had it before I updated recently to 2.4-9. The BAM was created by IndelRealigner, using multiple samples simultaneously and finished with no errors. Although I could add the flag "-rf BadCigar" to stop receiving the error, that won't help me to know that the BAM is not malformed and why this happened.

By the way, I to aln the reads I used BWA Version: 0.7.4-r385 (released 23/04/2013)

Thanks - Rob

Best Answer

Answers

  • rwnessrwness Member
    edited April 2013
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Rob,

    Are you sure about the BWA version you used? The "CIGAR with leading D" issue was a bug that popped up in a recent version of BWA, and 0.7.4 was specifically meant to fix that bug (plus a few others). From the release notes for BWA 0.7.4:

    Release 0.7.4 (23 April, 2013)
    ~~~~~~

    >

    This is a bugfix release. Most of bugs are considered to be minor which only
     very rarely.
     ccur

    >

     * Bugfix: wrong CIGAR when a query sequence bridges three or more target
       sequences. This only happens when aligning reads to short assembly contigs.

    >

     * Bugfix: leading "D" operator in CIGAR.

    >

     * Extend more seeds for better alignment around tandem repeats. This is also
       a cause of the leading "D" operator in CIGAR.

    >

     * Bugfix: SSE2-SSW may occasionally find incorrect query starting position
       around tandem repeat. This will lead to a suboptimal CIGAR in BWA-MEM and
       a wrong CIGAR in BWA.

    >

     * Bugfix: clipping penalty does not work as is intended when there is a gap
       towards the end of a read.

    >

     * Fixed an issue caused by a bug in the libc from Mac/Darwin. In Darwin,
       fread() is unable to read a data block longer than 2GB due to an integer
      overflow bug in its implementation.

    >

    Since version 0.7.4, BWA-MEM is considered to reach similar stability to
    BWA-backtrack for short-read mapping.

    >

    (0.7.4: 23 April, r385)

  • Now that I think of it - Its possible that one sample was done with with a slightly earlier version. I will look into it and write back.

  • rwnessrwness Member
    edited April 2013

    I looked into it by running the UnifiedGenotyper on only those samples that were run with the newest version of BWA. Unfortunately the same BadCigar error occurred. The version of BWA is recorded in the BAM:

    samtools view -H my_bam.bam|grep "VN:"
    "@PG     ID:bwa  PN:bwa  VN:0.7.4-r385"
    

    Do you know whether this means the BAM is full of errors and should be trashed or can I continue using the "-rf BadCigar" flag?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, that's weird - but since you posted I have heard a similar report about BWA 0.7.4 from another user (via twitter). Perhaps the issue persists in certain cases.

    If so you should report this to the BWA developer. However first we should make sure that this is indeed bwa's doing and not GATK's.

    You mentioned the bam was created by IndelRrealigner. We need to find out if the pre-realignment bam file also had these bad cigars. If you have the ID of one read that has the issue you can Grep for that, or you can run a simple walker like CountReads with -rf BadCigar. The run summary should tell you how many reads were filtered out because of the cigar issue, if any.

    You can also run the same thing on the post-realignment file to check if there's any difference. If the proportion of affected reads is low you're probably ok to proceed with analysis using the filter. If not then you should consider trashing the bam and starting again from the top.

  • So I picked one of the offending alignments and I went back to the first BAM and GREP'd for the bad CIGAR. A number of bad reads came back with the same CIGAR string but I believe this is the one that raised the error:

    samtools view my_bam.bam |grep "94M3I3M2D"
    FCD1DAFACXX:6:1207:9423:74756#ATCCAAGC  163     chromosome_1    441146  8       94M3I3M2D       =       441500  453    AGGTGTGGCCCTTGCCGACCCCTGCCTCATGCTCATGCACACGCCTGCCACTGAGCAGACGAATACACTACTTCGCCACAAGTGCACATGTGCATGTGCA    @CCDDFFFHHHHHJJJJJJJJJJJJJIJJJJJJIIJJIJJIIJJJJJJJJGIJJJHHHHFFFDDDDDDDDCDDDDDDDDDDBACDDDDDDCCDEEDDCAC  X0:i:2  X1:i:0  XA:Z:chromosome_1,+441146,94M1D6M,5;    MD:Z:64G32^CG0  RG:Z:CC2937_2   XG:i:1  AM:i:0  NM:i:6  SM:i:0  XM:i:4  XO:i:1  MQ:i:37 XT:A:R
    

    I can also use some bash to count that there are 250 reads with CIGAR fields that end with a deletion

    samtools view my_bam.bam |awk '{print substr($6,length($6),1)}' |grep D |wc -l
    240
    

    I then ran CountReads on the same bam and got nothing:

    java -jar GenomeAnalysisTK.jar -T CountReads  -I my_bam --reference_sequence my_ref.fasta -rf BadCigar
    
    INFO  11:52:52,362 Walker - [REDUCE RESULT] Traversal result is: 59131409
    INFO  11:52:52,363 ProgressMeter -            done        5.91e+07    9.0 m        9.0 s    100.0%         9.0 m     0.0 s
    INFO  11:52:52,363 ProgressMeter - Total runtime 537.69 secs, 8.96 min, 0.15 hours
    INFO  11:52:52,452 MicroScheduler - 0 reads were filtered out during traversal out of 200010 total (0.00%)
    INFO  11:52:53,511 GATKRunReport - Uploaded run statistics report to AWS S3
    

    I am confused why CountReads didn't identify it but UnifiedGenotyper did. The only reason I can imagine for this bug persisting in BWA v7.4 is that I used bwa sampe rather than the new bwa-mem

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    If you only have 250 reads with the bad cigars then it's what, 0.1% of your total read count? That's not too big a deal -- so you're probably okay carrying on with this bam using the filter as opposed to remapping and reprocessing from scratch, if you're in a hurry. But I would still check exactly how many reads get filtered out (including both leading and trailing Ds), just to be safe.

    Maybe with CountReads the filtering didn't get activated because it's a read walker -- unlike the UG which is a locus walker. I didn't expect that would make a difference, but the GATK always manages to surprise me. Can you please try that again with CountLoci instead? That one's a locus walker so it should behave like the UG in this respect.

  • You're right about the difference between CountReads and CountLoci - there isn't a huge number of these reads in my bams and I could probably carry on.

    java -jar GenomeAnalysisTK.jar -T CountLoci  -I my_bam.bam --reference_sequence my_ref.fasta -rf BadCigar
    INFO  17:46:08,564 MicroScheduler - 1641213 reads were filtered out during traversal out of 47212161 total (3.48%)
    INFO  17:46:08,564 MicroScheduler -   -> 622 reads (0.00% of total) failing BadCigarFilter
    INFO  17:46:08,564 MicroScheduler -   -> 1640591 reads (3.47% of total) failing UnmappedReadFilter
    

    It is likely a remaining problem with BWA and maybe just with ALN/SAMPE . I will see if I can find where to report the issue.

    Thanks for your help

  • yingzhangyingzhang minneapolisMember

    I had the same issue, and I am not using BWA. The GATK version I am using is version 2.5-2-gf57256b.

    Since my data is RNA-Seq, I used tophat/bowtie to map my reads to the reference human genome.

    I searched online. It seems people reported the bug since Oct. 2012, but in my personal opinion, the bug has not been fixed in any way.

    Here is my detailed error message:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/lambav0/shared/RISS_analysis/GATK/output_phaseI/T040227.realn.recal.red.bam} is malformed: read starts with deletion. Cigar: 13H1106N1D37M. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar
    ERROR ------------------------------------------------------------------------------------------

    Again, -rf BadCigar option didn't help at all.

    Please give me any hint.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @yingzhang, do you mean that you ran with -rf BadCigar but still got the same error? Can you please post the console output?

    Also, can you tell me if the error still occurs with version 2.6?

  • yingzhangyingzhang minneapolisMember

    @Geraldine_VdAuwera said:
    yingzhang, do you mean that you ran with -rf BadCigar but still got the same error? Can you please post the console output?

    Also, can you tell me if the error still occurs with version 2.6?

    Hi, Geraldine,

    Thank you for your prompt reply.

    The issue was not solved in version 2.6. I attached my stderr below:

    java -Xmx16g -jar $gatk -T HaplotypeCaller -R $ref -D $dbvcf/dbsnp_137.hg19.vcf -o raw.snps.vcf -I $bamfiles/T040227.realn.recal.red.bam -U ALLOW_N_CIGAR_READS -rf BadCigar
    Picked up _JAVA_OPTIONS: -Xmx2730666k
    INFO 12:48:08,048 HelpFormatter - --------------------------------------------------------------------------------
    INFO 12:48:08,051 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.6-4-g3e5ff60, Compiled 2013/06/24 14:48:56
    INFO 12:48:08,051 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 12:48:08,051 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 12:48:08,056 HelpFormatter - Program Args: -T HaplotypeCaller -R /project/db/genomes/Hsapiens/hg19_canonical/seq/hg19_canonical.fa -D /home/riss/shared/RISS_database/broad_bundle_hg19_v2.5/dbsnp_137.hg19.vcf -o raw.snps.vcf -I /home/lambav0/shared/RISS_analysis/GATK/output_phaseI/T040227.realn.recal.red.bam -U ALLOW_N_CIGAR_READS -rf BadCigar
    INFO 12:48:08,056 HelpFormatter - Date/Time: 2013/07/09 12:48:08
    INFO 12:48:08,056 HelpFormatter - --------------------------------------------------------------------------------
    INFO 12:48:08,056 HelpFormatter - --------------------------------------------------------------------------------
    INFO 12:48:08,085 ArgumentTypeDescriptor - Dynamically determined type of /home/riss/shared/RISS_database/broad_bundle_hg19_v2.5/dbsnp_137.hg19.vcf to be VCF
    INFO 12:48:08,151 GenomeAnalysisEngine - Strictness is SILENT
    INFO 12:48:08,307 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
    INFO 12:48:08,315 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 12:48:08,338 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
    INFO 12:48:08,346 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
    INFO 12:48:08,357 RMDTrackBuilder - Loading Tribble index from disk for file /home/riss/shared/RISS_database/broad_bundle_hg19_v2.5/dbsnp_137.hg19.vcf
    INFO 12:48:08,738 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO 12:48:09,148 GenomeAnalysisEngine - Done preparing for traversal
    INFO 12:48:09,148 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 12:48:09,149 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining
    INFO 12:48:09,253 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
    INFO 12:48:39,151 ProgressMeter - chr1:7842561 1.66e+04 30.0 s 30.2 m 0.3% 3.3 h 3.3 h
    .....................................
    INFO 13:08:59,220 ProgressMeter - chr3:186938096 4.92e+08 20.8 m 2.0 s 21.9% 94.9 m 74.1 m
    INFO 13:09:25,929 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/lambav0/shared/RISS_analysis/GATK/output_phaseI/T040227.realn.recal.red.bam} is malformed: read starts with deletion. Cigar: 13H1106N1D37M. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar
    ERROR ------------------------------------------------------------------------------------------

    Before you look into details for this error, I would like to provide you with more details of my analysis. I am working on a RNA-Seq data with single end 50bp reads. So far, I have done the analysis for phase I, and created the reduced reads for HaplotypeCaller. I have 100 samples in total. I must say, I have run the same commands (without -rf BadCigar option) on several other data files, which were done successfully. This is the first one in my pipeline to report such an error.

    Any input on how to handle this issue will be appreciated.

    Best,
    Ying

Sign In or Register to comment.