catch22 for filtering reads using FilterSamReads

Hi, trying to prepare some BAM files using picard/MarkDuplicates I got a SAM validation error (Padding operator not between real operators in CIGAR) for a few reads. I figured I'd remove them using FilterSamReads. I built a list of reads by running ValidateSamFile, then putting the read names into a text file to use as read list file. However, if I run the filtering step, FilterSamReads will stop with the same exception as MarkDuplicates, complaining about the very reads I'm trying to remove. I'm probably missing out on something here, but I'm stuck right now. Picard version used is: 2.16.0-1-g763d98e-SNAPSHOT, Java is: OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-1~bpo8+1-b11, command is:
java -jar /home/picard/build/libs/picard.jar FilterSamReads I=input.bam FILTER=excludeReadList O=output.bam USE_JDK_INFLATER=true USE_JDK_DEFLATER=true RLF=input.bam.brokenreads

Exception thrown is:
ERROR 2017-12-12 14:48:43 FilterSamReads Failed to filter 1724-0121-WholeExome_S1_L001_R1_001_paired.bam
htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name NS500396:228:HKL5MBGX2:1:12210:18362:13274_2:N:0:TAAGGCGA, Padding operator not between real operators in CIGAR
at htsjdk.samtools.SAMUtils.processValidationErrors(
at htsjdk.samtools.BAMRecord.getCigar(
at htsjdk.samtools.SAMRecord.getAlignmentEnd(
at htsjdk.samtools.SAMRecord.computeIndexingBin(
at htsjdk.samtools.SAMRecord.isValid(
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(
at htsjdk.samtools.BAMFileReader$
at htsjdk.samtools.BAMFileReader$
at htsjdk.samtools.SamReader$
at htsjdk.samtools.SamReader$
at picard.sam.FilterSamReads.writeReadsFile(
at picard.sam.FilterSamReads.doWork(
at picard.cmdline.CommandLineProgram.instanceMain(
at picard.cmdline.PicardCommandLine.instanceMain(
at picard.cmdline.PicardCommandLine.main(

input.bam.brokenreads looks like this:

Thanks for any help!

Best Answers


  • EADGEADG KielMember

    as a quick and dirty solution grep -v "NS500396:228:HKL5MBGX2:1:13203:21100:6035_1" would be your friend.

    Just invert grep and exclude the lines/reads you don't want.

    Greets EADG

  • kogrokogro Member
    edited December 2017

    Hi EADG,
    thanks for your comment.
    I've thought about combining grep with samtools and I agree it would work.
    I've got about 40 large files, however, where I'd need to filter between 3 and 300 lines each.
    grep would go through each file for each line separately, so I figured using picard's FilterSamReads functionality might be the better/quicker option...

    @EADG said:
    as a quick and dirty solution grep -v "NS500396:228:HKL5MBGX2:1:13203:21100:6035_1" would be your friend.

  • kogrokogro Member
    Accepted Answer

    Hi @EADG,
    thanks again, gnu parallel would have been a good solution. Hadn't thought of that.
    Turns out, that setting the VALIDATION_STRINGENCY=LENIENT parameter will allow me to filter the reads using picard. So the pipeline's happily working again. Still, I find it a bit awkward that I can't use the filtering tool to filter out reads that are known to be broken using the default settings. IMO it should kick them out, without even validating them...
    Thanks again, EADG!

  • EADGEADG KielMember

    Hi @kogro,

    you are welcome, using FilterSamReads with VALIDATION_STRINGENCY=LENIENT is the far more elegant way ;)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    What kind of reference are you working with? Can you post the CIGAR string for some of those malformed reads?


  • Hi @Sheila,
    Here are some of the CIGAR strings. They certainly are odd with soft clipping, padding and insertion at the beginning, etc.


    The software that has been used is clcgenomics server and we got the already aligned BAM files from a partner in a research project and I didn't want to re-align everything. Only a very tiny fraction of reads (1-399 at most from 160-230 million) show that "behaviour".
    Though the chromosome ids in the files look like hg19, they have been aligned to the b37 reference (I asked them to send me a dict file, so I could check the md5 sums).

    I was only puzzled as I assumed providing a list of reads to exclude to FilterSamReads would directly remove them and not first run the ValidationCheck on them and then fail. I was a bit slow to realize that I needed to set VALIDATION_STRINGENCY to LENIENT to get that result, or I wouldn't even have bothered anyone with my question. ;-)

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • SkyWarriorSkyWarrior TurkeyMember

    I would have sworn that those reads are ion torrent reads but not illumina.. Such a weird CIGAR...

  • Ah, I'd have thought that the software would be to blame. Why would you have thought ION Torrent?

  • SkyWarriorSkyWarrior TurkeyMember
    edited December 2017

    Way too many deletions and insertions. :D But on the other hand. Could that be possible that they used a wrong reference genome to map ?

  • Well, as I said already, these are very few reads, less than 20 for most files a few hundred for about 3. I'll have a look at their location over the weekend.
    For most of the reads the CIGAR looks more like "151M" :)
    Nevertheless MarkDuplicates didn't like the other ones...

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @kogro,

    The error:

    Padding operator not between real operators in CIGAR

    appears to be complaining about the softclip and insertion that surround the P. I'm unfamiliar with the contexts that generate a P CIGAR operator. However, if it is similar to insertions, softclips and hardclips (ISH ~ P), then how did your aligner/processing decide between S, P and I? My guess is that what makes these reads malformed is that these operators are adjacent to each other. A way to test this is to consolidate the adjacent SPI bases into any of the three--S, P, or I--and see if ValidateSamFile accepts such reads. I suspect that these will pass given these are CIGAR operators that are part of the SAM Specification.

    From the discussion in this thread, it sounds like you find having to set VALIDATION_STRINGENCY to LENIENT for FilterSamReads counter-intuitive. However, FilterSamReads is a versatile tool used for a variety of contexts where it would not necessarily expect malformed reads, e.g. subsetting read pairs to generate mate-complete snippets of data.

    All this being said, my guess is that such reads represent some sort of contamination. It may be interesting to BLAST the reads to confirm.

Sign In or Register to comment.