catch22 for filtering reads using FilterSamReads

Hi, trying to prepare some BAM files using picard/MarkDuplicates I got a SAM validation error (Padding operator not between real operators in CIGAR) for a few reads. I figured I'd remove them using FilterSamReads. I built a list of reads by running ValidateSamFile, then putting the read names into a text file to use as read list file. However, if I run the filtering step, FilterSamReads will stop with the same exception as MarkDuplicates, complaining about the very reads I'm trying to remove. I'm probably missing out on something here, but I'm stuck right now. Picard version used is: 2.16.0-1-g763d98e-SNAPSHOT, Java is: OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-1~bpo8+1-b11, command is:
java -jar /home/picard/build/libs/picard.jar FilterSamReads I=input.bam FILTER=excludeReadList O=output.bam USE_JDK_INFLATER=true USE_JDK_DEFLATER=true RLF=input.bam.brokenreads

Exception thrown is:
ERROR 2017-12-12 14:48:43 FilterSamReads Failed to filter 1724-0121-WholeExome_S1_L001_R1_001_paired.bam
htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name NS500396:228:HKL5MBGX2:1:12210:18362:13274_2:N:0:TAAGGCGA, Padding operator not between real operators in CIGAR
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:253)
at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:606)
at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1575)
at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2087)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:811)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at picard.sam.FilterSamReads.writeReadsFile(FilterSamReads.java:193)
at picard.sam.FilterSamReads.doWork(FilterSamReads.java:213)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

input.bam.brokenreads looks like this:
NS500396:228:HKL5MBGX2:1:12210:18362:13274_2:N:0:TAAGGCGA
NS500396:228:HKL5MBGX2:4:23505:12020:5193_2:N:0:TAAGGCGA
NS500396:228:HKL5MBGX2:1:13203:21100:6035_1:N:0:TAAGGCGA
etc.

Thanks for any help!

Best Answers

Answers

  • EADGEADG KielMember ✭✭✭

    Hi,
    as a quick and dirty solution grep -v "NS500396:228:HKL5MBGX2:1:13203:21100:6035_1" would be your friend.

    Just invert grep and exclude the lines/reads you don't want.

    Greets EADG

  • kogrokogro Member
    edited December 2017

    Hi EADG,
    thanks for your comment.
    I've thought about combining grep with samtools and I agree it would work.
    I've got about 40 large files, however, where I'd need to filter between 3 and 300 lines each.
    grep would go through each file for each line separately, so I figured using picard's FilterSamReads functionality might be the better/quicker option...

    @EADG said:
    Hi,
    as a quick and dirty solution grep -v "NS500396:228:HKL5MBGX2:1:13203:21100:6035_1" would be your friend.

  • kogrokogro Member
    Accepted Answer

    Hi @EADG,
    thanks again, gnu parallel would have been a good solution. Hadn't thought of that.
    Turns out, that setting the VALIDATION_STRINGENCY=LENIENT parameter will allow me to filter the reads using picard. So the pipeline's happily working again. Still, I find it a bit awkward that I can't use the filtering tool to filter out reads that are known to be broken using the default settings. IMO it should kick them out, without even validating them...
    Thanks again, EADG!

  • EADGEADG KielMember ✭✭✭

    Hi @kogro,

    you are welcome, using FilterSamReads with VALIDATION_STRINGENCY=LENIENT is the far more elegant way ;)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kogro
    Hi,

    What kind of reference are you working with? Can you post the CIGAR string for some of those malformed reads?

    Thanks,
    Sheila

  • Hi @Sheila,
    Here are some of the CIGAR strings. They certainly are odd with soft clipping, padding and insertion at the beginning, etc.

    2S10P2I1M6I2M2D14M1D10M1D46M1I35M1D3M1D1M1I7M1D1M1D1M1D13M1I4M
    3S4P1I2M1I3M2D9M3D1M3D8M1D10M1D70M2I11M1D3M1D8M1D1M1D1M1D14M1I2M
    3S3P3I2M1I1M2D3M1I6M1I3M1D10M1D74M1I7M1D3M1D8M1D1M1D1M1D6M1I1M1I3M1I2M2I3M1I1M
    

    The software that has been used is clcgenomics server and we got the already aligned BAM files from a partner in a research project and I didn't want to re-align everything. Only a very tiny fraction of reads (1-399 at most from 160-230 million) show that "behaviour".
    Though the chromosome ids in the files look like hg19, they have been aligned to the b37 reference (I asked them to send me a dict file, so I could check the md5 sums).

    I was only puzzled as I assumed providing a list of reads to exclude to FilterSamReads would directly remove them and not first run the ValidationCheck on them and then fail. I was a bit slow to realize that I needed to set VALIDATION_STRINGENCY to LENIENT to get that result, or I wouldn't even have bothered anyone with my question. ;-)

    Issue · Github
    by Sheila

    Issue Number
    2783
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    I would have sworn that those reads are ion torrent reads but not illumina.. Such a weird CIGAR...

  • Ah, I'd have thought that the software would be to blame. Why would you have thought ION Torrent?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited December 2017

    Way too many deletions and insertions. :D But on the other hand. Could that be possible that they used a wrong reference genome to map ?

  • Well, as I said already, these are very few reads, less than 20 for most files a few hundred for about 3. I'll have a look at their location over the weekend.
    For most of the reads the CIGAR looks more like "151M" :)
    Nevertheless MarkDuplicates didn't like the other ones...

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hi @kogro,

    The error:

    Padding operator not between real operators in CIGAR

    appears to be complaining about the softclip and insertion that surround the P. I'm unfamiliar with the contexts that generate a P CIGAR operator. However, if it is similar to insertions, softclips and hardclips (ISH ~ P), then how did your aligner/processing decide between S, P and I? My guess is that what makes these reads malformed is that these operators are adjacent to each other. A way to test this is to consolidate the adjacent SPI bases into any of the three--S, P, or I--and see if ValidateSamFile accepts such reads. I suspect that these will pass given these are CIGAR operators that are part of the SAM Specification.

    From the discussion in this thread, it sounds like you find having to set VALIDATION_STRINGENCY to LENIENT for FilterSamReads counter-intuitive. However, FilterSamReads is a versatile tool used for a variety of contexts where it would not necessarily expect malformed reads, e.g. subsetting read pairs to generate mate-complete snippets of data.

    All this being said, my guess is that such reads represent some sort of contamination. It may be interesting to BLAST the reads to confirm.

Sign In or Register to comment.