`MalformedReadFilter` seemingly overly aggressive?

I recently came across(blog post) a scenario in which a subset of my reads had triggered the MalformedReadFilter during indel realignment. I'm curious to know about the definition of "malformed", as whilst invalid, I wouldn't have described incorrect mate names and alignment positions for unmapped reads as "malformed" and capable of causing job crashes...

Not a bug - I'm just curious to know of any implementation details or reasoning about the MalformedReadFilter.

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    We're going to try to include it in the upcoming release (since those docs are generated at build time). Now that you've blogged about it we can't go on ignoring it ;)

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @samnicholls
    Hi,

    This page describes what the filter does: https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MalformedReadFilter.php

    If you are interested in any other filters, you can have a look at this page: https://www.broadinstitute.org/gatk/guide/tooldocs/ (click on Read Filters")

    -Sheila

  • samnichollssamnicholls Wales, UKMember

    Hi Sheila,
    I link that help page from the blog but don't see anything describing why it would filter out the reads in question.
    None of the reads had any of the three properties listed:

    • Filter out reads with no stored bases (i.e. '*' where the sequence should be), instead of failing with an error
    • Filter out reads with mismatching numbers of bases and base qualities, instead of failing with an error
    • Filter out reads with CIGAR containing the N operator, instead of failing with an error

    I was looking for an explanation of why the MalformedReadFilter chooses to filter reads that have an invalid mate name or alignment when the read is unmapped, which does not appear to be a "gross malformation" that would cause jobs to crash.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @samnicholls
    Hello Sam,

    I just read your entire page and I like the charm you write with. Now, after all of those fixes you made, did the bam run fine through GATK? After ValidateSamFile produced no errors, did you still get the reads failing the MalformedReadFilter?

    -Sheila

  • samnichollssamnicholls Wales, UKMember

    @Sheila said:
    samnicholls
    Hello Sam,

    I just read your entire page and I like the charm you write with. Now, after all of those fixes you made, did the bam run fine through GATK? After ValidateSamFile produced no errors, did you still get the reads failing the MalformedReadFilter?

    -Sheila

    Hi Sheila,
    Thanks! And yes, after I went back once more and fixed the data for the unmapped mate pairs (set the starting alignment position to 0 and the matename to *), the reads no longer failed the MalformedReadFilter. But this is what I find confusing - because there's no mention on the MalformedReadFilter guide that it fails reads that have these properties (invalid mate name/position for unmapped reads)? I don't see how they are considered to be "grossly malformed" - I get that they are technically invalid, but unmapped reads with incorrect mate names or non-zero positions don't appear to do much harm in other pipelines.

    Sam

  • samnichollssamnicholls Wales, UKMember

    I will add to the documentation these additional things Malformed Read Filter looks for.

    Excellent! Thank you. I'd be curious to know what sort of problems might have triggered this being added by the developers in the first place but a full list of what the MalformedReadFilter does on its manual page would be brilliant :)

  • samnichollssamnicholls Wales, UKMember

    @Sheila
    Any progress on this? I don't have permission to view the Github issue linked above (but see it has been updated recently!)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Accepted Answer

    We're going to try to include it in the upcoming release (since those docs are generated at build time). Now that you've blogged about it we can't go on ignoring it ;)

    Issue · Github
    by Sheila

    Issue Number
    1244
    State
    open
    Last Updated
  • shleeshlee CambridgeMember, Broadie, Moderator
    • Filter out reads with CIGAR containing the N operator, instead of failing with an error

    Isn't N used in RNA-Seq data to denote a skipped region from the reference? Should we be filtering these?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We filter them by default, but when working with RNAseq data we have an argument that lets them through where appropriate. This is described in the RNAseq best practices.

Sign In or Register to comment.