Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

OverclippedReadFilter doesn't filter anything

Hello,

I have 51bp reads and am trying to filter out those where about 20-30 bases have been soft-clipped. The command I have been using is:

java -Xmx8g -jar $GATK_JAR -R $REFERENCE -T PrintReads -rf OverclippedRead --filter_is_too_short_value 40 -o /dev/stdout --disable_bam_indexing -I snippet.bam

Using GATK version v3.4-46, human reference hg19.

The problem I'm encountering is that this doesn't actually filter out any reads. I'm looking at many reads with 27 soft-clipped bases and 24 matches, but those are included in the output. Also the log is unambiguous:

-> 0 reads (0.00% of total) failing OverclippedReadFilter
MicroScheduler - 0 reads were filtered out during the traversal out of approximately 2474 total reads (0.00%)

Thanks.

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Jacob
    Hi Jacob!

    I just saw your fix in Github. Thanks for making the change.

    -Sheila

  • JacobJacob Member

    For the benefit of others:

    The current version only filtered reads if there were a soft-clipped block on both ends. My reads were 25S25M (only softclipped on one end) and hence were not being filtered.

    The pull request I submitted adds an option to set the minimum number of soft-clipped blocks. Default is 2 for backwards compatibility. User can also set to 1 or 0 to filter on (read length - soft clipped bases). So in my case I set --min_softclip_blocks 1 and got my desired behavior.

  • JensRJensR SFMember

    Hi,

    Do you have an estimate of when Jacob's fix will make it into a GATK release?

    We noticed a lot of FP due to bacterial reads in our saliva samples and while the OverclippedReadFilter already should help a lot, Jacob's fix looks like a more robust solution which we would like to try out.

    -Jens

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Pretty sure it got merged some time this summer. Can't look it up now but try the latest nightly build (downloads page) and see if that helps.

  • AngieAngie Member

    Hello,

    I just would like to ask whether I'm interpreting correctly the filters OverclippedReadFilter v.s --dontUseSoftClippedBases.

    When using -T HaplotypeCaller I believe OverclippedRead would just discard those reads that have less than "--filter_is_too_short_value" bases aligned. Then if a 100bp-long-read has 40 soft-clipped bases and in the command line I use -rf OverclippedRead --filter_is_too_short_value 65, this read would be discarded.
    While if I use the flag --dontUseSoftClippedBases, this same read (60 aligned bases + 40 soft-clipped (60M40S)) wouldn't be discarded but those 60M bases would still be used for the variant calling. Is this right?

    If so could you please help me identifying what would be the pros and cons of using one or another for the variant calling? I know that OverclippedRead was develop to deal with foreign DNA in human samples, like bacterial DNA. But in my case I'm working with a non-model species and I found that may of the mapped reads present a considerable proportion of soft-clipped bases and I would like to avoid false positives calls from those SC bases. However I don't know what could be more convenient, whether to discard the whole read or only not using those SC bases.

    Another issue is I see that a lot of the ref-allele calls are present at the end of the reads, that makes me worry that those calls are false positives (although it would be worse if they were alternative-allele calls), but I think this problem, if indeed those variants are supported by SC bases, would be solve if I just deal with the soft-clipped bases or even the entire Overclipped read.

    Many many thanks in advance for your help and comments!!

    best,

    Angelica

    Issue · Github
    by Sheila

    Issue Number
    2190
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie admin

    @Angie
    Hi Angelica,

    Sorry for the delay. My team will get back to you soon about this.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Angie, your interpretation of what the OverclippedReadFilter and the --dontUseSoftClippedBases argument do is correct. The main reason we have both is because OverclippedReadFilter allows us to get rid of reads that originate from foreign DNA (eg pathogen contaminant) while retaining the ability to use the soft-clipped regions of "good" reads. These often hold valuable information that is crucial to calling large indels confidently, so we wouldn't want to lose them.

    I would recommend trying to find out why you're getting so much soft-clipping. I believe in some non-model organisms you might experience that kind of problem if the individuals you're sequencing are genetically distant enough from the reference to cause mapping issues. If so you would probably want to allow the program to use the soft-clipped bases. If that seems unlikely then the quality of the sequencing data may be in doubt, which is a bigger concern.

  • AngieAngie Member

    Thanks a lot @Sheila and @Geraldine_VdAuwera for your help.

    It's possible that the problem is created by the ref-genome. I'm working with a hybrid species and the ref-genome used to map the reads is from one of the parental species, thus there will be some regions in the hybrid-genome where I'd be mapping to a sister-species-genome rather than a "same species genome". I could do a trial with the different methods and compare what would be more convenient, however either way there will be some limitations with the variants I obtain.

    best,

    Angelica

Sign In or Register to comment.