Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

High proportion of reads exclude after Base Recalibration

Hi GATK team,

Currently, I am doing RNA-seq variant calling following your best practice, except I am using HISAT2 for alignment.
After performing BaseRecalibration step, I notice in the output there're many reads excluded mainly because NotPrimaryAlignmentFilter. Does it mean my reads has high number of secondary alignment? If yes, will it be a problem?

Below is the complete output message:

INFO 15:33:06,313 BaseRecalibrator - BaseRecalibrator was able to recalibrate 63343819 reads
INFO 15:33:06,318 ProgressMeter - done 6.3343875E7 38.8 m 36.0 s 99.7% 38.9 m 6.0 s
INFO 15:33:06,319 ProgressMeter - Total runtime 2325.16 secs, 38.75 min, 0.65 hours
INFO 15:33:06,319 MicroScheduler - 292530923 reads were filtered out during the traversal out of approximately 355874798 total reads (82.20%)
INFO 15:33:06,320 MicroScheduler - -> 753 reads (0.00% of total) failing BadCigarFilter
INFO 15:33:06,320 MicroScheduler - -> 50087851 reads (14.07% of total) failing DuplicateReadFilter
INFO 15:33:06,320 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 15:33:06,320 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 15:33:06,320 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 15:33:06,320 MicroScheduler - -> 9331702 reads (2.62% of total) failing MappingQualityZeroFilter
INFO 15:33:06,320 MicroScheduler - -> 233110617 reads (65.50% of total) failing NotPrimaryAlignmentFilter
INFO 15:33:06,321 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter


Done. ------------------------------------------------------------------------------------------

Need your suggestion. Thank you.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rkendar
    Hi,

    Yes, the large portion of failing reads will cause issues downstream. Also, it is a waste of your money on unusable sequence data. We recommend STAR for alignment. Any chance you can try STAR and compare?

    -Sheila

  • Hi @Sheila ,

    Thanks for your reply.
    Hm, I'll try STAR and compare. Meanwhile, is it possible to ignore the NotPrimaryAlignmentFilter filter in GATK?
    Also, based on your experience, is it normal to have many secondary alignment in RNA-seq?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @rkendar
    Hi,

    Let us know if STAR helps. For the Primary Read Filter, you should be able to disable it using -drf PrimaryLineReadFilter. But, I would not do that until we can figure out why the reads are all failing.

    What are you using as a reference? I have not heard of so may secondary alignments in RNA data, but perhaps @shlee can jump in here with more insight.

    -Sheila

Sign In or Register to comment.