50bp Reads

Hi,

I am running 50 bp rnaseq reads through the RNASEQ best practice pipeline. 100% of my reads are being filtered.

INFO 00:45:55,543 MicroScheduler - -> 50520046 reads (53.97% of total) failing DuplicateReadFilter INFO 00:45:55,543 MicroScheduler - -> 20934382 reads (22.36% of total) failing FailsVendorQualityCheckFilter INFO 00:45:55,543 MicroScheduler - -> 22159150 reads (23.67% of total) failing HCMappingQualityFilter

I have run the same pipeline, with same reference with two sets of reads of lengths 100bp and 125 bp and both work fine. I am trying to disable the filters using (below) but I am getting the following errors. I get a similar error for using both HCMappingQuality and FailsVendorQualityCheck.

-drf DuplicateRead \ -drf HCMappingQuality

`Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/vlsci/SG0009/bshaban/gatk/chimpanzee/cromwell-executions/rnaSeq/a6c7e739-6394-48ff-ace4-60bae7fdb3d8/call-variantCalling/execution/tmp.nVAlSz

ERROR --
ERROR stack trace

java.lang.IllegalStateException: org.broadinstitute.gatk.tools.walkers.haplotypecaller.HCMappingQualityFilter@40239b34 cannot be disabled
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.createFilters(GenomeAnalysisEngine.java:383)
at org.broadinstitute.gatk.engine.CommandLineExecutable.getArgumentSources(CommandLineExecutable.java:169)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:214)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: org.broadinstitute.gatk.tools.walkers.haplotypecaller.HCMappingQualityFilter@40239b34 cannot be disabled
ERROR ------------------------------------------------------------------------------------------

`

I went to check [https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_filters_FailsVendorQualityCheckFilter.php]
(https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_filters_FailsVendorQualityCheckFilter.php "https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_filters_FailsVendorQualityCheckFilter.php") but there isn't much there

and https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HCMappingQualityFilter.php

from which I changed the min quality to 10 but to no effect.

Any help would be greatly appreciated.
Thanks,
Bobbie

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @babakshaban
    Hi Bobbie,

    Does anything change if you set -mmq 0? I don't think the DuplicateReadFilter can be disabled for any tools except for ASEReadCounter.

    -Sheila

  • Hi Sheila,

    I changed the -mmq to 0 and it's still the same. I know the quality of the files are very good as I have checked them in fastqc. I run the same pipeline with "poorer" quality reads and they work fine. They are 75 and 125 bp length reads. Could this have anything to do with the minimum active region needing to be 50 bp in length and some trimming somewhere down the line cutting them to less than 50 bp?

    Thanks,
    Bobbie

    Issue · Github
    by Sheila

    Issue Number
    2991
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @babakshaban
    Hi Bobbie,

    I am not sure if that could be the cause. Let me check with the team and get back to you.

    -Sheila

  • That would be great, Thanks!
    Bobbie

  • Hi, I should add that the samples I used were from here:

    https://www.ncbi.nlm.nih.gov/sra/SRX690960[accn]
    and
    https://www.ncbi.nlm.nih.gov/sra/SRX690946[accn]

    and the reference from the link below.

    http://hgdownload.soe.ucsc.edu/goldenPath/panTro4/bigZips/

    I'm thinking the encodings on these "older" rnaseq reads may be solex1.3 or similar. I'm not sure if that would make a difference, I have however, checked using a number of available scripts and the encoding always comes up as illumina 1.9.

    Thanks,
    Bobbie.

  • Hi,

    OK. So I was able to get it "working" but I still don't really understand why. I checked, double, triple checked the encoding. I was told they were solexa1.3 encoding but every check was coming up as illumina 1.9.

    I then converted the fastq files to fasta and used --defaultBaseQualities 30 at the SplitNCigarsReads and I was finally able to obtain output.

    I still don't have an answer to why, but I will update if I find out what exactly it is. I just thought someone might now a simple reason for this and be able to save me some time.
    Thanks,
    Bobbie.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @babakshaban
    Hi Bobbie,

    So, you are saying this is an issue with low base quality scores? Now that you set the base quality scores to 30, you get variants in the final VCF? If so, what were the original base quality scores that did not produce variants in the VCF?

    Thanks,
    Sheila

  • Hi Sheila,

    It's not an issue with low quality scores, it's an issue with the quality scores. The fastqc revealed a graph with no quality score under Q30. I tried looking for any documentation that mentions issues with different types of encoding and I couldn't find any. Do you have any ideas on that?

    I also tried using the allow potentially erroneous flag but that didn't work either. It only worked when I removed all quality scores and added defaults using the --defaultBaseQualities flag.

  • Hi, shlee.

    I tried the commands in that post and I still received an error. I performed a QualityDistribution with picard and I attached the pdf below. Does the Quality score distribution chart account for scores above 40?

    I used both the -fixMisencodedQuals, --fix_misencoded_quality_scores

    And I received the following error.

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool
    ERROR ------------------------------------------------------------------------------------------

    So I then used allowPotentiallyMisencodedQuals

    ERROR MESSAGE: Argument with name 'allowPotentiallyMisencodedQuals' isn't defined.

    So then I used --allow_potentially_misencoded_quality_scores and it seems to have worked, i.e. not thrown an error but the vcf is now empty (attached)

    I then go to stderr for the haplotype section and the MicroScheduler then gives me the following.

    INFO 22:29:33,227 HaplotypeCaller - Ran local assembly on 0 active regions
    INFO 22:29:33,231 ProgressMeter - done 3.146680125E9 13.1 m 0.0 s 100.0% 13.1 m 0.0 s
    INFO 22:29:33,231 ProgressMeter - Total runtime 783.77 secs, 13.06 min, 0.22 hours
    INFO 22:29:33,231 MicroScheduler - 10100760 reads were filtered out during the traversal out of approximately 10100760 total reads (100.00%)
    INFO 22:29:33,232 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 22:29:33,232 MicroScheduler - -> 2914677 reads (28.86% of total) failing DuplicateReadFilter
    INFO 22:29:33,232 MicroScheduler - -> 4721331 reads (46.74% of total) failing FailsVendorQualityCheckFilter
    INFO 22:29:33,232 MicroScheduler - -> 0 reads (0.00% of total) failing HCMappingQualityFilter
    INFO 22:29:33,232 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 22:29:33,233 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 22:29:33,233 MicroScheduler - -> 2464752 reads (24.40% of total) failing NotPrimaryAlignmentFilter

    INFO 22:29:33,233 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    Done. ------------------------------------------------------------------------------------------

    I think it's an issue with the encoding as I previously mentioned but I am uncertain as to what it is.

    I really appreciate your help.
    Bobbie

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited March 20

    Hi @babakshaban,

    Please bear with me as I'm new to this error. Looks like there are a number of discussion threads that discuss handling misencoded quals. Here are two that appear to have found the source of the error:

    From here we see that GATK expects the standard expected format of Q0 == ASCII 33. Finally, from this wiki entry we see that there are a variety of scales used by different platforms for base qualities. Again, to use the terminology from the article, GATK expects Sanger Phred+33 scale and NOT Solexa+64 scale base qualities.

    Your error message:

    ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool

    implies there are reads from the two different scales in your BAM. What I suggest you do is confirm that this is so. You can post your BAM header here and we can help you figure it out if you need. Specifically, we are looking to see multiple read groups in the header, which you can pull out with samtools view -H xyz.bam | grep '@RG'. The read group platform PL tag will indicate the sequencing tech used in generating the data. Then, you want to pull out some example reads from each of the read groups to confirm that their base qualities are on different scales, e.g. with samtools view xyz.bam | grep 'RG:Z:<fill with each read group ID> | head -n 2.

    One of our developers suggests that you generate the quality score distribution plot you attached before for each read group.

    If we confirm that there are reads mixed from different scales, you will have to separate them into different files, e.g. using RevertSam OUTPUT_BY_READGROUP and REMOVE_ALIGNMENT_INFORMATION false etc. (be sure to study all the options and set them appropriately). Please do let us know if this is the case as we will want to improve the error message folks get to be more helpful.

    If it is not the case your reads are from mixed scales, then, well, we'll have to take it from there.

    P.S. Instead of RevertSam, I see we have a tool SplitReads that can organize reads by read group.
    P.P.S. You may find the solution presented in this discussion of interest.

    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie, Moderator

    One more thing @babakshaban. I noticed that ~half your reads are QCFAIL (0x200 SAM flag).

    INFO 22:29:33,232 MicroScheduler - -> 4721331 reads (46.74% of total) failing FailsVendorQualityCheckFilter

    This means they do not pass quality controls, FYI.

Sign In or Register to comment.