Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

SplitNCigarReads complains about N in CIGAR

I am working on RNAseq data. I expect SplitNCigarReads can handel spliced reads, but it complains about N in CIGAR. I just wonder whether SplitNCigarReads is ready to handle spliced reads or still under construction. I am using version 3.1-1-g07a4bf8.

The command line is "java -jar $GATKjar -T SplitNCigarReads -I a.bam -R $REFfa -o a.splitN.bam".

The error message is as follows:
MESSAGE: Unsupported CIGAR operator N in read DHT4KXP1:231:C4DW4ACXX:3:1301:3574:77410 at chrM:673. Perhaps you are trying to use RNA-Seq data? While we are currently actively working to support this data type unfortunately the GATK cannot be used with this data in its current form. You have the option of either filtering out all reads with operator N in their CIGAR string (please add --filter_reads_with_N_cigar to your command line) or assume the risk of processing those reads as they are including the pertinent unsafe flag (please add -U ALLOW_N_CIGAR_READS to your command line).

According to the error message, it seems SplitNCigarReads still cannot handle spliced reads. Or, I missed something? Thanks!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, but it's easy to miss: you need to add the -U ALLOW_N_CIGAR_READS argument. In future it won't be necessary, but for now it is.

  • srivas31srivas31 Member

    Hi Geraldine,

    I am trying to run GATK RNA-Seq workflow on mouse genome to call the variants. However, I am getting the error in SplitNCigarReads tool for few samples; the command I am using is below

    java -Xmx4g -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R GRCm38_68_chr.fa -I .rg.bam -o mq.reassign.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

    The error is:

    ERROR stack trace

    java.lang.NullPointerException
    at org.broadinstitute.sting.utils.GenomeLoc.disjointP(GenomeLoc.java:133)
    at org.broadinstitute.sting.utils.GenomeLoc.overlapsP(GenomeLoc.java:143)
    at org.broadinstitute.sting.gatk.walkers.rnaseq.OverhangFixingManager.fixSplit(OverhangFixingManager.java:235)
    at org.broadinstitute.sting.gatk.walkers.rnaseq.OverhangFixingManager.addRead(OverhangFixingManager.java:209)
    at org.broadinstitute.sting.gatk.walkers.rnaseq.SplitNCigarReads.splitNCigarRead(SplitNCigarReads.java:202)
    at org.broadinstitute.sting.gatk.walkers.rnaseq.SplitNCigarReads.reduce(SplitNCigarReads.java:167)
    at org.broadinstitute.sting.gatk.walkers.rnaseq.SplitNCigarReads.reduce(SplitNCigarReads.java:87)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:251)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:240)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):

    It is running fine for many samples but giving error for few bams. It is working fine If I use the option the --filter_reads_with_N_cigar.
    Could you please let me know if there is any way-around this problem.

    Thanks !

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @srivas31,

    It looks like we have a bug in SplitNCigarReads, as three other users have reported similar issues this week. We're hoping to process this very soon, and we'll make an announcement when we have a fix. Sorry for the inconvenience!

  • corlagoncorlagon germanyMember
    edited November 2014

    Just got the same error as DZhao using version 3.3.0.
    The whole point of the program is to split reads according to N's, isn't it? At least that's the piece you get from the "help"

    rnaseq
    SplitNCigarReads Splits reads that contain Ns in their cigar string (e.g.

    So what's the "error message" good for?
    And what is it doing when you set "-U ALLOW_N_CIGAR_READS" in its extremely long running time?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    -U ALLOW_N_CIGAR_READS tells the GATK engine not to freak out on the reads with N cigars. This is a leftover of pre-RNAseq times; eventually we'll have some smarter logic that sets this automatically for RNAseq processing tools, but in the meantime you need to set it in your command, as documented.

  • corlagoncorlagon germanyMember

    Dear Geraldine,
    Thanks for the fast reply! So this tool has had a different function in former times? Could you then please tell me what it is doing with my reads? Is it splitting them? Is it doing something else or additionally?

    I normally read the documentation/help before asking that many questions, but what I quoted before is everything the help provides for the tool. Furthermore, it is really confusing when the only tool specifically listed under "rnaseq" tells you that it is not intended to be used with this kind of data type... And just turning off warning messages is normally not what I'm trying first ;) Maybe adding a small sentence like "For now, this tool has to be used with the "-U ALLOW_N_CIGAR_READS" flag" will be a faster and less time consuming "fix" then setting it automatically
    Best, c

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This tool is a relatively new one; the filtering out of N cigar reads is a legacy utility in the GATK engine, which for a long time was needed to ensure data integrity. We need to make the engine catch up to the tool itself. This is all documented in the online documentation (under "Guide" in the menu at the top of the page) about the RNA variant calling workflow. We do recommend users read the workflow documentation in addition to individual tool docs for best results.

  • HasaniHasani GermanyMember
    edited December 2014

    Hello GATK,

    according to RNA-seq workflow, SplitNCigarReads splits reads into exon segments and hard-clip any sequences overhanging into the intronic regions. If this step already done, and I correct the mapping scores as in 1, is there any remaining need to provide the exon list to the tool?

    Thank you in advance
    H.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Providing a list of intervals of interest will speed up the process by allowing GATK to ignore other regions.

  • fcruzfcruz SpainMember

    Dear GATK Team,

    I am calling variants in RNAseq using the recommended pipeline: https://www.broadinstitute.org/gatk/guide/article?id=3891

    I have validated the BAMs with Picard v1.60 after the SplitNRead step with GATK v3.6. For all the BAMs I've got the "INVALID_CIGAR" error. A more verbose look at them states "Read name SRR442036.1427131, No real operator (M|I|D|N) in CIGAR". That's what I should expect from the "-U ALLOW_N_CIGAR_READS" option, Right?

    However, there are another errors that concern me a bit, in which cases should I worry?

    ERROR:INVALID_VERSION_NUMBER 1
    ERROR:MATES_ARE_SAME_END 8271472
    ERROR:MATE_NOT_FOUND 6157122
    ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 16542944
    ERROR:MISMATCH_MATE_ALIGNMENT_START 17972754

    Thanks in advance,
    Fernando

  • SheilaSheila Broad InstituteMember, Broadie admin

    @fcruz
    Hi Fernando,

    It looks like something may have gone wrong during the alignment step. Did you use STAR aligner in the same way we recommend?

    -Sheila

  • fcruzfcruz SpainMember

    Hi Sheila,

    I have to say that after submitting my enquiry I do realized that there were problems generating some bams because I did not write in temporary directory (the computing nodes of our cluster complained about disk space).

    After doing things properly and realignment around indels. The picard a set of summary reports just say: ERROR:INVALID_VERSION_NUMBER but other samples say:

    HISTOGRAM java.lang.String

    Error Type Count
    ERROR:INVALID_VERSION_NUMBER 1
    ERROR:MATES_ARE_SAME_END 6361954
    ERROR:MATE_NOT_FOUND 4926843
    ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 6029245
    ERROR:MISMATCH_MATE_ALIGNMENT_START 7487777

    For such set of samples I realized (after the mapping) that they were produced using Phred+64. Therefore I have to include option --fix_misencoded_quality_scores during the SpliNCigarReads step.

    We have produced the alignments using STAR version 2.5.1b following the GATK guidelines but we did not have consider the base quality encoding at all. Do you think this could be the source of the problem?

    Thanks in advance,
    Fernando

    Issue · Github
    by Sheila

    Issue Number
    1030
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • fcruzfcruz SpainMember

    Hello again Sheila,

    Our STAR alignments also differ from the recommended by GATK. We have added an option to remove non-canonical junctions.

    STAR --genomeDir ../../genome --readFilesIn Gill1_l1_1.fq Gill1_l1_2.fq --runThreadN 4 --outFileNamePrefix Gill1_l1 --outFilterIntronMotifs RemoveNoncanonical

    Hope this helps to spot and fix the issues.

    Thanks,
    Fernando

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @fcruz That sounds like something you should ask the developer of STAR. He has a mailing list and is very responsive, so I would recommend you ask your question there. Good luck.

  • ArielPaulsonArielPaulson Kansas City MOMember

    Hi GATK team,

    I also get the error "No real operator (M|I|D|N) in CIGAR" while running Picard's BuildBamIndex after running SplitNCigarReads.

    In this case a 1-pass STAR alignment was used, with the genome (danRer10) pre-indexed along with splice junctions from Ensembl 84 models. The alignment was fine and SplitNCigarReads completed 24 BAMs, 1.5 - 2.0GB each, with no errors.

    An example of the problem is this input cigar 12S3M317N31M3783N30M, which SplitNCigarReads broke into 3 parts: 9S4167H, 332H31M3813H, and 4146H30M. The first is not really an alignment, and it is not clear why it was generated.

    Short of filtering the bams to remove all alignments with no "M" in the cigar, how do I suppress these? I don't see a lot of parameters for SplitNCigarReads.

    Thanks,
    Ariel

    Issue · Github
    by Sheila

    Issue Number
    1227
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @ArielPaulson, We're aware that SplitNCigarReads causes some invalid read issues; this has been addressed in the GATK4 alpha. Pending that being released, I would advise that you use the BadCigar filter to filter out these problematic reads. Be sure to check the filtering results to assess how many reads are getting filtered our because of this.

Sign In or Register to comment.