Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Errors on the road to GATK

I am using the best practices RNA-Seq pipeline for 6 libraries. Four have completed without any problem. Two (from the same project) have gotten snagged. The errors occur at "add or replace read groups" and at "mark duplicates." The errors:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Read name HWI-D00273:94:C6GFHANXX:8:1312:12804:32959, CIGAR M operator maps off end of reference

and

Exception in thread "main" net.sf.samtools.SAMFormatException: Did not inflate expected amount

I know picard tools is not part of GATK, but wondered if anyone has thoughts about what's going on. I have tried starting from scratch with trimmed reads, running cleansam, checking that all pairs are intact...nothing helps. I'm especially puzzled that the other libraries have no issues.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We're actually providing support for Picard too now so you're in the right place :)

    The CIGAR error looks to me like a mistake by the mapper. What program did you use? Does the file immediately output by the mapper pass validation?

    The second error looks like a possible file corruption problem. I'd recommend redoing the previous operation to regenerate the file.

  • saraoppsaraopp nycMember

    I'm using STAR, the log file concludes with "all done!" the log.final.out file reads:
    Started job on | Oct 04 12:14:44
    Started mapping on | Oct 04 12:14:53
    Finished on | Oct 04 13:21:38
    Mapping speed, Million of reads per hour | 50.89

                          Number of input reads |   56610576
                      Average input read length |   244
                                    UNIQUE READS:
                   Uniquely mapped reads number |   11749664
                        Uniquely mapped reads % |   20.76%
                          Average mapped length |   220.34
                       Number of splices: Total |   53778
            Number of splices: Annotated (sjdb) |   47481
                       Number of splices: GT/AG |   36504
                       Number of splices: GC/AG |   5814
                       Number of splices: AT/AC |   164
               Number of splices: Non-canonical |   11296
                      Mismatch rate per base, % |   0.48%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.63
                        Insertion rate per base |   0.00%
                       Insertion average length |   2.52
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   5750899
             % of reads mapped to multiple loci |   10.16%
        Number of reads mapped to too many loci |   215233
             % of reads mapped to too many loci |   0.38%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   68.70%
                     % of reads unmapped: other |   0.00%
    

    Which seems okay???

    I will try rerunning to fix the second error.

  • saraoppsaraopp nycMember

    When I reran second thing, I still got the error. Here is the complete error text:

    Exception in thread "main" net.sf.samtools.SAMFormatException: Did not inflate expected amount
    at net.sf.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:98)
    at net.sf.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:379)
    at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:361)
    at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:109)
    at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:234)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:394)
    at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371)
    at net.sf.samtools.util.BinaryCodec.readString(BinaryCodec.java:449)
    at net.sf.samtools.BAMFileReader.readSequenceRecord(BAMFileReader.java:435)
    at net.sf.samtools.BAMFileReader.readHeader(BAMFileReader.java:405)
    at net.sf.samtools.BAMFileReader.(BAMFileReader.java:146)
    at net.sf.samtools.BAMFileReader.(BAMFileReader.java:114)
    at net.sf.samtools.SAMFileReader.init(SAMFileReader.java:514)
    at net.sf.samtools.SAMFileReader.(SAMFileReader.java:167)
    at net.sf.samtools.SAMFileReader.(SAMFileReader.java:122)
    at net.sf.picard.sam.MarkDuplicates.openInputs(MarkDuplicates.java:272)
    at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:322)
    at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:122)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:106)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For the first error (CIGAR), what does Picard ValidateSamFile say on the file produced by STAR?

    For the second, I mean you need to rerun the step that generated the file which is then getting an error in MarkDuplicates, not rerun MarkDuplicates itself.

  • saraoppsaraopp nycMember

    ValidateSam says the following:
    WARNING: Record 1, Read name HWI-D00273:94:C6GFHANXX:8:1101:20146:19326, NM tag (nucleotide differences) is missing
    ......
    Maximum output of [100] errors reached.

    For the second issue, I tried going back to the first step. Everything works until MarkDuplicates, then I get the same error again. When I ran Validate on the input sam file, I got the same errors as above.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oy, I just noticed that you have 68.70% reads that are unmapped because of being too short. That seems problematic. Have you done any QC on your data?

  • saraoppsaraopp nycMember

    Yes, trimmed for quality and adaptor content, rRNA filtering, kraken filtering, using only good pairs. What does "too short" mean in this context? And again, these are 2 out of 6 files all prepared, sequenced, and processed in a single pipeline, so not clear to me why some have problems, not others. Sorry, I am grouchy about why nothing is ever easy! I appreciate your input and thoughts about what may be happening.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You'll have to ask the developer of STAR what his program considers too short for mapping. But this sounds like the data quality of those samples was inferior to the others. That happens; sometimes things go wrong in the lab during the prep work, or during sequencing. That's why we do systematic QC, and by QC I don't mean processing like trimming, I mean collecting statistics on the dataset and flagging for further investigation if numbers fall under certain thresholds. This helps us avoid wasting computing resources on sequence datasets that are not good enough to generate useable call data.

    If you applied quality trimming and a lot of the data was low quality, that would explain why you end up with a lot of reads being too short.

    I would recommend going over those samples with whoever provided the sequence data. If the data is failing quality thresholds they may need to redo the sequencing.

  • saraoppsaraopp nycMember

    Here are fastqc screenshots for my reads. The "clean" files are my trimmed, cleaned read set (the ones I'm trying to use for mapping and variant calling). To me they all look perfectly respectable...?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Do you know your distribution of read length before/after trimming?

  • saraoppsaraopp nycMember

    Read length (125bp) did not change. Total number of reads was reduced from 58,002,580 to 34,597,412 and from 62,996,559 to 48,750,864.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Read length is unchanged after trimming? How does that work?

    In any case you should really ask the STAR developer, Alex Dobin, what his program considers "too short for mapping". There's obviously something going on there and we have no way to help you with it. Alex has a mailing list and is very responsive, so I encourage you to get in touch with him.

  • saraoppsaraopp nycMember

    Apparently seqs were high Q and no adapter content, and what I removed was putative contamination? I will contact Alex, and look over my reads data in more detail. I agree that unchanged length is odd, especially if they are being called as "too short" for mapping.

Sign In or Register to comment.