Coordinate errors

gtazgtaz NetherlandsMember
edited September 2016 in GenomeSTRiP

I'm trying to SVpreprocess some bam files but I keep getting errors stating that the bam files are not properly ordered:

RCCache: Read coordinate chr22:23242120 precedes earliest allowed read coordinate chr22:23243001 - input bam files are out of order?

This actually happens on every chromosome at different positions.

I've used Picard ResortSam and coordinate sorting with both samtools and picard for preparing the bam files, but I keep getting these errors. Does anyone have any more ideas on this?

Best Answer

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator

    Thank you for your question @gtaz. I have moved it to our GATK-specific forum, as it appears to be a general question about GATK, rather than one about WDL. If you think this is a mistake, please let me know and I can take another look. Otherwise, I leave you in the very capable hands of my collegue, @Sheila.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @gtaz
    Hi,

    Sorry for the late response. Have you re-generated the BAM index file? Can you try running ValidateSamFile on your files?

    Thanks,
    Sheila

  • gtazgtaz NetherlandsMember
    edited October 2016

    My prep pipeline is now (because of the errors after my normal SortSam & MarkDup):
    1) ReorderSam
    2) SortSam (coordinate)
    3) MarkDuplicates (no removal)
    4) BuildBamIndex (But I've indexed at every step)

    Output ValidateSamFile: "No errors found"

    No errors in the logs of step 1-4 either....but still the same errors in GS (at every chromosome).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @gtaz Moving your question to the Genomestrip category, which is supported by a different team; apologies for bouncing you around. This looks to me like it could be a thread safety issue but @bhandsaker may be of more help to you.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Can you post a stack trace from one of the errors?

  • gtazgtaz NetherlandsMember

    Sure! Here is the entire Log/error file of Preprocessing on chr1:

  • bhandsakerbhandsaker Member, Broadie, Moderator
    edited October 2016

    On the face of it, I'm suspicious that your bam file is not, in fact, properly sorted.
    Can you post your reference .fai file ?
    Let's also check to make sure one of the input bam files is in fact sorted in the range identified in the error message.
    Perhaps something like:

    samtools view LP6008176-DNA_B12.reheader.reorder.sorted.dedup.depth.bam \ | cut -f 3,4 | grep -n . | awk '$2 > 8980000 && $2 < 8985000' | grep -w chr1

    and let's especially check for reads with start coordinate 8980976.

  • gtazgtaz NetherlandsMember

    That was my first idea as well, but sorting with samtools/picard/sambamba all led to the same error.
    Here are the chr 1 regions for the three bam files I'm using for troubleshooting (the final set will be much larger ofcourse) and the fasta.fai to which I reordered (but should also be the reference of the alignment).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    By the way, did you try regenerating the index file explicitly after sorting the bam? We've seen cases where chucking the index file and making a new one resolved sorting complaints.

  • gtazgtaz NetherlandsMember

    Yes; I used Picard BuildBamIndex as my final step before starting GenomeSTRiP (and previously, before the errors started, I used samtools sort followed by index).

    Additional information that might be of use: My pipeline used to work fine one year ago, the only things that changed are that this is a new batch of bam files (generated by Illumina, Hiseq X) and there has been a system update (CentOS6 to CentOS7). It was until after the errors that I've upgraded to the latest GenomeSTRiP version.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Any chance you can test the updated pipeline with an old batch of files that ran well previously?
  • gtazgtaz NetherlandsMember
    edited October 2016

    You are probably right!

    Checking the problematic region in one of the bam files with the 8980976 read, I found the following:

    "HKLH2CCXX:2:2118:3909378:0 161 chr1 8980900 37 1M6440D149M"

    I removed this read and rerun the pipeline hoping that it would skip the first error, but I still get the same errors, so there are probably more of these in there.

    The aligner that was used for generation of our bam files is Isaac's aligner, which was done by our service provider; we were thinking of realigning anyway and this might be another argument to do so (but it's on a lot of bam files, we don't have the fastq).

    Running the pipeline on the old bam files (from the same service provider) works perfectly fine. So it's not a system issue. However, these in these bam files Picard reported thousands of errors while marking duplicates, such as:

    "Ignoring SAM validation error: ERROR: Record 440277, Read name C2C06ACXX_1:7:1116:2980013:0, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned"

    Probably they have changed things to the aligner and now these errors are gone.

    Can you think of any way of removing/ignoring these reads instead of realigning?

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Yes, this is almost certainly what is causing the problem.
    That is a bizarre alignment - this is WGS data, yes?
    Before having seen this example, I would have guessed that these were legitimate split reads but using a different representation than we usually see with bwa, but in this case it just looks to me like a bug in the aligner.

Sign In or Register to comment.