I have two bams with equivalent sams; one yields a validation error and the other does not.

Ahh_gustAhh_gust University of North Texas Health Sciences CenterMember

Hi! I have two bam files whose sam equivalents are identical-- as in:

diff <(samtools view -h small.bam) <(samtools view -h smalltest.bam)

yields nothing, and when I run haplotype caller on one file I get errors that say (for every read):

Ignoring SAM validation error: ERROR: Record 1, Read name RSRS1, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned

and no SNPs are generated, while the other file processes just fine.

Needless to say, the bin fields are the same.

To be clear, I generated one of the files, it generated the error, and when I converted from bam->sam->bam, GATK processed it correctly.
Ideas?

I'm using gatk-4.0.11.0, samtools verion 1.7 (htslib 1.9)

-August

Tagged:

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @Ahh_gust

    I am curious whether this command line method is truly indicating identical sam files.

    It might be worthwhile checking if that is the case using a different method, such as bamhash

    This might reveal the issue.

    It seems that the diff tool, if I am reading it correctly, is comparing the same files based on identical coordinates. So, if some syntax is off in the coordinates, it may not find the differences between the lines in the sam.

  • Ahh_gustAhh_gust University of North Texas Health Sciences CenterMember

    To be clear, this is the unix utility diff-- I get the same result with using the unix utility cmp.

    By hand it's pretty easy to see that the reported error message is just wrong (the sequences are the same, the cigar/position, etc) wrt to the SAM file. That's not to say that there's not something else wrong (I suspect that there is!).
    Diff just makes this statement more concrete. If I had to wager, it's something about the parsing of the BAM file-- the bams are in fact different (differing in size by 31 bytes), but this could be due to differences in compression.
    I'll try bamhash just to make absolutely sure, but the error is tripped even when there's a single read in both files (easy to check by hand).
    -August

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    edited January 11

    hi @Ahh_gust

    Here is a document explaining quick and easy way to examine errors with your bam file: https://gatkforums.broadinstitute.org/gatk/discussion/7571/errors-in-sam-bam-files-can-be-diagnosed-with-validatesamfile

    This should help find why that might be.

  • Ahh_gustAhh_gust University of North Texas Health Sciences CenterMember

    Hmm. This just gives me the same error messages as GATK. Again, this cannot be the error that it claims to be (wrong bin)-- otherwise the diff results would not be as they are. To motivate this, it's an alignment to the human mitochondrial genome. The first record looks like:
    RSRS 0 chrM 1 60 16569M
    and the length of the sequence is 16569 (making the bin calculation trivial).
    This suggests to me that the difference comes from how samtools reads a bam and how GATK does-- do they both use htslib? and are the versions the same?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @Ahh_gust I guess my comment about the "diff" method you describe is that it may not be comparing apples to apples the way it is written. If the coordinates are read differently for sam file 1 versus sam file 2, it doesn't find any differences between the two files.

    It is better to use the output from validatesamfile to diagnose the errors. If it is saying that there is a "wrong bin" it is probably a hidden syntax error. For example, the error:

    INVALID_INDEXING_BIN Indexing bin set on SAMRecord does not agree with computed value

    can give you some indication about what needs to be fixed.

    It would be helpful to have the complete error output from the ValidateSamFiles tool for both files.

    Without further information to assist, this may be due to some internal spacing, punctuation or other syntax error that is misreading one or both of the two files.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    edited January 14

    HI @Ahh_gust

    Run md5sum check on both files. If they match then please share the files with us using the follow instructions, and we will look into this for you. If they do not match then you have your answer, the files are not the same.

Sign In or Register to comment.