CRAM support in GATK 3.7 is broken

I have not been able to get GATK 3.7 HaplotypeCaller to work with CRAM files at all (it has a 100% failure rate so far with our whole genome CRAMs). Based on my analysis of the problem, I don't think GATK 3.7 will work with any CRAM files containing IUPAC ambiguity codes other than 'N' (including GRCh37/hs37d5 and GRCh38/HS38DH).

The error I get is:

ERROR   2017-01-05 02:18:59     Slice   Reference MD5 mismatch for slice 2:60825966-60861215, ATCTTTCATG...CTCTCCCATT
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM/CRAM file /keep/46909b690725869e1d9bfbc1da4a1398+19932/20657_7.cram is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more
##### ERROR ------------------------------------------------------------------------------------------

This error occurs for 100% of my CRAM files, which can be read by samtools, scramble, or previous versions of GATK (including 3.6) without any issues, so the error message is incorrect and the CRAM files are not malformed.

The CRAM slice in question is on chromosome 3 of hs37d5 (3:60825966-60861215). We can verify externally that the FASTA reference we are passing into GATK with -R does have the md5 that GATK reports it is expecting:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | md5sum
0e0ff678755616cba9ac362f15b851cc  -

And the sequence starts and ends with the bases that htsjdk reports:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c1-10
ATCTTTCATG
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c35241-
CTCTCCCATT

I ended up having to recompile GATK and htsjdk from source and added some print debugging to htsjdk to dump the whole sequence from which the md5 was being calculated. It seems the sequence that cause problems are regions of the reference with IUPAC ambiguity codes other than 'N' (in this case a slice of chromosome 3 that contains an 'M' and two 'R's). In GATK 3.7 (built with htsjdk 2.8.1), the reference which is used to calculate the md5 for the slice has had all ambiguity codes converted to 'N'. The md5 it calculates for this slice (according to my print debugging) is: 5d820b3624e78202f503796f7330d8d9

I have verified that this is the md5 we would get from converting the IUPAC codes in this slice to N's:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | tr RYMKWSBDHV NNNNNNNNNN | md5sum
5d820b3624e78202f503796f7330d8d9  -

I have tried in vain to figure out where in GATK and/or htsjdk the ambiguous reference bases are being converted to 'N's. I initially thought that it was in the CachingIndexedFastaSequenceFile call to BaseUtils.convertIUPACtoN (when preserveIUPAC is false, although I didn't find any code path that could set it to true). However, after recompiling with preserveIUPAC manually set to true, the problem persisted. I guess there must be some other place where the bases are remapped. I'll leave it to you guys to figure out how to get an unmodified view on the reference for htsjdk to use for CRAM decoding.

There is, however, no mystery as to why this problem has suddenly appeared in GATK 3.7. The slice md5 validation code in htsjdk was only added in July 2016 (https://github.com/samtools/htsjdk/commit/a781afa9597dcdbcde0020bfe464abee269b3b2e). The first release version it appears in is version 2.7.0. Prior to that, it seems CRAM slice md5's were not validated in htsjdk, so this error would not have occurred.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Man, you're really having a bad time with 3.7, aren't you? Sorry about all the trouble, @jrandall. These are all bits that we don't use ourselves, so I'm not surprised they're insufficiently tested -- but embarrassed nonetheless by the crop of bugs you're finding.

    This is probably an htsjdk problem -- we'll let the CRAM devs know about this and see what they think.
  • jrandalljrandall Member

    I think you are right - the issues do seem to be with htsjdk. James Bonfield has looked into this a bit as well and just opened a related issue on the htsjdk github: https://github.com/samtools/htsjdk/issues/778

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, I saw that ticket pop up over there and hoped it meant we're not on the hook for this one ;)

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited September 2017

    Guys, sorry about joining the choir, but I have the same problem with 3.8. @jrandall did you revert to 3.6 or how did you solve it? Did you convert to bam? Thanks.

    Issue · Github
    by Sheila

    Issue Number
    2526
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    I should add that HaplotypeCaller3.8 yields a stream of DEBUG log entries, when calling from crams. Nothing is written to the output gVCF file, while these DEBUT log entries are written to stderr. My problem is similar to the one described here:
    https://github.com/samtools/htsjdk/issues/649

    Here example of stderr:

    DEBUG   2017-09-20 15:54:28 ContainerParser Adding external data: 7172947
    DEBUG   2017-09-20 15:54:28 ContainerParser Slice records read time: 32
    DEBUG   2017-09-20 15:54:30 CompressionHeader   FOUND ENCODING: BF_BitFlags, EXTERNAL, [15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen
    Hi Tommy,

    I am happy to see you back here, but too bad you have a complaint and not a question :neutral:

    I will check with the team and get back to you.

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila

    True! I've been gone for ages! I got caught up in the realm of RNAseq.

    In the end I decided to just convert from cram to bam and call the variants from the bam files with version 3.8-0-ge9d806836. I'm still waiting to see that it worked. I will know today.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen
    Hi Tommy,

    Yes, it seems the recommendation is to simply convert to BAM for now.

    -Sheila

  • jrandalljrandall Member
    edited October 2017

    It looks like 3.8-0 is built using htsjdk 2.10.1 whereas the fixes to htsjdk for CRAMs with IUPAC codes (such as https://github.com/samtools/htsjdk/pull/889/files#diff-7cd217a7a5f3143d33ff0ed843c71ddc) did not land until 2.11.0 (although they were committed to htsjdk master 9 days before GATK 3.8-0 was released).

    If you were to build a new version (3.8-1, perhaps?) against a newer htsjdk, that would have a chance of fixing CRAM support (unless there are also other issues). Seems like it would be worth a try!

  • jrandalljrandall Member

    @jrandall said:
    It looks like 3.8-0 is built using htsjdk 2.10.1 whereas the fixes to htsjdk for CRAMs with IUPAC codes (such as https://github.com/samtools/htsjdk/pull/889/files#diff-7cd217a7a5f3143d33ff0ed843c71ddc) did not land until 2.11.0 (although they were committed to htsjdk master 9 days before GATK 3.8-0 was released).

    If you were to build a new version (3.8-1, perhaps?) against a newer htsjdk, that would have a chance of fixing CRAM support (unless there are also other issues). Seems like it would be worth a try!

    We have confirmed that if we build GATK 3.8 from source against htsjdk 2.11.0, human CRAMs appear to work without issue. Please could you release a new GATK built against a working htsjdk?

  • jrandalljrandall Member

    In case anyone else wants to build this from source for themselves prior to a new release, the only thing that appears to need changing is one line in public/gatk-root/pom.xml:

    # git diff public/gatk-root/pom.xml
    diff --git a/public/gatk-root/pom.xml b/public/gatk-root/pom.xml
    index 21acfb4e0..a4a17470f 100644
    --- a/public/gatk-root/pom.xml
    +++ b/public/gatk-root/pom.xml
    @@ -44,7 +44,7 @@
             <test.listeners>org.testng.reporters.FailedReporter,org.testng.reporters.JUnitXMLReporter,org.broadinstitute.gatk.utils.TestNGTestTransformer,org.broadinstitute.gatk.utils.GATKTextReporter,org.uncommons.reportng.HTMLReporter</test.listeners>
    
             <!-- Version numbers for picard and htsjdk -->
    -        <htsjdk.version>2.10.1</htsjdk.version>
    +        <htsjdk.version>2.11.0</htsjdk.version>
             <picard.version>2.10.2</picard.version>
         </properties>
    

    Issue · Github
    by Sheila

    Issue Number
    2782
    State
    open
    Last Updated
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jrandall
    Hi,

    Thanks for the posts. I will make a note of it for the team. I think there should be a patch release coming out soon.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We're planning a patch release soon, I'll make sure it takes the latest htsjdk.

Sign In or Register to comment.