We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
CRAM support in GATK 3.7 is broken

I have not been able to get GATK 3.7 HaplotypeCaller to work with CRAM files at all (it has a 100% failure rate so far with our whole genome CRAMs). Based on my analysis of the problem, I don't think GATK 3.7 will work with any CRAM files containing IUPAC ambiguity codes other than 'N' (including GRCh37/hs37d5 and GRCh38/HS38DH).
The error I get is:
ERROR 2017-01-05 02:18:59 Slice Reference MD5 mismatch for slice 2:60825966-60861215, ATCTTTCATG...CTCTCCCATT ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: SAM/BAM/CRAM file /keep/46909b690725869e1d9bfbc1da4a1398+19932/20657_7.cram is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more ##### ERROR ------------------------------------------------------------------------------------------
This error occurs for 100% of my CRAM files, which can be read by samtools, scramble, or previous versions of GATK (including 3.6) without any issues, so the error message is incorrect and the CRAM files are not malformed.
The CRAM slice in question is on chromosome 3 of hs37d5 (3:60825966-60861215). We can verify externally that the FASTA reference we are passing into GATK with -R
does have the md5 that GATK reports it is expecting:
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | md5sum 0e0ff678755616cba9ac362f15b851cc -
And the sequence starts and ends with the bases that htsjdk reports:
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c1-10 ATCTTTCATG $ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c35241- CTCTCCCATT
I ended up having to recompile GATK and htsjdk from source and added some print debugging to htsjdk to dump the whole sequence from which the md5 was being calculated. It seems the sequence that cause problems are regions of the reference with IUPAC ambiguity codes other than 'N' (in this case a slice of chromosome 3 that contains an 'M' and two 'R's). In GATK 3.7 (built with htsjdk 2.8.1), the reference which is used to calculate the md5 for the slice has had all ambiguity codes converted to 'N'. The md5 it calculates for this slice (according to my print debugging) is: 5d820b3624e78202f503796f7330d8d9
I have verified that this is the md5 we would get from converting the IUPAC codes in this slice to N's:
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | tr RYMKWSBDHV NNNNNNNNNN | md5sum 5d820b3624e78202f503796f7330d8d9 -
I have tried in vain to figure out where in GATK and/or htsjdk the ambiguous reference bases are being converted to 'N's. I initially thought that it was in the CachingIndexedFastaSequenceFile call to BaseUtils.convertIUPACtoN (when preserveIUPAC
is false, although I didn't find any code path that could set it to true). However, after recompiling with preserveIUPAC
manually set to true, the problem persisted. I guess there must be some other place where the bases are remapped. I'll leave it to you guys to figure out how to get an unmodified view on the reference for htsjdk to use for CRAM decoding.
There is, however, no mystery as to why this problem has suddenly appeared in GATK 3.7. The slice md5 validation code in htsjdk was only added in July 2016 (https://github.com/samtools/htsjdk/commit/a781afa9597dcdbcde0020bfe464abee269b3b2e). The first release version it appears in is version 2.7.0. Prior to that, it seems CRAM slice md5's were not validated in htsjdk, so this error would not have occurred.
Answers
This is probably an htsjdk problem -- we'll let the CRAM devs know about this and see what they think.
I think you are right - the issues do seem to be with htsjdk. James Bonfield has looked into this a bit as well and just opened a related issue on the htsjdk github: https://github.com/samtools/htsjdk/issues/778
OK, I saw that ticket pop up over there and hoped it meant we're not on the hook for this one
Guys, sorry about joining the choir, but I have the same problem with 3.8. @jrandall did you revert to 3.6 or how did you solve it? Did you convert to bam? Thanks.
Issue · Github
by Sheila
I should add that HaplotypeCaller3.8 yields a stream of DEBUG log entries, when calling from crams. Nothing is written to the output gVCF file, while these DEBUT log entries are written to stderr. My problem is similar to the one described here:
https://github.com/samtools/htsjdk/issues/649
Here example of stderr:
@tommycarstensen
Hi Tommy,
I am happy to see you back here, but too bad you have a complaint and not a question
I will check with the team and get back to you.
-Sheila
@Sheila
True! I've been gone for ages! I got caught up in the realm of RNAseq.
In the end I decided to just convert from cram to bam and call the variants from the bam files with version 3.8-0-ge9d806836. I'm still waiting to see that it worked. I will know today.
@tommycarstensen
Hi Tommy,
Yes, it seems the recommendation is to simply convert to BAM for now.
-Sheila
It looks like 3.8-0 is built using htsjdk 2.10.1 whereas the fixes to htsjdk for CRAMs with IUPAC codes (such as https://github.com/samtools/htsjdk/pull/889/files#diff-7cd217a7a5f3143d33ff0ed843c71ddc) did not land until 2.11.0 (although they were committed to htsjdk master 9 days before GATK 3.8-0 was released).
If you were to build a new version (3.8-1, perhaps?) against a newer htsjdk, that would have a chance of fixing CRAM support (unless there are also other issues). Seems like it would be worth a try!
We have confirmed that if we build GATK 3.8 from source against htsjdk 2.11.0, human CRAMs appear to work without issue. Please could you release a new GATK built against a working htsjdk?
In case anyone else wants to build this from source for themselves prior to a new release, the only thing that appears to need changing is one line in
public/gatk-root/pom.xml
:Issue · Github
by Sheila
@jrandall
Hi,
Thanks for the posts. I will make a note of it for the team. I think there should be a patch release coming out soon.
-Sheila
We're planning a patch release soon, I'll make sure it takes the latest htsjdk.