Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CRAM support timeline

TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

Hi,

With the recent release of improvements to sequencing my poor disks are threatened with an influx of more data than ever and I've been trying to figure out ways of reducing the amount of space each sample takes up on our fast compute disks (as well as the time it has to spend there). One of the approaches I am exploring is the use of CRAM (in lossless mode initially) in place of BAM as our primary binary reads format as it seems likely to become the preferred format of the ENA/EGA the SRA where we archive our BAMs.

A samtools version that will support CRAM will be released in a few weeks; CRAMtools hacks support into Picard but at the moment I'm stuck using BAM for steps with involve GATK. I was wondering if this was likely to change in the near future, if so whether it's likely to be weeks, months or undefined as yet wish list?

Thanks

Tagged:

Best Answers

  • ebanksebanks Broad Institute ✭✭✭✭
    Accepted Answer

    Yes, it's definitely in our plans. We already started work on generalizing out some of the pieces of the GATK engine to make it work with arbitrary input sources (BAMs, CRAMs, streams). I'd pin the ETA for this on the order of months (because it's not weeks and not years). Does that help?

  • kshakirkshakir ✭✭
    Accepted Answer

    The GATK 3.4 release added .cram read/write support using HTSJDK 1.132. Via the HTSJDK defaults, the GATK 3.4 writes .cram.bai. However, GATK 3.4 does not require these files upon reading, as it does not use the index, and instead always seeks into .cram from the start of the file.

    The GATK 3.5 pre-releases already use HTSJDK 1.138 to read .cram.bai, and use that to index-seek within .cram files. As with .bam files, the index is now required. However, support for reading .cram.crai is available in that version of the HTSJDK. Unless the .cram was generated using an HTSJDK-based tool, one must use CRAMTools 3.0 to create missing .cram.bai index, or the pre-release of the GATK 3.5 will not work with the .cram.

    There is an internal GATK 3.5 pull request in review that uses Vadim's latest changes to a pre-release of HTSJDK 1.139. In addition to .cram.bai, this build can read .cram.crai files generated by non-HTSJDK tools. Via the HTSJDK defaults, the GATK still writes .cram.bai, not .cram.crai.

    TL;DR The GATK 3.4+ reads and writes .cram. Within the next couple days after this post, the GATK 3.5 nightlies should be able to read and use .cram.crai or .cram.bai while using the HTSJDK defaults to write .cram.bai.

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭
    Accepted Answer

    Yes, it's definitely in our plans. We already started work on generalizing out some of the pieces of the GATK engine to make it work with arbitrary input sources (BAMs, CRAMs, streams). I'd pin the ETA for this on the order of months (because it's not weeks and not years). Does that help?

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Thanks Eric, that's brilliant as it means it's probably more likely to happen before we start using any new tech in earnest (I'm more worried about next year than this year). I'll focus on testing the rest of it and stub out + push the GATK bits to the end of the schedule.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @ebanks can cram support be pinned down to weeks or months at this point in time? Thanks.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    The CRAM pull request from Vadim has yet to be integrated into HTSJDK (last I checked owing to some spec ambiguities regarding secondary read representation in CRAM) so this is a blocker. Once that's done we'll need to wait for a version of GATK which has been converted over to fully use HTSJDK to read CRAM and had the supporting command line switches added.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    That's exactly right. We've actually have a branch started to incorporate the changes in hts-jdk, but until it has all the functionality we need we are blocking on that.

  • gleesoninformaticsgleesoninformatics Rockefeller University, NYCMember

    Just wondering if I could get a status update on CRAM support for GATK.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Based on the latest github chatter it looks like we're very close. I'll see if we can get one of the engineers to give a more detailed sit rep.

  • kshakirkshakir Broadie, Dev ✭✭
    Accepted Answer

    The GATK 3.4 release added .cram read/write support using HTSJDK 1.132. Via the HTSJDK defaults, the GATK 3.4 writes .cram.bai. However, GATK 3.4 does not require these files upon reading, as it does not use the index, and instead always seeks into .cram from the start of the file.

    The GATK 3.5 pre-releases already use HTSJDK 1.138 to read .cram.bai, and use that to index-seek within .cram files. As with .bam files, the index is now required. However, support for reading .cram.crai is available in that version of the HTSJDK. Unless the .cram was generated using an HTSJDK-based tool, one must use CRAMTools 3.0 to create missing .cram.bai index, or the pre-release of the GATK 3.5 will not work with the .cram.

    There is an internal GATK 3.5 pull request in review that uses Vadim's latest changes to a pre-release of HTSJDK 1.139. In addition to .cram.bai, this build can read .cram.crai files generated by non-HTSJDK tools. Via the HTSJDK defaults, the GATK still writes .cram.bai, not .cram.crai.

    TL;DR The GATK 3.4+ reads and writes .cram. Within the next couple days after this post, the GATK 3.5 nightlies should be able to read and use .cram.crai or .cram.bai while using the HTSJDK defaults to write .cram.bai.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Hi @kshakir

    Could you send this to the GA4GH file formats mailing list? Vadim gave us an brief update but it would be good to hear from the Broad themselves. If you could also let us know when that pull request lands because I'd like to integration test it with the upcoming samtools/htslib 1.3 release.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭
    edited September 2015

    I've given the current nightly a whirl and it chokes on the following line from our standard test samtools test suite (view.001.sam converted to CRAM v3 by samtools):

    ref1_grp1_p003  99  ref1    9   4   10M =   33  34  GTACCCGGGG  %%%%%%%%%%  fa:f:1.66e-27   MD:Z:10 NM:i:0  RG:Z:grp1
    

    with:

    failed at record 5. Here is the previously read record: [ref1_grp1_p003; flags=99; alignmentOffset=2; mateOffset=11; mappingQuality=4; scores: ]
    INFO  14:26:47,269 GATKRunReport - Uploaded run statistics report to AWS S3 
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version nightly-2015-09-04-gbed9adc): 
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: SAM/BAM/CRAM file view.001.cram is malformed: java.lang.RuntimeException: java.lang.RuntimeException: Tag value is too large to store as signed integer.
    ##### ERROR ------------------------------------------------------------------------------------------
    

    I will investigate whether this is a bug in what samtools is producing or how htsjdk is parsing it.

    Issue · Github
    by Sheila

    Issue Number
    162
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭
    edited September 2015

    So it's probably actually the line after:

    ref1_grp2_p003  99  ref1    11  5   10M =   35  34  ACCCGGGGAT  &&&&&&&&&&  ia:i:4294967295 MD:Z:10 NM:i:0  RG:Z:grp2
    

    For some reason GATK can parse integers larger than signed int32 as BAM but not as CRAM. The BAM specification explicitly specifies uint32_t as a permitted type (which would be converted from i to I when converting from SAM to BAM).

  • kshakirkshakir Broadie, Dev ✭✭

    @TechnicalVault The exception you mentioned is originating from within HTSJDK.

    https://github.com/samtools/htsjdk/blob/3d13016bef282595328d4eb6eed9c71d885a0347/src/java/htsjdk/samtools/cram/structure/ReadTag.java#L380

    The debug discussion should move over to an HTSJDK GitHub issue where it can be better tracked and course of action planned.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭
  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Do we have a timeline for the release of GATK 3.5? Once pull requests 339 and 355 has been merged CRAM support should be done and there will be beers headed in your direction. We are about to embark on a major project and I would rather use a named version rather than a nightly for this.

    Issue · Github
    by Sheila

    Issue Number
    258
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Martin,

    The 3.5 release will definitely happen before our upcoming Nov 18 workshop (to be announced this pm); and in fact I'd like to push it out asap. It could happen as soon as end of next week if the CRAM PRs get sorted out in that time. Looking forward to the beer ;)

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Nov18 only 7 days from now. This is like counting down to Christmas Eve, when I was 5 years old :) I've been gone from the GATK forum for a few months. Now back!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Welcome back @tommycarstensen, we've missed you :)

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    I just retested on the latest nightly and no dice, I the latest HTSJDK is yet to be integrated?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah we haven't revved htsjdk yet, that might be it. Will check.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Current nightly build seems to mostly work. Well done! Does look like HTSJDK validation is not taking it's setting from GATK's -S however. The BAM is slightly invalid but the error message should be a warning not a fail unless we are set to -S STRICT

    INFO  13:33:09,162 GenomeAnalysisEngine - Strictness is SILENT 
    
    ##### ERROR MESSAGE: SAM/BAM/CRAM file mpileup.1.cram is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM validation error: ERROR: Record 8, Read name ERR162872.5127873, Mate Alignment start should be 0 because reference name = *.
    

    Issue · Github
    by Sheila

    Issue Number
    359
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @TechnicalVault
    Hi Martin!

    Can you please submit a bug report so we can debug locally? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

Sign In or Register to comment.