Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

CRAM vs BAM difference in output of HaplotypeCaller

Hi,
I have been investigating CRAM as a long term storage format. I took a paired fastq dataset through processing to 'analysis ready BAM'; https://github.com/gatk-workflows/gatk4-data-processing

I then converted that file to cram using scramble.
scramble -r hs37d5x.fa NA12878.hs37d5x.bam NA12878.hs37d5x.bam.scramble.cram

I beleive that this conversion is mostly lossless, it preserves names, bases, and qualities. However it does reorder the 'samtags', and discard some (NM, MD at least, because they are trivially re-calculable).

I then ran both the original bam and the cram through best practices part 2: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.wdl

There are some small differences in the output .g.vcf (interleaved, first the cram, then the bam output):

1   143405790   .   GC  G,<NON_REF> 0   .   BaseQRankSum=0.48;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-1.485;RAW_MQ=87127;ReadPosRankSum=-0.741  GT:AD:DP:GQ:PL:SB   0/0:66,2,0:68:99:0,121,3162,205,3168,3252:28,38,2,0
1   143405790   .   GC  G,<NON_REF> 0   .   BaseQRankSum=0.48;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-1.485;RAW_MQ=87127;ReadPosRankSum=-0.741  GT:AD:DP:GQ:PL:SB   0/0:66,2,0:68:99:0,121,3130,205,3136,3220:28,38,2,0
1   143405791   .   C   T,<NON_REF> 0   .   BaseQRankSum=0.52;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-0.045;RAW_MQ=87127;ReadPosRankSum=1.37    GT:AD:DP:GQ:PL:SB   0/0:67,3,0:70:99:0,114,3025,204,3034,3124:31,36,1,2
1   143405791   .   C   T,<NON_REF> 0   .   BaseQRankSum=0.52;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-0.045;RAW_MQ=87127;ReadPosRankSum=1.37    GT:AD:DP:GQ:PL:SB   0/0:67,3,0:70:99:0,114,2991,204,3000,3090:31,36,1,2
1   144920661   .   G   A,<NON_REF> 378.77  .   BaseQRankSum=1.196;ClippingRankSum=0;DP=41;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-1.184;RAW_MQ=144373;ReadPosRankSum=-0.807  GT:AD:DP:GQ:PL:SB   0/1:23,15,0:38:99:407,0,897,477,942,1419:10,13,11,4
1   144920661   .   G   A,<NON_REF> 378.77  .   BaseQRankSum=1.196;ClippingRankSum=0;DP=41;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-1.184;RAW_MQ=144373;ReadPosRankSum=-0.807  GT:AD:DP:GQ:PL:SB   0/1:23,15,0:38:99:407,0,897,476,942,1418:10,13,11,4
1   144920668   .   C   T,<NON_REF> 612.77  .   BaseQRankSum=-1.145;ClippingRankSum=0;DP=40;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=1.266;RAW_MQ=140773;ReadPosRankSum=1.3 GT:AD:DP:GQ:PL:SB   0/1:15,22,0:37:99:641,0,564,686,630,1316:11,4,10,12
1   144920668   .   C   T,<NON_REF> 611.77  .   BaseQRankSum=-1.145;ClippingRankSum=0;DP=40;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=1.266;RAW_MQ=140773;ReadPosRankSum=1.3 GT:AD:DP:GQ:PL:SB   0/1:15,22,0:37:99:640,0,564,685,630,1315:11,4,10,12
1   144920703   .   CA  C,CAA,CAAA,<NON_REF>    0.2 .   BaseQRankSum=-0.776;ClippingRankSum=0;DP=35;ExcessHet=3.0103;MLEAC=1,0,0,0;MLEAF=0.5,0,0,0;MQRankSum=0;RAW_MQ=122773;ReadPosRankSum=-1.378  GT:AD:DP:GQ:PL:SB   0/1:9,4,1,2,0:16:14:24,0,188,23,109,216,14,137,215,319,61,173,215,246,267:9,0,6,1
1   144920703   .   CA  C,CAA,CAAA,<NON_REF>    0.2 .   BaseQRankSum=-0.776;ClippingRankSum=0;DP=35;ExcessHet=3.0103;MLEAC=1,0,0,0;MLEAF=0.5,0,0,0;MQRankSum=0;RAW_MQ=122773;ReadPosRankSum=-1.378  GT:AD:DP:GQ:PL:SB   0/1:9,4,1,2,0:16:14:24,0,188,23,105,190,14,137,204,319,61,171,202,239,258:9,0,6,1
1   144920754   .   GA  G,<NON_REF> 224.73  .   BaseQRankSum=1.247;ClippingRankSum=0;DP=39;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0;RAW_MQ=138564;ReadPosRankSum=0.893    GT:AD:DP:GQ:PL:SB   0/1:22,11,0:33:99:262,0,668,328,701,1030:16,6,6,5
1   144920754   .   GA  G,<NON_REF> 205.73  .   BaseQRankSum=1.247;ClippingRankSum=0;DP=39;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0;RAW_MQ=138564;ReadPosRankSum=0.893    GT:AD:DP:GQ:PL:SB   0/1:22,11,0:33:99:243,0,669,309,701,1010:16,6,6,5

The 'QUAL' field is slightly different in some cases, but very different in the last (224.73 vs 205.73).
The 'PL' Genotype field is slightly different. 'PL: phred-scaled genotype likelihoods rounded to the closest integer'

Can anyone help me explain these differences? I am not sure if these differences are numerically significant, but I would not have expected them regardless. I do need to double check that each HaplotypeCaller run used the same version of GATK4.

Thanks

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    I am working on a GenotypeConcordance with BAM vs CRAM samples.

    My initial findings

    • As long as compression is lossless and made with a recent version of Samtools concordance rate is above %99.99.
    • Minor changes are only at quality and phred calculation level which may not really affect anything.

    I am moving all my archival data to CRAM format so that's why I am testing this.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi again @Evan_Benn,

    You can use GATK4 PrintReads to convert BAM to CRAM and back again. For example, here is a command I am actually running right now:

    gatk4040 PrintReads \
    -I NA20885.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram \
    -O NA20885_lc.bam \
    -R ~/Documents/ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa
    

    If you have issues with the GATK conversion, then please let us know.

  • Evan_BennEvan_Benn Member

    Hi @shlee,
    Thanks for the suggestion, I had not tried PrintReads yet.

    The question I have is this: Should Bam and Cram input to HaplotypeCaller produce the same results?

    I show some differences in the results above when I tested this. I do not know if these differences are expected, or significant.

    Thanks

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Here is a snapshot of one of my trials

    Same sample. Truth set is from BAM file from GATK 4.0.4.0 and Call set is from the same bam file compressed lossless with samtools and called with GATK 4.0.4.0.

    There is one variant difference which I don't think comes from the actual BAM CRAM difference.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Evan_Benn
    Hi,

    Can you take a look at the CRAM file in the sites where there are differences? You can use Samtools or PrintReads. Also, can you try using PrintReads to produce the CRAM and see if there are differences?

    -Sheila

  • Evan_BennEvan_Benn Member

    Hi @Sheila,
    I have inspected the bam and cram file at the site of the first difference. There is a difference, which I mentioned in my first post: The metadata fields are reordered, and the MD and NM fields are missing.

    I have attached the two files filtered to bam with PrintReads. (Actually I got the message 'the file failed to upload'). https://cloudstor.aarnet.edu.au/plus/s/mpC8QaAERRMvprR

    The remainder of the data is the same, I used too bamUtils to check:

    bam diff --in1 NA12878.hs37d5x.bam --in2 NA12878.hs37d5x.scramble.bam --flag --mapQual --mate --isize --seq --baseQual --tags AS:i,MC:Z,MQ:i,PG:Z,RG:Z,UQ:i
    

    Thanks

  • Evan_BennEvan_Benn Member

    Apologies, I forgot to update after my previous message: 'I do need to double check that each HaplotypeCaller run used the same version of GATK4.' I had indeed used a different GATK4 version (4.0.2.1 vs 4.0.4.0). My immediate preceding post references the two files 'HaplotypeCaller'ed with the same GATK version (4.0.4.0).

    Using the up to date version made different differences, if that makes sense:

    1   142689598   .   C   T,<NON_REF> 60.77   .   BaseQRankSum=0.389;ClippingRankSum=0;DP=60;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-2.199;RAW_MQ=130389;ReadPosRankSum=-1.112  GT:AD:DP:GQ:PGT:PID:PL:SB   0/1:54,6,0:60:89:0|1:142689598_C_T:89,0,2262,252,2280,2532:30,24,4,2
    1   142689598   .   C   T,<NON_REF> 60.77   .   BaseQRankSum=0.389;ClippingRankSum=0;DP=60;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-2.199;RAW_MQ=130389;ReadPosRankSum=-1.112  GT:AD:DP:GQ:PGT:PID:PL:SB   0/1:54,6,0:60:89:0|1:142689598_C_T:89,0,2280,252,2298,2550:30,24,4,2
    1   142689600   .   G   A,<NON_REF> 0   .   BaseQRankSum=-2.862;ClippingRankSum=0;DP=59;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-1.348;RAW_MQ=129660;ReadPosRankSum=-0.741   GT:AD:DP:GQ:PL:SB   0/0:56,3,0:59:99:0,101,2481,172,2490,2560:32,24,2,1
    1   142689600   .   G   A,<NON_REF> 0   .   BaseQRankSum=-2.862;ClippingRankSum=0;DP=59;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-1.348;RAW_MQ=129660;ReadPosRankSum=-0.741   GT:AD:DP:GQ:PL:SB   0/0:56,3,0:59:99:0,101,2479,172,2488,2558:32,24,2,1
    1   144920562   .   A   G,<NON_REF> 951.77  .   BaseQRankSum=-0.299;ClippingRankSum=0;DP=61;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0.26;RAW_MQ=217010;ReadPosRankSum=1.496    GT:AD:DP:GQ:PL:SB   0/1:26,35,0:61:99:980,0,986,1061,1092,2153:17,9,14,21
    1   144920562   .   A   G,<NON_REF> 951.77  .   BaseQRankSum=-0.299;ClippingRankSum=0;DP=61;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0.26;RAW_MQ=217010;ReadPosRankSum=1.496    GT:AD:DP:GQ:PGT:PID:PL:SB   0/1:26,35,0:61:99:0|1:144920562_A_G:980,0,975,1061,1080,2142:17,9,14,21
    1   144920569   .   C   T,<NON_REF> 588.77  .   BaseQRankSum=-1.081;ClippingRankSum=0;DP=59;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-0.271;RAW_MQ=209810;ReadPosRankSum=-1.918 GT:AD:DP:GQ:PL:SB   0/1:35,24,0:59:99:617,0,1398,722,1469,2191:14,21,17,7
    1   144920569   .   C   T,<NON_REF> 746.77  .   BaseQRankSum=-1.081;ClippingRankSum=0;DP=59;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-0.271;RAW_MQ=209810;ReadPosRankSum=-1.918 GT:AD:DP:GQ:PGT:PID:PL:SB   0/1:35,24,0:59:99:1|0:144920562_A_G:775,0,1394,880,1469,2349:14,21,17,7
    11  51591444    .   C   G,T,<NON_REF>   1615.77 .   BaseQRankSum=-0.048;ClippingRankSum=0;DP=152;ExcessHet=3.0103;MLEAC=0,1,0;MLEAF=0,0.5,0;MQRankSum=-7.12;RAW_MQ=450410;ReadPosRankSum=1.05   GT:AD:DP:GQ:PL:SB   0/2:71,7,61,0:139:99:1644,1733,4399,0,2070,2031,1912,4222,2253,4342:44,27,29,39
    11  51591444    .   C   G,T,<NON_REF>   1751.77 .   BaseQRankSum=-0.343;ClippingRankSum=0;DP=157;ExcessHet=3.0103;MLEAC=0,1,0;MLEAF=0,0.5,0;MQRankSum=-7.438;RAW_MQ=454999;ReadPosRankSum=0.965 GT:AD:DP:GQ:PL:SB   0/2:71,7,66,0:144:99:1780,1883,4610,0,2069,2016,2059,4406,2253,4514:44,27,31,42
    11  51591478    .   C   T,<NON_REF> 21.8    .   BaseQRankSum=1.24;ClippingRankSum=0;DP=160;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-0.804;RAW_MQ=491041;ReadPosRankSum=2.005   GT:AD:DP:GQ:PL:SB   0/1:139,15,0:154:50:50,0,4613,471,4660,5131:68,71,6,9
    11  51591478    .   C   T,<NON_REF> 3.41    .   BaseQRankSum=1.207;ClippingRankSum=0;DP=163;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-0.471;RAW_MQ=489940;ReadPosRankSum=1.912  GT:AD:DP:GQ:PL:SB   0/1:146,15,0:161:29:29,0,4885,471,4932,5403:74,72,6,9
    11  51591485    .   T   <NON_REF>   .   .   END=51591485    GT:DP:GQ:MIN_DP:PL  0/0:187:0:187:0,0,4701
    11  51591485    .   T   C,<NON_REF> 0   .   BaseQRankSum=0.824;ClippingRankSum=0;DP=161;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-6.111;RAW_MQ=482710;ReadPosRankSum=0.3  GT:AD:DP:GQ:PL:SB   0/0:143,10,0:153:99:0,112,4998,430,5028,5347:78,65,3,7
    11  51591542    .   C   A,<NON_REF> 584.77  .   BaseQRankSum=1.4;ClippingRankSum=0;DP=143;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=4.86;RAW_MQ=366818;ReadPosRankSum=0.836  GT:AD:DP:GQ:PL:SB   0/1:113,30,0:143:99:613,0,3720,953,3810,4763:61,52,15,15
    11  51591542    .   C   A,<NON_REF> 587.77  .   BaseQRankSum=1.406;ClippingRankSum=0;DP=142;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=4.853;RAW_MQ=362569;ReadPosRankSum=0.992   GT:AD:DP:GQ:PL:SB   0/1:112,30,0:142:99:616,0,3689,953,3779,4732:59,53,15,15
    
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Evan_Benn,

    The metadata fields are reordered, and the MD and NM fields are missing.

    scramble is not a tool we support. If this external tool is losing data while decompressing CRAMs to BAMS, and unless this loss of data is something you desire, then do not use the tool! Do you see a loss of information when you decompress the original CRAM to BAMwith samtools or GATK4 PrintReads?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    cramtools and scramble are tools that are not being updated regularly as far as I can tell. samtools and GATK on the other hand uses the latest htslib for CRAM support and should be the defacto for BAM - CRAM - BAM conversion.

  • Evan_BennEvan_Benn Member
    edited June 2018

    I have not completed my analysis of this, but for anyone else who comes to this thread the scramble -P flag preserves the GATK NM and MD (and RG). There are also flags for samtools to write and not recreate those flags. The sam format will shortly standardise the meaning of the NM, MD tags, which should prevent this issue and mean you wont require -P (or the extra space in the file to store those). https://github.com/samtools/hts-specs/pull/308

    I was using scramble as it was ~8x faster than samtools (in terms of cpu-seconds). Samtools in turn was ~5x faster than gatk.

    I think that scramble is being actively developed, it is the library written by one of the authors of the cram format: https://github.com/jkbonfield/io_lib

  • Evan_BennEvan_Benn Member

    I have recently come back to this, this time I used
    samtools -C --output-fmt-option store_nm=1 --output-fmt-option store_md=1
    to create a cram from the bam. And I have updated to gatk 4.0.6.0

    I confirmed this cram was similar using bamutil diff : https://github.com/statgen/bamUtil .
    There were some differences, all in reads marked as 'CO:Z:Cross-species contamination'. Those reads had their CIGAR wiped.

    The gvcf output still differs, similar to above.

    I see that the newer best practice pipeline uses samtools to convert cram to bam first: https://github.com/gatk-workflows/gatk4-germline-snps-indels/pull/23~~~~

  • Evan_BennEvan_Benn Member

    I have also run the gvcf -> vcf pipe.

    bam -> gvcf -> vcf (control).
    bam -> cram -> gvcf (differs) -> vcf (differs)
    bam -> cram -> bam -> gvcf (matches control) -> vcf (currently running)

    the bam->cram->bam produces differences as above, only in the cross species contam. However running that bam through haplotypecaller produces an identical gvcf to the original bam. This indicates to me that there is a difference in GATK haplotypecaller when the input is cram.

  • Evan_BennEvan_Benn Member

    The final stage is now complete:
    bam -> cram -> bam -> gvcf (matches control) -> vcf (matches control)

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Evan_Benn,

    Thanks for reporting your observations.

    there is a difference in GATK haplotypecaller when the input is cram

    This particular observation is concerning. Would it be possible for you to enumerate (i) how the CRAM is made, e.g. with Samtools and (ii) the extent of the differences between the GVCFs? We have been recommending that researchers use CRAM for storage and that you convert to BAM for analyses. The reason for this recommendation is mostly due to how efficiently GATK can process CRAM vs. BAM. For example, the CRAM may be suitable for a one-off analysis, e.g. verifying this is the sample of interest, while the BAM is recommended if the sample alignments will be accessed more than once.

    Because you are noticing scientific differences, we will want to look more into this. Would it be possible for you to submit a bug report? Instructions are at https://software.broadinstitute.org/gatk/guide/article?id=1894. Many thanks.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
Sign In or Register to comment.