Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

picard tool for MarkDuplicates for cram

Hi picarders,

I tried to use picard tool's MarkDuplicates to run against cram files, but I failed! Here is the error message:
Exception in thread "main" java.lang.IllegalStateException: A valid CRAM reference was not supplied and one cannot be acquired via the property settings reference_fasta or use_cram_ref_download
at htsjdk.samtools.cram.ref.ReferenceSource.getDefaultCRAMReferenceSource(ReferenceSource.java:108)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:386)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:211)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:470)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:228)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:228)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Here is the command I ran:
java -jar /software/picard.jar MarkDuplicates TMP_DIR=/space1/tmp AS=TRUE M=/dev/null VALIDATION_STRINGENCY=SILENT I=H2MMFCCXY-2.hgv.cram.20x.ds.cram O=H2MMFCCXY-2.hgv.cram.20x.ds.Marked.cram R=/next-gen/Illumina/bwa_references/g/GRCh38_1000Genomes/GRCh38_full.fa

the java version is jdk1.8.0_74

The Picard version is: 2.10.6

Thanks!

Peter

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @peter_huang,

    Exception in thread "main" java.lang.IllegalStateException: A valid CRAM reference was not supplied and one cannot be acquired via the property settings reference_fasta or use_cram_ref_download

    Do the MD5 hashes in your @SQ lines in your CRAMs match those of your reference? The error message implies that they do not match. If you can post some of (i) your @SQ header lines and (ii) the corresponding contig reference dictionary lines, we can take a look. You can derive the dictionary afresh for the reference with Picard CreateSequenceDictionary.

    I see in your post from last Friday that you are rather discouraged:

    I submitted the question "picard tool for MarkDuplicates for cram" to the board, but no answer so far. I don't think I will get the working answer within this week.

    I think it's rather unfair of you to expect support over the weekend, if this is indeed your expectation. I'm sorry that you are in such a bind for your work. I will try to help you as best I can.

  • Thanks @shlee for the kind reply!

    Here are the items you suggested.
    @SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd UR:/next-gen/Illumina/bwa_references/g/GRCh38_1000Genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
    @SQ SN:chr2 LN:242193529 M5:f98db672eb0993dcfdabafe2a882905c UR:/next-gen/Illumina/bwa_references/g/GRCh38_1000Genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa

    E00380:406:H2MMFCCXY:2:2208:22181:71910 147 chr1 10400 0 111S39M = 10008 -431 CCCCAACCCCCCCCCCCCCCCCGCACCCCCGCTCCCCCCCAAACCCCCCTCTTT
    TAACGATACGGCGCCCACCGACATCTACACTCTCTCCCCCCACGACGCTCTTCCGCTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA 5'55?++'&+&'?&&&&&'&'5&++&5&5#&'5&'''55++++&&5&&'+'5555++$??
    ++++'$'5??5555++++5++5+55?5'?555&55?$++5555555'+?'??????????5????????????????????????????? MC:Z:38M112S PG:Z:MarkDuplicates MQ:i:0 AS:i:39 XS:i:42 MD
    :Z:39 NM:i:0 RG:Z:0
    E00380:406:H2MMFCCXY:2:1205:1489:18203 1107 chr1 10400 0 111S38M1S = 10014 -424 ACCCCTCCCTCCCCACCCCTATCCAATACCACATCTCTACCCCTAT
    TTTCAAGCACAAGCCGGCACCCGAGATCACAAGGTGACTGGAGCTCAGACGCGTGCTCTTCCGCTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTN 5''++5+5'555'+5'''+5555++?5?'+?+?5+5???''???5??5??55
    +??5???555++5&'+?5555+5???+'?5?555+?5+?+???5+55???5???+5'+?+5+5??????????????????????????????5???# MC:Z:38M112S PG:Z:MarkDuplicates MQ:i:0 AS:i:38 XS
    :i:41 MD:Z:38 NM:i:0 RG:Z:0

    Thanks!

    Peter

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @peter_huang,

    Thanks for the BAM @SQ lines. Can you also post the corresponding contig lines from your reference dictionary?

  • Thanks @shlee!
    Here they are:

    chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38

    chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38

    Thanks!

    Peter

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @peter_huang These match for chr1 and chr2. Now, you should check that all of your contigs are present in both sets and that these hashes match for each of them. I suggest you first count if you have the same number of contigs and if you do, then see that the MD5s match.

    For example, for your BAM:

    samtools view xyz.bam | grep '@SQ' | wc -l
    

    And your dictionary:

    cat ref.dict | grep '^chr' | wc -l
    

    Do you get the same number?

  • Hi @shlee,

    Here the results:

    /hgsc_software/samtools/samtools-1.3.1/samtools view -H H2MMFCCXY-2.hgv.cram | grep '^@SQ' | wc -l
    3366

    cat GRCh38_full_analysis_set_plus_decoy_hla.fa | grep '^>' | wc -l
    3366

    Thanks!

    Peter

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @peter_huang,

    In your second command above, you are counting contigs directly from the reference FASTA. I believe Picard tools will use the dictionary you provide in the same folder, in your case named GRCh38_full_analysis_set_plus_decoy_hla.dict. The contents of this dictionary file must match the @SQ lines, if the file exists. Can you check this?

    If the number of contigs match, then check that the contigs are the same and then whether the MD5 values match between the two.

  • Hi @shlee,

    Since it happened to the chr1 for my case, so I checked the MD5 values both in .dict and .fa files. They do match!

    I am really curious, how does htsjdk calculate the MD5 values of sliced sequences? Based on the whole contig or the sliced sequence? I search the web, it only mentioned that htsjdk used uppercase sequence (non-normalized). But it didn't tell if it was from a whole contig or a sliced sequence.

    Thanks!

    Peter

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @peter_huang,

    What do you mean by sliced sequence? Your reference contigs should be invariate, so are you referring to a subset of BAM records?

  • Hi @shlee,

    Here is the error message from another posted question of mine:

    ERROR 2017-08-01 15:22:35 Slice Reference MD5 mismatch for slice 0:248733152-248765543, CACAGATTCT...TGATTATCTC
    Exception in thread "main" htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for slice: sequence id 0, start 248733152, span 32392, expected MD5 83849184708b9131cb455f52e70da52d

    Therefore, sliced sequence = sliced reference in my previous question.

    Thanks!

    Peter

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2017

    Hi @peter_huang,

    If you search the web with "Slice Reference MD5 mismatch for slice", you will come across:

    The discussions within these issues appear related to your error. My best guess is that slice may be used loosely to refer to a contig, or in the case of IGV the interval which it loads into memory. I think there are cases where we should not take stdout and stderr descriptions literally. These discussions and the related code are beyond the scope of my understanding.

    For you to move forward, I think you need to reexamine the reference your data was made into a CRAM from and the reference you are using to read the CRAM from. They must match exactly in every respect. Unfortunately, as I mentioned to you in a different conversation, CRAM support on our end is spotty. I highly recommend you convert to BAM before processing your data.

    If you are creating your own reference contigs by cutting up full contigs, and this is the reference to which your reads were aligned, then this is the reference you must use to read the CRAM.

  • Thanks @shlee for the kind updates!

    I will double check our reference again at our site as you suggested!

    Thanks again for all your time and kind efforts!

    Best,

    Peter

Sign In or Register to comment.