The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Error running CollectAlignmentSummaryMetrics on a bam generated from .maf file

SDFfASFSDFfASF Member Posts: 5

Hello,
Recently I run an alignment with LAST tool (http://last.cbrc.jp/ - fasta aligner for long reads alignment), it produces .maf file which I then converted to sam(with http://last.cbrc.jp/doc/maf-convert.html) then to bam (with picard). Until now everything looks fine, next I try to run picard CollectAlignmentSummaryMetrics and it throws this error:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector$IndividualAlignmentSummaryMetricsCollector.collectQualityData(AlignmentSummaryMetricsCollector.java:329)
at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector$IndividualAlignmentSummaryMetricsCollector.addRecord(AlignmentSummaryMetricsCollector.java:195)
at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:127)
at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:93)
at picard.metrics.MultiLevelCollector$AllReadsDistributor.acceptRecord(MultiLevelCollector.java:192)
at picard.metrics.MultiLevelCollector.acceptRecord(MultiLevelCollector.java:315)
at picard.analysis.AlignmentSummaryMetricsCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:89)
at picard.analysis.CollectAlignmentSummaryMetrics.acceptRead(CollectAlignmentSummaryMetrics.java:147)
at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:138)
at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:77)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I am adding the head of the bam file:

0034a196-edbc-429f-89c4-b5280a486760_Basecall_2D_2d 0 burn-in 1 100 4H21=1D18=.....1D17=2D1X6=6D1X11=2I1X31=19H * 0 0 GGGCGGCGACCTCGCGGGT.....AGCATGCCACG * NM:i:152 AS:i:10909
06c0ff36-09df-4bb3-b952-146fca6f60ae_Basecall_2D_2d 0 burn-in 1 100 8H21=1D3=......2D1=1I57=2D68=1D1=2D42=29H * 0 0 GGGCGGCGACCTCGCGGG...........GCAAGCGTGA * NM:i:402 AS:i:33419

I deleted values in the middle of SEQ and CIGAR strings because they are very long.

Running ValidateSamFile on this bam file shows not relevant problem:

HISTOGRAM java.lang.String

Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 2441

For the same sequencing run I had fastq files which I aligned with bwa and when I run CollectAlignmentSummaryMetrics on the bam file from this workflow it worked fine. here is a head of the bam from this workflow (alignment with bwa using fastq):

0034a196-edbc-429f-89c4-b5280a486760_Basecall_2D_2d 0 burn-in 1 60 4S18M1D1....M6D32M19S * 0 0 TGCTGG...TGTTTGA /)6-,(-.../9/)0,*, MD:Z:18^T..A11G31 NM:i:138 AS:i:1920 XS:i:0
06c0ff36-09df-4bb3-b952-146fca6f60ae_Basecall_2D_2d 0 burn-in 1 60 8S18M1D1...D1M2D42M29S * 0 0 GTATTGC...ATGTGTTTC =.01-)**)./....'-.+*+ MD:Z:18^.^A1^AA42 NM:i:371 AS:i:5836 XS:i:0

Same as before, I removed the characters in the middle of the long strings.

Hope you could help me with my problems.

Thanks and have a great day.

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin
    The ArrayIndexOutOfBoundsException error suggests that you may have some malformed reads where the alignment information does not make sense, e.g. Maps off the end of a contig or something like that. That could be a bug in the aligner you're using. This seems especially likely considering the BWA alignment appears to be healthy.

    Geraldine Van der Auwera, PhD

  • SDFfASFSDFfASF Member Posts: 5
    edited December 2016

    Well it seems that the only difference between the two bam files (one from aligning with fastq and one from aligning with fasta, two example sequences from the files are posted in the first post) is that in one file there is a phred score and in the other file there is a single "*" in that place.
    I'm trying to figure how to work with that but if anybody have a suggestion I will try it.

    BTW, I'm using the latest version of picard (2.8.1)

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 494 admin

    Hi @SDFfASF,

    If what you post is indeed the top of the BAM file, then you are missing an actual header. Also, your error messages are saying that the BAM is missing read group information, which also indicates a missing header.

    To start, take a look at this FAQ to see what a BAM header should look like. To add such a header, e.g. you can use Picard's ReplaceSamHeader.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin
    I'm pretty sure the problem is because of that asterisk. What program generated your alignments?

    Geraldine Van der Auwera, PhD

  • SDFfASFSDFfASF Member Posts: 5

    @shlee It had a header I just didn't post it by mistake. I will read the faq for sure, thanks.

    @Geraldine_VdAuwera I'm pretty sure too and when I put a some phred values instead of this asterisk picard worked fine. But I thought i saw somewhere in the documentation that picard did not requier qscore in the bam/sam and could work with files where it's replaced with a "*". The tool was last, i linked to it in the first post.

    Issue · Github
    by Sheila

    Issue Number
    1625
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 494 admin
    edited January 13

    @SDFfASF,

    Thanks for the feedback. I examined your two sets of records carefully and notice one interesting difference. The first set (that gives you problems with CollectAlignmentSummaryMetrics) uses extended CIGAR nomenclature (1D17=2D1X6=6D1X11=2I1X31=19H), while the second set (that works fine) does not (M6D32M19S). Would it be possible for you to attach a file of 100 such extended CIGAR SAM records in a valid BAM file, i.e. with header, so that we can test whether this is the problem or if something else is causing the issue? Can you make sure this snippet still gives you the error before attaching it here in this thread? Thanks.

  • SDFfASFSDFfASF Member Posts: 5
    edited January 15

    Lately, I'm using another tool that had similar problems with it's output with picard. Though it didn't use extended CIGAR and still had the problem with the asterisk, so I will post some test file snippets from this tool's output. I attached files which were originally .sam files but I changed it to .txt in order to upload.

    Important to note: picard throws "Error parsing SAM header. @RG line missing SM tag" with those files but I read that this SM tag is not essential for picard and u can ignore this error by adding "VALIDATION_STRINGENCY=SILENT" which I did.

    First file [testWithAsterisk.txt] - Original sam file which had asterisk instead of qscore:
    @HD VN:1.0 SO:unsorted
    @SQ SN:burn-in LN:48502
    @RG ID:1
    @PG ID:6 PN:minialign
    8915e658-528c-4677-88a8-c2eba6c58fc5_Basecall_2D_2d 16 burn-in
    8915e658-528c-4677-88a8-c2eba6c58fc5_Basecall_2D_template 4 * 0 0 * * 0 0 TTGGCAGATAACATATTTTATCTTTTGCTCACCAGTTCGATGATTAACGGAAGTTCATCTGCTTTATGGG * RG:Z:1
    8da715a9-3717-4f04-9667-e7e0c2792104_Basecall_2D_2d 16 burn-in

    And the command and error it produced: (it didnt output any file)

    $ java -jar ~/tools/picard.jar CollectAlignmentSummaryMetrics R=../LambdaRefGenome.fa I=test2.sam O=testSummary4.txt VALIDATION_STRINGENCY=SILENT
    [Sun Jan 15 10:53:12 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics REFERENCE_SEQUENCE=../LambdaRefGenome.fa INPUT=test2.sam OUTPUT=testSummary4.txt VALIDATION_STRINGENCY=SILENT MAX_INSERT_SIZE=100000 EXPECTED_PAIR_ORIENTATIONS=[FR] ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG] METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Sun Jan 15 10:53:12 IST 2017] Executing as artemd@nshomron.tau.ac.il on Linux 2.6.32-642.1.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17; Picard version: 2.8.1-SNAPSHOT
    WARNING 2017-01-15 10:53:12 SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway.
    [Sun Jan 15 10:53:12 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=504889344
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
    at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector$IndividualAlignmentSummaryMetricsCollector.collectQualityData(AlignmentSummaryMetricsCollector.java:323)
    at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector$IndividualAlignmentSummaryMetricsCollector.addRecord(AlignmentSummaryMetricsCollector.java:189)
    at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:121)
    at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:87)
    at picard.metrics.MultiLevelCollector$AllReadsDistributor.acceptRecord(MultiLevelCollector.java:192)
    at picard.metrics.MultiLevelCollector.acceptRecord(MultiLevelCollector.java:315)
    at picard.analysis.AlignmentSummaryMetricsCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:83)
    at picard.analysis.CollectAlignmentSummaryMetrics.acceptRead(CollectAlignmentSummaryMetrics.java:147)
    at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:138)
    at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:77)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    Now the second file [testWithQscore.txt] - with the only thing changed is added (fake) qscore values instead of the asterisk:
    @HD VN:1.0 SO:unsorted
    @SQ SN:burn-in LN:48502
    @RG ID:1
    @PG ID:6 PN:minialign
    8915e658-528c-4677-88a8-c2eba6c58fc5_Basecall_2D_2d 16 burn-in
    8915e658-528c-4677-88a8-c2eba6c58fc5_Basecall_2D_template 4 * 0 0 * * 0 0 TTGGCAGATAACATATTTTATCTTTTGCTCACCAGTTCGATGATTAACGGAAGTTCATCTGCTTTATGGG 1111111111111111111111111111111111111111111111111111111111111111111111 RG:Z:1
    8da715a9-3717-4f04-9667-e7e0c2792104_Basecall_2D_2d 16 burn-in

    And the command for this one is: (it produced a normal AlignmentSummaryMetrics file)

    $ java -jar ~/tools/picard.jar CollectAlignmentSummaryMetrics R=../LambdaRefGenome.fa I=test.sam O=testSummary2.txt VALIDATION_STRINGENCY=SILENT
    [Sun Jan 15 10:53:01 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics REFERENCE_SEQUENCE=../LambdaRefGenome.fa INPUT=test.sam OUTPUT=testSummary2.txt VALIDATION_STRINGENCY=SILENT MAX_INSERT_SIZE=100000 EXPECTED_PAIR_ORIENTATIONS=[FR] ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG] METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Sun Jan 15 10:53:01 IST 2017] Executing as artemd@nshomron.tau.ac.il on Linux 2.6.32-642.1.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17; Picard version: 2.8.1-SNAPSHOT
    WARNING 2017-01-15 10:53:01 SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway.
    [Sun Jan 15 10:53:01 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=504889344

    Hope this helps.

    ArtemD.

    txt
    txt
    testWithQscore.txt
    12K
    txt
    txt
    testWithAsterisk.txt
    7K
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin

    Hi @SDFfASF,

    I can confirm it's the asterisk that causes a problem. The error stack trace shows that this is the function that's choking on your read:

    IndividualAlignmentSummaryMetricsCollector.collectQualityData
    

    This function looks up the quality scores by the index position of the corresponding base, so if the array is just a single asterisk, the function will error out for any base after the first. That's why you get an ArrayIndexOutOfBounds as explained here.

    The tricky thing is that many Picard tools have requirements that are different from the majority of tools and are often not documented. The metrics collection tools tend to have the most exhaustive requirements for records being complete, because they access most if not all of the properties of the data. We'll try to document these things more clearly in future.

    Geraldine Van der Auwera, PhD

  • SDFfASFSDFfASF Member Posts: 5

    OK, thanks @Geraldine_VdAuwera that clears up a whole lot of confusion. Now I believe that collect summary metrics require the qscore values in order to calculate few metrics "for high quality bases" but can I somehow turn this option off so picard could collect all other metrics not related to quality? OR can I ask picard to assume all bases have the same qscore?

    If there is no solution on picard's end I guess I would need to either loop over each read in the sam file and to "fake" qscore values of the same length of the read or (what might be more troublesome to write) for each read go to the original fastq file and place the qscore values from the fastq to the corresponding bases for this read in the sam file.

  • elcinchu27elcinchu27 BroadMember, Broadie Posts: 19
    edited March 9

    Hello @Geraldine_VdAuwera

    I am not sure but I think that I have the "asterisk problem" with CollectMultipleMetrics in the "MutationCalling_QC_v1-1_BETA_cfg" pipeline:

    -INFO 2017-03-09 17:42:01 SinglePassSamProgram Processed 104,000,000 records. Elapsed time: 00:28:44s. Time for last 1,000,000: 15s. Last read position: X:153,045,203
    -[Thu Mar 09 17:42:10 UTC 2017] picard.analysis.CollectMultipleMetrics done. Elapsed time: 30.98 minutes.
    -Runtime.totalMemory()=1025507328
    -To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 287

    I am not sure if this is the problem or not because other people who launch the same analysis with non-mapped reads didn't have this kind of error. The picard version is 2.1.0 and the symbol for this type of reads is " * / * ". Do you know if I could do something to fix it?

    Thank you for the help,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin

    Hi @elcinchu27, if your data doesn't have qscores you need to add a flat default; see my "accepted" answer at the top of the thread.

    Geraldine Van der Auwera, PhD

  • elcinchu27elcinchu27 BroadMember, Broadie Posts: 19

    Hello again @Geraldine_VdAuwera,

    As you said, I tried to add flat default values for the qscores but unfortunately I still have the same problem with the pipeline:

    java -jar /usr/local/bin/GenomeAnalysisTK.jar \
    -T PrintReads \
    -R reference.genome \
    -I input_bam \
    -DBQ 0 \
    -o qscores.bam

    I wrote "-DBQ 0" because the default value is -1, and there is an error message that said that it is no possible to use negative values. Do you have any idea about the problem? Thanks for the help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin

    The default -1 value is a special-cased value that disables the use of default quals. I'm not sure about 0 but that might be special-cased too. I would recommend using a more realistic qual value, like 20 or 30 instead.

    That being said I don't know whether this will actually solve your problem; I'm assuming that your problem is the same as the original poster's, but if it's a different problem then this won't be sufficient. Did you run ValidateSamFile on this data?

    Geraldine Van der Auwera, PhD

  • elcinchu27elcinchu27 BroadMember, Broadie Posts: 19

    Yes, I looked for errors and warnings but the output shows "No errors found":

    /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -Xmx2g -jar /opt/picard-tools/picard.jar ValidateSamFile \
    I=input_bam \
    OUTPUT=output_errors.list \
    MODE=VERBOSE \
    IGNORE_WARNINGS=true

    /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -Xmx2g -jar /opt/picard-tools/picard.jar ValidateSamFile \
    I=input_bam \
    OUTPUT=$output_warnings_errors.list \
    MODE=VERBOSE

    I could try the same with those other values and see if it works or not.

  • elcinchu27elcinchu27 BroadMember, Broadie Posts: 19
    edited March 14

    I just checked the input and output files of "PrintReads" with "-DBQ 20" and I don't see any difference between them, because the asterisk is still in the new bam generated by the program. I don´t know why there is no change between both files.

    1) input.bam:

    HWI-ST731_18:2:1101:10003:49500#8@0 77 * 0 0 * * 0 0 TTTTCCATAATAGACGCAACGCGAGCAGTAGACTCATTCTGTTGATAAGCAAGCATCTCATTTTGTGCATATACTT
    ????II???I??I?I5???I?I????+55?+?+5?+5+?????I???I############################
    PG:Z:MarkDuplicates RG:Z:1271ND
    @DDDBDBF<;FF?G::@FG>;GB?@?)00B* B*/9)/)8==BCG;;FH############################

    2) qscores.bam:
    HWI-ST731_18:2:1101:10003:49500#8@0 77 * 0 0 * * 0 0 TTTTCCATAATAGACGCAACGCGAGCAGTAGACTCATTCTGTTGATAAGCAAGCATCTCATTTTGTGCATATACTT
    ????II???I??I?I5???I?I????+55?+?+5?+5+?????I???I############################
    PG:Z:MarkDuplicates RG:Z:1271ND
    @DDDBDBF<;FF?G::@FG>;GB?@?)00B* B*/9)/)8==BCG;;FH############################

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin

    Ah, I think I was partly wrong and the -DBQ argument only injects base quals for on the fly computation, but the values aren't actually replaced in the sam records.

    More to the point though, it looks like your reads do have base qualities. I thought from your original post that they didn't -- that was the original poster's problem. So your problem is actually that you're missing mapping qualities. I'm not up to speed on all the requirements of CollectMultipleMetrics but I wouldn't be surprised if it also required mapping qualities. This brings us back to my recommendation to ask the authors of the pipeline to state the data requirements for the pipeline to work.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.