Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to make MergeBamAlignment work?

CharlesDavidCharlesDavid New ZealandMember

Hi guys, I think I am missing something with this process (obviously...):
Using latest version of Picard Tools.
Reads are Illumina 100bp single end.
Pre-processing includes de-multiplexing, FastQC and Trimmomatic
I have the uBAMs with read groups added, verified no errors, sorted by query name.
I have the aligned BAMs from BWA-MEM verified with only error "no read groups", sorted by query name.
The header of the uBAM has only the @HD and @RG tags.
The header of the BAM has @HD, multiple @SQ, and @PG tags. No @RG.
I have tried about a dozen times to make this fly but the program ALWAYS throws an ArrayIndexOutOfBoundsException.
Here is the error file from one example:

[Tue May 23 13:41:22 NZST 2017] picard.sam.MergeBamAlignment UNMAPPED_BAM=/project1_APEK1/021.MIA/sorted/C_F1_1.bam ALIGNED_BAM=[/project1_APEK1/022.BWA/sorted/C_F1_1.bam] OUTPUT=/project1_APEK1/022.BWA/MBA/C_F1_1.bam REFERENCE_SEQUENCE=/genome/Peaxi162/Petunia_3000.fasta CLIP_ADAPTERS=false MAX_INSERTIONS_OR_DELETIONS=-1 ATTRIBUTES_TO_RETAIN=[XS] SORT_ORDER=coordinate PRIMARY_ALIGNMENT_STRATEGY=BestMapq CLIP_OVERLAPPING_READS=true INCLUDE_SECONDARY_ALIGNMENTS=true ADD_MATE_CIGAR=true TMP_DIR=[/project1_APEK1/tmp] VALIDATION_STRINGENCY=SILENT MAX_RECORDS_IN_RAM=5000000 CREATE_INDEX=true PAIRED_RUN=true IS_BISULFITE_SEQUENCE=false ALIGNED_READS_ONLY=false ATTRIBUTES_TO_REVERSE=[OQ, U2] ATTRIBUTES_TO_REVERSE_COMPLEMENT=[E2, SQ] READ1_TRIM=0 READ2_TRIM=0 ALIGNER_PROPER_PAIR_FLAGS=false UNMAP_CONTAMINANT_READS=false MIN_UNCLIPPED_BASES=32 MATCHING_DICTIONARY_TAGS=[M5, LN] UNMAPPED_READ_STRATEGY=DO_NOT_CHANGE VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue May 23 13:41:22 NZST 2017] Executing as [email protected] on Linux 3.10.0-514.16.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-b11; Picard version: 2.9.2-SNAPSHOT
INFO 2017-05-23 13:41:22 SamAlignmentMerger Processing SAM file(s): [/project1_APEK1/022.BWA/sorted/C_F1_1.bam]
INFO 2017-05-23 13:41:39 AbstractAlignmentMerger Merged 1,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 16s. Last read position: Peaxi162Scf00133:585,031
INFO 2017-05-23 13:42:01 AbstractAlignmentMerger Merged 2,000,000 records. Elapsed time: 00:00:39s. Time for last 1,000,000: 21s. Last read position: Peaxi162Scf00376:259,458
INFO 2017-05-23 13:42:14 AbstractAlignmentMerger Merged 3,000,000 records. Elapsed time: 00:00:52s. Time for last 1,000,000: 13s. Last read position: Peaxi162Scf00566:712,679
[Tue May 23 13:42:27 NZST 2017] picard.sam.MergeBamAlignment done. Elapsed time: 1.10 minutes.
Runtime.totalMemory()=3898081280
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 96
at htsjdk.samtools.util.SequenceUtil.calculateMdAndNmTags(SequenceUtil.java:910)
at picard.sam.AbstractAlignmentMerger.fixNmMdAndUq(AbstractAlignmentMerger.java:558)
at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:533)
at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:181)
at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:282)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

I would be grateful if you could point me in the right direction to get this going.

Thanks! :-)

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @CharlesDavid,

    Your uBAM, is it from before all your preprocessing or the BAM after Trimmomatic (and before alignment)? Also, can you double-check that both the uBAM and aligned BAM are in the same sort order?

    To make sense of your error--ArrayIndexOutOfBoundsException--see this post.

    My first guess is that there may be a mismatch in read lengths from the trimming. This will cause an error. My second guess would be that the aligned BAM is in coordinate order while the uBAM remains queryname-grouped. This will also cause an error.

    Our Genomics Platform has a pipelined workflow that includes a MergeBamAlignment step. You can see an older version of it explained here. It uses a newer feature of MergeBamAlignment where it merges queryname-grouped BAMs then adds a few more steps to finish preprocessing, e.g. fixing tags. It also uses a tool called SetNmAndUqTags that has since been replaced with SetNmMdAndUqTags.

    If the solution doesn't present itself, come back and let us know.

  • CharlesDavidCharlesDavid New ZealandMember

    Hi @shlee thanks for getting back to me.

    The answer to your two questions are:

    1. The uBAMs were constructed AFTER de-multiplexing and trimming, so the read lengths range from 60 to 100 bp
      (No trimming done after constructing uBAM)

    2. Both are verified and sorted by queryname.

    The workflow is a GBS protocol, and the summary is:

    raw fastq --> de-multiplex --> verify restriction enzyme site at start --> Trimmomatic for adapters,etc --> FastqToSam (add read group metadata) --> MarkIlluminaAdapters --> SamSort (by queryname) --> SamToFastq --> BWA-MEM (aligned BAMs) --> SamSort (by queryname) --> MergeBamAlignment (FAIL)

    My goal was to implement the piped version of the last three steps here, that you described in another post quite nicely... :-)

    Charles

  • CharlesDavidCharlesDavid New ZealandMember

    Hi, Have not heard back from you... Could you please confirm if my workflow as described above is compatible with the use of the MergeBamAlignment tool? It it is, maybe you could suggest some de-bugging?? Thanks!

  • CharlesDavidCharlesDavid New ZealandMember

    Hello @shlee I have some additional info that may help diagnose this issue: I am able to run the MergeBamAlignment tool, but ONLY if I do not include secondary alignments... very strange.
    As soon as the command is changed to allow them, PT crashes: See error files.
    Perhaps there is some issue with the CIGAR string in the secondary reads...
    All files were verified for errors and sort-order. So to get to the bottom of this, I am attaching some files containing samples of the uBAM and BAM, as well as the error files. Also included is the command lines used, and a sample of the merged bam header and body.
    Clearly I am still missing something, hopefully another set of eyes will catch it...

    Issue · Github
    by shlee

    Issue Number
    2112
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @CharlesDavid

    Thanks for the additional info. I will see what I find from your uploaded datasets. In the meanwhile, check out this GBS thread.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @CharlesDavid ,

    When you say:

    Both are verified

    Do you mean by our tool ValidateSamFile?

  • CharlesDavidCharlesDavid New ZealandMember
  • CharlesDavidCharlesDavid New ZealandMember

    Hey @shlee I have some additional diagnostics for you: While ValidateSamFIle verifies the BAMS that are to be merged as OK, if I ask MergeBamAlignment to clip adapters, I get some additional errors. It looks like the read length information is being munted by one of the processes in the workflow...not sure which one... but seems consistent with the JAVA error we are getting. Please see the attached file for the errors.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hey @CharlesDavid,

    Nothing pops out as suspicious when I glance at your reads. What I do have to say is that I noticed you have single ended reads for which many produce secondary alignments. Also, your reference genome has 1000 or some large number of contigs.

    MergeBamAlignment should handle this situation so long as both your uBAM and aligned BAM are both queryname grouped and the uBAM contains all the reads that the aligned BAM contains. I recommend using Picard SortSam for sorting as there are minor differences in sorting between different programs. For the latter comment, see this thread.

    You do realize that GATK tools mostly exclude secondary alignments from consideration? That is, you won't be calling variants on these.

    I took a look at this abstract and see:

    The genomes reveal that the Petunia lineage has experienced at least two rounds of hexaploidization

    If your reference genome represents the same gene locus over multiple contigs, then I think you may be interested in the new standard in reference genomes that use alternate (alt) contigs. The example we use in our forum is the human GRCh38. This blogpost and this tutorial outline how the latest programs allow for alt-aware mapping.

    If MergeBamAlignment still fails for your carefully sorted reads, then please follow the bug report protocol we give a link to in the left menu.

  • CharlesDavidCharlesDavid New ZealandMember

    Hi @shlee thanks for the reply. Yes, the petunia reference has over 86,000 contigs, of which I use 'only' about 4,000 that are longer than 3kb. And yes, the SE reads are expected to have many secondary alignments...which is why we use (currently) 3 variant calling workflows, GATK being one of those.

    You said:

    "MergeBamAlignment should handle this situation so long as both your uBAM and aligned BAM are both queryname grouped and the uBAM contains all the reads that the aligned BAM contains. I recommend using Picard SortSam for sorting as there are minor differences in sorting between different programs."

    I agree and that is the method I used for sorting:

    COMMAND="${JAVA} ${PICARD} SortSam \
    MAX_RECORDS_IN_RAM=5000000 \
    VALIDATION_STRINGENCY=STRICT \
    TMP_DIR=${PROJECT}/tmp \
    I=${bam} \
    O=${OUT}/${sampleID}.bam \
    SORT_ORDER=queryname"

    What I do not understand is:

    1. Why does the CigarUtil throws errors when the option to clip adapters is set to true.
      Specifically, why is there confusion about the read length? and
      Why is the read name being confused with a CIGAR string?

      ERROR CigarUtil 96 read length does not = cigar length 74 oldCigar
      700666F:231:C9P2BANXX:6:1101:10640:18956 96b aligned read. cigar:64M2I10M96S

    2. Why does MergeBamAlignment work fine as long as the option to output secondary alignments is set to false?

    The last one is a big one for this project given the points you made above...

    But the problem or bug is probably related to the first one...

    I really appreciate your time and insight with this! :smile:

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2017

    Hi @CharlesDavid,

    So you are saying that after switching from SamSort to Picard SortSam, you are still encountering merging issues? Did you check that all the reads that are in the alignment are present in your uBAM?

    For the clip adapters setting in MergeBamAlignment, please see my reply to Tim (@timk) in this thread. I thought from our previous discussion that all of your trimming was done pre-uBAM. If your read lengths in the uBAM and in the aligned BAM mismatch beyond the hard-clipping that aligners do (and that they then note in the CIGAR string), then there will be errors related to resolving the CIGAR string.

    I just noticed from your last attached error stack-trace that you are using OpenJDK:

    OpenJDK 64-Bit Server VM warning: ignoring option UseStringCache; support was removed in 8.0
    

    We don't officially support OpenJDK. I believe certain settings within tools and program versions should run fine in Open JDK, e.g. those given in this workflow (there is a MergeBamAlignment step). All other program versions and tool options are not guaranteed. So my suggestion is to try running the tools using Oracle JDK and see if this allows you to retain secondary alignments.

    Please remember that using MergeBamAlignment in workflows is one of multiple acceptable ways to pre-process your alignments. You can also add back metadata to your alignments using Picard's AddOrReplaceReadGroups. Really, for your SE reads, the benefit of using MergeBamAlignment is less than if you had PE reads so I don't see any additional benefit than with using, e.g. AddOrReplaceReadGroups.

    If errors persist with Oracle JDK, then please file a bug report. For the file snippets, be sure to use one of our tools, e.g. PrintReads, so that file extensions and SAM format elements, e.g. the invisible marker at the end that marks the end of a file, are retained.

    Finally, I must ask. Is Petunia's large number of contigs a limitation of the genome assembly or truly representative of the species? I'm curious to know if this is a scenario our tools should consider for performance or if your assembly will improve going forward with say use of long read techs.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hey @CharlesDavid,

    Did you figure out if this is an OpenJDK issue? We're very keen to find out because we hope to support OpenJDK going forward.

  • CharlesDavidCharlesDavid New ZealandMember

    Hi @shlee I have verified that this is NOT a Java issue: The same errors occur in Oracle as in OpenJDK. Neither is the version of Picard Tools: 2.9.4 vs 3.10... I think I have narrowed the problem to a specific part of the code: SequenceUtil.calculateMdAndNmTags and possibly to CigarUtil. I am prepared to submit a bug report with my test data that includes a down-sampled unaligned bam file, the resulting aligned bam file and the clean bam file as processed by MergeBamAlignment (when secondary alignments are not included). I have the ERROR files for the case that works and the case that fails: when asking for secondary alignments to be included. I also have text files showing the reads cited by CigarUtil as errors, along with the erroneous cigar that it reports: These are in the aligned and cleaned bam for comparison. Also worth noting that the protocol works without errors on paired-end data.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017

    Hi @CharlesDavid,

    Sure, let's get your Bug Report. I think I gave you a link with instructions in a previous post.

    Post edited by shlee on
  • CharlesDavidCharlesDavid New ZealandMember

    Hi @shlee I have verified that this is NOT an OpenJDK Java issue: The same errors occur in Oracle as in OpenJDK. Same for the version of Picard Tools: 2.9.4 vs 2.10.3 I think I have narrowed the problem to a specific part of the code: SequenceUtil.calculateMdAndNmTags (and possibly to CigarUtil). I am prepared to submit a bug report with my test data that includes a down-sampled unaligned bam file, the resulting aligned bam file and the clean bam file as processed by MergeBamAlignment (when secondary alignments are not included). I have the ERROR files for the case that works and the case that fails: when asking for secondary alignments to be included. I also have text files showing the reads cited by CigarUtil as errors, along with the erroneous cigar that it reports: These are in the aligned and cleaned bam for comparison.

  • CharlesDavidCharlesDavid New ZealandMember

    OK, I have the ftp site but am not able to connect using WINSCP. I can connect form command line, but not able to upload the tarball (get 550 errors no matter what). Could you verify settings for the FTP client?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @CharlesDavid,

    Can you try again? Our server was down over the weekend for routine maintenance.

  • CharlesDavidCharlesDavid New ZealandMember

    @shlee OK. Done ...
    The name of the archive is: 'GATK-discussion-9632-how-to-make-mergebamalignment-work.tar.gz'
    Thanks :-)

    Issue · Github
    by shlee

    Issue Number
    2332
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017

    Hi @CharlesDavid,

    It appears you've sent us text files with records (with extension .bam.txt) instead of actual SAM or BAM files. We don't have the bandwidth to reformat your text files. Can you please use PrintReads to create workable data that meets SAM/BAM specifications? Thanks.

    P.S. To pull out mates, you can follow the workflow implied at the bottom of this blogpost.

  • CharlesDavidCharlesDavid New ZealandMember

    @shlee oops, forgot to include the bams! Will upload again: 'GATK-discussion-9632-how-to-make-mergebamalignment-work2.tar.gz'. Please see the README that describes the procedure and included files. The outputs document the exact commands run and the errors produced, and I have extracted from the bam files the exact reads that throw the errors. Note the the down-sampled bam file is very small for testing purposes... only 419kb.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I've received the new tarball @CharlesDavid. Thanks. I will see if I can recapitulate the error on our end and then get back to you.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hey @CharlesDavid, do you mind also giving us your reference fasta, fai index and dictionary? MergeBamAlignment requires the reference sequence and I don't see it in the files you sent over.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017

    Please disregard my previous request. I've generated your subset reference using the much larger Petunia_axillaris_v1.6.2_genome.fasta reference I found online at ftp://ftp.solgenomics.net/genomes/Petunia_axillaris/assembly/. Since this works with the tool, I assume it is the correct base reference.

    I can recapitulate the oddities you observe:
    [1] For the successful MBA run that excludes secondary alignments, the run completes with cryptic error messages about read length does not = cigar length.
    [2] An ArrayIndexOutOfBoundsException: 96 when you try to include secondary alignments in the MBA run.

    I'll be tracing these back to the code and getting developer feedback.

    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I have found the source of your error @CharlesDavid. Your secondary alignments are malformed in a manner that bypasses our filters as well as ValidateSamFile. These records actually fail to meet SAM specifications. For example, records have CIGAR strings and other metadata indicating alignment but are missing actual bases and base qualities in the record.

    You can view the reads yourself with

    samtools view -f 256 test.aligned.bam | less 
    

    And I hope you don't mind I post these secondary alignments as an example for the wider community:

    700666F:231:C9P2BANXX:6:1101:11129:79671        256     Peaxi162Scf00636        247238  0       34H50M12H       *       0       0       *       *       NM:i:0  MD:Z:50 AS:i:50
    700666F:231:C9P2BANXX:6:1101:11129:79671        256     Peaxi162Scf00568        309467  47      44M52H  *       0       0       CTGCAATCCTTCCCTCACATCGGATTCACTTGTCCCACCCTCGC    CGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG    NM:i:1  MD:Z:20T23      AS:i:39 XS:i:0  SA:Z:Peaxi162Scf00235,524895,+,31S65M,47,0;
    700666F:231:C9P2BANXX:6:1101:16521:73953        256     Peaxi162Scf00727        4831    0       96M     *       0       0       *       *       NM:i:0  MD:Z:96 AS:i:96
    700666F:231:C9P2BANXX:6:1101:16521:73953        256     Peaxi162Scf00727        9866    0       96M     *       0       0       *       *       NM:i:1  MD:Z:94A1       AS:i:94
    700666F:231:C9P2BANXX:6:1101:16521:73953        272     Peaxi162Scf00347        5595    0       23H73M  *       0       0       *       *       NM:i:0  MD:Z:73 AS:i:73
    700666F:231:C9P2BANXX:6:1101:17055:36510        256     Peaxi162Scf00132        52003   0       48H48M  *       0       0       *       *       NM:i:0  MD:Z:48 AS:i:48
    700666F:231:C9P2BANXX:6:1101:17244:84588        256     Peaxi162Scf00911        214419  0       57M16H  *       0       0       *       *       NM:i:2  MD:Z:14A9A32    AS:i:47
    700666F:231:C9P2BANXX:6:1101:19188:85979        272     Peaxi162Scf00206        1588234 0       96M     *       0       0       *       *       NM:i:4  MD:Z:8T63G3A16A2        AS:i:78
    700666F:231:C9P2BANXX:6:1101:20787:7531 256     Peaxi162Scf00217        137270  0       96M     *       0       0       *       *       NM:i:1  MD:Z:80T15      AS:i:91
    700666F:231:C9P2BANXX:6:1101:20787:7531 272     Peaxi162Scf00211        245255  0       96M     *       0       0       *       *       NM:i:3  MD:Z:3C15G4G71  AS:i:82
    

    I believe in a previous conversation with you I mentioned that GATK tools ignore secondary alignments and that your research would benefit from using supplementary alignments instead and possibly benefit from alt-aware alignment with a better organized reference genome.

    Given you previously mentioned that you need these secondary alignments, I suggest you redo your alignment with an aligner or alignment mode that produces alignment records that meet SAM specs. Also, if you will use GATK tools downstream to analyze these currently secondary alignments, I strongly suggest you use supplementary markings instead of secondary. See if your aligner has some mode to switch between these.

    As for the odd ERROR/WARN for your successful MergeBamAlignment run that excludes secondary alignments, since the tool successfully runs despite these, unless you find downstream processes choking on these same reads, I think we can ignore these. Let us know otherwise.

  • CharlesDavidCharlesDavid New ZealandMember

    Hi @shlee thanks for your reply. However, still confused. Here's why: I am implementing your workflow, using the aligner you used in your article: https://gatkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-short-read-sequence-data-efficiently. You say "Your secondary alignments are malformed in a manner that bypasses our filters as well as ValidateSamFile. These records actually fail to meet SAM specifications.", the fact is that these were produced by BWA mem. If there are issues with BWA, could you elaborate on what needs to be modified? Also, if the reads are malformed, why are they bypassing both filters and the ValidateSamFile tool that is designed just for that purpose? In my original remarks, I mentioned that I am applying your workflow to SE reads, and I asked if the GATK tools were compatible with that. I believe that this is still an open question. Let's focus on why the GATK/Picard tools are choking on the outputs of BWA mem (and STAR). Perhaps there are parameters that must be adjusted for use. In any case, I am still hoping that your workflow can be adapted to SE reads and for situations where secondary alignments are needed. Clearly, you intended this in your article where this is set: INCLUDE_SECONDARY_ALIGNMENTS=true \ #default. Now that we know the source of the error, what remains is to determine how to achieve compatibility.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2017

    @CharlesDavid,

    What is the version of BWA you are using and can you post your related commands? We recommend specific modes of the aligner for our workflows.

    Working with SE reads should be fine.

  • CharlesDavidCharlesDavid New ZealandMember

    @shlee OK:
    We are using bwa-0.7.15
    The bwa mem command is:
    bwa mem -a -S -P -M -t 16 ${INDEX} /dev/stdin where $INDEX is the bwa index
    We are using Picard-Tools 2.10.1
    The full piped command is:
    ${JAVA} ${PICARD} SamToFastq \
    I=${uBAM} \
    FASTQ=/dev/stdout \
    CLIPPING_ATTRIBUTE=XT \
    CLIPPING_ACTION=2 \
    INTERLEAVE=false \
    NON_PF=true \
    TMP_DIR=${TEMP} | \
    ${BWA} | \
    ${JAVA} ${PICARD} MergeBamAlignment \
    ALIGNED_BAM=/dev/stdin \
    UNMAPPED_BAM=${uBAM} \
    OUTPUT=${OUT}/${sampleID}.bam \
    R=${GENOME} \
    ADD_MATE_CIGAR=false \
    ALIGNED_READS_ONLY=true \
    CLIP_ADAPTERS=true \
    CLIP_OVERLAPPING_READS=false \
    INCLUDE_SECONDARY_ALIGNMENTS=true \
    MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=BestMapq \
    ATTRIBUTES_TO_RETAIN=XS \
    MAX_RECORDS_IN_RAM=10000000 \
    VALIDATION_STRINGENCY=STRICT \
    SORT_ORDER=coordinate \
    CREATE_INDEX=true \
    TMP_DIR=${TEMP}
    Where $GENOME is the reference, and
    $JAVA="java -jar -Xms8G -Xmx64G \
    -Djava.io.tmpdir=${TEMP} \
    -Dsamjdk.buffer_size=131072 \
    -Dsamjdk.use_async_io=true \
    -Dsamjdk.compression_level=1 \
    -Dsamjdk.USE_ASYNC_IO_READ_FOR_SAMTOOLS=true \
    -Dsamjdk.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS=true \
    -XX:GCTimeLimit=50 \
    -XX:GCHeapFreeLimit=10"
    Let's see if we can find the 'happy mode' for these commands :-)

    Issue · Github
    by shlee

    Issue Number
    2354
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @CharlesDavid, before I spend time on this, can we make sure you can reproduce the odd reads? Can you run BWA in isolation just on the FASTQ records that end up as secondary alignments (from your snippet) and see if these reproduce as problematic? Please do not pipe any step.

    For the tutorial you link, I believe I used an older version of BWA (version circa November 2015), which I did not note in the tutorial. Sorry about that. I was new to the field at the time and that was one of the first tutorials I wrote. I used BWA v0.7.12 back then. BWA v0.7.15 is the version I use in the GRCh38 alt-aware alignment tutorial. Actually, it appears that there is a new version of BWA (v0.7.16) that was released 16 hours ago. I suggest you try both the 0.7.12 version and the latest (15 or 16) version. Thanks.

  • CharlesDavidCharlesDavid New ZealandMember

    @shlee Yes, the errors are fully reproducible. I just re-ran the whole workflow starting with a down sampled fastq. Changing versions of bwa to latest had no affect. Have you looked over the run parameters to see if anything stands out as a known issue? I admit that I am confused by the outputs in the aligned bam compared to the merged bam produced by MBA when it works, especially when trying to make sense of the error messages... Really hoping you can tease this apart or optimistically, identify incompatible parameters.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @CharlesDavid,

    You say the errors are reproducible for your entire workflow. Did you test removing all the piping? Piping can cause odd errors.

    BWA isn't one of our tools and so we have limited capacity to help you with its many run parameters. Sorry, but you are on your own there. We can only recommend you use the parameters we present in our Best Practice Workflows for error free workflows. I understand that this may not be ideal for different organisms and their available references. I recommend you search to see if others have had similar problems at BioStars, SeqAnswers and the BWA repo (https://github.com/lh3/bwa). If you cannot find explanations, then you should post your question to these sites. I'm sorry we cannot help additionally. We are a small team.

  • CharlesDavidCharlesDavid New ZealandMember

    Hi @shlee Yes, the errors are persistent when no piping is done. I have included all files in the workflow. I still want to know what is causing the GATK specific errors. This debugging can only be done by you. FYI the piped command works fine with PE RNA seq data using the STAR aligner, outputting secondary alignments without errors. It seems to me that a unit test should be run on the reads that I provided to you. This should be very easy for the developer working on PT MergeBamAlignment. As a supporter of GATK, I would like to know what the limitations of the software are regarding SE reads and secondary alignments. Unanswered still is the question of whether these GATK tools are compatible with SE reads. Receiving errors that are uninformative and uncaught java problems like ArrayOutOfBounds exceptions is not productive, and we should determine the root cause. I believe it would be time well spent for you... the scope of the software will be better clarified, possible bugs identified, and a deeper knowledge base established. A win-win in my book :-)

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Kia ora @CharlesDavid,

    The GATK errors you encountered are due to invalid SAM record formats. Since you say you encounter these types of reads even when you remove piping, then the source of these bad records is your alignment process. Yes? GATK allows you to remove these ill-formatted records and these all happen to be secondary alignments, which you say you want to keep as they are important to your research. As I recommended before, you should check other forums and BWA documentation to see how to avoid producing these types of odd alignments. I would additionally recommend testing out different versions of BWA, perhaps a version that you saw others use with the same exact parameters you are using with success from a publication.

    The STAR aligner is a different aligner that doesn't produce invalid SAM records. So it is not surprising that these alignments would work fine in the parallel piped workflow. Again, the problem is not with GATK tools. They are with your alignment. And again, GATK supports SE libraries. There is no evidence to indicate otherwise in this situation. Did I miss something?

    We thank you for suggesting we spend time debugging your situation. However, our focus is mainly GATK and Picard functions that support our Best Practice Workflows that feature GATK functionalities. I agree it would be nice to have a MergeBamAlignment feature that corrects for your type of odd alignment records. It does seem an extension of MergeBamAlignment's functions. Picard tools are in the public domain with many collaborators across the world. So, for feature requests I would suggest you open a ticket in the Picard Github repository at https://github.com/broadinstitute/picard/issues and state your need. A developer may choose to pick it up.

Sign In or Register to comment.