picard SortSam | picard SetNmMdAndUqTags fails on post reference genome alignment

maartenSmaartenS AmsterdamMember
edited March 2017 in Ask the GATK team

I am trying to run a https://github.com/broadinstitute/wdl/tree/develop/scripts/broad_pipelines style pipeline. The picard SortSam | picard SetNmMdAndUqTags fails because BWA aligned some part of the reads beyond the end of the chromosome, which seems normal behaviour for BWA MEM (does it?) I checked the md5sum of the reference files used by bwa are the same as on
gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38*

Did I make a mistake in the pipeline and missing something obvious?

I provided the output of the commands and an minimal sam file to recreate the problem.

non al:

$picard SortSam TMP_DIR=. INPUT=minimal.bam OUTPUT=/dev/stdout SORT_ORDER="coordinate" |$picard SetNmMdAndUqTags INPUT=/dev/stdin OUTPUT=NA12878_sorted.bam  REFERENCE_SEQUENCE=/cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SortSam INPUT=minimal.bam OUTPUT=/dev/stdout SORT_ORDER=coordinate TMP_DIR=[.]    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SetNmMdAndUqTags INPUT=/dev/stdin OUTPUT=NA12878_sorted.bam REFERENCE_SEQUENCE=/cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta    IS_BISULFITE_SEQUENCE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Mar 13 12:02:58 CET 2017] Executing as maartenk@ui on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
[Mon Mar 13 12:02:58 CET 2017] Executing as maartenk@ui on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SortSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=251658240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 1, Read name C4681ANXX:1:1109:2167753:0, Mate Alignment start (59024838) must be <= reference sequence length (57227415) on reference chrY
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:796)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.<init>(BAMFileReader.java:769)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.<init>(BAMFileReader.java:757)
    at htsjdk.samtools.BAMFileReader.getIterator(BAMFileReader.java:465)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.iterator(SamReader.java:473)
    at picard.sam.SortSam.doWork(SortSam.java:99)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SetNmMdAndUqTags done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=251658240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Input must be coordinate-sorted for this program to run. Found: unsorted
    at picard.sam.SetNmMdAndUqTags.doWork(SetNmMdAndUqTags.java:96)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Increasing the VALIDATION_STRINGENCY to LENIENT does not help since it will crash in SetNmMdAndUqTags

[Mon Mar 13 12:04:50 CET 2017] picard.sam.SetNmMdAndUqTags done. Elapsed time: 0.21 minutes.
Runtime.totalMemory()=440926208
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 59024837
    at htsjdk.samtools.util.SequenceUtil.sumQualitiesOfMismatches(SequenceUtil.java:497)
    at picard.sam.AbstractAlignmentMerger.fixNmMdAndUq(AbstractAlignmentMerger.java:563)
    at picard.sam.SetNmMdAndUqTags.lambda$doWork$0(SetNmMdAndUqTags.java:106)
    at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:372)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at picard.sam.SetNmMdAndUqTags.doWork(SetNmMdAndUqTags.java:107)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

a picard ValidateSamFile reports on the minimal exampl:

ERROR: Record 1, Read name C4681ANXX:1:1109:2167753:0, Mate Alignment start (59024838) must be <= reference sequence length (57227415) on reference chrY
ERROR: Record 1, Read name C4681ANXX:1:1109:2167753:0, Mate CIGAR M operator maps off end of reference
ERROR: Record 2, Read name C4681ANXX:1:1109:2167753:0, Alignment start (59024838) must be <= reference sequence length (57227415) on reference chrY
ERROR: Record 2, Read name C4681ANXX:1:1109:2167753:0, Read CIGAR M operator maps off end of reference

A minimal example file

@HD VN:1.5  GO:none SO:queryname
@SQ SN:chrY LN:57227415 M5:ce3e31103314a704255f3cd90369ecce UR:file:/scratch2/testingmicro_MergeBamAlignment_3/Homo_sapiens_assembly38.fasta
@SQ SN:chrY_KI270740v1_random   LN:37240    M5:69e42252aead509bf56f1ea6fda91405 UR:file:/scratch2/testingmicro_MergeBamAlignment_3/Homo_sapiens_assembly38.fasta
@RG ID:0    PL:ILLUMINA SM:NA12878  PU:C4681ANXX:1:none
@RG ID:1    PL:ILLUMINA SM:NA12878  PU:C4681ANXX:2:none
@RG ID:2    PL:ILLUMINA SM:NA12878  PU:C468BANXX:1:none
@PG ID:bwamem   VN:0.7.15-r1140 CL:bwa mem -K 100000000 -p -v 3 -t 8 /cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta PN:bwamem
@PG ID:MarkDuplicates   VN:2.9.0-1-gf5b9f50-SNAPSHOT    CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/data/scratch/10366498.batch.gina.sara.nl/NA12878_0.bam, /data/scratch/10366498.batch.gina.sara.nl/NA12878_1.bam, /data/scratch/10366498.batch.gina.sara.nl/NA12878_2.bam] OUTPUT=NA12878MarkDuplicates.bam METRICS_FILE=NA12878MarkDuplicates.metrics.txt ASSUME_SORT_ORDER=queryname OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TMP_DIR=[.] VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> VERBOSITY=INFO QUIET=false MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates   PP:bwamem
C4681ANXX:1:1109:2167753:0  99  chrY    56878349    0   94M31S  =   59024838    2146524 CTCGATTTCATCCAAAAGTTTGGGCAGTGATCCCATCCACACTANNNNNNNNNNNNNNNNNNNNNAGTATAACACCAGGACATCGAGNNNATGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!!!!!!!!!!!!!!!!!!!!!<<<FFFFFFFFFFFFFFFFFFF!!!<<BF!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   BC:Z:none   MC:Z:90S35M MD:Z:44A0A0A0A0A0G0G0A0A0G0A0G0A0T0C0A0T0A0C0A0G22A0T0C4    PG:Z:MarkDuplicates RG:Z:0  NM:i:24 SM:i:416    MQ:i:60 AS:i:46 ZX:i:1944   ZY:i:10212
C4681ANXX:1:1109:2167753:0  147 chrY    59024838    60  90S35M  =   56878349    -2146524    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNTTTGTATATTTTACCANNA   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<!FFFFFFFFFFFFFF<<!!B   BC:Z:none   MC:Z:94M31S PG:Z:MarkDuplicates RG:Z:0  NM:i:17 SM:i:191    MQ:i:0  AS:i:705    ZX:i:1944   ZY:i:10212

Software used:

picard 2.9.0
jre1.8.0_102
BWA 0.7.15-r1140
CentOS 6.x

Issue · Github
by Sheila

Issue Number
1813
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Comments

  • valentinvalentin Cambridge, MAMember, Dev
    edited March 2017

    Are you sure? it really seems like the bam was aligned against b37/hg19 where the chromosome Y is 59,373,566 bases long and then the tools were run using b38/hg20 where Y is only 57,227,415 bases long.

    I wonder whether the aligner used is the one that actually includes the reference information in the bam header or whether this is added or altered later by a separated tool.

  • valentinvalentin Cambridge, MAMember, Dev

    Regardless what is in the header the final proof is in the reads sequences....
    and this is rather odd.

    Despite all the Ns in the mate, GNTTTGTATATTTTACCANNA seems to align nicely with chrY:59,024,850 onwards in b37...

    Ok I though that that confirmed my thesis .... however the first in the pair actually aligns to b38 and not b37!!! (the latter contains only Ns around chrY:56878349.

    This is very strange, is as if the sample was aligned against some sort of chimeric reference.

    Btw, I'm using the UCSC genome browser.

  • valentinvalentin Cambridge, MAMember, Dev
    edited March 2017

    Perhaps this is the results of bug in some process that realigns/lifts-over a b37-aligned BAM against b38 and that might keep b37 alignments around by mistake?

  • maartenSmaartenS AmsterdamMember

    @valentin said:
    Are you sure? it really seems like the bam was aligned against b37/hg19 where the chromosome Y is 59,373,566 bases long and then the tools were run using b38/hg20 where Y is only 57,227,415 bases long.

    I do save all commands executed in a database and it does use b38/hg20 reference

    I wonder whether the aligner used is the one that actually includes the reference information in the bam header or whether this is added or altered later by a separated tool.

    I added this using MergeBamAlignment PROGRAM_GROUP* options

    This error occur 210 times for a full size GIAB sample

  • maartenSmaartenS AmsterdamMember

    @valentin said:
    Perhaps this is the results of bug in some process that realigns/lifts-over a b37-aligned BAM against b38 and that might keep b37 alignments around by mistake?

    The source was a Isaac aligned file which has been unaligned by RevertSam | MarkIlluminaAdapters.
    the options for RevertSam were:

    OUTPUT_BY_READGROUP_FILE_FORMAT=bam \
    OUTPUT_BY_READGROUP=true \
    SANITIZE=true \
    MAX_DISCARD_FRACTION=0.005 \
    ATTRIBUTE_TO_CLEAR=BC \
    ATTRIBUTE_TO_CLEAR=SM \
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    ATTRIBUTE_TO_CLEAR=AS \
    RESTORE_ORIGINAL_QUALITIES=true \
    REMOVE_DUPLICATE_INFORMATION=true \
    REMOVE_ALIGNMENT_INFORMATION=true\
    SORT_ORDER=queryname \
    
  • valentinvalentin Cambridge, MAMember, Dev

    Do you have access to the BAM file that is produced by the RevertSam command or has it been lost? If you have it around could you print out those 2 reads' records?

  • maartenSmaartenS AmsterdamMember

    I will rerun this sample and extract the unwilling reads.

  • maartenSmaartenS AmsterdamMember

    First of all , thanks for the time and effort you spend on this topic.

    I ran a small sample to save some time: the errors are the same, only another chromosome

    The reads after bwa mem:

    bam header

    @SQ     SN:chr9 LN:138394717
    

    The reads

    D0KU2ACXX:4:1101:973490:0       73      chr9    136723247       60      34M66S  =       136723247       0       CAGGGTGCTCTCCAGCAGTGGCTGGGGCCTGCGTTGGCGTTGCGGGGCAAAATATCAAAAAGGCCTCTTGAGTTCGATGAAGGGGGCCTTGGCTGTCCCT    BBBBF0BBBBBFFFBBFB<BBFFF<BFFII70'0'7'707'7''07'0'077'07'0'0''''000'00<''00'0''0''0''''70<<0'''''0'''    NM:i:0  MD:Z:34 AS:i:34 XS:i:0
    D0KU2ACXX:4:1101:973490:0       133     chr9    136723247       0       *       =       136723247       0       GCGGAGGGCGGCGGGGGCGCCATGCCGCAGGTAGCCCCCAAAGACTGCGCACAACCTGGCCAGGATGGCAGGGGTGCGGTAGCCCTTGCACGTGCTCGTA    <<<<0<B0000070<0'''0<''0''''''77'''007'''''0'00'''''0'00'00''0'''0'0'''''''''''''''00B'000''0'''0''0    AS:i:0  XS:i:0
    

    After MarkDuplicates and MarkDuplicates

    bam header

    @SQ     SN:chr9 LN:138394717    AS:38   M5:6c198acf68b5af7b9d676dfdd531b5de     UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta     SP:Homo sapiens
    
    

    The reads

    D0KU2ACXX:4:1101:973490:0       99      chr9    136723247       60      34M66S  =       139618051       2894842 CAGGGTGCTCTCCAGCAGTGGCTGGGGCCTGCGTTGGCGTTGCGGGGCAAAATATCAAAAAGGCCTCTTGAGTTCGATGAAGGGGGCCTTGGCTGTCCCT    BBBBF0BBBBBFFFBBFB<BBFFF<BFFII70'0'7'707'7''07'0'077'07'0'0''''000'00<''00'0''0''0''''70<<0'''''0'''    BC:Z:none       MC:Z:62S38M     MD:Z:34 PG:Z:MarkDuplicates    RG:Z:3   NM:i:0  SM:i:417        MQ:i:60 AS:i:34 ZX:i:2132       ZY:i:5115
    D0KU2ACXX:4:1101:973490:0       147     chr9    139618051       60      62S38M  =       136723247       -2894842        TACGAGCACGTGCAAGGGCTACCGCACCCCTGCCATCCTGGCCAGGTTGTGCGCAGTCTTTGGGGGCTACCTGCGGCATGGCGCCCCCGCCGCCCTCCGC    0''0'''0''000'B00'''''''''''''''0'0'''0''00'00'0'''''00'0'''''700'''77''''''0''<0'''0<0700000B<0<<<<    BC:Z:none       MC:Z:34M66S     PG:Z:MarkDuplicates    RG:Z:3   NM:i:4  SM:i:370        MQ:i:60 AS:i:438        ZX:i:2132       ZY:i:5115
    
  • valentinvalentin Cambridge, MAMember, Dev
    edited March 2017

    Not sure I understand this.... the first extract is the bwa mem output and the the second would be the output from MarkDuplicates taking as input the bwa mem output, correct?

    I have to say I'm puzzled by this... so basically MarkDuplicates would be chaning the alignment? I think it only should change the flag values if it applies.

    Please make double sure that the input bam for MarkDuplicates is in fact the output bwa mem (b38) and not the Isaac's original bam (b37). This looks like a pipeline-script bug to me.

    Can you post your exact commands for "bwa mem" and "MarkDuplicates"? Including (absolute) input and output filenames.

  • valentinvalentin Cambridge, MAMember, Dev

    Also, is there no step between bwa mem and MarkDuplicates?

  • maartenSmaartenS AmsterdamMember

    That was a typical case of need more coffee/ need to get some sleep:

    It should mention: After MergeBamAlignment and MarkDuplicates (instead of twice MarkDuplicates)

    I am using a object store to save the intermediate results. The files are copied to the workernode before calling picard/bwa/samtools

    MergeBamAlignment

    /cvmfs/softdrive.nl/maartenk/project_mine/Haplotyper201607/jre1.8.0_102/bin/java -Xmx3000m -jar /cvmfs/softdrive.nl/maartenk/project_mine/Haplotyper201607/picard-2.9.0.jar MergeBamAlignment TMP_DIR=. VALIDATION_STRINGENCY=SILENT EXPECTED_ORIENTATIONS=FR ATTRIBUTES_TO_RETAIN=X0 ALIGNED_BAM=/data/scratch/10568459.batch.gina.sara.nl/speedytestgatkforum_1.bam UNMAPPED_BAM=/data/scratch/10568459.batch.gina.sara.nl/speedytestgatkforum_chunk_1_MarkIlluminaAdapters.bam OUTPUT=out.bam REFERENCE_SEQUENCE=/cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta PAIRED_RUN=true SORT_ORDER="unsorted" IS_BISULFITE_SEQUENCE=false ALIGNED_READS_ONLY=false CLIP_ADAPTERS=false MAX_RECORDS_IN_RAM=2000000 ADD_MATE_CIGAR=true MAX_INSERTIONS_OR_DELETIONS=-1 PRIMARY_ALIGNMENT_STRATEGY=MostDistant PROGRAM_RECORD_ID="bwamem" PROGRAM_GROUP_VERSION="0.7.15-r1140" PROGRAM_GROUP_COMMAND_LINE="bwa mem -K 100000000 -p -v 3 -t 8 /cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta" PROGRAM_GROUP_NAME="bwamem" UNMAP_CONTAMINANT_READS=true COMPRESSION_LEVEL=1
    

    /speedytestgatkforum_chunk_1_MarkIlluminaAdapters.bam is the input file of of SamToFastq|bwa|samtools

    MarkDuplicates

    cvmfs/softdrive.nl/maartenk/project_mine/Haplotyper201607/jre1.8.0_102/bin/java -Xmx4000m -jar /cvmfs/softdrive.nl/maartenk/project_mine/Haplotyper201607/picard-2.9.0.jar MarkDuplicates TMP_DIR=. INPUT=/data/scratch/10568458.batch.gina.sara.nl/speedytestgatkforum_0.bam INPUT=/data/scratch/10568458.batch.gina.sara.nl/speedytestgatkforum_1.bam INPUT=/data/scratch/10568458.batch.gina.sara.nl/speedytestgatkforum_2.bam INPUT=/data/scratch/10568458.batch.gina.sara.nl/speedytestgatkforum_3.bam OUTPUT=speedytestgatkforumMarkDuplicates.bam METRICS_FILE=speedytestgatkforumMarkDuplicates.metrics.txt VALIDATION_STRINGENCY=SILENT OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 ASSUME_SORT_ORDER="queryname" COMPRESSION_LEVEL=1
    

    /data/scratch/10568458.batch.gina.sara.nl/speedytestgatkforum_[0123].bam is the output of MergeBamAlignment

  • valentinvalentin Cambridge, MAMember, Dev

    This starts to be quite complicated, with object stores and all...

    Honestly, with all the information given, I think that this is scripting and or data-management error by the pipeline.

    Perhaps, just perhaps, a tool is failing to remove the original b37 alignment information and another tool downstream is reintroducing it into its output mixed with b38 alignments... For example... is the UNMAPPED_BAM argument of MergeBamAligment really unmapped?

    Anyway, for us to look into a potential bug in a tool, we would need a simple bug reproducing set-up where all the inputs are regular files (no object stores please), small ones, just containing the affected reads, the command line and the version number used.

    In order to spot the step/tool that is causing the issue, my advice is that you work your way back in the pipeline to determine the provenance of those b37 alignments.

    Perhaps If you have the cpu time to redo the work is just best to start from scratch making sure that you are suppressing the b37 alignment information at the very beginning:

    1. Generate reverted .fastq files from the b37 .bam files (relevant tutorial)
      a. either using only RevertSam but making double sure that all the alignment info is gone (e.g. comprehensive checks using commands like "samtools view ... | awk ...")
      b. RevertSam and SamToFastq (I beleive fastq cannot contain any alignment information).

    2. Throw those b37 .bam files out: i.e. store them in a safe place that is inaccessible (or won't ever be accessed) by the pipeline.

    3. Clear the content of your object stores (make double sure they are empty).
    4. Make sure that the pipeline only makes reference to the b38 reference.
    5. Re-run the pipeline out of the .fastq or the clean unaligned bam. If you decided to use .fastq and require an unaligned bam then use FastqToSam at the very beginning of your pipeline.
  • maartenSmaartenS AmsterdamMember

    I figured out where the problem was. Since we have Isaac aligned bam files we use RevertSam to unalign the bam files. This was also the point where the problem was introduced. I wanted to use the options

    REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true 
    

    but there was an typo(no space after the first true):

    REMOVE_DUPLICATE_INFORMATION=trueREMOVE_ALIGNMENT_INFORMATION=true 
    

    Picard did not raise an error! Since "trueREMOVE_ALIGNMENT_INFORMATION=true" does not look near a valid parameter for this option, it is expected that an error would be raised (as happens with SORT_ORDER for instance). Even worse, it evaluated REMOVE_DUPLICATE_INFORMATION to false. I think this is a nice suggestion to make picard a bit more user friendly and raise a error when the value is not as expected.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh, nice catch. We have some plans to improve the argument parsing that should resolve this sort of issue going forward.

  • valentinvalentin Cambridge, MAMember, Dev

    mmm... even if the revert step has fallen to do its job due to the parsing error, that is not the only reason why those alignments are present in the output.... e.g. is it the case that the Bam merging tool is taking these alignment using the UNALIGNED_BAM argument and put them in the output? that also would be a bug IMO.

Sign In or Register to comment.