US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

GATK Error SAM/BAM file SAMFileReader is malformed BAM header

dashtidashti Posts: 1Member
edited February 2013 in Ask the GATK team

Hi there,

I get an error when I try to run GATK with the following command:

java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I  merged_bam_files_indexed_markduplicate.bam -o reads.intervals

However I get this error:

SAM/BAM file SAMFileReader{/merged_bam_files_indexed_markduplicate.bam} is malformed: Read HWI-ST303_0093:5:5:13416:34802#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK.  Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem

It suggest that it a header issue however my bam file has a header:

samtools view -h merged_bam_files_indexed_markduplicate.bam | grep ^@RG
@RG     ID:test1      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test   CN:japan
@RG     ID:test2      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test    CN:japan

when I grep the read within the error:

HWI-ST303_0093:5:5:13416:34802#0        99      1       1090    29      23S60M17S       =       1150    160     TGTTTGGGTTGAAGATTGATACTGGAAGAAGATTAGAATTGTAGAAAGGGGAAAACGATGTTAGAAAGTTAATACGGCTTACTCCAGATCCTTGGATCTC        GGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGGGGDGFGFGGGGGFEDFGEGGGDGEG?FGGDDGFFDGGEDDFFFFEDG?E        MD:Z:60 PG:Z:MarkDuplicates     RG:Z:test1      XG:i:0  AM:i:29 NM:i:0  SM:i:29 XM:i:0  XO:i:0  XT:A:M

Following Picard solution:

java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test  RGPL=Illumina RGSM=test/ RGPU=HWI-ST303  RGCN=japan CREATE_INDEX=True 

I get this error after 2 min.:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.`

Any recommendation on how to solve this issue ?

My plan is to do the following to resolve the issue:

picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=LENIANT
samtools  index test_markduplicate.bam

I see a lot of messages like below but the command still running:

Ignoring SAM validation error: ERROR: Record (number), Read name HWI-ST303_0093:5:5:13416:34802#0, RG ID on SAMRecord not found in header: test1

while running the command

then try the GATK RealignerTargetCreator

I already tried to do the following

java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I  merged_bam_files_indexed_markduplicate.bam -o reads.intervals --validation_strictness LENIENT

But I still got the same error

N.B: the same command run with no issue with GATK version (1.2)

My pipeline in short: mapping the paired end reads with

bwa aln -q 20 ref.fa read > files.sai
bwa sampe ref.fa file1.sai file2.sai read1 read2 > test1.sam
samtools view -bS test1.sam | samtools sort - test
samtools  index test1.bam
samtools merge -rh RG.txt test test1.bam test2.bam

RG.txt

@RG     ID:test1      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test   CN:japan
@RG     ID:test2      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test    CN:japan

samtools  index test.bam
picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=SILENT
samtools  index test_markduplicate.bam
Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    You need to fix your SAM file before you can proceed with any GATK analysis. I would recommend not using lenient validation with the Picard tools -- you should use strict validation and fix any problems that come up. Otherwise you're just going to get more problems later on.

    If you can't figure out how to fix your sam file, I would recommend going back to the original data and reprocessing it. Validate your files at every step.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    see http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:Why_am_I_getting_errors_from_Picard_like.22MAPQ_should_be_0_for_unmapped_read.22_or_.22CIGAR_should_have_zero_elements_for_unmapped_read.3F.22

  • redzengenoistredzengenoist Posts: 27Member

    Hello @Geraldine_VdAuwera,

    May I ask whether you guys have grown any wiser about this type of error in the past months?

    I ask because it's my experience after working with 1000+ genomes that all of them seem to have a few reads that are unmapped - or at least I don't recall ever seeing a genome without a single one. I'm aligning with bwa 7.3, and even running fixmates in a vain attempt to fix this handful of bad reads.

    A typical Picard validateSAM looks like this:

    [Wed Sep 18 00:46:08 HKT 2013] net.sf.picard.sam.ValidateSamFile INPUT=xxx VALIDATE_INDEX=true MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false IS_BISULFITE\

    _SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false

    [Wed Sep 18 00:46:08 HKT 2013] Executing as xxx@compute-2-4.local on Linux 2.6.18-194.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_16-b01; Picard version: 1.90(1432)

    ERROR: Record 643, Read name FCC1LGYACXX:3:1314:10126:29246, MAPQ should be 0 for unmapped read.

    ERROR: Record 647, Read name FCC1LGYACXX:3:2105:20692:12733, MAPQ should be 0 for unmapped read.

    ERROR: Record 657, Read name FCC1LGYACXX:3:1102:7968:38734, MAPQ should be 0 for unmapped read.

    ERROR: Record 666, Read name FCC1LGYACXX:3:1114:11070:32068, MAPQ should be 0 for unmapped read.

    ERROR: Record 682, Read name FCC1LGYACXX:3:1205:16651:13950, MAPQ should be 0 for unmapped read.

    ERROR: Record 686, Read name FCC1LGYACXX:3:2101:9205:40305, MAPQ should be 0 for unmapped read.

    ERROR: Record 37673, Read name FCC1LGYACXX:3:2307:14733:95754, MAPQ should be 0 for unmapped read.

    ERROR: Record 37748, Read name FCC1LGYACXX:3:2104:6127:53592, MAPQ should be 0 for unmapped read.

    INFO 2013-09-18 00:47:24 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:01:15s. Time for last 10,000,000: 75s. Last read position: chr2:146,390,510

    INFO 2013-09-18 00:48:39 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:02:29s. Time for last 10,000,000: 74s. Last read position: chr4:99,062,115

    INFO 2013-09-18 00:49:52 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:03:43s. Time for last 10,000,000: 73s. Last read position: chr6:122,666,091

    INFO 2013-09-18 00:51:07 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:04:57s. Time for last 10,000,000: 74s. Last read position: chr9:33,518,557

    INFO 2013-09-18 00:52:21 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:06:11s. Time for last 10,000,000: 73s. Last read position: chr12:23,616,307

    INFO 2013-09-18 00:53:33 SamFileValidator Validated Read 60,000,000 records. Elapsed time: 00:07:23s. Time for last 10,000,000: 72s. Last read position: chr16:18,056,214

    INFO 2013-09-18 00:54:46 SamFileValidator Validated Read 70,000,000 records. Elapsed time: 00:08:37s. Time for last 10,000,000: 73s. Last read position: chr22:29,023,696

    [Wed Sep 18 01:00:03 HKT 2013] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 13.92 minutes.

    Runtime.totalMemory()=1527054336

    To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp

    There's always a handful or two of them in there. While the rest map ok.

    If I google for others who have had a similar problem, pretty much every answer out there suggests setting picard to LENIENT, and I must say that I've done this myself, with no negative consequences so far (I'm well into variant calling, CNV, etc). The few "solutions" beyond LENIENT I've seen seem to revolve around editing the unmapped reads away (many suggest doing this manually! A nutty idea, if you have a lot of genomes). But it seems that virtually all of the "solutions" out there simply involve setting to LENIENT.

    But I worry that setting to LENIENT will cause me to miss some "real" errors elsewhere.

    Is Picard being stupid for defaulting at this being a dealbreaker, so that everyone has to spend all this time figuring out and worrywarting that these 5-10 "MAPQ should be 0" reads in every genome should be ignored, and then setting stringency to lenient? Worse still, am I (and so many other people out there!) being overly incautious about these "MAPQ should be 0" errors?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi @redzengenoist,

    This is not something we've looked at very hard, to be honest. Mostly because the issue originates upstream of our domain, so to speak; as long as GATK works properly on fully valid data, we consider that it is not our fault, and therefore we shouldn't have to spend resources to fix or mitigate it. Admittedly this is not very helpful for users such as yourself who encounter the issue, but it is based on the need to prioritize our efforts.

    If you ask me, is it okay to just use LENIENT validation, my first reflex is to say (as above) no, you should never use anything less than STRICT validation. The reason is that in general, as you rightly worry, being lenient can cause other issues to slip by unnoticed until much further downstream, and then it is a huge pain to identify and/or fix the issue (no one likes being told they have to reprocess a whole bunch of data). Ideally you'd want to be lenient on just those errors that are harmless for all practical purposes (and I do think the "MAPQ should be 0" ones are), and strict on the rest. But unfortunately you typically don't have that level of resolution in validation settings; afaik Picard doesn't offer that capability. So if you have this problem, you are stuck with only the LENIENT option for now.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 373Member, GSA Collaborator ✭✭✭

    My two cents:

    Ultimately, this is a BWA issue - it's the program generating the offending alignments. But this particular error is pretty inoffensive - as long as the unmapped flag is set, the MAPQ is unimportant to just about any program out there. So in my view, using lenient validation in Picard to avoid this issue is okay (I don't believe GATK cares about this error). The danger, which you alluded to, is that setting Picard to lenient will mask any other errors that exist in your file.

    Since all of my data is generated by BWA, I just use lenient mode and let it go. But if you're getting data from multiple alignment sources, the correct move is probably to validate each file and check each error before deciding how to proceed. It's a pain, but it's probably the correct way. Note that my technique carries a non-zero risk as well - system glitches or interference from space aliens may cause a bad file to be generated, which I won't catch until much later (if at all).

    You may be able to avoid some of the pain in ValidateSamFile with an appropriate IGNORE flag - possibly INVALID_MAPPING_QUALITY? But even that may be too broad for this specific error. To answer your direct questions, I would say that Picard is rightly pedantic, though having an ignore flag for this case would be nice. And I think the risk of getting a true error from BWA is low enough that ignoring it is reasonable.

    By the way, this error is not because of unmapped reads per se - I think it stems from the situation described in the FAQ on the BWA page, which is why you only see a handful of them:

    Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this case, BWA will flag the read as unmapped, but you will see position, CIGAR and all the tags. A better solution would be to choose an alternative position or trim the alignment out of the end, but this is quite complicated in programming and is not implemented at the moment.

  • redzengenoistredzengenoist Posts: 27Member

    You're both lovely people, thanks. Ok, for now I'm going to follow Picards suggestion and upgrade my LENIENT to the less egregious **IGNORE=INVALID_MAPPING_QUALITY **.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Great, thanks for reporting on that option!

    Geraldine Van der Auwera, PhD

  • georgeyuegeorgeyue hongkongPosts: 1Member

    @Geraldine_VdAuwera‌ For this above error:

    java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test RGPL=Illumina RGSM=test/ RGPU=HWI-ST303 RGCN=japan CREATE_INDEX=True I get this error after 2 min.: Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.`

    I have try your suggestion 'with the Picard tools -- you should use strict validation ', but it didn't work. If i use lenient validation, add this:VALIDATION_STRINGENCY=LENIENT. Everything looks like fine. I don't know why!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    @georgeyue This tells you that there are minor deviations from the specification in your file, but they are not important. You can add IGNORE=INVALID_MAPPING_QUALITY to make the picard program ignore those minor problems.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.