Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GATK Error SAM/BAM file SAMFileReader is malformed BAM header

Hi there,
I get an error when I try to run GATK with the following command:
java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I merged_bam_files_indexed_markduplicate.bam -o reads.intervals
However I get this error:
SAM/BAM file SAMFileReader{/merged_bam_files_indexed_markduplicate.bam} is malformed: Read HWI-ST303_0093:5:5:13416:34802#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
It suggest that it a header issue however my bam file has a header:
samtools view -h merged_bam_files_indexed_markduplicate.bam | grep ^@RG @RG ID:test1 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan @RG ID:test2 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan
when I grep the read within the error:
HWI-ST303_0093:5:5:13416:34802#0 99 1 1090 29 23S60M17S = 1150 160 TGTTTGGGTTGAAGATTGATACTGGAAGAAGATTAGAATTGTAGAAAGGGGAAAACGATGTTAGAAAGTTAATACGGCTTACTCCAGATCCTTGGATCTC GGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGGGGDGFGFGGGGGFEDFGEGGGDGEG?FGGDDGFFDGGEDDFFFFEDG?E MD:Z:60 PG:Z:MarkDuplicates RG:Z:test1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:M
Following Picard solution:
java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test RGPL=Illumina RGSM=test/ RGPU=HWI-ST303 RGCN=japan CREATE_INDEX=True
I get this error after 2 min.:
Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.`
Any recommendation on how to solve this issue ?
My plan is to do the following to resolve the issue:
picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=LENIANT samtools index test_markduplicate.bam
I see a lot of messages like below but the command still running:
Ignoring SAM validation error: ERROR: Record (number), Read name HWI-ST303_0093:5:5:13416:34802#0, RG ID on SAMRecord not found in header: test1
while running the command
then try the GATK RealignerTargetCreator
I already tried to do the following
java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I merged_bam_files_indexed_markduplicate.bam -o reads.intervals --validation_strictness LENIENT
But I still got the same error
N.B: the same command run with no issue with GATK version (1.2)
My pipeline in short:
mapping the paired end reads with
bwa aln -q 20 ref.fa read > files.sai bwa sampe ref.fa file1.sai file2.sai read1 read2 > test1.sam samtools view -bS test1.sam | samtools sort - test samtools index test1.bam samtools merge -rh RG.txt test test1.bam test2.bam
RG.txt
@RG ID:test1 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan @RG ID:test2 PL:Illumina PU:HWI-ST303 LB:test PI:75 SM:test CN:japan samtools index test.bam picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=SILENT samtools index test_markduplicate.bam
Answers
You need to fix your SAM file before you can proceed with any GATK analysis. I would recommend not using lenient validation with the Picard tools -- you should use strict validation and fix any problems that come up. Otherwise you're just going to get more problems later on.
If you can't figure out how to fix your sam file, I would recommend going back to the original data and reprocessing it. Validate your files at every step.
see http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:Why_am_I_getting_errors_from_Picard_like.22MAPQ_should_be_0_for_unmapped_read.22_or_.22CIGAR_should_have_zero_elements_for_unmapped_read.3F.22
Hello @Geraldine_VdAuwera,
May I ask whether you guys have grown any wiser about this type of error in the past months?
I ask because it's my experience after working with 1000+ genomes that all of them seem to have a few reads that are unmapped - or at least I don't recall ever seeing a genome without a single one. I'm aligning with bwa 7.3, and even running fixmates in a vain attempt to fix this handful of bad reads.
A typical Picard validateSAM looks like this:
There's always a handful or two of them in there. While the rest map ok.
If I google for others who have had a similar problem, pretty much every answer out there suggests setting picard to LENIENT, and I must say that I've done this myself, with no negative consequences so far (I'm well into variant calling, CNV, etc). The few "solutions" beyond LENIENT I've seen seem to revolve around editing the unmapped reads away (many suggest doing this manually! A nutty idea, if you have a lot of genomes). But it seems that virtually all of the "solutions" out there simply involve setting to LENIENT.
But I worry that setting to LENIENT will cause me to miss some "real" errors elsewhere.
Is Picard being stupid for defaulting at this being a dealbreaker, so that everyone has to spend all this time figuring out and worrywarting that these 5-10 "MAPQ should be 0" reads in every genome should be ignored, and then setting stringency to lenient? Worse still, am I (and so many other people out there!) being overly incautious about these "MAPQ should be 0" errors?
Hi @redzengenoist,
This is not something we've looked at very hard, to be honest. Mostly because the issue originates upstream of our domain, so to speak; as long as GATK works properly on fully valid data, we consider that it is not our fault, and therefore we shouldn't have to spend resources to fix or mitigate it. Admittedly this is not very helpful for users such as yourself who encounter the issue, but it is based on the need to prioritize our efforts.
If you ask me, is it okay to just use LENIENT validation, my first reflex is to say (as above) no, you should never use anything less than STRICT validation. The reason is that in general, as you rightly worry, being lenient can cause other issues to slip by unnoticed until much further downstream, and then it is a huge pain to identify and/or fix the issue (no one likes being told they have to reprocess a whole bunch of data). Ideally you'd want to be lenient on just those errors that are harmless for all practical purposes (and I do think the "MAPQ should be 0" ones are), and strict on the rest. But unfortunately you typically don't have that level of resolution in validation settings; afaik Picard doesn't offer that capability. So if you have this problem, you are stuck with only the LENIENT option for now.
My two cents:
Ultimately, this is a BWA issue - it's the program generating the offending alignments. But this particular error is pretty inoffensive - as long as the unmapped flag is set, the MAPQ is unimportant to just about any program out there. So in my view, using lenient validation in Picard to avoid this issue is okay (I don't believe GATK cares about this error). The danger, which you alluded to, is that setting Picard to lenient will mask any other errors that exist in your file.
Since all of my data is generated by BWA, I just use lenient mode and let it go. But if you're getting data from multiple alignment sources, the correct move is probably to validate each file and check each error before deciding how to proceed. It's a pain, but it's probably the correct way. Note that my technique carries a non-zero risk as well - system glitches or interference from space aliens may cause a bad file to be generated, which I won't catch until much later (if at all).
You may be able to avoid some of the pain in ValidateSamFile with an appropriate IGNORE flag - possibly INVALID_MAPPING_QUALITY? But even that may be too broad for this specific error. To answer your direct questions, I would say that Picard is rightly pedantic, though having an ignore flag for this case would be nice. And I think the risk of getting a true error from BWA is low enough that ignoring it is reasonable.
By the way, this error is not because of unmapped reads per se - I think it stems from the situation described in the FAQ on the BWA page, which is why you only see a handful of them:
You're both lovely people, thanks. Ok, for now I'm going to follow Picards suggestion and upgrade my LENIENT to the less egregious **IGNORE=INVALID_MAPPING_QUALITY **.
Great, thanks for reporting on that option!
@Geraldine_VdAuwera
For this above error:
java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test RGPL=Illumina RGSM=test/ RGPU=HWI-ST303 RGCN=japan CREATE_INDEX=True
I get this error after 2 min.:
Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.`
I have try your suggestion 'with the Picard tools -- you should use strict validation ', but it didn't work.
If i use lenient validation, add this:VALIDATION_STRINGENCY=LENIENT.
Everything looks like fine.
I don't know why!
@georgeyue This tells you that there are minor deviations from the specification in your file, but they are not important. You can add
IGNORE=INVALID_MAPPING_QUALITY
to make the picard program ignore those minor problems.To add more information to this thread, I was getting the error:
ERROR MESSAGE: SAM/BAM/CRAM file is malformed: Fasta index file should be found but is not /ngs/reference/hg19/hg19.fa.fai
The BAM file is okay and the fasta index is okay as well. Its turns out 'ulimit -n' was reporting only 1024 open files allowed to my account. Once I increased that limit, this error went away. I wonder if its possible for GATK to report more accurate errors? This was very misleading and took a few days to track down.
@golharam
Apologies for the inconvenience; unfortunately this is tricky to resolve in the current GATK framework, but in the future GATK 4 version we'll try to make this sort of thing less likely to happen and easier to diagnose.
Hello GATK team and community
I recently was in the same mood as @redzengenoist concerning MAPQ should be 0 for unmapped read warnings. I solve this issue by using picard CleanSam tool on my bamset. As it is describe on the manual
You can use it on bam without sam conversion. Works fine for me. Hope it is the right place to post this.
BtW thx for your great work GATK team, and your support.
Very good point, @guitib, thanks for reporting this option.
FYI, since this discussion was last active I learned that we don't encounter this issue internally (in the Broad production pipeline, which is our main development target) because of our somewhat quirky uBAM workflow (recently documented in some workshop presentations). One of the Picard tools involved (MergeBamAlignments) applies various bam cleansing procedures internally including those performed as a standalone process by CleanSam. So all our BAMs are squeaky clean by the time they reach GATK.
Right good to know, this cleanning step is not specify on MergeBamAlignments manual. Thx