We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Errors in SAM/BAM files can be diagnosed with ValidateSamFile

The problem
You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following: java.lang.String
, Error Type Count
, NullPointerException
-- or maybe something else that doesn't mean anything to you.
Why this happens
The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.
The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.
The solution
To fix these problems, you first have to know what's wrong. Fortunately there's a handy Picard tool that can test for (almost) all possible SAM/BAM format errors, called ValidateSamFile.
We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.
In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!
Workflow for diagnosing SAM/BAM file errors with ValidateSamFile

1. Generate summary of errors
First, run ValidateSamFile in SUMMARY
mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set MODE=SUMMARY
explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ MODE=SUMMARY
If this outputs No errors found
, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like SM
that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have SM
tags in the @RG
lines in their BAM header. That is the most common culprit.
However, if the command above outputs one or more of the 8 possible WARNING
or 48 possible ERROR
messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.
When run in SUMMARY
mode, ValidateSamFile outputs a table that differentiates between two levels of error: ERROR
proper and WARNING
, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the ERROR
category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the WARNING
category may often be ignored for some, if not all subsequent analyses.
Example of error summary
ValidateSamFile (SUMMARY) | Count |
---|---|
ERROR:MISSING_READ_GROUP | 1 |
ERROR:MISMATCH_MATE_ALIGNMENT_START | 4 |
ERROR:MATES_ARE_SAME_END | 894289 |
ERROR:CIGAR_MAPS_OFF_REFERENCE | 354 |
ERROR:MATE_NOT_FOUND | 1 |
ERROR:MISMATCH_FLAG_MATE_UNMAPPED | 46672 |
ERROR:MISMATCH_READ_LENGTH_AND_E2_LENGTH | 1 |
WARNING:RECORD_MISSING_READ_GROUP | 54 |
WARNING:MISSING_TAG_NM | 33 |
This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1 MISSING_READ_GROUP
error, 4 MISMATCH_MATE_ALIGNMENT_START
errors, 894,289 MATES_ARE_SAME_END
errors, and so on. Moreover, this output also indicates that there are 54 RECORD_MISSING_READ_GROUP
warnings and 33 MISSING_TAG_NM
warnings.
2. Generate detailed list of ERROR records
Since ERRORs
are more severe than WARNINGs
, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE_WARNINGS=true \ MODE=VERBOSE
Note that we invoked the MODE=VERBOSE
and the IGNORE_WARNINGS=true
arguments.
The former is technically not necessary as VERBOSE
is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of ERROR
than is given in the SUMMARY
output.
The IGNORE_WARNINGS
option enables us to specifically examine only the records with ERRORs
. When working with large files, this feature can be quite helpful, because there may be many records with WARNINGs
that are not immediately important, and we don't want them flooding the log output.
Example of VERBOSE report for ERRORs only
ValidateSamFile (VERBOSE) | Error Description |
---|---|
ERROR: Read groups is empty | Empty read group field for multiple records |
ERROR: Record 1, Read name 20FUKAAXX100202:6:27:4968:125377 | Mate alignment does not match alignment start of mate |
ERROR: Record 3, Read name 20FUKAAXX100202:6:27:4986:125375 | Both mates are marked as second of pair |
ERROR: Record 6, Read name 20GAVAAXX100126:4:47:18102:194445 | Read CIGAR M operator maps off end of reference |
ERROR: Read name 30PPJAAXX090125:1:60:1109:517#0 | Mate not found for paired read |
ERROR: Record 402, Read name 20GAVAAXX100126:3:44:17022:23968 | Mate unmapped flag does not match read unmapped flag of mate |
ERROR: Record 12, Read name HWI-ST1041:151:C7BJEACXX:1:1101:1128:82805 | Read length does not match quals length |
These ERRORs
are all problems that we must address before using this BAM file as input for further analysis. Most ERRORs
can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.
For example, MISSING_READ_GROUP
errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.
Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in SUMMARY
mode again. We do this to evaluate whether our attempted fix has solved the original ERRORs
, and/or any of the original WARNINGs
, and/or introduced any new ERRORs
or WARNINGs
(sadly, this does happen).
If you still have ERRORs
, you'll have to loop through this part of the workflow until no more ERRORs
are detected.
If you have no more ERRORs
, congratulations! It's time to look at the WARNINGs
(assuming there are still some -- if not, you're off to the races).
3. Generate detailed list of WARNING records
To obtain more detailed information about the warnings, we invoke the following command:
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE=type \ MODE=VERBOSE
At this time we often use the IGNORE
option to tell the program to ignore a specific type of WARNING
that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some WARNINGs
at all because we know they are harmless (for example, MATE_NOT_FOUND
warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.
Example of VERBOSE report for WARNINGs only
ValidateSamFile (VERBOSE) | Warning Description |
---|---|
WARNING: Read name H0164ALXX140820:2:1204:13829:66057 | A record is missing a read group |
WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776 | NM tag (nucleotide differences) is missing |
Here we see a read group-related WARNING
which would probably be fixed when we fix the MISSING_READ_GROUP
error we encountered earlier, hence the prioritization strategy of tackling ERRORs
first and WARNINGs
second.
We also see a WARNING
about missing NM
tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM
from now on when you run ValidateSamFile on this file.
Once you have attempted to fix all the WARNINGs
that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in SUMMARY
mode. Again, we check that no new ERRORs
have been introduced and that the only WARNINGs
that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.
Appendix: List of all WARNINGs and ERRORs emitted by ValidateSamFile
We are currently in the process of updating the Picard website to include the following two tables, describing WARNING
(Table I) and ERROR
(Table II) cases. Until that's done, you can find them here.
Table I | |
---|---|
WARNING | Description |
Header Issues | |
INVALID_DATE_STRING | Date string is not ISO-8601 |
INVALID_QUALITY_FORMAT | Quality encodings out of range; appear to be Solexa or Illumina when Phred expected. Avoid exception being thrown as a result of no qualities being read. |
General Alignment Record Issues | |
ADJACENT_INDEL_IN_CIGAR | CIGAR string contains an insertion (I) followed by deletion (D), or vice versa |
RECORD_MISSING_READ_GROUP | A SAMRecord is found with no read group id |
Mate Pair Issues | |
PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND | Pair flag set but not marked as first or second of pair |
Optional Alignment Tag Issues | |
MISSING_TAG_NM | The NM tag (nucleotide differences) is missing |
E2_BASE_EQUALS_PRIMARY_BASE | Secondary base calls should not be the same as primary, unless one or the other is N |
General File, Index or Sequence Dictionary Issues | |
BAM_FILE_MISSING_TERMINATOR_BLOCK | BAM appears to be healthy, but is an older file so doesn't have terminator block |
Table II | |
---|---|
ERROR | Description |
Header Issues | |
DUPLICATE_PROGRAM_GROUP_ID | Same program group id appears more than once |
DUPLICATE_READ_GROUP_ID | Same read group id appears more than once |
HEADER_RECORD_MISSING_REQUIRED_TAG | Header tag missing in header line |
HEADER_TAG_MULTIPLY_DEFINED | Header tag appears more than once in header line with different value |
INVALID_PLATFORM_VALUE | The read group has an invalid value set for its PL field |
INVALID_VERSION_NUMBER | Does not match any of the acceptable versions |
MISSING_HEADER | The SAM/BAM file is missing the header |
MISSING_PLATFORM_VALUE | The read group is missing its PL (platform unit) field |
MISSING_READ_GROUP | The header is missing read group information |
MISSING_SEQUENCE_DICTIONARY | There is no sequence dictionary in the header |
MISSING_VERSION_NUMBER | Header has no version number |
POORLY_FORMATTED_HEADER_TAG | Header tag does not have colon |
READ_GROUP_NOT_FOUND | A read group ID on a SAMRecord is not found in the header |
UNRECOGNIZED_HEADER_TYPE | Header record is not one of the standard types |
General Alignment Record Issues | |
CIGAR_MAPS_OFF_REFERENCE | Bases corresponding to M operator in CIGAR extend beyond reference |
INVALID_ALIGNMENT_START | Alignment start position is incorrect |
INVALID_CIGAR | CIGAR string error for either read or mate |
INVALID_FLAG_FIRST_OF_PAIR | First of pair flag set for unpaired read |
INVALID_FLAG_SECOND_OF_PAIR | Second of pair flag set for unpaired read |
INVALID_FLAG_PROPER_PAIR | Proper pair flag set for unpaired read |
INVALID_FLAG_MATE_NEG_STRAND | Mate negative strand flag set for unpaired read |
INVALID_FLAG_NOT_PRIM_ALIGNMENT | Not primary alignment flag set for unmapped read |
INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT | Supplementary alignment flag set for unmapped read |
INVALID_FLAG_READ_UNMAPPED | Mapped read flat not set for mapped read |
INVALID_INSERT_SIZE | Inferred insert size is out of range |
INVALID_MAPPING_QUALITY | Mapping quality set for unmapped read or is >= 256 |
INVALID_PREDICTED_MEDIAN_INSERT_SIZE | PI tag value is not numeric |
MISMATCH_READ_LENGTH_AND_QUALS_LENGTH | Length of sequence string and length of base quality string do not match |
TAG_VALUE_TOO_LARGE | Unsigned integer tag value is deprecated in BAM. Template length |
Mate Pair Issues | |
INVALID_FLAG_MATE_UNMAPPED | Mate unmapped flag is incorrectly set |
MATE_NOT_FOUND | Read is marked as paired, but its pair was not found |
MATE_CIGAR_STRING_INVALID_PRESENCE | A cigar string for a read whose mate is NOT mapped |
MATE_FIELD_MISMATCH | Read alignment fields do not match its mate |
MATES_ARE_SAME_END | Both mates of a pair are marked either as first or second mates |
MISMATCH_FLAG_MATE_UNMAPPED | Mate unmapped flag does not match read unmapped flag of mate |
MISMATCH_FLAG_MATE_NEG_STRAND | Mate negative strand flag does not match read strand flag |
MISMATCH_MATE_ALIGNMENT_START | Mate alignment does not match alignment start of mate |
MISMATCH_MATE_CIGAR_STRING | The mate cigar tag does not match its mate's cigar string |
MISMATCH_MATE_REF_INDEX | Mate reference index (MRNM) does not match reference index of mate |
Optional Alignment Tag Issues | |
INVALID_MATE_REF_INDEX | Mate reference index (MRNM) set for unpaired read |
INVALID_TAG_NM | The NM tag (nucleotide differences) is incorrect |
MISMATCH_READ_LENGTH_AND_E2_LENGTH | Lengths of secondary base calls tag values and read should match |
MISMATCH_READ_LENGTH_AND_U2_LENGTH | Secondary base quals tag values should match read length |
EMPTY_READ | Indicates that a read corresponding to the first strand has a length of zero and/or lacks flow signal intensities (FZ) |
INVALID_INDEXING_BIN | Indexing bin set on SAMRecord does not agree with computed value |
General File, Index or Sequence Dictionary Issues | |
INVALID_INDEX_FILE_POINTER | Invalid virtualFilePointer in index |
INVALID_REFERENCE_INDEX | Reference index not found in sequence dictionary |
RECORD_OUT_OF_ORDER | The record is out of order |
TRUNCATED_FILE | BAM file does not have terminator block |
Comments
Hi @dekling,
I have run ValidateSamFile in a summary mode on my bam file and it outputs a "MATE_CIGAR_STRING_INVALID_PRESENCE" error for 59409 reads/count. The reads that picard points to as ERRORs have a mapping quality of 0, and I don't understand why picard throes this error ? Should I have removed reads with 0 mapping quality from my bam file before running the ValidateSamFile ?
[[email protected] bwa]$ java -jar ~/algorithms/picard/20151013/picard-tools-1.140/picard.jar ValidateSamFile I="bwaAlign_MDup.bam" MODE= SUMMARY
[Fri May 06 14:36:38 EDT 2016] picard.sam.ValidateSamFile INPUT=bwaAlign_MDup.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPR ESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri May 06 14:36:38 EDT 2016] Executing as [email protected] on Linux 3.10.0-229.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_99-moc kbuild_2016_03_23_23_52-b00; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
INFO 2016-05-06 14:37:35 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:56s. Time for last 10,000,000: 56s. Last read position: 8:38,110,502
INFO 2016-05-06 14:38:29 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:50s. Time for last 10,000,000: 54s. Last read position: 19:44,988,530
HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE 59409
I re-ran the ValidateSamFile in verbose mode and got the list of reads with the above error. One of the reads with the above error is .
I did a grep search on my bam file and here are the records :
[[email protected] bwa]$ ~/algorithms/samtools/20150929/samtools-1.2/samtools view bwaAlign_MDup.bam | grep "NS500510:11:H2T5HAFXX:1:11102:4034:19007"
NS500510:11:H2T5HAFXX:1:11102:4034:19007 121 1 24559 0 151M = 24559 0 TCCCAGTAGTAAAAGCTTCCAAGTTGGGCTCTCACTTCAGCCCCTCCCACACAGGGAAGCCAGATGGGTTCCCCAGGACCGGGATTCCCCAAGGGGGCTGCTCCCAGAGGGTGTGTTGCTGGGATTGCCCAGGACAGGGATGGCCCTCTCA /EAA<</EEEAE<EEEEEAE<EEEAEA<A/AEAEEEEEEEAEEEEEEEEEEEEAE6EEE<EEAEEAAAEEEEEEA<AEEAE<AEEEEEEA6AEAEAAEAAEEEEEAEE/E<EAEAEEEEEEEEEEEAAEAEEEEEEEE/EA/EEEEAAAAA NM:i:1 MD:Z:6G144 AS:i:146 XS:i:146 RG:Z:771111_NS500510_0099_L001 XA:Z:9,-24672,151M,1;19,-66167,151M,1;2,+114346309,151M,1;15,+102506455,151M,1;12,+78968,151M,3; MC:Z:* MQ:i:0
NS500510:11:H2T5HAFXX:1:11102:4034:19007 181 1 24559 0 * = 24559 0 TTTGGTTGTTTTTTTTTTAAATATTTTTTTTTTCCTTCCATTTGGTGGGCCTATCCTTTATGTTTTTCCTGAACCAAAAAAATATTTCCCCTTTTTTGGTGGTTTGAACTCCTTGGTGTTATTCTCAATATACCCCCCCCCTTATACGCC ///////A/////////<//////<//////E/////E///<////E//////E///E////6/E////A/////EA///<///EE//////A////<//E////////////////<///////////////////////////A/A// AS:i:0 XS:i:0 RG:Z:771111_NS500510_0099_L001 MC:Z:151M MQ:i:0
Both the reads have a Mapping Quality 0, and one of the read is unmapped(flag 181) and other one is mapped(flag 121). This format seems to be genuine and I don't understand why Picard reports this as an error ?
Thanks:
mglclinical
Issue · Github
by Sheila
Hello @mglclinical:
Let us start with the easy stuff. First update your Picard to the most recent version. Then try running again and see what happens. Let me know.
-David
@mglclinical:
Your Samtools appears to be out of date as well.
Hello @dekling ,
thank you for quick response. I will ask our system administrator to install java 8 in order for me to use latest picard, and then see how the results look like with the new picard
Thanks
mglclinical
Hi @dekling
On our linux server, we have java 1.7, and not java 1.8. I learnt from my system administrator that if we update to java 1.8, then he may remove the existing java 1.7 as recommended by Oracle .
Most recent version of picard 2.2.4 can be run only with java 1.8, where as the most recent version of GATK 3.5 can be only run with java 1.7, hence I am in dilemma . Please advice .
I learnt that GATK_3.6 is going to be released soon which requires java 1.8, and not 1.7 . Any estimates on release dates for GATK3.6 ?
Thanks,
mglclinical
@mglclinical
Hi mglclinical,
Ah, I see the issue. We do have an article that helps users easily switch between Java versions, but it seems that will not help you
However, if you are able to use the latest nightly build, that requires Java 1.8. So, you can upgrade to Java 1.8 and use it for both GATK and Picard.
If you absolutely cannot use the latest nightly build, I suspect the next stable release (3.6) will be out soon (we are finalizing things). But, Geraldine @Geraldine_VdAuwera may be able to comment more
-Sheila
@Sheila
Thank you for the quick reply
Could you please point me to that article that helps "users easily switch between Java versions" ? I know it may not be completely useful, but I like to look into it.
Thanks,
mglclinical
@mglclinical
Hi mglclinical,
Here it is.
-Sheila
Thank you @Sheila
Hi @dekling
You have suggested me to upgrade picard and run it. I did upgrade picard and I got the same error "MATE_CIGAR_STRING_INVALID_PRESENCE" . All those reads with errors are the ones with mapping quality 0
C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>java -Xmx6G -jar picard.jar ValidateSamFile I="C:\sgajjala_projects\test_validateBAM\novoalign_bam\SampleReads_Final.bam" MODE=SUMMARY
[Fri May 20 17:02:57 EDT 2016] picard.sam.ValidateSamFile INPUT=C:\sgajjala_projects\test_validateBAM\novoalign_bam\SampleReads_Final.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri May 20 17:02:57 EDT 2016] Executing as [email protected] on Windows 7 6.1 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17; Picard version: 2.3.0(9a00c87b7ffdb01cfb5a0d6e76556146196babb8_1463071327) JdkDeflate
INFO 2016-05-20 17:03:47 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:50s. Time for last 10,000,000: 50s. Lastread position: 2:43,053,844
INFO 2016-05-20 17:04:37 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:40s. Time for last 10,000,000: 49s. Lastread position: 3:169,343,211
INFO 2016-05-20 17:05:30 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:33s. Time for last 10,000,000: 53s. Lastread position: 6:28,648,000
INFO 2016-05-20 17:06:23 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:03:26s. Time for last 10,000,000: 52s. Lastread position: 8:61,156,723
INFO 2016-05-20 17:07:16 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:04:18s. Time for last 10,000,000: 52s. Lastread position: 11:824,441
INFO 2016-05-20 17:08:06 SamFileValidator Validated Read 60,000,000 records. Elapsed time: 00:05:09s. Time for last 10,000,000: 50s. Lastread position: 13:78,221,271
INFO 2016-05-20 17:08:57 SamFileValidator Validated Read 70,000,000 records. Elapsed time: 00:06:00s. Time for last 10,000,000: 51s. Lastread position: 16:75,192,511
INFO 2016-05-20 17:09:50 SamFileValidator Validated Read 80,000,000 records. Elapsed time: 00:06:53s. Time for last 10,000,000: 52s. Lastread position: 19:57,065,257
HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE 149067
[Fri May 20 17:16:05 EDT 2016] picard.sam.ValidateSamFile done. Elapsed time: 13.14 minutes.
Runtime.totalMemory()=82837504
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
Here are my grep results for one of the reads that caused "MATE_CIGAR_STRING_INVALID_PRESENCE" error by Picard :
[[email protected] novoalign]$
[[email protected] novoalign]$ ~/algorithms/samtools/20150929/samtools-1.2/samtools view SampleReads_Final.bam | grep "NS500510:11:H2T5HAFXX:4:11508:2247:4270"
NS500510:11:H2T5HAFXX:4:11508:2247:4270 77 * 0 0 * * 0 0 TGCCACCTCTACTGAACCCACAACCACAACAACAGCCAGCACTACAACTGAACAGTCTGCCACAACAACTGGAGGTACTTCAACTGGAGTTACAACCACAACAGCAACCACTAGCACCTCTACTGAACCCAGCACTACGACAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE<EA/EE/EE/<<E/<E/A/E</<AE/E/AE//E//EAAE///EE/A/E//EEEA<AAE/6E///E<EE/<E PG:Z:novoalign.1 RG:Z:776666_NS500510_0099_L004 PU:Z:6_L004 LB:Z:Lib1 ZS:Z:NMMC:Z:* MQ:i:0
NS500510:11:H2T5HAFXX:4:11508:2247:4270 141 * 0 0 * * 0 0 GCCTGCAGTTGTGGTGGCAGACTGGTCAGTCACAGTGCCAATTGTCGTAGTGCTGGGTTCAGTAGAGGTGCTAGTGGTTGCTGTTGTGGTAGTAACTCCAGTACAAGTACCTCCAGTTGTTGTGGCAGACTGTTCAG AAAAAEEEEEEEEEEEEEEEEEEEAEEEEEAE/EEEEE/EE/EEE/EEEEEEAE/E/AE6E/6EAA/E/E/E/E<EAEEAAE/EE/E//A//EEAEE/6//<///E/EE</E/6///E/AE/A//E/E//EA/EE/E PG:Z:novoalign.1 RG:Z:776666_NS500510_0099_L004 PU:Z:6_L004 LB:Z:Lib1 ZS:Z:NM MC:Z:* MQ:i:0
[[email protected] novoalign]$
So, the above reads seems to be genuine and I don't understand why Picard calls them out as "MATE_CIGAR_STRING_INVALID_PRESENCE" ?
Hi @mglclinical,
I'm not familiar with this case figure so I looked up the test for this error in the Picard source code. This is the read pair in the test file, which is considered illegal by the SAM specification:
As far as I can tell this is the same case figure as what you have, ins't it? If so the tool is correct and these read pairs were formatted incorrectly by the aligner you used. My understanding is that the mapped read should not have an MC tag. I think Picard FixMate may be able to clean this up; I would recommend you give it a try on a snippet that contains some of the offending reads.
Sorry I meant FixMateInformation.
@Geraldine_VdAuwera
Thank you so much for a quick reply over the weekend.
In my current state of my pipeline, I run HaplotypeCaller on my previous bam file(unclean version) and my results are good enough (good sensitivity and specificity for NA12878 ; and also picked up positive expected variants from positive control test samples)
So, If I clean up my bam file with Picard's FixMateInformation, would it affect my results in vcf file when I run with HaplotypeCaller ?
Thanks,
mglclinical
Hi @mgclinical,
No, I wouldn't expect this to affect the results of HaplotypeCaller, as it doesn't use or even read in mate cigar information. This would only be a problem if you wanted to run some other tool that needs to access MC and can't handle this case figure.
@Geraldine_VdAuwera thank you
This is just an information for future Picard users
Source - https://sourceforge.net/p/picard/wiki/Main_Page/
Q: Why am I getting errors from Picard like "MAPQ should be 0 for unmapped read" or "CIGAR should have zero elements for unmapped read?"
A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-"*" CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.
This is what happened in my case, one of the reads had a valid CIGAR string when mapping quality was 0, and that is how BWA assigns the CIGAR strings for reads found at the end of the reference sequences.
Thanks,
mglclinical
Thank you @mglclinical.
@Geraldine_VdAuwera
Hi,
I run ValidateSamFile tool in summary mode:
java -Xmx10g -jar $PICARD ValidateSamFile \
I=$INPUT/Sample.modRG.bam \
O=$OUTPUT/Sample.modRG.bam.txt \
MODE=SUMMARY
It prompts error messages i.e.
Error Type Count
ERROR:INVALID_PLATFORM_VALUE 2
However, when i checked the RG tag in header section, the PL looks like this:
@RG ID:HS2000-1259_212.L2 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L2 CN:Illumina-ltd
@RG ID:HS2000-1259_212.L3 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L3 CN:Illumina-ltd
My questions are:
1. Here the PLATFORM_VALUE (PL) is "HiSeq2000", is it wrong or am i missing something??
2. I am curious either is due to change in PU value format or something else??
3. can i proceed without giving serious attention towards this error, if not, how to resolve it?
Thanks!
@MUHAMMADSOHAILRAZA: According to the SAM specification sheet, there are a limited number of PL values which include: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO. HiSeq2000 is not one of them. Use Picard's AddOrReplaceReadGroups to fix your SAM/BAM files and then try ValidateSamFile again. Let us know if you have more questions.
@dekling
hi,
If i dont wanna change PL "HiSeq2000", is it okay for downstream analysis?
@MUHAMMADSOHAILRAZA: I do not know for sure. You are welcome to try it, but I strongly recommend against it.
@dekling
Actually, i received merged BAM file with multiple RGs, i think, if i am not missing anything, Picard's AddOrReplaeReadGroups tool work on single lane of sample.. Is there any way to change PL values for multiple RGs in merged SAM/BAM by picards. i.e. replacing only PL value below,
@RG ID:HS2000-1259_212.L2 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L2 CN:Illumina-ltd
@RG ID:HS2000-1259_212.L3 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L3 CN:Illumina-ltd
i know it can be done by text editing, but want to find any other good way:
@MUHAMMADSOHAILRAZA: I am not sure, but you can try experimenting with Picard's ReplaceSamHeader tool.
Thanks! @dekling
Hi All,
I am using picard-2.9.4. When I run ValidateSamFile \
I=input.bam \
MODE=SUMMARY
I got an error message: ERROR: Unrecognized option:
....
OPTIONS_FILE=File File of OPTION_NAME=value pairs. No positional parameters allowed. Unlike command-line
options, unrecognized options are ignored. A single-valued option set in an options file
may be overridden by a subsequent command-line option. A line starting with '#' is
considered a comment. Required.
is it true that a OPTIONS_FILE has to be referred under this version of picard to run this function?
Thanks for answering.
@junhou
Hi,
I just tried running this exact command:
java -jar picard.jar ValidateSamFile I=input.bam MODE=SUMMARY
and I get no errors.Can you post what exactly comes after
ERROR: Unrecognized option:
?Thanks,
Sheila
I've been trying to run Picard's tool AddorReplaceReadGroups but I keep getting the following message:
Code: java -jar $PICARD_JARS/AddOrReplaceReadGroups.jar INPUT=aligned_reads.merged.bam OUTPUT=aligned_reads.correct.bam RGID=1 RGLB=lib1 RGPL=illumina_nextra RGPU=unit1 RGSM=ERR1211180
Error message:
Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 1, Read name ERR1211180.544729, RG ID on SAMRecord not found in header: group1
at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540)
at net.sf.samtools.BAMFileReader$BAMFileIterator.(BAMFileReader.java:499)
at net.sf.samtools.BAMFileReader$BAMFileIterator.(BAMFileReader.java:487)
at net.sf.samtools.BAMFileReader.getIterator(BAMFileReader.java:289)
at net.sf.samtools.SAMFileReader.iterator(SAMFileReader.java:319)
at net.sf.samtools.SAMFileReader.iterator(SAMFileReader.java:37)
at net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:98)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)
I'm not sure what I'm doing wrong. The only error message for this bam file is that the read group info is not found.
Thanks for any help!
@pschnepp
Hi,
It seems you already have read group tags in your individual reads and you may be trying to add conflicting read group information to the header. If you look at your individual reads, you should be able to determine the RGID and add that manually to your header (no need to use AddOrReplaceReadGroups).
-Sheila
P.S. You may be able to find some extra help if you do a google search for your error message
I was hoping there would be an easier way than manually changing the read group info. I've tried doing a google search but hadn't come up with any answers that solved my problem. Guess brute force and manually changing is the way to go. Thanks for the help.
Is there a way to get it to generate a 'clean' bam file containing only the records that pass validation?
@bhaas
Hi,
There are some Picard tools that can help, however you may need to use a few tools to clean your file. What are the errors that you are getting? Also, if you just have a few offending records, and you know which ones they are, you can manually remove them.
-Sheila
Thanks, Sheila. I ended up just running the validator first to identify all offending records, and then wrote a separate little tool to remove the offending records from the bam. That'll suffice for now. Ideally, there'd be a flag somewhere in the picard tools to tell it to output just the records that validate, though. Would have saved me some effort. ;-)
@bhaas
Hi,
Yes, I agree that would be nice. But, people may abuse that and move on with just 50% of their data when they could recover 100% of their data with a fix. We have to make it "kind of" tough for people to move on without proper fixes
Glad to hear you solved this on your own.
-Sheila
I appreciate the sentiment, but think it would still be better to make it easy to do this and just properly inform the user. You could add a parameter so that if less than some percent of reads are output as 'valid', then it would crash. Having just 0.0001% of your records flagged as invalid for whatever reason just makes it difficult to move on with the existing setup, and if the user isn't a programmer, it's of course difficult to impossible for them to move on.
cheers,
~brian
Issue · Github
by Sheila
@bhaas
Hi Brian,
Thanks for sharing. I will bring this up to the team, but something tells me this is never going to happen
-Sheila
when I looked at the code for this, I came to pretty much the same conclusion. ;-) We can dream.
Hi everyone,
Recently I ran ValidateSamFile and received the following errors for my bam files.
To fix the offending bam files, I ran them through AddOrReplaceReadGroups and samtools fixmate. But it appears that doing so introduced some new problems. Re-running the new bam files through the ValidateSamFile script revealed the following new errors:
Are there guidelines for how I might go about fixing these specific errors?
Many thanks,
Caroline
@CarolineJudy
Hi Caroline,
How did you align your reads? Using BWA? Did you use the very latest version of the aligner?
-Sheila
Hi @Sheila,
Thanks,
I used bwa version 0.7.17 which is the latest version, to my knowledge.
Run BWA mem
Here are some additional details of my troubleshooting attempts:
ValidateSamFile
output
AddOrReplaceReadGroups
Re-validate using ValidateSamFile
Run samtools fixmate
Re-validate using ValidateSamFile
Try sorting (by coordinates), then re-run validation
Seems most of the errors are generated using samtools fixmate. I'm using samtools vs. 1.6.
@CarolineJudy
Hi Caroline,
Okay. Let's go back through your steps. What tools have you run on your BAM file? Perhaps this thread may help.
-Sheila
Hi @Sheila,
Thanks for your response! I'm currently in the pre-processing steps for the GATK Best Practices for variant calling on RNAseq. After performing the alignments, I performed sort/deduplication using samtools and picard-tools.
After performing the sort/deduplication step, I ran ValidateSamFile, and then undertook the trouble shooting steps detailed in my comment from June 29th.
Hi,
I ran bwa mem on paired end fastq files to generate sam file. I get the following error when I run
FixMateInformation on it:
sing GATK jar /home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar FixMateInformation -I /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam -O /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.bam --ADD_MATE_CIGAR true --ASSUME_SORTED false --IGNORE_MISSING_MATES true --SORT_ORDER coordinate
14:20:45.697 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Jul 06 14:20:46 PDT 2018] FixMateInformation --INPUT /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam --OUTPUT /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.bam --SORT_ORDER coordinate --ASSUME_SORTED false --ADD_MATE_CIGAR true --IGNORE_MISSING_MATES true --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri Jul 06 14:20:46 PDT 2018] Executing as [email protected] on Linux 3.10.0-862.3.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.0.4.0
INFO 2018-07-06 14:20:46 FixMateInformation Sorting input into queryname order.
[Fri Jul 06 14:21:12 PDT 2018] picard.sam.FixMateInformation done. Elapsed time: 0.45 minutes.
Runtime.totalMemory()=4734320640
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam; Line 1604998
Line: E00489:44:HNNVYCCXX:6:1101:8389:51324 147 2 115826625 60 150M = 115826414 -361 TTGTGGTTTCATATGCTTTAGGATTTCTTTTGTATTTCTGTAAATAATACCATTGGATTTTTATGGGGCTTGCATTGATTCTGTATTTGTTTTGAATTTTTAACAATATAGATTCTTCCAATATATAAACATAGATTTTTTTGTTTATTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:457)
at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:338)
at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255)
at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:200)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
I get the following error when I run ValidateSamFile:
Error Type Count
ERROR:MISMATCH_READ_LENGTH_AND_QUALS_LENGTH 1
ERROR:MISMATCH_SEQ_QUAL_LENGTH 1
WARNING:MISSING_TAG_NM 1
WARNING:RECORD_MISSING_READ_GROUP 1
I understand that Length of sequence string and length of base quality string do not match in line 'E00489:44:HNNVYCCXX:6:1101:8389:51324' but how should I fix this?
I am using gatk4 tools.
Thanks,
Shruti
Hi @Sheila
Just checking back regarding my post from July 2. Any ideas on what might be going wrong?
@CarolineJudy Sorry for the lag; Sheila is on vacation so I'm looking through your issue now. To be frank, we don't have the bandwidth to provide further one-on-one support since we don't develop bwa, samtools and other such pre-processing tools. I would recommend you reach out to the samtools developers, who have their own mailing list, and see if they can help you. Good luck!
@mshruti I believe RevertSam has a SANITIZE option that should allow you to exclude these malformed reads.
Thanks @Geraldine_VdAuwera. I will check it out.
Okay, thanks @Geraldine_VdAuwera!
Hi @Geraldine_VDAuwera,
To correct a missing mate problem, I ran the picard tool FixMateInformation as recommended above. However, running SamFileValidation shows that over 1M reads are still missing mate information.
ERROR:MATE_NOT_FOUND 1457087
Is there a way to filter these 1457087 "malformed" reads using picard tools?
Many thanks in advance for any help you can offer,
Hi everyone - I resolved this problem - the missing mate issue was generated when I used MarkDuplicates with option "REMOVE_DUPLICATES=TRUE", which appears to have created orphan reads in my outputted bam file. A Biostars discussion hypothesizes that the picard tool MarkDuplicates leaves unmapped mates in the BAM for pairs where only one of the two paired reads map. I think this might be what happened to me.
@CarolineJ
Hi,
Thank you for posting your findings.
-Sheila
In terms of best practices, do you have any suggestions? Currently, I plan to proceed with variant calling without performing the deduplication step, but I'm concerned with the bias that may create.
Thanks for any input you can provide.
@CarolineJ
Hi,
Can you try not using
REMOVE_DUPLICATES=TRUE
in MarkDuplicates?Thanks,
Sheila
Hi Sheila,
That works! Thanks!
Greetings,
I have run into a similar problem as CarolineJ. When running ValidateSamFile on a specific .bam file, I get an extremely high number of MATE_NOT_FOUND errors (>80 milllion, >50% of the total number of reads). My protocol so far:
All the other data sets processed simultaneously using this protocol and these tools validated fine with no errors. I have not marked/removed duplicates yet. samtools flagstat suggests that the majority of reads in the problematic data set are mapped with a mate.
Thanks!
Hi @clayf,
ValidateSamFile's MATE_NOT_FOUND error is explicitly about not being able to find the mate's read physically in the data file itself. Is there any possibility the file was truncated, data was subset to certain contigs or the second of pair sequencing run was problematic?
Hi @shlee,
I'm not sure, but I think it's unlikely. The original alignment file produced by HISAT2 (before running FixMateInfo or AddOrReplaceReadGroups) looks fine, and as I said, samtools seems to think the read mates are mapped correctly. HISAT2 also seems to think that the majority of mates are mapping correctly. I did run ValidateSamFile on the original alignment file and got the same error. Is there some reason that samtools flagstat would think that the mates are okay, but ValidateSamFile would not?
Yes, @clayf, the validation stringency between samtools and GATK is different. I suspect this has something to do with pair/proper pair designations of secondary/supplementary alignments, which MergeBamAlignment solves for our workflows. Do you have some idea how HISAT2 assigns pair designations for secondary/supplementary alignments, e.g. reads split across introns? Are you by chance working on RNA-Seq data? In this case, our workflows require an additional processing step, SplitNCigarReads.
@shlee, ah yes this is indeed RNA-seq data, so it could definitely be related to that. However, I would then wonder why the rest of these RNA-seq samples would pass ValidateSamFile with no errors or problems as they are, but this particular data set fails.
Indeed that is odd @clayf. I've seen from other users certain pipelines do not handle single-end reads well in that the pipelines expect paired-end reads. In some cases, it turned out the preprocessing was done incorrectly and the SAM flags are misassigned. If you pull out some of the reads that cause the
MATE_NOT_FOUND
warning (with e.g.samtools view xyz.bam | grep '<read name>'
; you can get a list of reads with ValidateSamFile by changing theMODE
to be more verbose), do both pairs come out? Then if you examine their SAM flags (see here), do they make sense? My recommendation is to see if your caller of choice then also chokes on this particular sample or is able to call variants on it and if the variant calls are of the same quality as your other samples.