Errors in SAM/BAM files can be diagnosed with ValidateSamFile

deklingdekling Broad InstituteMember
edited May 2016 in Solutions to Problems

The problem

You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following: java.lang.String, Error Type Count, NullPointerException -- or maybe something else that doesn't mean anything to you.

Why this happens

The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.

The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.

The solution

To fix these problems, you first have to know what's wrong. Fortunately there's a handy Picard tool that can test for (almost) all possible SAM/BAM format errors, called ValidateSamFile.

We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.

In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!


Workflow for diagnosing SAM/BAM file errors with ValidateSamFile

image

1. Generate summary of errors

First, run ValidateSamFile in SUMMARY mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set MODE=SUMMARY explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.

$ java -jar picard.jar ValidateSamFile \ 
        I=input.bam \ 
        MODE=SUMMARY 

If this outputs No errors found, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like SM that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have SM tags in the @RG lines in their BAM header. That is the most common culprit.

However, if the command above outputs one or more of the 8 possible WARNING or 48 possible ERROR messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.

When run in SUMMARY mode, ValidateSamFile outputs a table that differentiates between two levels of error: ERROR proper and WARNING, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the ERROR category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the WARNING category may often be ignored for some, if not all subsequent analyses.

Example of error summary

ValidateSamFile (SUMMARY) Count
ERROR:MISSING_READ_GROUP 1
ERROR:MISMATCH_MATE_ALIGNMENT_START 4
ERROR:MATES_ARE_SAME_END 894289
ERROR:CIGAR_MAPS_OFF_REFERENCE 354
ERROR:MATE_NOT_FOUND 1
ERROR:MISMATCH_FLAG_MATE_UNMAPPED 46672
ERROR:MISMATCH_READ_LENGTH_AND_E2_LENGTH 1
WARNING:RECORD_MISSING_READ_GROUP 54
WARNING:MISSING_TAG_NM 33

This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1 MISSING_READ_GROUP error, 4 MISMATCH_MATE_ALIGNMENT_START errors, 894,289 MATES_ARE_SAME_END errors, and so on. Moreover, this output also indicates that there are 54 RECORD_MISSING_READ_GROUP warnings and 33 MISSING_TAG_NM warnings.

2. Generate detailed list of ERROR records

Since ERRORs are more severe than WARNINGs, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:

$ java -jar picard.jar ValidateSamFile \ 
        I=input.bam \ 
        IGNORE_WARNINGS=true \
        MODE=VERBOSE 

Note that we invoked the MODE=VERBOSE and the IGNORE_WARNINGS=true arguments.

The former is technically not necessary as VERBOSE is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of ERROR than is given in the SUMMARY output.

The IGNORE_WARNINGS option enables us to specifically examine only the records with ERRORs. When working with large files, this feature can be quite helpful, because there may be many records with WARNINGs that are not immediately important, and we don't want them flooding the log output.

Example of VERBOSE report for ERRORs only

ValidateSamFile (VERBOSE) Error Description
ERROR: Read groups is empty Empty read group field for multiple records
ERROR: Record 1, Read name 20FUKAAXX100202:6:27:4968:125377Mate alignment does not match alignment start of mate
ERROR: Record 3, Read name 20FUKAAXX100202:6:27:4986:125375 Both mates are marked as second of pair
ERROR: Record 6, Read name 20GAVAAXX100126:4:47:18102:194445 Read CIGAR M operator maps off end of reference
ERROR: Read name 30PPJAAXX090125:1:60:1109:517#0 Mate not found for paired read
ERROR: Record 402, Read name 20GAVAAXX100126:3:44:17022:23968 Mate unmapped flag does not match read unmapped flag of mate
ERROR: Record 12, Read name HWI-ST1041:151:C7BJEACXX:1:1101:1128:82805 Read length does not match quals length

These ERRORs are all problems that we must address before using this BAM file as input for further analysis. Most ERRORs can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.

For example, MISSING_READ_GROUP errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.

Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in SUMMARY mode again. We do this to evaluate whether our attempted fix has solved the original ERRORs, and/or any of the original WARNINGs, and/or introduced any new ERRORs or WARNINGs (sadly, this does happen).

If you still have ERRORs, you'll have to loop through this part of the workflow until no more ERRORs are detected.

If you have no more ERRORs, congratulations! It's time to look at the WARNINGs (assuming there are still some -- if not, you're off to the races).

3. Generate detailed list of WARNING records

To obtain more detailed information about the warnings, we invoke the following command:

$ java -jar picard.jar ValidateSamFile \ 
        I=input.bam \ 
        IGNORE=type \
        MODE=VERBOSE 

At this time we often use the IGNORE option to tell the program to ignore a specific type of WARNING that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some WARNINGs at all because we know they are harmless (for example, MATE_NOT_FOUND warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.

Example of VERBOSE report for WARNINGs only

ValidateSamFile (VERBOSE) Warning Description
WARNING: Read name H0164ALXX140820:2:1204:13829:66057 A record is missing a read group
WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776 NM tag (nucleotide differences) is missing

Here we see a read group-related WARNING which would probably be fixed when we fix the MISSING_READ_GROUP error we encountered earlier, hence the prioritization strategy of tackling ERRORs first and WARNINGs second.

We also see a WARNING about missing NM tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM from now on when you run ValidateSamFile on this file.

Once you have attempted to fix all the WARNINGs that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in SUMMARY mode. Again, we check that no new ERRORs have been introduced and that the only WARNINGs that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.


Appendix: List of all WARNINGs and ERRORs emitted by ValidateSamFile

We are currently in the process of updating the Picard website to include the following two tables, describing WARNING (Table I) and ERROR (Table II) cases. Until that's done, you can find them here.

Table I
WARNINGDescription
Header Issues
INVALID_DATE_STRINGDate string is not ISO-8601
INVALID_QUALITY_FORMAT Quality encodings out of range; appear to be Solexa or Illumina when Phred expected. Avoid exception being thrown as a result of no qualities being read.
General Alignment Record Issues
ADJACENT_INDEL_IN_CIGAR CIGAR string contains an insertion (I) followed by deletion (D), or vice versa
RECORD_MISSING_READ_GROUP A SAMRecord is found with no read group id
Mate Pair Issues
PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND Pair flag set but not marked as first or second of pair
Optional Alignment Tag Issues
MISSING_TAG_NM The NM tag (nucleotide differences) is missing
E2_BASE_EQUALS_PRIMARY_BASE Secondary base calls should not be the same as primary, unless one or the other is N
General File, Index or Sequence Dictionary Issues
BAM_FILE_MISSING_TERMINATOR_BLOCK BAM appears to be healthy, but is an older file so doesn't have terminator block
Table II
ERROR Description
Header Issues
DUPLICATE_PROGRAM_GROUP_ID Same program group id appears more than once
DUPLICATE_READ_GROUP_ID Same read group id appears more than once
HEADER_RECORD_MISSING_REQUIRED_TAG Header tag missing in header line
HEADER_TAG_MULTIPLY_DEFINED Header tag appears more than once in header line with different value
INVALID_PLATFORM_VALUE The read group has an invalid value set for its PL field
INVALID_VERSION_NUMBER Does not match any of the acceptable versions
MISSING_HEADER The SAM/BAM file is missing the header
MISSING_PLATFORM_VALUE The read group is missing its PL (platform unit) field
MISSING_READ_GROUP The header is missing read group information
MISSING_SEQUENCE_DICTIONARY There is no sequence dictionary in the header
MISSING_VERSION_NUMBER Header has no version number
POORLY_FORMATTED_HEADER_TAG Header tag does not have colon
READ_GROUP_NOT_FOUND A read group ID on a SAMRecord is not found in the header
UNRECOGNIZED_HEADER_TYPE Header record is not one of the standard types
General Alignment Record Issues
CIGAR_MAPS_OFF_REFERENCE Bases corresponding to M operator in CIGAR extend beyond reference
INVALID_ALIGNMENT_START Alignment start position is incorrect
INVALID_CIGAR CIGAR string error for either read or mate
INVALID_FLAG_FIRST_OF_PAIR First of pair flag set for unpaired read
INVALID_FLAG_SECOND_OF_PAIR Second of pair flag set for unpaired read
INVALID_FLAG_PROPER_PAIR Proper pair flag set for unpaired read
INVALID_FLAG_MATE_NEG_STRAND Mate negative strand flag set for unpaired read
INVALID_FLAG_NOT_PRIM_ALIGNMENT Not primary alignment flag set for unmapped read
INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT Supplementary alignment flag set for unmapped read
INVALID_FLAG_READ_UNMAPPED Mapped read flat not set for mapped read
INVALID_INSERT_SIZE Inferred insert size is out of range
INVALID_MAPPING_QUALITY Mapping quality set for unmapped read or is >= 256
INVALID_PREDICTED_MEDIAN_INSERT_SIZE PI tag value is not numeric
MISMATCH_READ_LENGTH_AND_QUALS_LENGTH Length of sequence string and length of base quality string do not match
TAG_VALUE_TOO_LARGE Unsigned integer tag value is deprecated in BAM. Template length
Mate Pair Issues
INVALID_FLAG_MATE_UNMAPPED Mate unmapped flag is incorrectly set
MATE_NOT_FOUND Read is marked as paired, but its pair was not found
MATE_CIGAR_STRING_INVALID_PRESENCE A cigar string for a read whose mate is NOT mapped
MATE_FIELD_MISMATCH Read alignment fields do not match its mate
MATES_ARE_SAME_END Both mates of a pair are marked either as first or second mates
MISMATCH_FLAG_MATE_UNMAPPED Mate unmapped flag does not match read unmapped flag of mate
MISMATCH_FLAG_MATE_NEG_STRAND Mate negative strand flag does not match read strand flag
MISMATCH_MATE_ALIGNMENT_START Mate alignment does not match alignment start of mate
MISMATCH_MATE_CIGAR_STRING The mate cigar tag does not match its mate's cigar string
MISMATCH_MATE_REF_INDEX Mate reference index (MRNM) does not match reference index of mate
Optional Alignment Tag Issues
INVALID_MATE_REF_INDEX Mate reference index (MRNM) set for unpaired read
INVALID_TAG_NM The NM tag (nucleotide differences) is incorrect
MISMATCH_READ_LENGTH_AND_E2_LENGTH Lengths of secondary base calls tag values and read should match
MISMATCH_READ_LENGTH_AND_U2_LENGTH Secondary base quals tag values should match read length
EMPTY_READ Indicates that a read corresponding to the first strand has a length of zero and/or lacks flow signal intensities (FZ)
INVALID_INDEXING_BIN Indexing bin set on SAMRecord does not agree with computed value
General File, Index or Sequence Dictionary Issues
INVALID_INDEX_FILE_POINTER Invalid virtualFilePointer in index
INVALID_REFERENCE_INDEX Reference index not found in sequence dictionary
RECORD_OUT_OF_ORDER The record is out of order
TRUNCATED_FILE BAM file does not have terminator block
Post edited by dekling on

Comments

  • mglclinicalmglclinical USAMember

    Hi @dekling,

    I have run ValidateSamFile in a summary mode on my bam file and it outputs a "MATE_CIGAR_STRING_INVALID_PRESENCE" error for 59409 reads/count. The reads that picard points to as ERRORs have a mapping quality of 0, and I don't understand why picard throes this error ? Should I have removed reads with 0 mapping quality from my bam file before running the ValidateSamFile ?

    [[email protected] bwa]$ java -jar ~/algorithms/picard/20151013/picard-tools-1.140/picard.jar ValidateSamFile I="bwaAlign_MDup.bam" MODE= SUMMARY
    [Fri May 06 14:36:38 EDT 2016] picard.sam.ValidateSamFile INPUT=bwaAlign_MDup.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPR ESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Fri May 06 14:36:38 EDT 2016] Executing as [email protected] on Linux 3.10.0-229.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_99-moc kbuild_2016_03_23_23_52-b00; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
    INFO 2016-05-06 14:37:35 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:56s. Time for last 10,000,000: 56s. Last read position: 8:38,110,502
    INFO 2016-05-06 14:38:29 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:50s. Time for last 10,000,000: 54s. Last read position: 19:44,988,530

    HISTOGRAM java.lang.String

    Error Type Count
    ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE 59409

    I re-ran the ValidateSamFile in verbose mode and got the list of reads with the above error. One of the reads with the above error is .

    I did a grep search on my bam file and here are the records :

    [[email protected] bwa]$ ~/algorithms/samtools/20150929/samtools-1.2/samtools view bwaAlign_MDup.bam | grep "NS500510:11:H2T5HAFXX:1:11102:4034:19007"
    NS500510:11:H2T5HAFXX:1:11102:4034:19007 121 1 24559 0 151M = 24559 0 TCCCAGTAGTAAAAGCTTCCAAGTTGGGCTCTCACTTCAGCCCCTCCCACACAGGGAAGCCAGATGGGTTCCCCAGGACCGGGATTCCCCAAGGGGGCTGCTCCCAGAGGGTGTGTTGCTGGGATTGCCCAGGACAGGGATGGCCCTCTCA /EAA<</EEEAE<EEEEEAE<EEEAEA<A/AEAEEEEEEEAEEEEEEEEEEEEAE6EEE<EEAEEAAAEEEEEEA<AEEAE<AEEEEEEA6AEAEAAEAAEEEEEAEE/E<EAEAEEEEEEEEEEEAAEAEEEEEEEE/EA/EEEEAAAAA NM:i:1 MD:Z:6G144 AS:i:146 XS:i:146 RG:Z:771111_NS500510_0099_L001 XA:Z:9,-24672,151M,1;19,-66167,151M,1;2,+114346309,151M,1;15,+102506455,151M,1;12,+78968,151M,3; MC:Z:* MQ:i:0
    NS500510:11:H2T5HAFXX:1:11102:4034:19007 181 1 24559 0 * = 24559 0 TTTGGTTGTTTTTTTTTTAAATATTTTTTTTTTCCTTCCATTTGGTGGGCCTATCCTTTATGTTTTTCCTGAACCAAAAAAATATTTCCCCTTTTTTGGTGGTTTGAACTCCTTGGTGTTATTCTCAATATACCCCCCCCCTTATACGCC ///////A/////////<//////<//////E/////E///<////E//////E///E////6/E////A/////EA///<///EE//////A////<//E////////////////<///////////////////////////A/A// AS:i:0 XS:i:0 RG:Z:771111_NS500510_0099_L001 MC:Z:151M MQ:i:0

    Both the reads have a Mapping Quality 0, and one of the read is unmapped(flag 181) and other one is mapped(flag 121). This format seems to be genuine and I don't understand why Picard reports this as an error ?

    Thanks:
    mglclinical

    Issue · Github
    by Sheila

    Issue Number
    864
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    dekling
  • deklingdekling Broad InstituteMember

    Hello @mglclinical:

    Let us start with the easy stuff. First update your Picard to the most recent version. Then try running again and see what happens. Let me know.
    -David

  • deklingdekling Broad InstituteMember

    @mglclinical:
    Your Samtools appears to be out of date as well.

  • mglclinicalmglclinical USAMember

    Hello @dekling ,

    thank you for quick response. I will ask our system administrator to install java 8 in order for me to use latest picard, and then see how the results look like with the new picard

    Thanks
    mglclinical

  • mglclinicalmglclinical USAMember

    Hi @dekling

    On our linux server, we have java 1.7, and not java 1.8. I learnt from my system administrator that if we update to java 1.8, then he may remove the existing java 1.7 as recommended by Oracle .

    Most recent version of picard 2.2.4 can be run only with java 1.8, where as the most recent version of GATK 3.5 can be only run with java 1.7, hence I am in dilemma . Please advice .

    I learnt that GATK_3.6 is going to be released soon which requires java 1.8, and not 1.7 . Any estimates on release dates for GATK3.6 ?

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mglclinical
    Hi mglclinical,

    Ah, I see the issue. We do have an article that helps users easily switch between Java versions, but it seems that will not help you :neutral:

    However, if you are able to use the latest nightly build, that requires Java 1.8. So, you can upgrade to Java 1.8 and use it for both GATK and Picard.

    If you absolutely cannot use the latest nightly build, I suspect the next stable release (3.6) will be out soon (we are finalizing things). But, Geraldine @Geraldine_VdAuwera may be able to comment more :smile:

    -Sheila

  • mglclinicalmglclinical USAMember

    @Sheila

    Thank you for the quick reply

    Could you please point me to that article that helps "users easily switch between Java versions" ? I know it may not be completely useful, but I like to look into it.

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mglclinical
    Hi mglclinical,

    Here it is.

    -Sheila

  • mglclinicalmglclinical USAMember

    Hi @dekling

    You have suggested me to upgrade picard and run it. I did upgrade picard and I got the same error "MATE_CIGAR_STRING_INVALID_PRESENCE" . All those reads with errors are the ones with mapping quality 0

    C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
    C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
    C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>java -Xmx6G -jar picard.jar ValidateSamFile I="C:\sgajjala_projects\test_validateBAM\novoalign_bam\SampleReads_Final.bam" MODE=SUMMARY
    [Fri May 20 17:02:57 EDT 2016] picard.sam.ValidateSamFile INPUT=C:\sgajjala_projects\test_validateBAM\novoalign_bam\SampleReads_Final.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Fri May 20 17:02:57 EDT 2016] Executing as [email protected] on Windows 7 6.1 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17; Picard version: 2.3.0(9a00c87b7ffdb01cfb5a0d6e76556146196babb8_1463071327) JdkDeflate
    INFO 2016-05-20 17:03:47 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:50s. Time for last 10,000,000: 50s. Lastread position: 2:43,053,844
    INFO 2016-05-20 17:04:37 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:40s. Time for last 10,000,000: 49s. Lastread position: 3:169,343,211
    INFO 2016-05-20 17:05:30 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:33s. Time for last 10,000,000: 53s. Lastread position: 6:28,648,000
    INFO 2016-05-20 17:06:23 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:03:26s. Time for last 10,000,000: 52s. Lastread position: 8:61,156,723
    INFO 2016-05-20 17:07:16 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:04:18s. Time for last 10,000,000: 52s. Lastread position: 11:824,441
    INFO 2016-05-20 17:08:06 SamFileValidator Validated Read 60,000,000 records. Elapsed time: 00:05:09s. Time for last 10,000,000: 50s. Lastread position: 13:78,221,271
    INFO 2016-05-20 17:08:57 SamFileValidator Validated Read 70,000,000 records. Elapsed time: 00:06:00s. Time for last 10,000,000: 51s. Lastread position: 16:75,192,511
    INFO 2016-05-20 17:09:50 SamFileValidator Validated Read 80,000,000 records. Elapsed time: 00:06:53s. Time for last 10,000,000: 52s. Lastread position: 19:57,065,257

    HISTOGRAM java.lang.String

    Error Type Count
    ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE 149067

    [Fri May 20 17:16:05 EDT 2016] picard.sam.ValidateSamFile done. Elapsed time: 13.14 minutes.
    Runtime.totalMemory()=82837504
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

    C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
    C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>
    C:\sgajjala_projects\test_validateBAM\picard_20160520\picard-tools-2.3.0>

    Here are my grep results for one of the reads that caused "MATE_CIGAR_STRING_INVALID_PRESENCE" error by Picard :

    [[email protected] novoalign]$
    [[email protected] novoalign]$ ~/algorithms/samtools/20150929/samtools-1.2/samtools view SampleReads_Final.bam | grep "NS500510:11:H2T5HAFXX:4:11508:2247:4270"
    NS500510:11:H2T5HAFXX:4:11508:2247:4270 77 * 0 0 * * 0 0 TGCCACCTCTACTGAACCCACAACCACAACAACAGCCAGCACTACAACTGAACAGTCTGCCACAACAACTGGAGGTACTTCAACTGGAGTTACAACCACAACAGCAACCACTAGCACCTCTACTGAACCCAGCACTACGACAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE<EA/EE/EE/<<E/<E/A/E</<AE/E/AE//E//EAAE///EE/A/E//EEEA<AAE/6E///E<EE/<E PG:Z:novoalign.1 RG:Z:776666_NS500510_0099_L004 PU:Z:6_L004 LB:Z:Lib1 ZS:Z:NMMC:Z:* MQ:i:0
    NS500510:11:H2T5HAFXX:4:11508:2247:4270 141 * 0 0 * * 0 0 GCCTGCAGTTGTGGTGGCAGACTGGTCAGTCACAGTGCCAATTGTCGTAGTGCTGGGTTCAGTAGAGGTGCTAGTGGTTGCTGTTGTGGTAGTAACTCCAGTACAAGTACCTCCAGTTGTTGTGGCAGACTGTTCAG AAAAAEEEEEEEEEEEEEEEEEEEAEEEEEAE/EEEEE/EE/EEE/EEEEEEAE/E/AE6E/6EAA/E/E/E/E<EAEEAAE/EE/E//A//EEAEE/6//<///E/EE</E/6///E/AE/A//E/E//EA/EE/E PG:Z:novoalign.1 RG:Z:776666_NS500510_0099_L004 PU:Z:6_L004 LB:Z:Lib1 ZS:Z:NM MC:Z:* MQ:i:0
    [[email protected] novoalign]$

    So, the above reads seems to be genuine and I don't understand why Picard calls them out as "MATE_CIGAR_STRING_INVALID_PRESENCE" ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mglclinical,

    I'm not familiar with this case figure so I looked up the test for this error in the Picard source code. This is the read pair in the test file, which is considered illegal by the SAM specification:

    pair_read   73  chr7    3   9   6M  =   3   0   CAACAG  )'.*.+  MC:Z:*  PG:Z:0  RG:Z:0  NM:i:4  UQ:i:33
    pair_read   133 chr7    3   0   *   =   3   0   NCGCGG  &/1544  MC:Z:6M PG:Z:0  RG:Z:0
    

    As far as I can tell this is the same case figure as what you have, ins't it? If so the tool is correct and these read pairs were formatted incorrectly by the aligner you used. My understanding is that the mapped read should not have an MC tag. I think Picard FixMate may be able to clean this up; I would recommend you give it a try on a snippet that contains some of the offending reads.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • mglclinicalmglclinical USAMember

    @Geraldine_VdAuwera

    Thank you so much for a quick reply over the weekend.

    In my current state of my pipeline, I run HaplotypeCaller on my previous bam file(unclean version) and my results are good enough (good sensitivity and specificity for NA12878 ; and also picked up positive expected variants from positive control test samples)

    So, If I clean up my bam file with Picard's FixMateInformation, would it affect my results in vcf file when I run with HaplotypeCaller ?

    Thanks,
    mglclinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mgclinical,

    No, I wouldn't expect this to affect the results of HaplotypeCaller, as it doesn't use or even read in mate cigar information. This would only be a problem if you wanted to run some other tool that needs to access MC and can't handle this case figure.

  • This is just an information for future Picard users

    Source - https://sourceforge.net/p/picard/wiki/Main_Page/

    Q: Why am I getting errors from Picard like "MAPQ should be 0 for unmapped read" or "CIGAR should have zero elements for unmapped read?"

    A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-"*" CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.

    This is what happened in my case, one of the reads had a valid CIGAR string when mapping quality was 0, and that is how BWA assigns the CIGAR strings for reads found at the end of the reference sequences.
    Thanks,
    mglclinical

  • deklingdekling Broad InstituteMember
  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    @Geraldine_VdAuwera
    Hi,

    I run ValidateSamFile tool in summary mode:

    java -Xmx10g -jar $PICARD ValidateSamFile \
    I=$INPUT/Sample.modRG.bam \
    O=$OUTPUT/Sample.modRG.bam.txt \
    MODE=SUMMARY

    It prompts error messages i.e.
    Error Type Count
    ERROR:INVALID_PLATFORM_VALUE 2

    However, when i checked the RG tag in header section, the PL looks like this:

    @RG ID:HS2000-1259_212.L2 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L2 CN:Illumina-ltd
    @RG ID:HS2000-1259_212.L3 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L3 CN:Illumina-ltd

    My questions are:
    1. Here the PLATFORM_VALUE (PL) is "HiSeq2000", is it wrong or am i missing something??
    2. I am curious either is due to change in PU value format or something else??
    3. can i proceed without giving serious attention towards this error, if not, how to resolve it?

    Thanks!

  • deklingdekling Broad InstituteMember

    @MUHAMMADSOHAILRAZA: According to the SAM specification sheet, there are a limited number of PL values which include: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO. HiSeq2000 is not one of them. Use Picard's AddOrReplaceReadGroups to fix your SAM/BAM files and then try ValidateSamFile again. Let us know if you have more questions.

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    @dekling
    hi,
    If i dont wanna change PL "HiSeq2000", is it okay for downstream analysis?

  • deklingdekling Broad InstituteMember

    @MUHAMMADSOHAILRAZA: I do not know for sure. You are welcome to try it, but I strongly recommend against it.

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember

    @dekling

    Actually, i received merged BAM file with multiple RGs, i think, if i am not missing anything, Picard's AddOrReplaeReadGroups tool work on single lane of sample.. Is there any way to change PL values for multiple RGs in merged SAM/BAM by picards. i.e. replacing only PL value below,
    @RG ID:HS2000-1259_212.L2 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L2 CN:Illumina-ltd
    @RG ID:HS2000-1259_212.L3 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L3 CN:Illumina-ltd

    i know it can be done by text editing, but want to find any other good way:

  • deklingdekling Broad InstituteMember

    @MUHAMMADSOHAILRAZA: I am not sure, but you can try experimenting with Picard's ReplaceSamHeader tool.

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember
  • junhoujunhou rotterdamMember

    Hi All,

    I am using picard-2.9.4. When I run ValidateSamFile \
    I=input.bam \
    MODE=SUMMARY
    I got an error message: ERROR: Unrecognized option:
    ....

    • followed by options, including

    OPTIONS_FILE=File File of OPTION_NAME=value pairs. No positional parameters allowed. Unlike command-line
    options, unrecognized options are ignored. A single-valued option set in an options file
    may be overridden by a subsequent command-line option. A line starting with '#' is
    considered a comment. Required.

    is it true that a OPTIONS_FILE has to be referred under this version of picard to run this function?

    Thanks for answering.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @junhou
    Hi,

    I just tried running this exact command:
    java -jar picard.jar ValidateSamFile I=input.bam MODE=SUMMARY and I get no errors.

    Can you post what exactly comes after ERROR: Unrecognized option:?

    Thanks,
    Sheila

  • I've been trying to run Picard's tool AddorReplaceReadGroups but I keep getting the following message:
    Code: java -jar $PICARD_JARS/AddOrReplaceReadGroups.jar INPUT=aligned_reads.merged.bam OUTPUT=aligned_reads.correct.bam RGID=1 RGLB=lib1 RGPL=illumina_nextra RGPU=unit1 RGSM=ERR1211180

    Error message:
    Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 1, Read name ERR1211180.544729, RG ID on SAMRecord not found in header: group1
    at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.(BAMFileReader.java:499)
    at net.sf.samtools.BAMFileReader$BAMFileIterator.(BAMFileReader.java:487)
    at net.sf.samtools.BAMFileReader.getIterator(BAMFileReader.java:289)
    at net.sf.samtools.SAMFileReader.iterator(SAMFileReader.java:319)
    at net.sf.samtools.SAMFileReader.iterator(SAMFileReader.java:37)
    at net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:98)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
    at net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)

    I'm not sure what I'm doing wrong. The only error message for this bam file is that the read group info is not found.

    Thanks for any help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited October 2017

    @pschnepp
    Hi,

    It seems you already have read group tags in your individual reads and you may be trying to add conflicting read group information to the header. If you look at your individual reads, you should be able to determine the RGID and add that manually to your header (no need to use AddOrReplaceReadGroups).

    -Sheila

    P.S. You may be able to find some extra help if you do a google search for your error message :smiley:

  • I was hoping there would be an easier way than manually changing the read group info. I've tried doing a google search but hadn't come up with any answers that solved my problem. Guess brute force and manually changing is the way to go. Thanks for the help.

  • bhaasbhaas Broad InstituteMember, Broadie

    Is there a way to get it to generate a 'clean' bam file containing only the records that pass validation?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bhaas
    Hi,

    There are some Picard tools that can help, however you may need to use a few tools to clean your file. What are the errors that you are getting? Also, if you just have a few offending records, and you know which ones they are, you can manually remove them.

    -Sheila

  • bhaasbhaas Broad InstituteMember, Broadie

    Thanks, Sheila. I ended up just running the validator first to identify all offending records, and then wrote a separate little tool to remove the offending records from the bam. That'll suffice for now. Ideally, there'd be a flag somewhere in the picard tools to tell it to output just the records that validate, though. Would have saved me some effort. ;-)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bhaas
    Hi,

    Yes, I agree that would be nice. But, people may abuse that and move on with just 50% of their data when they could recover 100% of their data with a fix. We have to make it "kind of" tough for people to move on without proper fixes :smiley: Glad to hear you solved this on your own.

    -Sheila

  • bhaasbhaas Broad InstituteMember, Broadie

    I appreciate the sentiment, but think it would still be better to make it easy to do this and just properly inform the user. You could add a parameter so that if less than some percent of reads are output as 'valid', then it would crash. Having just 0.0001% of your records flagged as invalid for whatever reason just makes it difficult to move on with the existing setup, and if the user isn't a programmer, it's of course difficult to impossible for them to move on.

    cheers,

    ~brian

    Issue · Github
    by Sheila

    Issue Number
    2895
    State
    open
    Last Updated
    Assignee
    Array
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bhaas
    Hi Brian,

    Thanks for sharing. I will bring this up to the team, but something tells me this is never going to happen :smile:

    -Sheila

  • bhaasbhaas Broad InstituteMember, Broadie

    when I looked at the code for this, I came to pretty much the same conclusion. ;-) We can dream.

  • CarolineJudyCarolineJudy Member
    edited June 27

    Hi everyone,

    Recently I ran ValidateSamFile and received the following errors for my bam files.

    ERROR:MATE_NOT_FOUND    1457087
    ERROR:MISSING_READ_GROUP    1
    WARNING:RECORD_MISSING_READ_GROUP   32944969
    

    To fix the offending bam files, I ran them through AddOrReplaceReadGroups and samtools fixmate. But it appears that doing so introduced some new problems. Re-running the new bam files through the ValidateSamFile script revealed the following new errors:

    ERROR:INVALID_FLAG_FIRST_OF_PAIR    473730
    ERROR:INVALID_FLAG_SECOND_OF_PAIR   983357
    ERROR:RECORD_OUT_OF_ORDER   424595
    

    Are there guidelines for how I might go about fixing these specific errors?

    Many thanks,
    Caroline

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @CarolineJudy
    Hi Caroline,

    How did you align your reads? Using BWA? Did you use the very latest version of the aligner?

    -Sheila

  • CarolineJudyCarolineJudy Member
    edited June 29

    Hi @Sheila,

    Thanks,

    I used bwa version 0.7.17 which is the latest version, to my knowledge.

    Run BWA mem

    # /bin/sh
    # ----------------Parameters---------------------- #
    #$ -S /bin/sh
    #$ -pe mthread 14
    #$ -q mThC.q
    #$ -l mres=2G,h_data=2G,h_vmem=2G
    #$ -cwd
    #$ -j y
    #$ -N aln_de_novo
    #$ -o aln_de_novo.log
    #$ -M [email protected]
    #
    # ----------------Modules------------------------- #
    module load bioinformatics/bwa
    #
    # ----------------Your Commands------------------- #
    #
    echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
    echo + NSLOTS = $NSLOTS
    #
    bwa mem -M -t 14 /prepare_ref/Trinity.fasta /2993_L008_R1_001_val_1.fq /S2993_L008_R2_001_val_2.fq  > results/sam/2993_denovo.sam
    #
    echo = `date` job $JOB_NAME done
    
  • CarolineJudyCarolineJudy Member
    edited June 29

    Here are some additional details of my troubleshooting attempts:

    ValidateSamFile

    # /bin/sh
    # ----------------Parameters---------------------- #
    #$ -S /bin/sh
    #$ -q mThC.q
    #$ -cwd
    #$ -j y
    #$ -N val_max_value_set
    #$ -o val.max_value_set.log
    #
    # ----------------Modules------------------------- #
    module load bioinformatics/picard-tools
    #
    # ----------------Your Commands------------------- #
    #
    echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
    #
    runpicard ValidateSamFile \
          I=my.bam \
          MODE=SUMMARY MAX_OPEN_TEMP_FILES=3050
    #
    echo = `date` job $JOB_NAME done
    

    output

    ## HISTOGRAM    java.lang.String
    Error Type  Count
    ERROR:MATE_NOT_FOUND    1457087
    ERROR:MISSING_READ_GROUP    1
    WARNING:RECORD_MISSING_READ_GROUP   32944969
    

    AddOrReplaceReadGroups

    # /bin/sh
    # ----------------Parameters---------------------- #
    #$ -S /bin/sh
    #$ -q mThC.q
    #$ -l mres=6G,h_data=6G,h_vmem=6G
    #$ -cwd
    #$ -j y
    #$ -N addreadgroups
    #$ -o addreadgroups.log
    #$ -M [email protected]
    #
    # ----------------Modules------------------------- #
    module load bioinformatics/picard-tools
    #
    # ----------------Your Commands------------------- #
    #
    echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
    #
    runpicard AddOrReplaceReadGroups \
          I=/pool/genomics/judyc/variant_calling/results/bam/2993_denovo.cleaned_sorted_deduplicated.bam \
          O=2993_denovo.cleaned_sorted_deduplicated_read_groups_added.bam \
          RGID=HG5Y5BBXX8ACATTGGC \
          RGLB=CVD245 \
          RGPL=illumina \
          RGPU=HG5Y5BBXX8ACATTGGC \
          RGSM=CVD245
    #
    echo = `date` job $JOB_NAME done
    

    Re-validate using ValidateSamFile

    HISTOGRAM   java.lang.String
    Error Type  Count
    ERROR:MATE_NOT_FOUND    1457087
    

    Run samtools fixmate

    # /bin/sh
    # ----------------Parameters---------------------- #
    #$ -S /bin/sh
    #$ -q sThC.q
    #$ -l mres=2G,h_data=2G,h_vmem=2G
    #$ -cwd
    #$ -j y
    #$ -N matefix_2993
    #$ -o matefix_2993.log
    #
    # ----------------Modules------------------------- #
    module load bioinformatics/samtools
    #
    # ----------------Your Commands------------------- #
    #
    echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
    #
    samtools sort -n ../../results/bam/2993_denovo.cleaned_sorted_deduplicated_read_groups_added.bam -o ../../results/bam/2993_name_sorted.bam -O bam | samtools fixmate -O bam ../../results/bam/2993_name_sorted.bamø 2993_matefix.bam
    #
    echo = `date` job $JOB_NAME done
    

    Re-validate using ValidateSamFile

    ERROR:INVALID_FLAG_FIRST_OF_PAIR    473730
    ERROR:INVALID_FLAG_SECOND_OF_PAIR   983357
    ERROR:RECORD_OUT_OF_ORDER   424595
    

    Try sorting (by coordinates), then re-run validation

    ## HISTOGRAM    java.lang.String
    Error Type  Count
    ERROR:INVALID_FLAG_FIRST_OF_PAIR    378322
    ERROR:INVALID_FLAG_SECOND_OF_PAIR   786354
    
    Post edited by CarolineJudy on
  • Seems most of the errors are generated using samtools fixmate. I'm using samtools vs. 1.6.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @CarolineJudy
    Hi Caroline,

    Okay. Let's go back through your steps. What tools have you run on your BAM file? Perhaps this thread may help.

    -Sheila

  • CarolineJudyCarolineJudy Member
    edited July 2

    Hi @Sheila,

    Thanks for your response! I'm currently in the pre-processing steps for the GATK Best Practices for variant calling on RNAseq. After performing the alignments, I performed sort/deduplication using samtools and picard-tools.

    # /bin/sh
    # ----------------Parameters---------------------- #
    #$ -S /bin/sh
    #$ -q mThC.q
    #$ -l mres=6G,h_data=6G,h_vmem=6G
    #$ -cwd
    #$ -j y
    #$ -N sort_deduplicate_loop
    #$ -o sort_deduplicate_loop.log
    #$ -M [email protected]
    #
    # ----------------Modules------------------------- #
    module load bioinformatics/picard-tools
    module load bioinformatics/samtools
    #
    # ----------------Your Commands------------------- #
    #
    echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
    echo + NSLOTS = $NSLOTS
    #
    sam=$1
    set --""
        newfile="$(basename $sam .sam)"
        samtools view -bS $sam | samtools sort - -o $sam.sorted.bam
        runpicard MarkDuplicates INPUT=$sam.sorted.bam OUTPUT= ../bam/${newfile}.cleaned_sorted_deduplicated.bam  METRICS_FILE=${newfile}.metricsfile MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=250 ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=True
    #
    echo = `date` job $JOB_NAME done
    
    Post edited by CarolineJudy on
  • After performing the sort/deduplication step, I ran ValidateSamFile, and then undertook the trouble shooting steps detailed in my comment from June 29th.

  • mshrutimshruti Member
    edited July 6

    Hi,

    I ran bwa mem on paired end fastq files to generate sam file. I get the following error when I run
    FixMateInformation on it:
    sing GATK jar /home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar FixMateInformation -I /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam -O /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.bam --ADD_MATE_CIGAR true --ASSUME_SORTED false --IGNORE_MISSING_MATES true --SORT_ORDER coordinate
    14:20:45.697 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Fri Jul 06 14:20:46 PDT 2018] FixMateInformation --INPUT /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam --OUTPUT /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.bam --SORT_ORDER coordinate --ASSUME_SORTED false --ADD_MATE_CIGAR true --IGNORE_MISSING_MATES true --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    [Fri Jul 06 14:20:46 PDT 2018] Executing as [email protected] on Linux 3.10.0-862.3.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.0.4.0
    INFO 2018-07-06 14:20:46 FixMateInformation Sorting input into queryname order.
    [Fri Jul 06 14:21:12 PDT 2018] picard.sam.FixMateInformation done. Elapsed time: 0.45 minutes.
    Runtime.totalMemory()=4734320640
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam; Line 1604998
    Line: E00489:44:HNNVYCCXX:6:1101:8389:51324 147 2 115826625 60 150M = 115826414 -361 TTGTGGTTTCATATGCTTTAGGATTTCTTTTGTATTTCTGTAAATAATACCATTGGATTTTTATGGGGCTTGCATTGATTCTGTATTTGTTTTGAATTTTTAACAATATAGATTCTTCCAATATATAAACATAGATTTTTTTGTTTATTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
    at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:457)
    at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:338)
    at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268)
    at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255)
    at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
    at picard.sam.FixMateInformation.doWork(FixMateInformation.java:200)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    I get the following error when I run ValidateSamFile:
    Error Type Count
    ERROR:MISMATCH_READ_LENGTH_AND_QUALS_LENGTH 1
    ERROR:MISMATCH_SEQ_QUAL_LENGTH 1
    WARNING:MISSING_TAG_NM 1
    WARNING:RECORD_MISSING_READ_GROUP 1

    I understand that Length of sequence string and length of base quality string do not match in line 'E00489:44:HNNVYCCXX:6:1101:8389:51324' but how should I fix this?
    I am using gatk4 tools.

    Thanks,
    Shruti

  • Hi @Sheila

    Just checking back regarding my post from July 2. Any ideas on what might be going wrong?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @CarolineJudy Sorry for the lag; Sheila is on vacation so I'm looking through your issue now. To be frank, we don't have the bandwidth to provide further one-on-one support since we don't develop bwa, samtools and other such pre-processing tools. I would recommend you reach out to the samtools developers, who have their own mailing list, and see if they can help you. Good luck!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mshruti I believe RevertSam has a SANITIZE option that should allow you to exclude these malformed reads.

  • mshrutimshruti Member

    Thanks @Geraldine_VdAuwera. I will check it out.

  • Hi @Geraldine_VDAuwera,

    To correct a missing mate problem, I ran the picard tool FixMateInformation as recommended above. However, running SamFileValidation shows that over 1M reads are still missing mate information.

    ERROR:MATE_NOT_FOUND 1457087

    Is there a way to filter these 1457087 "malformed" reads using picard tools?

    Many thanks in advance for any help you can offer,

  • Hi everyone - I resolved this problem - the missing mate issue was generated when I used MarkDuplicates with option "REMOVE_DUPLICATES=TRUE", which appears to have created orphan reads in my outputted bam file. A Biostars discussion hypothesizes that the picard tool MarkDuplicates leaves unmapped mates in the BAM for pairs where only one of the two paired reads map. I think this might be what happened to me.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @CarolineJ
    Hi,

    Thank you for posting your findings.

    -Sheila

  • CarolineJCarolineJ Member
    edited July 18

    In terms of best practices, do you have any suggestions? Currently, I plan to proceed with variant calling without performing the deduplication step, but I'm concerned with the bias that may create.

    Thanks for any input you can provide.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @CarolineJ
    Hi,

    Can you try not using REMOVE_DUPLICATES=TRUE in MarkDuplicates?

    Thanks,
    Sheila

  • Hi Sheila,

    That works! Thanks!

Sign In or Register to comment.