We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Picard ValidateSamFile failing with INVALID_TAG_NM on hg38 HLA contigs

myourshawmyourshaw University of ColoradoMember ✭✭

Picard ValidateSamFile is failing with INVALID_TAG_NM on hg38 HLA contigs when MODE=VERBOSE. The first 100 HLA reads in my BAM file failed. I assume all would fail as there were a number of different contigs among the first 100. When I validated the same BAM file with IGNORE=INVALID_TAG_NM, it passes.

Oddly, when MODE=SUMMARY, I got 'No errors found'.

I am running Picard 2.9.2 and using the GATk bundle Homo_sapiens_assembly38.fasta* as the reference. The BAM file was produced by Novoalign and processed by SortSam and SetNmMdAndUqTags. I also tried the deprecated SetNmAndUqTags. Moreover, I manually checked a few of the failing reads in the original BAM file produced by Novoalign and the records, including the NM tags, were the same as after running SetNmMdAndUqTags. I looked at the subset of the failed reads where NM was 0 and compared them directly to the sequence in the fasta file; all were a perfect match at the expected position.

*This shouldn't matter, but I replaced the tab character with two spaces to separate the contig name field in the '>' lines of of the HLA records of Homo_sapiens_assembly38.fasta. All the other records have two spaces, and the tab character was causing problems for Novoalign (to be fixed in the next release).

Picard ValidateSamFile command/stderr:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-ValidateReadGroupSamFile/shard-0/execution/tmp.Et4z6V
[Tue Jun 06 00:40:48 UTC 2017] picard.sam.ValidateSamFile INPUT=/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-ValidateReadGroupSamFile/shard-0/inputs/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-SortAndFixReadGroupBam/shard-0/execution/NIST7035_TAAGGCGA_L001.aligned.sorted.bam OUTPUT=NIST7035_TAAGGCGA_L001.validation_report MODE=VERBOSE IGNORE=[] MAX_OUTPUT=1000000000 IS_BISULFITE_SEQUENCED=false REFERENCE_SEQUENCE=/mnt/hdd/germline/resources/gatk_bundle/Homo_sapiens_assembly38.fasta    IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Jun 06 00:40:48 UTC 2017] Executing as [email protected] on Linux 3.10.0-514.16.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-b12; Picard version: 2.9.2-SNAPSHOT
INFO    2017-06-06 00:41:45 SamFileValidator    Validated Read    10,000,000 records.  Elapsed time: 00:00:56s.  Time for last 10,000,000:   55s.  Last read position: chr5:75,080,505
INFO    2017-06-06 00:42:37 SamFileValidator    Validated Read    20,000,000 records.  Elapsed time: 00:01:48s.  Time for last 10,000,000:   52s.  Last read position: chr11:65,355,941
INFO    2017-06-06 00:43:31 SamFileValidator    Validated Read    30,000,000 records.  Elapsed time: 00:02:41s.  Time for last 10,000,000:   53s.  Last read position: chr19:1,918,167
INFO    2017-06-06 00:44:24 SamFileValidator    Validated Read    40,000,000 records.  Elapsed time: 00:03:34s.  Time for last 10,000,000:   52s.  Last read position: */*
[Tue Jun 06 00:45:14 UTC 2017] picard.sam.ValidateSamFile done. Elapsed time: 4.42 minutes.
Runtime.totalMemory()=1348468736
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Typical errors:

ERROR: Record 38230964, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [75]
ERROR: Record 38230965, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [76]
ERROR: Record 38230966, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [72]
ERROR: Record 38230967, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [72]
ERROR: Record 38230968, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [72]

BAM records for the above errors:

HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239  99  HLA-A*11:50Q    1091    30  101M    =   1111    121 CGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGG   @?>?=>[email protected]>@@@@[email protected]@>>>>@>>@@@@@>@[email protected]@>@@>@@>@@@@>@@>@@>@@@@@[email protected]@@[email protected]>[email protected]>@@>@>@>>?>@?>>[email protected]@@>>@>@@@>@?>??   ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:2  NM:i:0  SM:i:2  PQ:i:1  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756   163 HLA-A*11:50Q    1092    30  101M    =   1286    295 GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAACGCAAGTGGGAGGCGGC   [email protected]@>[email protected]@[email protected]@@[email protected]<==<@9<@[email protected]@@>@>[email protected]@9??>@@[email protected]@@@>@>[email protected]@<[email protected]<?><??<?<>>:>.??==:>>>@=9>=>.>>[email protected]??9>>[email protected]   ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:83G17  PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:2  NM:i:1  SM:i:2  PQ:i:26 UQ:i:13 AS:i:18 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908   163 HLA-A*11:50Q    1101    30  101M    =   1162    162 GGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCCCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGC   ?>@;<@@>[email protected]@@@@[email protected]=>@[email protected]@>@@>@>@@[email protected]@[email protected][email protected]@@@@[email protected]@@[email protected]>@[email protected]?+<[email protected]<:[email protected]?/2<>9:/;;;@@@>=:[email protected]@@==>;>[email protected]@>   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:62T38  PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:1  SM:i:3  PQ:i:13 UQ:i:10 AS:i:13 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488   163 HLA-A*11:50Q    1111    30  101M    =   1143    133 ACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGCGGAGCAGCGG   ;@;>@@[email protected]@[email protected][email protected]@>@@>@@>@@@@>@@[email protected]@[email protected]@@@@@@@@>@>>@@@>@@[email protected]>@>>@[email protected]@>[email protected]@[email protected]=>?>[email protected]@>@@@@@@@@@>@[email protected]@[email protected]@>@?>@@=>   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:0  SM:i:25 PQ:i:0  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440  163 HLA-A*11:50Q    1111    30  101M    =   1135    125 ACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGCGGAGCAGCGG   ;@;>@@[email protected]@[email protected][email protected]@>@@>@@>@@@@>@@[email protected]@[email protected]@@@@@@@@>@>>[email protected]>@@[email protected]>@>>@>@@>>@[email protected]@>>@>@@@>@@@@@@@@@>?>@@?>[email protected]>@@@@   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:0  SM:i:25 PQ:i:1  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878

Issue · Github
by Sheila

Issue Number
2142
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @myourshaw
    Hi,

    I am going to check with the team. Someone should get back to you soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2017

    Hi @myourshaw,

    You are aligning with Novoalign to a reference with alternate haplotype contigs. I'm not familiar with Novoalign. Does it align in an alt-contig aware manner like BWA? I talk about this special mode of BWA alignment in this blogpost and this tutorial.

    HLA loci are highly variable across the population and so they have many alternate contigs in GRCh38. I see that the alignments you list are paired end and are flagged as properly paired to an alternate contig HLA-A*11:50Q. However, the mapping score of 30 that your five alignments show, well, it appears that for Novoalign this is a low score because Novoalign's max best score is 150. This is in line with alignments that could be multimappers, as should be the case for alignments to HLA loci. Have you taken a look at all of the alignments as a set for these reads you list, e.g. the mate and any supplementary and secondary alignments? Do the NM tags in these match the CIGAR string? You can grep the sets out with, e.g. samtools view bam | grep 'HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440'.

    Also, I'd like a bit more clarification on your workflow. You are using the PairedEndSingleSampleWorkflow with an alignment type that we do not use ourselves. So I am wondering if there are any other changes to the pipeline, e.g. do you still use MergeBamAlignment?

    I suspect the reads you list by themselves are not the cause of the error and there is something going on with the alignment sets. I am not sure why there is a difference in the results when you change ValidateSamFile's MODE so I'd like to find out more about your data. I am wondering if the downstream GATK tools you use error on your data?

  • myourshawmyourshaw University of ColoradoMember ✭✭
    edited June 2017

    I think it is important that the NM errors claim reality is very high, usually >70, whereas the NMs in the file are mostly 0 or 1. In the easy-to-search case where NM is 0, I was able to confirm that the read indeed was a perfect match in the mapped contig and at the expected position. So whatever else may be going on, those reads represent correct alignments and I would not expect the best alignment to have an NM in the 70s.

    Is it possible that ValidateSamFile is looking at a different contig by mistake in order to come up with such a divergent NM?

    To answer your specific questions:

    I am familiar with your blog and tutorial on hg38, which were very helpful. I have been unable to implement BWA's alt mode because bwakit contained invalid hard-coded links to the reference.

    Novoalign does align in an alt-contig aware manner and I am using that option.

    I have not looked systematicaly at the supplementary and alternate alignments. I did grep a few of the reads as you suggested and only one pair of reads was in the BAM file for each ID.

    The NM tags match the CIGAR strings.

    My relevant workflow is: uBAM -> Novoalign -> SortSam -> SetNmMdAndUqTags -> ValidateSamFile -> MarkDuplicates -> BQSR, etc.
    - I have the same problem whether I go through SetNmMdAndUqTags, SetNmAndUqTags, or skip that step. I could not find SetNmMdAndUqTags in the GATK4 alpha; has it been deprecated?

    I do not use MergeBamAlignment. Novoalign does not drop tags from the uBAM. Initially, I tried MergeBamAlignment but it corrupted the data by such things as creating H CIGAR tags inside S tags, re-adding hard clipped regions and labelling them as S, and other issues.

    I'm currently using IGNORE=INVALID_TAG_NM to bypass the validation issue, and the BAM files are processed without error by all downstream tools such as BQSR, HaplotypeCaller, and the various Metrics tools, all the way to jointly called VCFs.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2017

    Thanks for the update @myourshaw. My next guess is that Picard ValidateSamFile may not be parsing the new reference nomenclature correctly as was the case for GATK versions prior to v3.6. I talk about the issue in http://gatkforums.broadinstitute.org/gatk/discussion/7857/reference-genome-components. I'll test this possibility out.

  • myourshawmyourshaw University of ColoradoMember ✭✭

    Those darn colons :)

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    can you show the header, in the sam-files you hand-crafted? I'm looking at the code and having a hard time understanding how the colons have anything to do with this....also, we validate all our files...so this can't be so simple.

Sign In or Register to comment.