The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# PICARD AlignmentSummaryMetrics

Member Posts: 42

Hi
I used the following command to gather summary metrics for my bam file generated via bowtie2 (tophat to be specific):

java -jar /usr/share/picard-tools-1.136/picard.jar CollectAlignmentSummaryMetrics INPUT=Sample_DY10.tophat.bam OUTPUT=tmpmetrics/alignmentmetrics R=/mnt/storage/ref_genome/Homo_sapiens/UCSC_hg19/UCSC/hg19/Sequence/Bowtie2Index/genome.fa

The output file is attached.
The question I have is that the metrics PF_HQ_MEDIAN_MISMATCHES has a very high number (66). When I look at NM tag in the bam file, I see that the median is 1 with max NM = 2
I am wondering how this number is calculated by PICARD.

Any help is appreciated.

#### Issue · Github September 2016 by Sheila

Issue Number
1315
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

@newbie16
Hi,

It says the metric is "The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS) in this article. However, I am not sure what an appropriate number is for the metric. I will check with the team and get back to you.

-Sheila

Hi @newbie16,

We've narrowed down your excessive PF_HQ_MEDIAN_MISMATCHES to three possibilities. Either CIGAR string S bases (softclips) are counted towards mismatches, or CIGAR string N bases (reference-skip bases, e.g. for intronic sequences), or both. Considering these types of bases in your alignment records, does your excessive median mismatches make sense?

In terms of reads for which this metric is calculated, these I believe have to have MAPQ > 20 (therefore must be aligned) and cannot be supplementary. The tool takes alignment blocks in the record, defined by the CIGAR string, and iterates over each of them to add to the mismatch count by directly comparing the base to the reference. Comparisons are case-insensitive.

Someone from the team informs me that the RNA samples have a PF_HQ_MEDIAN_MISMATCHES value typically around 0-2. So what I wrote above may be wrong. Can you post some of your alignment records so we can take a look at the SAM flag values, CIGAR string, etc?

• Member Posts: 42

Hi
Thanks for looking into it. I have uploaded a sample bam on google drive with below link. The PF_HQ_MEDIAN_MISMATCHES value for this file was 66.

• Member Posts: 42

Hi @DFFFDHDHFBFFG>EHHGJEGIFHGBGC@FH@GEGBFGHGGGGG@CGBCGGIEGG)=(=@=CG=C>EEEHBDECCBD?CDCCBBD>A>4:AC<?AA> AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 NM:i:0 XS:A:- NH:i:1 RG:Z:Sample_14011001
HISEQ:262:C99J2ACXX:8:2206:7109:2528 83 chr1 4777618 50 31M4919N69M = 4776766 -5871 TATTAATTTTTGCTTGAAAAGTATCAGCACCCTCTTCAACCAGCTGGACTCCATAATCCCTCTTAAGCGGCTGGATGGTCACACCTCTCCCATTCACAAG @DCCC>DCA@A;;(CED@CCFCEAHGIIGIFF=HFJIIIHF:IEIFGGGBGIIIHFGDGIGHGHFCFCJIHGBFHFHFFFFFCCC AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 NM:i:0 XS:A:- NH:i:1 RG:Z:Sample_14011001

@newbie16
Hi,

In this case, it is okay to use -U ALLOW_N_CIGAR_READS`. We added a note in that article the error message points you to

-Sheila

• Member Posts: 42

Thanks @shlee and @Sheila
Once I got rid of N's the PF_HQ_MEDIAN_MISMATCHES looks ok, i.e. 0