The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

PICARD AlignmentSummaryMetrics

newbie16newbie16 Member Posts: 42

Hi
I used the following command to gather summary metrics for my bam file generated via bowtie2 (tophat to be specific):

java -jar /usr/share/picard-tools-1.136/picard.jar CollectAlignmentSummaryMetrics INPUT=Sample_DY10.tophat.bam OUTPUT=tmpmetrics/alignmentmetrics R=/mnt/storage/ref_genome/Homo_sapiens/UCSC_hg19/UCSC/hg19/Sequence/Bowtie2Index/genome.fa

The output file is attached.
The question I have is that the metrics PF_HQ_MEDIAN_MISMATCHES has a very high number (66). When I look at NM tag in the bam file, I see that the median is 1 with max NM = 2
I am wondering how this number is calculated by PICARD.

Any help is appreciated.

xlsx
xlsx
temp.xlsx
9K

Issue · Github
by Sheila

Issue Number
1315
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,724 admin

    @newbie16
    Hi,

    It says the metric is "The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS) in this article. However, I am not sure what an appropriate number is for the metric. I will check with the team and get back to you.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 528 admin

    Hi @newbie16,

    We've narrowed down your excessive PF_HQ_MEDIAN_MISMATCHES to three possibilities. Either CIGAR string S bases (softclips) are counted towards mismatches, or CIGAR string N bases (reference-skip bases, e.g. for intronic sequences), or both. Considering these types of bases in your alignment records, does your excessive median mismatches make sense?

    In terms of reads for which this metric is calculated, these I believe have to have MAPQ > 20 (therefore must be aligned) and cannot be supplementary. The tool takes alignment blocks in the record, defined by the CIGAR string, and iterates over each of them to add to the mismatch count by directly comparing the base to the reference. Comparisons are case-insensitive.

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 528 admin

    @newbie16,

    Someone from the team informs me that the RNA samples have a PF_HQ_MEDIAN_MISMATCHES value typically around 0-2. So what I wrote above may be wrong. Can you post some of your alignment records so we can take a look at the SAM flag values, CIGAR string, etc?

  • newbie16newbie16 Member Posts: 42

    Hi
    Thanks for looking into it. I have uploaded a sample bam on google drive with below link. The PF_HQ_MEDIAN_MISMATCHES value for this file was 66.

    https://drive.google.com/open?id=0B2tk2ztVP7NIZHQ1V1pFNGxJeU0

  • newbie16newbie16 Member Posts: 42

    Hi @DFFFDHDHFBFFG>EHHGJEGIFHGBGC@FH@GEGBFGHGGGGG@CGBCGGIEGG)=(=@=CG=C>EEEHBDECCBD?CDCCBBD>A>4:AC<?AA> AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 NM:i:0 XS:A:- NH:i:1 RG:Z:Sample_14011001
    HISEQ:262:C99J2ACXX:8:2206:7109:2528 83 chr1 4777618 50 31M4919N69M = 4776766 -5871 TATTAATTTTTGCTTGAAAAGTATCAGCACCCTCTTCAACCAGCTGGACTCCATAATCCCTCTTAAGCGGCTGGATGGTCACACCTCTCCCATTCACAAG @DCCC>DCA@A;;(CED@CCFCEAHGIIGIFF=HFJIIIHF:IEIFGGGBGIIIHFGDGIGHGHFCFCJIHGBFHFHFFFFFCCC AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 NM:i:0 XS:A:- NH:i:1 RG:Z:Sample_14011001

    Could you please help

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,724 admin

    @newbie16
    Hi,

    In this case, it is okay to use -U ALLOW_N_CIGAR_READS. We added a note in that article the error message points you to :smile:

    -Sheila

  • newbie16newbie16 Member Posts: 42

    Thanks @shlee and @Sheila
    Once I got rid of N's the PF_HQ_MEDIAN_MISMATCHES looks ok, i.e. 0

Sign In or Register to comment.