Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

UG default filtering of improperly aligned pairs, v1.6.7

Hello,

I hoped you might help with questions about the version 1.6.7 UG's default read filters. I checked the help at the console for the v1.6.7 UG, but did not see answers there, and I believe the documentation for 1.6.7 has been retired. I am reviewing some UG runs done with that version.

In looking for improper paired-end alignments in a bam file that I had input into the UG, I tested the Flag value in the sam formatted version of the file (2nd column), for an unset 0x2 bit. As indicated by SAM documentation, (paired-end) reads with that bit unset all look to be improper paired-end alignments -- some having been aligned on different chromosomes, others with insert sizes well out of reasonable range. When I checked these "improperly aligned pairs" for the flag 0x4 for segment unmapped, in many that flag was not set, so I guess that they would not be filtered as unmapped by UG, if it uses the flag. Further many had mapping quailities > 20, for example.

I hoped you migh advise as to what extent the 1.6.7 UG was by default filtering out these improperly aligned paired-end reads. Is it filtering all (paired) reads flagged by the unset 0x2 bit as "improperly aligned," or perhaps only paired reads that aligned to different contigs (chromsomes), and not filtering for improper insert size?

Maybe relatedly, I noticed that the total reads reported by the UG (~13 million) console output was well under a line count of the total reads in the input bam (~26 million). However it was well over the total (~8 million) reads in the bam file after filtering it to include only reads overlapping the same bed intervals I had given (with the "-L" option ) the UG. (The UG also gave small totals (~141k, 12k) for reads failing BadMateFilter and UnmappedReadFilter.)

I wonder if UG's total was counting all reads, in or outside of the bed intervals, but after a prefiltering, including perhaps having filtered out the "improperly aligned" pairs as noted?

I also counted ~500k (vs the 12k reported by the UG) unmapped reads in the bed-filtered bam/sam file. This was a count of the entries with the 0x4 "segment unmapped" flag set. I was not sure if the UG was prefiltering for some majority of unmapped reads, then finding others during the genotyping.

I apologize for the length of my inquiry, and want to thank the GATK team again for making this valuable forum available.

Comments

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi there,

    We generally don't support older versions of the GATK tools, but since I think I know the answer I will try to help. The Unified Genotyper ignores reads whose unmapped bit in the Flag value of the SAM record is set, which should be the 0x4 bit. In addition, it ignores reads whose mate maps to a different contig.

    As for the number of reads being printed out by the traversal summary, I'm fairly certain that you cannot trust those numbers exactly in the older versions of the tool. You should run the Flagstat tool with the latest version of the GATK to get more accurate counts.

  • fcuisinefcuisine Member

    I appreciate very much for your time and information about the older version, especially since it's generally no longer supported.

    Just a last questian -- I wonder if I can assume, as your answer indicates, that pairs whose alignment shows a very large insert size (e.g. ~1e5) would not have been filtered out, as long as the unmapped bit was not set?

    Thanks again.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭
  • Alex747Alex747 Member

    hi,
    Further to the above query by fcuisine, I'm wanting to know which bitwise flags are actually used, rather than filtered, by the GATK Unified Genotyper (v2.2-15). For example, by default, does GATK UG only use reads in a SAM alignment file that have the flags 99, 147, 83, 163 (most accurately mapped) OR are there others that are also used - e.g. 89, 121 (where one of the mates in a read pair is unmapped)?
    Thanks,
    Alex

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Alex,

    See this thread for a list of the read filters that the UG uses by default:

    http://gatkforums.broadinstitute.org/discussion/1457/recommended-filters-for-unifiedgenotyper

    Anything not filtered out will be used.

Sign In or Register to comment.