If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
UG default filtering of improperly aligned pairs, v1.6.7
I hoped you might help with questions about the version 1.6.7 UG's default read filters. I checked the help at the console for the v1.6.7 UG, but did not see answers there, and I believe the documentation for 1.6.7 has been retired. I am reviewing some UG runs done with that version.
In looking for improper paired-end alignments in a bam file that I had input into the UG, I tested the Flag value in the sam formatted version of the file (2nd column), for an unset 0x2 bit. As indicated by SAM documentation, (paired-end) reads with that bit unset all look to be improper paired-end alignments -- some having been aligned on different chromosomes, others with insert sizes well out of reasonable range. When I checked these "improperly aligned pairs" for the flag 0x4 for segment unmapped, in many that flag was not set, so I guess that they would not be filtered as unmapped by UG, if it uses the flag. Further many had mapping quailities > 20, for example.
I hoped you migh advise as to what extent the 1.6.7 UG was by default filtering out these improperly aligned paired-end reads. Is it filtering all (paired) reads flagged by the unset 0x2 bit as "improperly aligned," or perhaps only paired reads that aligned to different contigs (chromsomes), and not filtering for improper insert size?
Maybe relatedly, I noticed that the total reads reported by the UG (~13 million) console output was well under a line count of the total reads in the input bam (~26 million). However it was well over the total (~8 million) reads in the bam file after filtering it to include only reads overlapping the same bed intervals I had given (with the "-L" option ) the UG. (The UG also gave small totals (~141k, 12k) for reads failing BadMateFilter and UnmappedReadFilter.)
I wonder if UG's total was counting all reads, in or outside of the bed intervals, but after a prefiltering, including perhaps having filtered out the "improperly aligned" pairs as noted?
I also counted ~500k (vs the 12k reported by the UG) unmapped reads in the bed-filtered bam/sam file. This was a count of the entries with the 0x4 "segment unmapped" flag set. I was not sure if the UG was prefiltering for some majority of unmapped reads, then finding others during the genotyping.
I apologize for the length of my inquiry, and want to thank the GATK team again for making this valuable forum available.