If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Ahoy mates, be ye unmapped?
If you follow along the recently published Tutorial#6483 on mapping and cleaning short reads, you find a passage noting two types of mate-unmapped records. We've now defined these more officially in our Dictionary.
Unofficially, I've been calling these two classes of mate-unmapped sets wendy & peterpan and divorced to further distinguish them from orphan reads. If we are going to anthropomorphize read records, then I say it's open seas and to each their own ship.
Wendy & Peterpan
The original mate-unmapped records: mapped reads with mates that are indeed unmapped. These arise from one-end anchored inserts. The unmapped mates in these sets were marked as such at alignment. They were lost to begin with just like the Lost Boys of Peter Pan and exist in an alternate plane in how they populate the BAM file. If you look at these records, the alignment columns 2 and 3 indicate they align, in fact identically to the mapped mate. However, they have a MAPQ of zero and a Tinkerbell-esque asterisk
* in the CIGAR field that together indicates the record is unmapped. In sorted BAMs, these peterpan records sort alongside their mapped mates, or wendys. This has the intended effect of keeping the pairs together through the thick and thin of file manipulation.
Mapped reads marked as mate-unmapped records whose mates are actually mapped. This second type of mate unmapped records have to do with multimapping read sets. In our pipeline, reads aligned using BWA-MEM's
-M option are passed through MergeBamAlignment, which officiates the creation of divorced reads: MergeBamAlignment marks secondary records as mate-unmapped. This effectively minimizes the association between secondary records from their previous mates, much like in a divorce.
How do tools treat them differently?
Let's take duplicate marking (published in Tutorial#6747) as an example. Because consideration for duplicate status requires mapping, duplicate marking tools altogether ignore peterpans. These tools also skip divorced records from consideration in that they are secondary records. However, duplicate marking tools consider wendys as single ended reads, and the tools will mark them if duplicate.
Peterpan clipart is from http://www.disneyclips.com/imagesnewb/peterpan4.html.