Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Some questions on this BAM file from UK Biobank

Hi,

Below is a screenshot for a few lines of one CRAM file that I downloaded from UKB. I am surprised to see that the sequencing reads is only 76bp instead of the usually used 150bp. For easy reading, I inserted a blank line between each of the 5 reads. Below, I will refer each of the 5 reads by numbers, such as read #1, read #2.

For the first line, read #1, my understanding is that this read starts at chr19 (column 3) and position 44908610 (column 4), and its pair read starts at 44908678 (column 8). Since the mate pair read’s starting position is 68bp more than the read #1 and the mate pair’s read is 76bp long, the distance between the OUTER points of these two reads are 68 + 76 = 144bp (column 9). So, the number on column 9 measures the distance between the two OUTER points of one pair and its mate pair? Can someone please confirm this?

I once see some sequencing data generated by Complete Genomics platform. The BAM file shows one read and its mate pair are on two different chromosomes. I thought that paired-end sequencing means sequencing the same segement from two ends, therefore one read and its pair are actually on the same DNA fragment. How could one read and its mate pair be on different chromosomes.

As highlighted in the attached screenshot, the position 44908678 showed up 5 times:

•   It is listed as a mate pair for **read #1** and **read #3** respectively. How is it possible that the same read could become the mate pair of two different reads?

•   It is listed as the starting position of both **read #4** and its mate pair. Does this mean this read is only 75bp, instead of 2*75bp?

•   It is listed as the starting position of **read #5**. Then it seems to me that read #4 and read #5 are duplicates. Therefore, one should be removed. However, these two reads do have different labels and different quality scores, therefore, they are not duplicates. How to identify duplicate reads? 

Below are my two important questions:

•   Now, if I want to find the mate pairs of all reads, can I use each reads’ starting position and its mate pair’s starting position as the matching field to merge the same SAM/BAM file. If so, either read #4 or read #5 will be merged into read #1 and read #3, depending on whether the merging algorithm keeps the first or the last occurrence of the same record. Is this correct? I think I could do this with a small Perl or Python script. Or is there a samtools command to quickly do this?

•   Since read #1 and read #4 are mate pairs, after I merged them together, can I assume that the DNA sequence on the merged line are on the same chromosome, i.e., they are in phase?

Anyone’s feedback on my above questions would be deeply appreciated!

Thank you very much & best regards,
Jie

Answers

  • mshandmshand Member, Broadie, Dev ✭✭

    Hi @jiehuang001 ,

    I'll try to address your questions one by one:

    So, the number on column 9 measures the distance between the two OUTER points of one pair and its mate pair? Can someone please confirm this?

    This number is filled in by the aligner. The SAM specification defines TLEN (the 9th column) as the observed template length (https://samtools.github.io/hts-specs/SAMv1.pdf), but ultimately it's the aligner that fills in this value. You can look at the documentation of the aligner used (you can usually find this information in the @PG line in the header) to see how this value is calculated.

    How could one read and its mate pair be on different chromosomes.

    While this is hopefully less common than the read and it's mate being close together, this can occur if there is a structural variation in your sample, or if there is a mapping artifact (the alignment is wrong for some reason).

    44908678 is listed as a mate pair for read #1 and read #3 respectively. How is it possible that the same read could become the mate pair of two different reads?

    Reads pairs are not determined by where they aligned. You can find a read and it's mate by looking at the readname (or queryname) which is the first column of the CRAM (so for read #1 in your example it's the string that ends in ":8343"). Note that based on this read #1 in your example is mates with read #5. The other reads happen to align to 44908678, but that doesn't mean they are mates.

    It is listed as the starting position of both read #4 and its mate pair. Does this mean this read is only 75bp, instead of 2*75bp?

    Read length is also not determined by alignment or where the mate is. Read length is how many bases were sequenced so you could find this by counting the number of bases in the read. I'd recommend opening these reads in IGV and taking a look to see what I mean.

    _How to identify duplicate reads? _

    We mark duplicates by looking at the unclipped start position of a read and it's mate. We don't look at quality scores or the bases to determine this. This is because a duplicate read was sequenced twice, but came from the same original molecule (either from PCR or another method of getting that molecule more than once). I'd recommend looking at this https://software.broadinstitute.org/gatk/documentation/article?id=6747#section3 or for more details on Marking Duplicates with Picard.

    Now, if I want to find the mate pairs of all reads, can I use each reads’ starting position and its mate pair’s starting position as the matching field to merge the same SAM/BAM file.

    No, you should find mates of all reads by looking at the read name. If you queryname sort the bam (using SortSam) all the mates will be right next to each other in your bam file, which is one way you can work with them easily.

    I'm also not sure I understand what you mean by "merge" the read and it's mate together, or why you would want to do that.

    Hope this helps!

  • jiehuang001jiehuang001 Member

    Thank you very much!

    Can I use "samtools sort" to sort my CRAM file? I did not see SortSam on our system. All the mates will be right next to each other in the new BAM file? That will be cool. I will give it a try.

    My other key questions is: if the sequence is AAACC... on one read, and the sequence is CCCTT... on its mate pair, can I assume that these two sequences are on the same chromosomes? I am trying to extract phasing information from my CRAM files.

    Best regards,
    Jie

  • mshandmshand Member, Broadie, Dev ✭✭

    SortSam is a Picard tool, but I'd imagine samtools can also produce a queryname sorted CRAM/BAM.

    The mate and it's pair should always be from the same original molecule (barring some kind of error mode). So you should be able to infer phasing based on the mate pair. I don't think you need to look at the actual bases to make this assumption, but maybe I'm misunderstanding your question.

Sign In or Register to comment.