This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Some questions on this BAM file from UK Biobank
Below is a screenshot for a few lines of one CRAM file that I downloaded from UKB. I am surprised to see that the sequencing reads is only 76bp instead of the usually used 150bp. For easy reading, I inserted a blank line between each of the 5 reads. Below, I will refer each of the 5 reads by numbers, such as read #1, read #2.
For the first line, read #1, my understanding is that this read starts at chr19 (column 3) and position 44908610 (column 4), and its pair read starts at 44908678 (column 8). Since the mate pair read’s starting position is 68bp more than the read #1 and the mate pair’s read is 76bp long, the distance between the OUTER points of these two reads are 68 + 76 = 144bp (column 9). So, the number on column 9 measures the distance between the two OUTER points of one pair and its mate pair? Can someone please confirm this?
I once see some sequencing data generated by Complete Genomics platform. The BAM file shows one read and its mate pair are on two different chromosomes. I thought that paired-end sequencing means sequencing the same segement from two ends, therefore one read and its pair are actually on the same DNA fragment. How could one read and its mate pair be on different chromosomes.
As highlighted in the attached screenshot, the position 44908678 showed up 5 times:
• It is listed as a mate pair for **read #1** and **read #3** respectively. How is it possible that the same read could become the mate pair of two different reads? • It is listed as the starting position of both **read #4** and its mate pair. Does this mean this read is only 75bp, instead of 2*75bp? • It is listed as the starting position of **read #5**. Then it seems to me that read #4 and read #5 are duplicates. Therefore, one should be removed. However, these two reads do have different labels and different quality scores, therefore, they are not duplicates. How to identify duplicate reads?
Below are my two important questions:
• Now, if I want to find the mate pairs of all reads, can I use each reads’ starting position and its mate pair’s starting position as the matching field to merge the same SAM/BAM file. If so, either read #4 or read #5 will be merged into read #1 and read #3, depending on whether the merging algorithm keeps the first or the last occurrence of the same record. Is this correct? I think I could do this with a small Perl or Python script. Or is there a samtools command to quickly do this? • Since read #1 and read #4 are mate pairs, after I merged them together, can I assume that the DNA sequence on the merged line are on the same chromosome, i.e., they are in phase?
Anyone’s feedback on my above questions would be deeply appreciated!
Thank you very much & best regards,