Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

PE mates are lost while downsampling with -dfrac?

Hi GATK team and users,

I am using PrintReads with -dfrac option to simulate different depths of coverage. The original data contains WGS, PE reads (from GATK's Bundle bam, PrintReads with -L 20, -dfrac 0.18). I'm using gatk-3.7.

I think that the PE mates are lost while downsampling (first observed at IGV with 'view as pairs'):

samtools flagstat still see 96.64% of "properly paired" reads but I guess that it is because the flags are inherited from the original bam reads.

samtools flagstat ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam:

##  9278360 + 0 in total (QC-passed reads + QC-failed reads)
## ...
##  8966551 + 0 properly paired (96.64% : N/A)
##  9023705 + 0 with itself and mate mapped
## ...

82% of the name of the reads are unique (and not duplicated as expected for PE data).

samtools view ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam | awk '{print $1}' | sort -n | uniq -c | awk '{print $1}' | sort -n | uniq -c

## 7612964 1
##   832698 2

Is there a way to downsample a bam file keeping the paired reads to simulate that I have got less data but still properly paired?

Thanks a lot for any help/discussion,
EsterQ

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @EsterQ
    Hi EsterQ,

    The reason the mates may be missing is because they may have mapped somewhere away from chromosome 20. For example, when you subset your BAM file to only chromosome 20, some mates may have mapped to different chromosomes and will not be present in your subset file.

    The unpaired mates will be ignored by downstream tools. I don't think there is a way for GATK tools to keep the mates if they are not in the interval you request.

    -Sheila

  • douymdouym Member

    @Sheila , I'm afraid not. I also did a test and found most of the paired reads are missed using GATK Printreads -dfrac.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @EsterQ and @douym,

    Picard DownsampleSam will retain mates when downsampling.

  • Thank you @shlee, I'll try with Picard DownsampleSam although it is not clear to me how the selected percentage is retained through 'probablitiy'. I'll let you know how it performed.

    Thank you @Sheila for the reply but as @douym confirmed (thanks a lot), the PE which were missed were also 'properly paired' at the original bam, all of them mapped at chr20 and with a correct insert size. I'm afraid that there is a bug or -dfrac is not aware about keep both or discard both mate-pairs. Can you please confirm?

    BR,
    Ester

  • Hi Sheila,

    I wanted to simulate sequencing data at different coverages before markdups. If it is not going to be fixed, I'll use another tool.

    Thanks a lot anyway,
    Ester

Sign In or Register to comment.