To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

PE mates are lost while downsampling with -dfrac?

Hi GATK team and users,

I am using PrintReads with -dfrac option to simulate different depths of coverage. The original data contains WGS, PE reads (from GATK's Bundle bam, PrintReads with -L 20, -dfrac 0.18). I'm using gatk-3.7.

I think that the PE mates are lost while downsampling (first observed at IGV with 'view as pairs'):

samtools flagstat still see 96.64% of "properly paired" reads but I guess that it is because the flags are inherited from the original bam reads.

samtools flagstat ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam:

##  9278360 + 0 in total (QC-passed reads + QC-failed reads)
## ...
##  8966551 + 0 properly paired (96.64% : N/A)
##  9023705 + 0 with itself and mate mapped
## ...

82% of the name of the reads are unique (and not duplicated as expected for PE data).

samtools view ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam | awk '{print $1}' | sort -n | uniq -c | awk '{print $1}' | sort -n | uniq -c

## 7612964 1
##   832698 2

Is there a way to downsample a bam file keeping the paired reads to simulate that I have got less data but still properly paired?

Thanks a lot for any help/discussion,

Best Answer


  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi EsterQ,

    The reason the mates may be missing is because they may have mapped somewhere away from chromosome 20. For example, when you subset your BAM file to only chromosome 20, some mates may have mapped to different chromosomes and will not be present in your subset file.

    The unpaired mates will be ignored by downstream tools. I don't think there is a way for GATK tools to keep the mates if they are not in the interval you request.


  • douymdouym Member

    @Sheila , I'm afraid not. I also did a test and found most of the paired reads are missed using GATK Printreads -dfrac.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @EsterQ and @douym,

    Picard DownsampleSam will retain mates when downsampling.

  • Thank you @shlee, I'll try with Picard DownsampleSam although it is not clear to me how the selected percentage is retained through 'probablitiy'. I'll let you know how it performed.

    Thank you @Sheila for the reply but as @douym confirmed (thanks a lot), the PE which were missed were also 'properly paired' at the original bam, all of them mapped at chr20 and with a correct insert size. I'm afraid that there is a bug or -dfrac is not aware about keep both or discard both mate-pairs. Can you please confirm?


  • Hi Sheila,

    I wanted to simulate sequencing data at different coverages before markdups. If it is not going to be fixed, I'll use another tool.

    Thanks a lot anyway,

Sign In or Register to comment.