File size is largely reduced in MarkIlluminaAdapters step

nithanitha indiaMember


I was doing data processing steps for raw reads (Fastq) in two way approaches

1. merging all the forward reads and reverse reads and used as input for further steps  
2. Without merging, each read (single raw fastq files) were used as input for each step   

While I am doing MarkIlluminaAdapter step I observed the data file size is reduced for 2nd ways, the Size details as follows

1. Raw fastq files size (80Gb)  
2. MarkIlluminaAdapter output size: **1st way (merged) 215Gb; 2nd way 179Gb**     

But I observed that in BWA mem-Alignment(1st way(merged) 258Gb; 2nd way 263Gb), Bam conversion (1stway 60Gb; 2ndway 80Gb) and Markduplicator (1stway59Gb and 2nd way60Gb) the data size is approximately retained BWA & MarkDupicates and size increased for 2nd way.

And another thing is, when I did alignment quality check for the both of the BAM files (in a 2nd way, the each reads output were merged to single file for quality check) using samtools flagstat in both types also showed 99.65% mapped but duplication was observed less in the 1st way (merged reads)

1st way duplication: 7197218 + 0 duplicates and 2ndway duplication: 208749 + 0 duplicates

could you please explain why this large size of data reduction had seen in MarkIlluminaAdapter and about this alignment quality check duplication difference in merged files?


