Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

File size is largely reduced in MarkIlluminaAdapters step

nithanitha indiaMember


I was doing data processing steps for raw reads (Fastq) in two way approaches

1. merging all the forward reads and reverse reads and used as input for further steps  
2. Without merging, each read (single raw fastq files) were used as input for each step   

While I am doing MarkIlluminaAdapter step I observed the data file size is reduced for 2nd ways, the Size details as follows

1. Raw fastq files size (80Gb)  
2. MarkIlluminaAdapter output size: **1st way (merged) 215Gb; 2nd way 179Gb**     

But I observed that in BWA mem-Alignment(1st way(merged) 258Gb; 2nd way 263Gb), Bam conversion (1stway 60Gb; 2ndway 80Gb) and Markduplicator (1stway59Gb and 2nd way60Gb) the data size is approximately retained BWA & MarkDupicates and size increased for 2nd way.

And another thing is, when I did alignment quality check for the both of the BAM files (in a 2nd way, the each reads output were merged to single file for quality check) using samtools flagstat in both types also showed 99.65% mapped but duplication was observed less in the 1st way (merged reads)

1st way duplication: 7197218 + 0 duplicates and 2ndway duplication: 208749 + 0 duplicates

could you please explain why this large size of data reduction had seen in MarkIlluminaAdapter and about this alignment quality check duplication difference in merged files?


Sign In or Register to comment.