We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Does the best practice pipeline for pre-data processing output a merged BAM file?

minimaxminimax Member
edited May 2019 in Ask the GATK team

I'm using the GATK4 best practice pipeline for pre-data processing. I have several questions and would like to confirm:

  1. The final output is a merged BAM file, right? All the input per-sample BAM files are merged during MarkDuplicates and are processed together ever since.

  2. It seems that I can use SplitSam on the final output to extract the BAM files for each sample, is this correct?

  3. In the accompanying .json file, the sample_name is NA12878. Are there any special reasons for choosing this name? Or can it be any name?

Also following the 1st question, why all the input BAM files are merged? Are there any special reasons for doing so? For the dataset I'm analyzing, each BAM is about 15GB, and I have 200 such files, if they are merged together it will be 3000GB = 3TB. Although I'm using a server, still it will be very difficult to process such a huge file (if such a huge file could be supported)! Besides, since they are merged, we will not be able to have parallel computing, and the computing time will be longer.

Thank you very much for your help!

Best Answer


  • minimaxminimax Member

    @SChaluvadi , thanks for your reply!

  • jackyhuangjackyhuang Taiwan,R.O.CMember
    > @SChaluvadi said:
    > @minimax
    > 1. Output of MarkDuplicates: Yes the output is a single file even if there are multiple inputs. MarkDuplicates can be used to merge a single sample's multiple bam files across multiple lanes so this is why the output is merged. A single sample's reads marked by duplicate and merged into a single file. Duplicate reads are defined as originating from a single fragment of DNA so you want to look across an entire sample. You can choose to run the inputs separately if you do not want to merge.
    > 2. SplitSam: Yes you should be able to create BAM files per sample from your input.
    > 3. It is labeled NA12878 because the sample was derived from the cell line standard. You can name your sample whatever you would like but are required to update the accompanying .json file with the name of your sample.

    The third means that NA12878 can download the BAM file with the correct answer?
    However I couldn't find any place to download file.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal/erroneous results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, see this [announcement](https://software.broadinstitute.org/gatk/blog?id=24419 “announcement”) and check out our [support policy](https://gatkforums.broadinstitute.org/gatk/discussion/24417/what-types-of-questions-will-the-gatk-frontline-team-answer/p1?new=1 “support policy”).

Sign In or Register to comment.