Does the best practice pipeline for pre-data processing output a merged BAM file?

minimax
edited May 2019

I'm using the GATK4 best practice pipeline for pre-data processing. I have several questions and would like to confirm:

  1. The final output is a merged BAM file, right? All the input per-sample BAM files are merged during MarkDuplicates and are processed together ever since.

  2. It seems that I can use SplitSam on the final output to extract the BAM files for each sample, is this correct?

  3. In the accompanying .json file, the sample_name is NA12878. Are there any special reasons for choosing this name? Or can it be any name?

Also following the 1st question, why all the input BAM files are merged? Are there any special reasons for doing so? For the dataset I'm analyzing, each BAM is about 15GB, and I have 200 such files, if they are merged together it will be 3000GB = 3TB. Although I'm using a server, still it will be very difficult to process such a huge file (if such a huge file could be supported)! Besides, since they are merged, we will not be able to have parallel computing, and the computing time will be longer.

Thank you very much for your help!

  • minimaxminimax Member

    @SChaluvadi , thanks for your reply!

  jackyhuang Taiwan,R.O.C
    > @SChaluvadi said:
    > @minimax
    > 1. Output of MarkDuplicates: Yes the output is a single file even if there are multiple inputs. MarkDuplicates can be used to merge a single sample's multiple bam files across multiple lanes so this is why the output is merged. A single sample's reads marked by duplicate and merged into a single file. Duplicate reads are defined as originating from a single fragment of DNA so you want to look across an entire sample. You can choose to run the inputs separately if you do not want to merge.
    > 2. SplitSam: Yes you should be able to create BAM files per sample from your input.
    > 3. It is labeled NA12878 because the sample was derived from the cell line standard. You can name your sample whatever you would like but are required to update the accompanying .json file with the name of your sample.

    The third means that NA12878 can download the BAM file with the correct answer?
    However I couldn't find any place to download file.
