To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

At what point should I merge read group BAM files belonging to the same sample into a single file?

SheilaSheila Broad InstituteMember, Broadie, Moderator

It is fairly common to have multiple read groups for a sample, either from sequencing multiple libraries or from spreading a library across multiple lanes. It seems this causes a lot of confusion, and people often tell us they're not sure how to organize the data for the pre-processing steps or how to feed the data into HaplotypeCaller.

Well, there are several options for organizing the processing. We have a fairly detailed FAQ article that describes our preferred workflow for pre-processing data from multiplexed sequencing and multi-library designs. But in this article we describe at a simpler level what are the main two options depending on how you want to provide the analysis ready BAM files to the variant caller.

To produce a combined per-sample bam file to feed to HaplotypeCaller (most common)

The simplest thing to do is to input all the bam files that belong to that sample, either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The choice depends mostly on how deep the coverage is. High depth means a lot of data to process at the same time, which slows down Indel Realignment. This is because Indel Realignment ignores all read group information and simply processes all reads together. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one handy per-sample file.

Note: We do not require the PU field in the RG, however, BQSR will consider the PU field over all other fields.

To produce a separate bam file for each read group (less common)

Another option is to keep all the bam files separate until variant calling, and then input them to Haplotype Caller together. You can do this by simply running Indel Realignment and BQSR on each of the bams separately. You can then input all of the bams into HaplotypeCaller at once. This works even if you want to run HaplotypeCaller in GVCF mode, which can only be done on a single sample at a time. As long as the SM tags are identical, HaplotypeCaller will recognize that it's a single-sample run. This is because the GATK engine will merge the data before presenting it to the HaplotypeCaller tool, so HaplotypeCaller does not know nor care whether the data came from many files or one file.

Note: If you input many bam files into Indel Realigner, the default output is one bam file. However, you can output one bam file for each input bam file by using -nWayOut.

Post edited by Geraldine_VdAuwera on


  • everestial007everestial007 GreensboroMember

    Thanks @Sheila : This is very helpful.

  • ibseqibseq United KingdomMember

    im quite confused on how to feed HC with several bam files in the input.

    I have a bam file for each of my differerent samples: can I merged all the bams in one file, or what is the other way? do i need to run this command for each of my over 100 samples?:
    java -jar GenomeAnalysisTK.jar \
    -R reference.fasta \
    -T HaplotypeCaller \
    -I sample1.bam \
    --emitRefConfidence GVCF \
    [--dbsnp dbSNP.vcf] \
    [-L targets.interval_list] \
    -o output.raw.snps.indels.g.vcf


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @ibseq This command must be run individually per-sample (no merging). Then afterward you run GenotypeGVCFs on all the GVCFs together.

Sign In or Register to comment.