The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# How should I pre-process data from multiplexed sequencing and multi-library designs?

edited May 2016 in FAQs

Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).

At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

### Example

Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:

• sampleA_lane1.fq
• sampleA_lane2.fq
• sampleB_lane1.fq
• sampleB_lane2.fq

These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).

#### 1. Run initial steps per-readgroup once

Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.

The example data becomes:

• sampleA_rgA1.bam
• sampleA_rgA2.bam
• sampleB_rgB1.bam
• sampleB_rgB2.bam

At this point we mark duplicates in each read group BAM file (dedup), which allows us to estimate the complexity of the corresponding library of origin as a quality control step. This step is optional.

The example data becomes:

• sampleA_rgA1.dedup.bam
• sampleA_rgA2.dedup.bam
• sampleB_rgB1.dedup.bam
• sampleB_rgB2.dedup.bam

Technically this first run of marking duplicates is not necessary because we will run it again per-sample, and that per-sample marking would be enough to achieve the desired result. To reiterate, we only do this round of marking duplicates for QC purposes.

#### 2. Merge read groups and mark duplicates per sample (aggregation + dedup)

Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).

The example data becomes:

• sampleA.merged.dedup.bam
• sampleB.merged.dedup.bam

To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

#### 3. Remaining per-sample pre-processing

Then you run indel realignment (optional) and base recalibration (BQSR).

The example data becomes:

• sample1.merged.dedup.(realn).recal.bam
• sample2.merged.dedup.(realn).recal.bam

Realigning around indels per-sample leads to consistent alignments across all lanes within a sample. This step is only necessary if you will be using a locus-based variant caller like MuTect 1 or UnifiedGenotyper (for legacy reasons). If you will be using HaplotypeCaller or MuTect2, you do not need to perform indel realignment.

Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Geraldine Van der Auwera, PhD

Post edited by Geraldine_VdAuwera on
Tagged:

Questions and comments up to August 2014 have been moved to an archival thread here:

Geraldine Van der Auwera, PhD

• USAMember Posts: 16

Hi,
I have aligned paired end fast files using BOWTIE2, SAMTOOLS and BCFTOOLS.(SRR901606_1.fastq and SRR901606_2.fastq). Now I have eg606.sam, eg606.bam, eg606.sorted.bam, eg606.raw.bcf. Now which will be my starting point to improve the quality of my data. Is there any sample command line for GATK? " each pre-processing step individually: map & dedup -> realign -> recal." How can I do this?

• Member Posts: 6

Hi,
I am trying to identify the best GATK pipeline to use with my multiplexed whole exome data.

In this topic, you say that it is better to run per-lane: map & dedup -> realign -> recal and then after merging lanes, a second pass of per-sample: dedup -> realign (but no 2nd pass of recal step, because "by definition there is no sense in trying to recalibrate across lanes").
Okay... it makes sense.
Per-lane: map & dedup, merge lanes (aggregation), then per-sample: dedup (2nd pass) -> realign -> recal.

Hmm our docs may be slightly out of sync, apologies for the confusion. The two ways are essentially equivalent; in the past we recommended doing realignment twice but not anymore. We'll update this document to reflect that and be in sync with the slides.

Geraldine Van der Auwera, PhD

Edited the doc to reflect how pre-processing is implemented in production at Broad.

Geraldine Van der Auwera, PhD

• Member Posts: 6

Thank you for your quick response Geraldine !

• USAMember Posts: 95

Thank you for updating/editing this post. Thank you for emphasizing why Baserecalibration is being applied with in a lane, not across all lanes of a sample. Also, thanks for mentioning that PU takes precedence over ID in read groups for baserecalibration.

Previously you mentioned here , that it is better to re-align (RTC + IR) per lane. Doing per-lane clean up increases the probability of doing re-alignment in high depth regions, as the amount of data will be less when compared to the data in all-lanes-merged-sample bam file. If per-lane clean up is not done, then those high-depth regions will be skipped by GATK.

I see that the new guideline (Lane Level : Map + Sort +Dedup ; All-Lanes-Merged-Sample Level : Dedup + Realign + Recalibrate) is released/updated after the GATK v 3.5 release. Does that mean that, with previous GATK3.4 version it is better to Re-Align at lane level as it cannot re-align regions of high-depth in all-lanes-merged-sample bam file, where-as current GATK v3.5 has capability to re-align in high-depth regions in all-lanes-merged-sample bam file ?

My pipeline currently uses GATK v3.4, and it follows previous Best Practices guideline as follows (Lane Level : Map + Sort + Dedup + Realign + Recalibrate ; All-Lanes-Merged-Sample Level : Sort + Dedup + Realign).

I have 4 lanes of data per sample (from Nextseq500 instrument) and our exome fastq.gz files can be as large as ~20 million reads in both paired-end reads. Steps Realign + Recalibrate at lane level are time consuming, and skipping these 2 steps at lane level, and including the Baserecalibration step at merged sample level, as according to new recommendation, decreases my overall processing time.

If I switch to GATK v3.5, would indels will be properly realigned in high-depth regions, even though I skip lane-level-realignment ?

Thanks,
mglclinical

@mglclinical These recent edits don't reflect any significant change of operation in the tools, just a shift in pipeline implementation philosophy. While indel realignment used to be very important when we were calling variants with UnifiedGenotyper, the HaplotypeCaller's ability to perform its own realignment has reduced the importance of the pre-processing realignment. It is still performed because it can produce benefits for base recalibration, but now we're satisfied with a best-effort pass per-sample, as opposed to the more thorough pass that can be achieved per-lane.

So as long as you're calling variants with HaplotypeCaller, you can follow the new recommendations with either version without concern.

Geraldine Van der Auwera, PhD

• USAMember Posts: 95

@Geraldine_VdAuwera thanks for the quick clarification

• New York, NYMember Posts: 9
edited March 2016

Maybe I'm confused, but with the new shift in philosophy, there seems to be minimal value in aligning, sorting and de-duping the lane-specific fastq's now. Given the current philosophy, it does not appear that there would be much difference between any of the following approaches:
* GATK best practices: per-lane align, per-lane sort, per-lane de-dup, merged, re-sort, re-dedup, re-align, ....
* per-lane align, merge, sort, de-dup, re-align, ....
* concat fastq's, align, merge, sort, de-dup, re-align, ....

The article above states that the per-lane sorting reduces the complexity of the subsequent analysis, but unlike the previous suggested practice (which was aimed to reduce lane-specific artifacts and improve alignment accuracy), there seems to be little value in the additional computation).

Am I missing something? If not, would it make sense to modify the article to state that if you are using the UnifiedGenotyper, you should follow the old instructions; otherwise, if you are using the HaplotypeCaller, you can follow the updated instructions, or just follow one of the simpler pipelines that I just listed?

@Geraldine_VdAuwera said:
@mglclinical These recent edits don't reflect any significant change of operation in the tools, just a shift in pipeline implementation philosophy. While indel realignment used to be very important when we were calling variants with UnifiedGenotyper, the HaplotypeCaller's ability to perform its own realignment has reduced the importance of the pre-processing realignment. It is still performed because it can produce benefits for base recalibration, but now we're satisfied with a best-effort pass per-sample, as opposed to the more thorough pass that can be achieved per-lane.

So as long as you're calling variants with HaplotypeCaller, you can follow the new recommendations with either version without concern.

#### Issue · Github March 2016 by Sheila

Issue Number
750
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Hi @pwaltman, sorry for the late reply. You summarize our recommendations above as "per-lane align, per-lane sort, per-lane de-dup, merged, re-sort, re-dedup, re-align, ...." but to be clear in your terms it would be "per-lane align, per-lane sort, per-lane dedup, merge, (no re-sort), re-dedup (actually this step can be used to do the merge too), local realign (to be clear it's not a repeat of the original align), ...". This produces the same core advantages as the previous version but in a more compact and efficient workflow. You're correct that results should be largely equivalent to the two other workflows you propose if I'm reading them correctly (I assume you're not proposing to combine all the fastqs before genome-aligning, since you would lose the ability to distinguish read groups); the main difference is the double dedup, which we do for QC reasons (to get per-lane library complexity metrics).

Geraldine Van der Auwera, PhD

• PolandMember Posts: 15

Thanks for this post, what shall I do for sample that have number of libraries and each library have number of lanes , should it be per-sample different library ending in one sample have number of bams file for each library?

@medhat
Hi,

-Sheila

• United StatesMember Posts: 26

Thanks for the detailed explanation. I wonder if it is okay to add RG information and consider different lanes after mark duplication for each sample please. Our samples are indeed multiplexed, but at the current stage I need to test multiple variant callers, I slightly prefer to just have one set of .dedup.bam for each sample. Thank you.

Sure, that's fine. All GATK tools that read BAM files can take multiple BAMs as input. As long as the RG info is correct the data will be handled appropriately internally.

Geraldine Van der Auwera, PhD

• United StatesMember Posts: 26

Hi Geraldine, thank you so much for the quick reply. Just to confirm, for each .dedup.bam file for each sample, if I can correctly label the @RG lines (one line per lane) at the start of the bam file, then this .dedup.bam should fit into GATK - is that correct please? Thanks a lot.

@mscjulia
Hi,

Yes, you are correct

-Sheila

• UKMember Posts: 8

Hello,

Thanks for this post, it is very helpful. I am working with whole genome resequencing data with several sequencing lanes for each sample.

I have mapped the fastq files to the reference genome and now I have several bam files (one per lane) for each individual that I would like to process with GATK. I should have roughly 10x coverage per individual and will thus use ANGSD for SNP discovery.

If I understand properly I should get rid of the duplicates twice, before and after merging the bam files of one individual? And then run the indel-realignment and the base recalibration step on the merged bam file for an individual (the lane information should be kept by adding the read group info using picard tools before merging the files)?

I guess it is better to run the indel-realignment step as I'll use ANGSD for SNP calling and not an haplotype caller, right?

I am also wondering whether I should use GATK or samtools to remove the duplicates.

Thank you,

Marie

Hi @mariel,

I'm not familiar with ANGSD. What is your reason for using ANGSD rather than the GATK workflow?

Let me restrict my answer to the question of whether you should run MarkDuplicates twice. This post and thread already explains that running MarkDuplicates twice is only relevant if your sample was multiplexed and you care about duplicate metrics, i.e. a library complexity estimate that discounts optical duplicates.

This site has extensive documentation on GATK workflows that use GATK and Picard tools. Please let us know if you have questions pertaining to these tools.

• UKMember Posts: 8

Hi shlee,

I was thinking using ANGSD for SNP discovery as I have low-coverage data (around 8-10x)

I would like to check if the read group information I'm going to include to the bam file with Picard tools are fine (I intend to use Mark Duplicates, base recalibration, indel realignment).
I have 1 library per sample, and all my 20 samples have been run on 6 sequencing lanes (all 20 samples are multiplexed on each lane).

Thus RGID would be: flow_cell/lane number
FCHVMHNCCXX_L0
FCH37WYALXX_L1
FCH37WYALXX_L2
I am not sure whether the sample ID should be included here.

RGLB would be the library number, e.g. wHAXPI030603-63

RGPU would be flow_cell/lane number/sample_name
e.g. FCHVMHNCCXX_L0_IR23
FCH37WYALXX_L1_IR23
FCH37WYALXX_L2_IR23

RGSM would be the sample name, e.g IR23.

Do you think the above information are fine?

Thanks for the help,
Best wishes,

Marie

Hi Marie ( @mariel ),

Assuming each of your 120 files provides >100M bases in sequence, your RG fields designations look great. Given your data is low coverage, you should make sure each RGID group has enough reads for BaseRecalibrator to be effective. To quote from this document, you need "well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results."

For whoever else reads this thread, let's rehash the RG field designations.

• You end up with 20 x 6 = 120 files that represent 20 samples sequenced over 6 flowcell lanes.
• You feed 6 files totalling a sample to MarkDuplicates, which then outputs a single sample file. Each of the input files has a unique RGID due to the flow_cell.lane_number and matching SM and LB fields. The resulting output single file has 6 RGIDs.
• You feed the single-sample BAM to indel realignment (RealignerTargetCreator/IndelRealigner). A single realigned file is output.
• You feed the single-sample realigned BAM to BQSR where RGPU designations take precedence. Each RGID group in the BAM has a unique RGPU, so effectively you are again using the 6 unique RGID groups to recalibrate per lane artifacts.
• You go on to call variants with a caller that only considers the RGSM (sample) field.

Make sure to run indel realignment before base recalibration. Also, in case you are unaware, we removed indel realignment recently from the Best Practice Workflows that use assembly-based variant calling, e.g. HaplotypeCaller & MuTect2. Workflows that use pileup-based callers, e.g. UnifiedGenotyper, still require indel realignment. So, depending on what type of caller ANGSD is, and the quality and type of data you have, keep this omission in mind.

• UKMember Posts: 8

Hi @shlee,

Thank you very much, this is very helpful!
I have around 6B bases per read group, so I should be able to run BaseRecalibrator.

Marie

• ItalyMember Posts: 28

Hi all,
They are very helpful and complete but I still have some doubts and I hope to not bother asking for help.
We have data sequenced with Illumina NextSeq, we loaded a single library that is a pool of 12 samples.
The flow cell has 4 lanes so at the end of de-multiplexing we have 4 paired fastq files per sample.
Reading the documentation I understand that the correct work flow is:

1) Per-lane alignment (I use BWA)
2) Per-lane sort (PICARD)
4) MarkDuplicates (PICARD) specifying all the 4 bam files of single sample obtaining a single merged.dedup bam file
5) BaseQualityScoreRecalibrator (GATK) on merged.dedup file
6) HaplotypeCaller (GATK) on bam files

My doubt regards the point 3 specifically the values of RGID, RGLB, RGPU, RGSM that have to be specified in AddOrReplaceReadGroup tool.
As far as I understand I have to use a different RGID, RGSM and RGLB in the per-lane bams; the same RGPU for all the 4 bam files.
As RGID I am using samplename_lanenumber, the same value is assigned in RGSM field. The RGLB is the number of the lane while the RGPU is the flowcell number.

For example for the sampleA I will have:
1) Raw data
sampleA_L001_R1.fastq.gz sampleA_L001_R2.fastq.gz
sampleA_L002_R1.fastq.gz sampleA_L002_R2.fastq.gz
sampleA_L003_R1.fastq.gz sampleA_L003_R2.fastq.gz
sampleA_L004_R1.fastq.gz sampleA_L004_R2.fastq.gz
sampleA_L001.bam with RGID=sampleA_L001 RGLB=L001 RGPU=HGH5YBGX2 RGSM=sampleA_L001
sampleA_L002.bam with RGID=sampleA_L002 RGLB=L002 RGPU=HGH5YBGX2 RGSM=sampleA_L002
sampleA_L003.bam with RGID=sampleA_L003 RGLB=L003 RGPU=HGH5YBGX2 RGSM=sampleA_L003
sampleA_L004.bam with RGID=sampleA_L004 RGLB=L004 RGPU=HGH5YBGX2 RGSM=sampleA_L004
3) After MarkDuplicates
sampleA.merge.dedup.bam
4) After BQSR
sampleA.merge.dedup.bqsr.bam

Is that correct?

Thank you so much for the help and the kind support.

Stefania

@merella.stefania
Hi Stefania,

Have you looked at this article? I think that will clarify things.

-Sheila

• ItalyMember Posts: 28

hi @Sheila
Thanks for the suggestion, I read the article and I changed the RG of my bam files. The workflow used is:

1) Per-lane alignment (I use BWA)
2) Per-lane sort (PICARD)
4) MarkDuplicates (PICARD) specifying all the 4 bam files of single sample obtaining a single merged.dedup bam file
5) BaseQualityScoreRecalibrator (GATK) on merged.dedup file
6) HaplotypeCaller (GATK) on bam files

And the RG assigned are:

sampleA_L001.bam with RGID=sampleA_L001 RGLB=sample_A RGPU=HGH5YBGX2 RGSM=sample_A
sampleA_L002.bam with RGID=sampleA_L002 RGLB=sample_A RGPU=HGH5YBGX2 RGSM=sample_A
sampleA_L003.bam with RGID=sampleA_L003 RGLB=sample_A RGPU=HGH5YBGX2 RGSM=sample_A
sampleA_L004.bam with RGID=sampleA_L004 RGLB=sample_A RGPU=HGH5YBGX2 RGSM=sample_A

It should be corrected.

Thanks,

Stefania

@merella.stefania
Hi Stefania,

Looks good I am happy the article clarified things.

-Sheila

• AustraliaMember Posts: 29
edited March 16

Hi, I looked through this topic, and didn't find information about how to do pre-processing multiplex flow cell sequencing data. I have 8 fastq files from two flow cells (flowcell 1 and flowcell 2), and each flow cell has two sample libraries (sample A and , and each sample is sequenced on two lanes (lane 1 and 2). Then the read group ID for the 8 files are:
ID:Flowcell1.sampleA.lane1
ID:Flowcell1.sampleA.lane2
ID:Flowcell1.sampleB.lane1
ID:Flowcell1.sampleB.lane2
ID:Flowcell2.sampleA.lane1
ID:Flowcell2.sampleA.lane2
ID:Flowcell2.sampleB.lane1
ID:Flowcell2.sampleB.lane2.

I know how to pre-process multiplex sequencing in one flow cell. But I do not know the Best Practice for multiple flow cell sequencing. Do I need to merge files from different flow cells (say, Flowcell1and 2.sampleA.lane1, Flowcell1and 2.sampleA.lane2, Flowcell1and 2.sampleB.lane1, Flowcell1and 2.sampleB.lane2) before pre-processing then flow the Best Practice for multiplex sequencing? Thanks for your time!

Hi @jingmeng,

I think this article https://software.broadinstitute.org/gatk/guide/article?id=6472 will be helpful. When we speak of sample libraries, we refer to Person A's library prep X. The same Person A could give a separate sample on another day for a different library prep Y.

For merging different read group files for the same library, you can feed the different files into MarkDuplicates all at once and the tool will output a single merged alignment records file.

• AustraliaMember Posts: 29

@shlee said:
Hi @jingmeng,

I think this article https://software.broadinstitute.org/gatk/guide/article?id=6472 will be helpful. When we speak of sample libraries, we refer to Person A's library prep X. The same Person A could give a separate sample on another day for a different library prep Y.

For merging different read group files for the same library, you can feed the different files into MarkDuplicates all at once and the tool will output a single merged alignment records file.

Thanks shlee. So read group ID:Flowcell1.sampleA.lane1 and read group ID:Flowcell2.sampleA.lane1 are different read groups. they need to do be aligned separately, and then be merged and marked duplicates. Is my understanding correct?

Also, I am not sure about the meaning of when sequencing pools of samples, use a pool name instead of an individual sample name. For example, for an individual (NA=12878) with three samples (sample A, B and C), the SM is NA12878 for all read groups, or the SM are sample A, B and C? Thanks for your time!

edited March 17

Hi @jingmeng ,

BWA is agnostic in many ways and aligns each read independently. I suggest you align each of your read groups separately so that you can then easily add the read group information to each file after the alignment.

GATK tools are read group aware. So once you provide MarkDuplicates all of your per sample read groups, it will merge them but it also takes care to differentially process the different library preparations and read groups.

Does your library consists of a pool of samples? Or does each library correspond to one sample? Each sample is a person whose genotypes our workflows ultimately call.

• AustraliaMember Posts: 29

@shlee said:
Hi @jingmeng ,

BWA is agnostic in many ways and aligns each read independently. I suggest you align each of your read groups separately so that you can then easily add the read group information to each file after the alignment.

GATK tools are read group aware. So once you provide MarkDuplicates all of your per sample read groups, it will merge them but it also takes care to differentially process the different library preparations and read groups.

Does your library consists of a pool of samples? Or does each library correspond to one sample? Each sample is a person whose genotypes our workflows ultimately call.

Thanks shlee. I have a high-coverage sequencing dataset from an individual (NA12878) and am going to call the genotypes. This dataset consists of 3 samples (sample A, B and C) from the same person NA12878. Each sample correspond to a different library. My question is how to name the three samples in the pre-processing step? The SM is NA12878 for all read groups, or the SM are sample A, B and C? Thanks for your time!

Hi @jingmeng,

The way you mean sample (experimental) and the way our workflows define sample (an individual) are different. Given we seem to be on different pages with this definition, one thing to clarify is what you mean by different libraries. Did your A, B and C get processed in different tubes? Or were they differentially barcoded and processed simultaneously? When loading onto the sequencer, did each come from a different tube (different libraries; biological replicates) or the same tube (same library; technical replicates)?

Library designations are important to estimating library complexity, which MarkDuplicates does for you while also marking duplicates or which EstimateLibraryComplexity does for you as a standalone tool. Depending on your interest in library complexity, and I discuss some of its implications in Tutorial#6747, then you may be more or less interested in library designations.

To answer your question above, let me describe two scenarios where we assume different libraries corresponding to biological replicates. I hope these provide the insight you need.

Typical use case where goal is to genotype an individual with as much evidence as possible
Given you say you have three different library preps for the same individual's DNA, then you want to be sure to represent this as different LB groups. E.g.:

LB:A SM:NA12878
LB:B SM:NA12878
LB:C SM:NA12878


You want each library to contribute as evidence in genotyping the same sample, so set your sample name identically across the libraries. The sample name is what ultimately gets a genotype column in the VCF.

Atypical scenario where goal is to compare different experimental conditions in genotyping
If you are using the different library preps to compare experimental conditions and the genotyping they enable, then you will want to set the sample names differently. This way you can process them jointly and obtain a genotype call column for each in the genotyped VCF.

• AustraliaMember Posts: 29

Thanks @shlee for your detailed explanation. My dataset corresponds to the first typical use case. Now I know how to label these files. Thanks again for your help.

You're welcome!

• AustraliaMember Posts: 29

@shlee said:
You're welcome!

Hi, shlee. I have another question coming. After the recalibration step, my bam file size is getting much bigger (from ~100G to ~240G). I know from this forum that it is because base recalibration adds insertion and deletion tags. I am wondering if there are some options in GATK BSQR that do not change the BAM file size dramatically and do not affect downstream analysis. Thanks for your time!

Hi @jingmeng,

I'm not familiar with the details of BQSR so let's ask @Sheila to jump in here.

In the meanwhile, here is the ApplyBQSR command our most recent production WDL. Notice that they are using GATK4:

java -XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
-XX:+PrintGCDetails -Xloggc:gc_log.log -Dsamjdk.use_async_io=false \
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx3000m \
-jar /usr/gitc/GATK4.jar \
ApplyBQSR \
--createOutputBamMD5 \
-R ${ref_fasta} \ -I${input_bam} \
--useOriginalQualities \
-O ${output_bam_basename}.bam \ -bqsr${recalibration_report} \
-SQQ 10 -SQQ 20 -SQQ 30 \
-L ${sep=" -L " sequence_group_interval}  I believe the OQ (original base quality score) tag is what takes up a lot of space in that it is hard to compress. Are you retaining these? I'm not familiar with the insertion and deletion tags so here again perhaps @Sheila can enlighten us. If you can provide us with your related commands, then we can see if there is some parameter that we could optimize. • AustraliaMember Posts: 29 @shlee said: Hi @jingmeng, I'm not familiar with the details of BQSR so let's ask @Sheila to jump in here. In the meanwhile, here is the ApplyBQSR command our most recent production WDL. Notice that they are using GATK4: java -XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \ -XX:+PrintGCDetails -Xloggc:gc_log.log -Dsamjdk.use_async_io=false \ -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx3000m \ -jar /usr/gitc/GATK4.jar \ ApplyBQSR \ --createOutputBamMD5 \ --addOutputSAMProgramRecord \ -R${ref_fasta} \
-I ${input_bam} \ --useOriginalQualities \ -O${output_bam_basename}.bam \
-bqsr ${recalibration_report} \ -SQQ 10 -SQQ 20 -SQQ 30 \ -L${sep=" -L " sequence_group_interval}


I believe the OQ (original base quality score) tag is what takes up a lot of space in that it is hard to compress. Are you retaining these?
I'm not familiar with the insertion and deletion tags so here again perhaps @Sheila can enlighten us.

If you can provide us with your related commands, then we can see if there is some parameter that we could optimize.

Thanks Shlee. Here are my commands:

java -jar ~/GATK-3.7/GenomeAnalysisTK.jar -T BaseRecalibrator -R ~/hg38.fa -knownSites ~/All_20170403.vcf -I ~/original.bam -o ~/recal.table

java -jar ~/GATK-3.7/GenomeAnalysisTK.jar -T PrintReads -R ~/hg38.fa -I ~/original.bam --BQSR ~/recal.table -o ~/recalibrated.bam

I guess I retained the 'OQ' tag. If so, how can I remove it? From BaseRecalibrator document, I did not find information about it. As it takes a while to do the recalibration, is there some way to refine the recalibrated bam file from the above command? Thanks very much for your time!

Under Optional Flags you see some BQSR related options:

--disable_indel_quals
-DIQ   false   Disable printing of base insertion and deletion tags (with -BQSR)
--emit_original_quals
-EOQ   false   Emit the OQ tag with the original base qualities (with -BQSR)


You can use these engine options within your PrintReads --BQSR` command to remove the indel quals. It appears that emitting the OQ tag is disabled by default.

Consider for the future also using GATK4's ApplyBQSR tool in isolation instead of PrintReads. I believe there are some optimizations that allow the tool to run faster. I know GATK4 is in alpha but the tool itself is being used in our Genomics Platform production runs so we can gather that it is stable.

I've asked @Sheila to add her two cents here so wait for that before making any decisions.