In the example of LB and PI (section 9) should the PI not be PI:200, PI:400, PI:200, PI:400 vs 200,200,400,400? LB flags the library, and you have the 2 libraries sequenced twice, once at 200bp inserts, once at 400bp. Surely this example defines it as being sequenced twice at 200 and 400bp respectively?
We do not require value for the CN, DS, DT, PG, PI, or PU fields
Does base recalibration look at the RG ID field to determine sequencing lanes, or does it look at the fastq headers for each read? How important is the format of the ID field for GATK (as long as it's unique per bam file)? It it doesn't contain all of the fastq header information (flowcell, lane etc), does that affect the performance of the gatk algorithms?
It is important to put as much information as possible in the RG field, because that is indeed what the tool uses to determine covariates for the analysis. The more information there is in the RG, the more accurate the analysis, for base recalibration at least. Other tools are not so reliant on that information, however.
In regard to the section: "What's the meaning of the standard read group fields?"
We often have multiplexed samples with the same flowcell and lane number. Should different samples (with different SM values) have the same the same read group IDs (in different BAM files), or should we include the barcode in that too? We wonder if GATK quality score recalibration might somehow use the information that the samples were sequenced in the same flowcell and lane. Is there anything wrong with setting multiple multiplexed samples with the same library?
I'm not sure I understand exactly what is your question, but I can tell you that Base Recalibration should be done purely per lane. The tool is read group aware so you can have information originating from several lanes for each sample in a single bam file; each lane's worth will be recalibrated separately as long as they are distinguished by different read groups.
Does that clarify things?
Is it correct to say that different samples, sequenced together on the same lane, will have exactly the same read group (RG), and be told apart from the sample name (SM)?
...I promised in another topic that was my last question of the year... I guess I am not good at keeping promises...
Tssk tssk... That's okay, Max, you can ask questions until tomorrow at 4 pm EST. After that, pfft, we're gone
The read group string overall contains ID, lane info, sample info etc, so strictly speaking they're not the same. But yes, it's okay for different samples to have the same read group ID. BQSR will correctly process data by ID, and the rest of the GATK tools will process data by sample name (SM).
EDIT: this was a bad answer (sorry!) and has been corrected below.
This is a very useful thread. However, the information about how to correct the contigs could use some clarifying. I went through the process of figuring out how to fix my bam files that are in the consortium standard to get them to work with the UCSC hg19 convention on my own a year or so ago. But now I can't remember how I got it. SortSam and ReorderSam do not seem to be doing it for me.
@furgason5, that's because the advice is realign them, not to correct the names. There are slight differences between the assemblies. To my knowledge, there are none in the autosomes or sex chromosomes, but I know the mito reference differs and would expect the unplaced chroms to differ as well.
Actually, the correct answer that avoids realignment is to use liftover. I know that there's a LiftoverVariants walker that works on VCFs, I don't know of anything for bams.
After all the warnings, if you still want to just modify the names in the bam, it's reasonably difficult. I don't think there are any tools that will do this for you, but what you would do is to write it out as a sam, filter each read to update the contig names (in both fields - one for the current alignment, one for the mate), update the @SQ dictionary in the header, and then convert back to bam. And if you care about any of the optional read tags, you'll need to update those as well (I think BWA puts alternate alignments in one of those tags occasionally).
On balance, I usually find it easy to just realign
If a RG id is unique within a bam file but not unique across bam files, it sounds like the RG id will need to be updated to a unique id across all bam files to be analyzed. It that the case? We have bam files from a sequencing center that has simply sequentially assigned RG ID to 0, 0.1 or 0.2 within each bam file. What is the best approach to deal with this? Is there a way to automatically generate unique RG ids across these bam files or is an update of the RG id required in each bam file?
Different tools use different RG tags to aggregate data, so for those that don't use the ID (but instead use SM, for example) you may get away with non-unique RG IDs. HOWEVER this is generally a bad idea and may cause you terrible headaches down the road, so I would strongly suggest using unique IDs. This should be a data management policy decision. You can update the RGs as you get the data from the sequencing center (or better yet, yell at them and demand that they assign sensible RGs); see the recommendations for generating unique IDs given in the table above.
Hi @Geraldine_VdAuwera ,
In your answers above to @trickytank and @mstagliamonte , you mentioned in the case of multiplexing several samples in one lane, all read groups should have the same @RG ID. I thought this wasn't allowed. Let me show an example.
I have two samples and I'm sequencing them in the same lane(flowcell=A, lane=1). If I understood your answers, you are saying my read groups should look like:
@RG ID:A.1 SM:SAMPLE01 LB:SAMPLE01 @RG ID:A.1 SM:SAMPLE02 LB:SAMPLE02
I don't think you can have multiple read groups with the same @RG ID. If I understood the SAM spec, I thought I should have something like this:
@RG ID:SAMPLE01 SM:SAMPLE01 LB:SAMPLE01 PU:A.1 @RG ID:SAMPLE02 SM:SAMPLE02 LB:SAMPLE02 PU:A.1
Now I have @RG ID that are unique, but BSQR won't use the same error model for both read groups. I wonder why BSQR doesn't use @RG PU instead? Picard already requires that tag.
Correct. It is NEVER okay to have identical read group IDs. That was a mistake on our end - thanks for catching it!
My understanding/experience has been this. PU should always be the most unique identifier. Given that PU is flowcell,lane.barcode for multiplex samples then if using SamToFastq.jar in Picard and specifying OUTPUT_PER_RG, the PU tag takes precedence over the ID tag when outputting the prefix name(s) of the fastq file(s). So if you have multiplex samples in a lane, but have a bam file for each platform unit, then individually they can have the same ID tag. However, if you are to merge them all back into a single bam file then the ID tag needs to be modified to avoid collisions, I believe MergeSamFiles.jar in Picard does this for you (or maybe a walker in GATK or both, I haven't had a need to do that in years). So given that you have two separate bam files like so;
@RG ID:flowcellA.lane1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX
@RG ID:flowcellA.lane1 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodeY
and then merged them together then, the RG lines should be;
@RG ID:flowcellA.lane1.1 SM:sample1 LB:sample1 PU:flowcellA.lane1.barcodeX
@RG ID:flowcellA.lane1.2 SM:sample2 LB:sample2 PU:flowcellA.lane1.barcodyY
I have a question about read groups: I've managed to give all my data (n=24 bams) the same RGID so obviously that needs to be fixed before I can trust anything.
However, given that I was running BQSR on each individual bam file in isolation, does the RGID matter for this step? Will re-running BQSR on each individual bam with the corrected RGID make any difference (or could I save the compute time and replace the read groups on my previously BQSR-recalibrated bams)?
(And on further thought: Of my 24 samples, 10 were multiplexed into one lane. So I presume these should have the same RGID. But do I still run BQSR on each of these 10 multiplexed samples individually? Or on all 10 at once (is that even possible?!))
For the files where 1 bam file contains the data from 1 sample run on 1 lane, you can just replace the rgid and proceed with your analysis.
For the 10 multiplexed samples, it is recommended to run BaseRecalibrator on them together. (Yes, it is possible!) You can either input a list of the bam files with each file name on a separate line and name it with a .list at the end. You can also use -I 10 times, with each file after 1 -I.
Recalibrating all the data from one lane together is better because it empowers the modeling process, but it may not make a big difference if the data has deep coverage and good quality sequence. It is up to you to decide whether to rerun or not, depending on how confident you are in the quality of the data.
If you need individual files for each of the 10 samples, you can use PrintReads after BQSR. Please read about Print Reads here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_readutils_PrintReads.html
I hope this helps.
Dear @Sheila, thanks very much for that info! I'll do as you suggest (the multiplexed samples are low coverage so I will rerun the BQSR on them collectively).
Please also note that typical usage is to use PrintReads after BQSR in any case, whether you want one group file or you want individual files. The difference is that in the former case, you pass in all the files as input to PrintReads in one go, while if you want separate files, you run PrintReads separately on each file in turn.
I do not think this is stated in the PrintReads document.
Hi @Sheila, yes, the PrintReads step was already in my pipeline - the document is not particularly clear about multi and single sample input vs output, but it seems to be working fine for me. Thanks for the clarification!
I am happy it is working for you. Thanks for pointing out that the document is not particularly clear. We will try to clear it up soon.
This is indeed a very useful thread. However, I am still a bit confused regarding what is the best way to proceed in terms of the read group ID assignment for different samples sequenced in the same lane.
I have 6 lanes of data. In each lane, 12 samples were multiplexed and sequenced together. I have therefore 12 different bam files per lane (72 in total).
Initially, I was assigning the same @RG ID for the samples sequenced in the same lane, but after reading this thread I understand that the @RG ID must be unique and therefore all samples should have different @RG ID. Is this correct?
However, if all samples have different @RG ID now, how will BSQR now which ones came from the same lane to use the same error model? (this goes on the following of @Carlos question).
Many thanks for the help!
As far as I know BSQR doesn't have a way to tell which samples came from the same lane in a multiplexing run. We have commercial support for GATK from Appistry and that was their answer to this question. I also asked here, at biostars and directly to Appistry about why GATK doesn't use PU to identify samples from the same lane. No answer so far as why that isn't the case.
Quick note, looking through my communications with Appistry I did get an answer about why not use PU. At the moment this tag is not required. I still think GATK should use the information from the tag if if present.
Hi Carlos et al.,
To be honest, this is a very low priority point for us because our in-house sequencing platform has an established system for processing and managing samples identifiers that works with how BQSR is set up. So even though PU could potentially be useful/relevant (I have not thought this all the way through myself), making the necessary changes to use it if present could lead to complications that we don't have the resources to tackle at this time.
We're also hesitant to advise people on how to proceed exactly with read group assignment because it depends so much on how the sequencing experiment is configured (are samples multiplexed on same lanes and/or spread across lanes, how do you want to split up / merge data etc, what technologies are involved). So we prefer to communicate clearly (or so we hope) on what basis BQSR treats a set of data as a unit (the RG ID), and the idea that recalibration should be done per unit of data that is subject to the same machine biases. With that, people should be able to decide how to process their data appropriately.
That's fair. My follow up question would be if is possible for you to share how Broad manage samples identifiers in a same lane multiplexed experiment. I don't see a way to do it and at the same time make sure recalibration is done per unit of data subjected to the same machine biases.
I have 4 recalibrated bam files produced from realined bam files (Note: I have lost my realined.bam files)
I have lost my realined.bam files
Right now I need to split those by chromosomes
I have tried the following code
java -Xmx4g -jar PATH/GenomeAnalysisTK.jar -T PrintReads -R PATH/reference -I rcal.bam01 -I rcal.bam02 -I rcal.bam03 -I rcal.bam04 -L chr25 -o PATH/chr25.bam -nct 8 -S LENIENT
It seems like working for a long time but no file PATH/chr25.bam is generating in the output directory
Thanks in advance
Why are you using -S LENIENT? This could be covering up something that is wrong with your file. Try running without it, and see if that works.
Sorry for the delayed response, I've been on the road for a workshop.
I've checked what we do internally; actually the Broad production pipeline demultiplexes samples from each lane to individual bam files, so they are recalibrated individually. So this simplifies the processing since each sample per lane gets its own read group. The downside is that you have to be sure to have enough data per sample for recalibration to work properly, which limits the number of samples you can multiplex on a single lane.
The last answer dated in July 2014 is clear regarding unique RGs for multiplexed samples. However, could you please offer some guidance on how much data is enough per sample for recalibration to work properly? We have high-coverage targeted (~800Kb) sequencing data.
Thank you very much!
The key metric is not how much you have per sample, it's how much you have per read group (as defined after demultiplexing). So the target size doesn't really matter; it comes down to how many libraries are multiplexed per lane, and therefore how much data is generated for each of them. I don't have hard numbers, but I believe in production there can be dozens of libraries multiplexed onto a single lane (on Illumina HiSeq).
I have been reading the previous comments about assigning RG to bam files, but still I am not sure what I'm doing is correct.
I am working with RAD-seq data from a tree species. I have 5 libraries, and each library was sequenced on a different lane of the same flowcell. In each lane, 95 multiplexed individuals were sequenced.
After aligning my reads to the reference genome, I have individual bam files for each of my samples (5x95=475 bam files). I added the RG with Picard Tools AddOrReplaceReadGroups, but I'm not sure I did it in the correct way, in order to use then IndelRealigner, BaseRecalibrator and HaplotypeCaller.
I haven't understood yet if the RGID must be unique for each sample or if it has to be the same for all the samples run on the same lane (and only RGSM must be different).
For example, for the sample named F008 belonging to library1 and coming from lane436, I gave the following values:
and for the sample F009 sequenced on the same lane:
while for the sample F033 belonging to library2 and sequenced on lane437 of the same machine:
Is this correct? Or should I give the same RGID to F008 and F009 if I want the BaseRecalibrator to be aware they were sequenced on the same lane?
Besides the samples in these 5 libraries I mentioned above, I also have samples belonging to library6 sequenced on a different instrument (therefore I changed flowcellA to flowcellB) for example on lane3. Is it correct to assign these RG values to the bam file of e.g. sample IT91?
I would like to be sure the BaseRecalibrator is able to distinguish samples sequenced on the same lane, and I have not understood yet if this is determined by RGID or RGPU.
Is BaseRecalibrator the only tool needing this information? Do IndelRealigner, DepthOfCoverage, UnifiedGenotyper and HaplotypeCaller take only RGSM into account?
Sorry for this long post, but I'm really confused about this!
Thank you very much.
You are doing it correctly. Have a look at this article for more information: https://www.broadinstitute.org/gatk/guide/article?id=1317
All of the GATK tools take the RGID into account, however, it is particularly important for the pre-processing tools.
thank you very much for your answer. It's actually the page you indicated that confused me in the first place! :-)
So, if I ran the RealignerTargetCreator and IndelRealigner with wrong information in the RGPU and RGLB fields, but correct information in the RGID field, do I have to re-run it again or RGPU and RGLB are not taken into account during the indel realignment?
Then, I read here gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr that "A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results."
So, is it necessary to have 100M bases for each RG defined by RGID (in my case each RGID corresponds to a different sample) or for each RG defined by RGPU (in my case each RGPU corresponds to a different lane)?
And what do you mean with "bin"?
In the same page I also read this: "BQSR constructs a separate model for each read group". Also in this case, is the read group defined by RGID or by RGPU? If it's defined by RGID, in my case it will build a separate model for each individual, also if batches of 95 individuals (see my previous post) were sequenced on the same lane, so they should follow the same model. If it's defined by RGPU, then it will construct a model for each lane, and this is what I would expect.
I thought the article I recommended was quite clear! Thanks for letting me know it's not. I will put in a ticket to have the RGID explained better.
I just posted a new article that is not for the exact same scenario as yours, but I think it will answer most of your questions: http://gatkforums.broadinstitute.org/discussion/6057/i-have-multiple-read-groups-for-1-sample-how-should-i-pre-process-them
Let me know if the article is helpful.
For your question about binning, it turns out the base quality scores are not represented as individual numbers but as ranges of numbers. These ranges are the "bins". It is to help reduce file size. For example, if you have individual scores for each base, the bam file size will get very large, but with bins, the files size can be kept smaller.
and thank you again for your answer.
Yes, the article you suggested was helpful.
However, there is still something not really clear to me: when running BaseRecalibrator, should I give as input all my bam files together (also if coming from different lanes) or should I run it separately for each lane (RGPU)? I mean, if I have 95 bam files for each lane and I have 5 lanes (so 5 different RGPUs), should I run it 5 times or should I run all the 475 bam files together? I would expect I can run them all together since BSQR recognizes different RGPUs.
If I have two bam files for the same sample that was sequenced on two different lanes, these files will have different RGID, different RGLB (because they come from two different libraries), different RGPU, but the same RGSM and RGPL. After putting them together at the IndelRealigner or BQSR steps as you wrote here gatkforums.broadinstitute.org/discussion/6057/i-have-multiple-read-groups-for-1-sample-how-should-i-pre-process-them, should I give them the same RGID, but still different RGPU and RGLB? Or should they still have also different RGIDs?
Could you be a bit more clear about the binning? Do you mean that not every bp has its own quality score, but they are grouped with other bp? I didn't understand this.
Last thing (at least for now!). I think there is a small mistake in the article you suggested: you wrote "We do not require the PU field in the RGID", but PU and ID are two different fields of the RG, PU is not a part of ID. Maybe it's just a detail, but when trying to understand this, also small things can create confusion :-)
Yes, you can run all your bam files together because BQSR recognizes RGID.
After Indel Realignment and BQSR, you can leave the RGID alone. All tools in the Variant Calling step recognize the SM tag.
For your binning question, I think this thread will explain more than I can say here: http://gatkforums.broadinstitute.org/discussion/4594/beware-of-using-binned-quality-scores-with-some-gatk-procedures
Thanks for the suggestion. I will change RGID to RG
Please let us know if this new dictionary entry clarifies your main question about ID versus PU fields. Also, if we've answered your questions.
It explains that the @RG's ID (or PU) field must be unique for each sample as this field defines a covariate for BaseRecalibrator.
We've created this entry as per @Sheila's request.