Questions about BQSR (2014-2016)

SystemSystem Administrator admin
This discussion was created from comments split from: Base Quality Score Recalibration (BQSR).

Comments

  • CanesBoy2015CanesBoy2015 HIHGMember

    Hi,

    I have read a few different forum threads about BQSR. I saw a few times it was mentioned that you can safely run BQSR on the chromosome level if there is a lot of data. I am assuming a 30X WGS sample produced on a HiSeq X machine would produce enough data to do this safely? Would you recommend to run BQSR on all chromosome together? I just want to run as many commands in parallel as possible.

    Thanks for all your help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CanesBoy2015‌

    Hi,

    Yes, 30X WGS is enough data to run BQSR safely.

    You can run BQSR by chromosome. It's not what we recommend, but it is technically feasible. Please do make sure that everything looks okay in the data once you are finished.

    -Sheila

  • CanesBoy2015CanesBoy2015 HIHGMember

    Hi Sheila,

    So you will run BQSR on a 30x WGS without splitting the BAM file? Is there any safe way to run this step in parallel that you would recommend?

    Thanks for the help

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CanesBoy2015‌

    Hi,

    You can use queue for this. Please read about queue here: http://www.broadinstitute.org/gatk/guide/topic?name=queue

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Questions older than Aug 21 have been archived in this thread.

  • FChjatonnetFChjatonnet RENNES University HospitalMember

    Hi,
    this is my first try with GATK. After aligning with bwa on ucsc_hg19 reference and de-dup with Picard, I have re-aligned a bam file and started the base re-calibration process before going to SNP calling (as indicated in best practices). I've run everything with default options, and generated the bqsr report file with RStudio and a downloaded BQSR.R script (to circumvent a Rscript PATH issue).
    I provided the BQSR.R script with the recal.table from the first BaseRecalibrator passage and the csv file from AnalyzeCovariates run after a second passage, as indicated in the "how to recalibrate base quality" tutorial but my feeling is that quality values are worse after recalibration. First I thought it was due to downsampling, but after running the whole process on the entire bam file, I got similar results (see attached file). I'm using the (I think) latest GATX version, v3.3-0-g37228af.
    Any help on how to interpret my recalibration plots?
    Thanks a lot,
    Fabrice.

  • KatieKatie United StatesMember ✭✭

    Hi, If I am working with a non-model organism without a database of verified SNPs, is BaseRecalibration impossible? Should I still perform indel realignment with RealignerTargetCreateor and IndelRealigner?

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Katie,

    You can bootstrap a set of variants to use for recalibration. Have a look at the troubleshooting sction oBQSR documentation.

  • AJWillseyAJWillsey San Francisco, CAMember

    Hi All,

    I am currently running whole-exome data from 350 trios (unaffected parents + affected child) through the best practices pipeline (GATK V3.2-2), starting from fastq files, with the eventual goal of identifying de novo mutations.

    I have healthy control data from another sequencing project (~650 trios), that I would like to use as a control here for comparing burden of de novo mutations. However, for these samples I only have bam files that have undergone BQSR (with V2.X GATK), and the original quality scores were not saved. I am reverting these back to fastq files in order to align them to the same reference as the other cohort, etc.

    Obviously I would like the control data to be as well-matched as possible, so in an ideal world, I would have raw quality scores for both cohorts, and after alignment would conduct BQSR analogously on each before proceeding to variant calling. Given that this is not possible, I am wondering if it is better to use the existing recalibrated base quality scores for the control data as is, or whether I should re-recalibrate these? Is it a bad idea to recalibrate already recalibrated base quality scores and/or is there any advantage to doing this?

    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @AJWillsey ,

    I would recommend running your controls through BQSR again. Successive runs should not have any negative effect, and since there have been a couple of substantial improvements in the BQSR tools compared to early 2.x versions, your data may benefit.

  • AJWillseyAJWillsey San Francisco, CAMember

    Hi @Geraldine_VdAuwera
    Thank you so much for the quick reply, this is really helpful. I'm amazed you are able to keep up so well with all of the questions people ask on here. Keep up the great work!

  • agopalagopal CanadaMember

    Question -- for the BaseRecalibrator walker, the documentation says having a list of known variant sites is optional, but with the walker, it's apparently mandatory.

    I was wondering if there was a recommended practice to get this information when it's unavailable? For instance, when I process raw reads, I usually do:
    mapping with bowtie
    duplicate removal
    indel realignment

    Would it make sense for me to get a preliminary list of high quality-score variants (e.g., QUAL > 100) and use these as "known" variant sites? Or do I need ALL known variant sites first?

    Any help would be much appreciated

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @agopal, yes, you're on the right track. Have a look at the presentation slides from the recent workshop posted on the GATK blog. In the "Non-human" presentation you will find recommendations for bootstrapping a set of known variants.

    See https://www.dropbox.com/sh/uw2cor3eix3w47e/AADCAkp7l51Wz9uCIjUboyU2a?dl=0

  • RhewterRhewter BrazilMember

    Hello @Geraldine_VdAuwera or other person who can help me...

    Working with a species of plant that does not have a set of known SNPs and read here in the forum about bootstrap variants calls. I wonder what that is. You get the bam file recalibrated and using variables identified to recalibrate it again? Or is using the new variants identified to recalibrate the original bam?

    Thank you very much for spending time to read my question.
    Sincerely,
    Rhewter.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Rhewter
    Hi Rhewter,

    Have a look at this article under I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr

    -Other person who can help you :smile:

  • RhewterRhewter BrazilMember
    edited May 2015

    Thank you so much @Sheila :smiley: :wink: !!

    If someone seeks answer here:
    "you should use the unrecalibrated bam file."

    Rhewter.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila @Geraldine_VdAuwera I have heard a few people at different occasions saying it's not necessary to run BQSR for Illumina HiSeq X Ten high coverage data. I thought I would hear it from the horse's mouth myself. Is it best practice to run BQSR in all cases? I'm sorry if this is a stupid question. I don't have much experience with BQSR and bam pre-processing in general. I am perfectly happy with you telling me, that I should evaluate it myself. Thank you for your time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @tommycarstensen Not a stupid question at all. I'm going to give you one of my infamous "yes and no" answers. Yes, it is true that with higher-quality, higher-coverage data (as one might expect from the beauty that is the HiSeq X Ten), BQSR produces diminishing returns. No, I wouldn't recommend skipping BQSR despite the previous statement.

    My favorite analogy is that BQSR is like fire insurance. Most of the time you don't need it and it feels like a waste of money. But when your house burns down (which you don't know is going to happen in advance -- unless you're committing insurance fraud, and that's another matter entirely) it can really save your bacon. Because it mutates into a fire department and rebuilding crew all at once. And that's where my analogy truly breaks down, but you get the point :)

    We're looking into ways to make it faster and economize on storage space, but for now BQSR is still definitely best-practice.

  • jellisjellis AustraliaMember

    Hi,

    I've recently been given some low coverage (6-8X) WGS BAMs to analyse. They have been processed with a GATK based workflow, but after alignment each BAM was split by chromosomes and each chromosome was process separately. This means some of the read groups going through BQSR have fewer than 100M bases. My understanding from the above is that this is not what you would recommend, but possible to do. My question is how much impact on quality is this likely to have (that is, having fewer than 100M reads) and how significant is the improvement with 1 B bases? If I decide to re-run BQSR can I simply run it on the BAMs I have?

    I've also noticed that they have been run through IndelRealigner before BQSR. I've always performed these steps the other way around with the rational that bases will potentially move during realignment, therefore, you can't tell if a bases is truly different from reference until it's in its final alignment. Can anyone comment on the correct order to run these modules, or does it not matter?

    Thanks,
    Jonathan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jellis
    Hi Jonathan,

    BQSR will be able to build a better model with more bases, so we do not recommend running on less than 100M bases. Can you tell me a little more about how the reads were sequenced? If you have whole genomes, I suspect you have a lot of reads per lane. BQSR is supposed to be run per-lane. So, you may be able to combine some of your data to get better results, even if they are not from the same sample. Again, it just matters that they are from the same lane.

    Indel Realignment should be performed before Base Recalibration because BQSR depends on the reads being in the correct place, as it considers any mismatch from the reference to be an error. This article may help: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr

    -Sheila

  • jellisjellis AustraliaMember

    @Sheila, thanks for your response. Can you clarify how BQSR determines which data come from the same lane? I know this is a question that has been asked before, but the answers don't seem to be that clear. I've read previous posts by you stating that the read group ID is used to determine lane (sorry can't find the link for that one), but here http://gatkforums.broadinstitute.org/discussion/4586/read-group-pu-field-used-by-baserecalibrator you state that the PU field will take precedence. There are lots of references all over the forum that the GATK doesn't require the PU field. Is it used in preference to the ID tag or not? If it's not, they I don't fully understand how to set the ID tags up for the data I have. The samples have been multiplexed (at least four samples per lane). Each sample will need it's own read group (with distinguishing SM tags), and each of these will need to have a unique ID tag. How will the GATK then know they are from the same lane?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    As those documents state, PU is used over RGID if it is present, but we generally assume that PU is not present, and in that case RGID is used. I'm not sure how I can state that more clearly.

    Note that when we say that BQSR analyzes the data per lane, we mean per lane per sample. So in your case, you'd typically first demultiplex the data into separate per-sample files. Within that, you may have data from different lanes or libraries (distinguished by RGID and LB tags, and having the same SM tag). BQSR will distinguish them accordingly. Does that make more sense?

  • redvexredvex AustralasiaMember

    Can you explain how BQSR will handle binned base qualities, for example from HiSeq4000 data? Will it be possible to merge a HiSeq2xxx BAM with a HiSeq4000 BAM? My Illumina FAS suggested performing base recalibration prior to merging in this case, but he doesn't have direct experience with this. What workflow would you recommend?

    Thanks

  • flehartyfleharty Member, Broadie, Dev ✭✭

    BQSR constructs a separate model for each read group, so as long as your read groups are well defined, there should be no problem. You can perform recalibration either prior, or after the merge.

  • fishyufishyu cambridgeMember
    edited August 2015

    BQSR generated different re-calibrate tables with different versions (I tested V3.1 vs V3.3).
    Everything else is same here (input bam file, DBSNP file, reference file and parameters), except GATK version. Is there anything changed in BQSR implementation between these two versions?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fishyu
    Hi,

    There may have been some changes that I can't think of off the top of my head. However, for the best results, we recommend using the latest version of GATK.

    -Sheila

  • MAPKMAPK Member
    edited September 2015

    I am new to GATK. Can someone please explain me what base quality score is? what does base quality score of 20 mean? I also need to know about the filter. What does PASS and VQSR mean in Filter column? Thank you

  • achalneupaneachalneupane Member
    edited September 2015

    @Sheila The filter pass link is not very clear, could you please provide something more in detail?

  • ylzylz hkMember

    Hi,
    Previously, I used v3.4 to do BQSR with default settings, after print reads, the BAM size became much larger, I know this case is normal as discussed in some post here.
    However, when I updated to v3.5, and in the printread task included one more option --disable_indel_quals, I found the new BAM size does not increase a lot, in some case it even decrease, for example, before BQSR the BAM is 8.1G, after that it became 7.9G.
    I would like to know is there problem?
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This is the expected effect of disabling the indel qualities.

  • everestial007everestial007 GreensboroMember
    edited December 2015

    @Geraldine_VdAuwera said:
    This is the expected effect of disabling the indel qualities.

    I also had the same case but thought it was related with the amount of information that is in the BAM files. When you disable the indel qualities you reduce the amount of information in the BAM file which would result in reduced size of the BAM file. But, it would not affect anyother information or values in the BAM file. Is that the case?

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It's true that disabling indel quals doesn't affect anything else, but there are a few other reasons why file size fluctuates when going through BQSR. When you don't have indel quals inflating the size of the file, the rest is more noticeable in comparison.

  • noushin6noushin6 Baltimore, MDMember

    Hi @Geraldine_VdAuwera. I am new at working with ion torrent sequencing data and will very much appreciate any pointers.I was wondering if BQSR is valid for bam files generated on ion torrent platform. In general, are there any steps in best practices guideline that are not valid / recommended for ion torrent data. Many thanks.

    Issue · Github
    by Sheila

    Issue Number
    514
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • noushin6noushin6 Baltimore, MDMember

    Can I also ask if BQSR is applicable to targeted sequencing bam files (ion PGM) with about 1M reads spread across 2-4 read groups?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @noushin6,

    We know that indel realignment won't work properly on Ion Torrent data; if I recall correctly the tool actually has a hard-coded check that will make it quit with an error if it encounters Ion Torrent data. I'm not sure about BQSR, I think that should be fine.

    More generally though we don't support running on Ion Torrent data because we know it has some error modes that GATK is not able to model correctly. I can't comment on Ion PGM as we have never had any such data to experiment with, to my knowledge at least.

    Ion has their own caller which does purport to model the idiosyncrasies of that data type correctly. I can't comment on its accuracy as we've never done any comparisons. I can only recommend trying both (or other callers as well) and seeing what performs best in your hands. Others in the community may have more useful recommendations, besides.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear you like the workshop videos!

    Unfortunately the BQSR tables are not expected to be consistent from one run to the next. The problems it addresses are not just machine-specific, they can also be lane-specific or library-specific. So it's not possible to be lazy. Believe me, we would be all over that if it was possible!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Please do go and school your friend, for they are sorely mistaken! It would be quite useless to run a method that amputates our ability to discover new things. Then we wouldn't be doing variant discovery, we'd just be re-genotyping!

    Your intuition is correct -- the effect of novel variants on the recal tables is tiny, largely because their number (relative to the number of bases in a genome or exome) is tiny. And qualities will only be adjusted based on trends across all reads, so any individual site may not see any adjustments (regardless of whether there is a variant there) unless it is affected by some bias -- in which case the adjustment is a good thing. As you say sometimes the adjustment is an increase in qualities. It's not all about reducing scores. So finally, you're correct that real novel variants have no less chance of getting called before or after BQSR. The major effect of BQSR is to decrease false positives and increase reliability of calls overall.

    Thank you for being an advocate of good methodology and understanding!

  • francismfrancism NigeriaMember

    Hi Geraldine, currently practicing with GATK sample data frag_1.fastq and frag_1.fastq with reference genome Hgenome14RK.fasta. presently running the analysis but having problem with BaseRecalibrator. /Elvis_tutorial$ java -jar /home/francis/GenomeAnalysisTK.jar -T BaseRecalibrator -R /home/francis/Elvis_tutorial/Hgenome14RK.fasta -I /home/francis/Elvis_tutorial/H14_realigned_reads.bam -knownSites /home/francis/Elvis_tutorial/known.vcf -o /home/francis/Elvis_tutorial/H14_recal_data.grp. how do i get known.vcf file because that is the errors that am getting from "Couldn't read file /home/francis/Elvis_tutorial/known.vcf because file '/home/francis/Elvis_tutorial/known.vcf' does not exist"?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @francism
    Hi,

    I suspect the known.vcf is not in the directory you specified. Can you make sure the known.vcf file is indeed in the directory?

    -Sheila

  • armarkesarmarkes LisbonMember

    Hi,
    I am sending you a file with an example of my graphics after BQSR.
    Can you please tell me if it is normal that after re-calibration my quality scores become lower? Is there any reason that can explain this?

    Thanks,
    Ana Marques

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @armarkes I'm sorry, I can't see your file. Please try to attach it again.

    In general, it is possible for the quality scores to be lower after BQSR if the program found that the machine systematically overestimated the quality of the calls.

    But in some cases, if you see a big drop in all quality scores, it can also mean that the set of known sites is not adequate. We mostly see that in studies of non-human/non-model organisms. What kind of data are you working with?

  • armarkesarmarkes LisbonMember

    Hi,

    I am sending again my graphics after BQSR. I hope you can see it.
    We did NGS of a specific panel of genes using human samples.

    Thanks,
    Ana Marques

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @armarkes
    Hi Ana,

    How many bases approximately are in your dataset? BQSR is not suitable for datasets with less than 100M bases. Have a look at this thread for more information.

    If that is not applicable, as Geraldine mentioned, it is possible the sequencer overestimated the original reported quality scores.

    -Sheila

  • armarkesarmarkes LisbonMember

    @Sheila I didn´t know that issue.
    We sequenced only a small target panel (3,509bp) in 96 samples and we used a MiSeq flowcell that produced a total of 250M bases, which gives a mean of 2M bases per readgroup (sample). Each sample has a mean of 300 reads per site (+- 75000bp).

    I performed BaseRecalibrator tool independently in each sample/readgroup (only 2M bases). Maybe I should create only one recal.table with all my samples (250M bases), however the problem is that this tool will analyze each sample/readgroup independently, so the result will be the same, right? Should I give the same @RG to all my samples since they are from the same library and sequenced in the same single-lane flowcell?

    Can I use BQSR in my data? Do you have an alternative for this?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @armarkes
    Hi Ana,

    As long as the samples have all been processed in the same lane, you can run BQSR on the samples together. BQSR should be done per-lane. Have a look here for help on assigning proper read groups.

    -Sheila

  • armarkesarmarkes LisbonMember

    @Sheila
    All my samples were processed in the same single-lane flowcell. But they represent different DNA samples.
    To distinguish these samples we used barcodes (but this information is not present in the FASTQ files).
    My read name is like this: @Instrument:RunID:FlowCellID:Lane:Tile:X:Y ReadNum:FilterFlag:0:SampleNumber

    I am not sure about what @RG ID should I give to my samples. I found the read groups tutorial very confuse.

    If I look to this example, I conclude that I should give my flowcell name to the @RG ID, and in this case all my BAM files from all samples will have the same readgroup ID:
    @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
    @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
    @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
    @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
    @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
    @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200

    I read in the comments, that the @RG ID is only important to BQSR. Is that right? Will I be able to distinguish my samples from @RG SM after this step?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @armarkes
    Hi Ana,

    Sorry, it seems I mislead you. Your dataset is just too small for BQSR. 3,509 bases is just not enough data for BQSR to build robust model. You will have to skip BQSR.

    -Sheila

    P.S. Your read groups look fine. You are correct BQSR takes into account the ID, but if PU is present, that takes precedence.

    -Sheila

  • armarkesarmarkes LisbonMember

    @Sheila
    Thank for your help. I wonder if there is other GATK tool that can detect bases resulting from sequencing errors or do you think it is not necessary to recalibrate bases in my case?
    It is ok to run the other tools in my small database?
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately there's not other way of recalibrating base qualities, but this is not a blocking problem. It just means that your data will be a bit more vulnerable to errors on that front, yet it is still usable. Since you have such a small target region, you will be able to evaluate variant calls individually to some extent, which compensates for the inability to correct for some types of error. So now you can proceed to calling variants on your data. Note that you also won't be able to apply variant recalibration (VQSR filtering) but that's also ok since you should have a small enough number of variants to be able to examine them each more closely.

  • jphekmanjphekman Broad InstituteMember

    Hi. I am calling variants on fox RNAseq reads which have been aligned to dog. I am hoping to perform BQSR on this data set using previously called fox variants which were generated using fox genomic reads aligned to dog. However, my PI is concerned about the fact that this recalibration data will contain both dog vs fox and fox vs fox variants, and that the dog vs fox variants will be much
    more numerous (around 80-90% of the variants in our experience). Should we be concerned about using this data set for recalibration?

    Thanks!
    Jessica

    Issue · Github
    by Sheila

    Issue Number
    919
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @jphekman,

    No, this makes perfect sense. What you're trying to do here is mask out as many as possible of the sites where your fox reads will legitimately mismatch the dog genome (meaning where you expect to potentially see real variation in the fox relative to the dog) to avoid counting those base calls as errors. Since your known variants were generated using the same fox vs dog setup, this seems absolutely appropriate to me.

  • jphekmanjphekman Broad InstituteMember
    edited May 2016

    Thanks Geraldine! This was exactly the kind of enthusiastic language I needed from your response to show my PI :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hah, you're welcome!
  • buddejbuddej St. LouisMember

    Is it advisable to specify an interval file with -L when running BQSR on WES data? We are only calling SNPs within the capture region for our exome data, but there are certainly reads outside that region, due to a variety of factors (as I'm sure you're aware).

    However, the reads outside the exome region should be subject to the same sequencing machine errors as those inside. So would we end up with a better model by including these regions, simply because we're feed more reads into the BQSR model?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Yes @buddej, that is what we recommend. There should be a minority of off target sequence, so you won't lose anything substantial in terms of amount of data for building the error model, and in fact you'll avoid incorporating reads that may have more mismatches than on target reads.

    When you apply the recalibration though you don't restrict to intervals.
  • cdrurycdrury United StatesMember

    Hi all,

    I've got some experience with other pipelines but am new to GATK. My files are prepped, but I have 1152 individuals (of a coral, Acropora cervicornis) from 6 flowcells/lanes and am unsure how many of them to start with. I'd like to run HaplotypeCaller on some (? all?) of these samples and then use that data for the bqsr and proceed from there. Can anyone share any advice on this? I only really need help with the organization component, and I'm not sure if I should be running so many samples, if they need to be distributed across read groups, or if I need to run them all. From what I can tell it seems best to maintain my .bam files per sample, but I've also seen that I need to merge based on read group at some point.

    Any help would be greatly appreciated! Thanks!

    -Ford

    Issue · Github
    by Sheila

    Issue Number
    1098
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cdrury
    Hi Ford,

    What do you mean by your "files are prepped"? Can you tell us about your end goal?

    We mostly work with humans at the Broad, so we don't have much experience with non-human data. That said, you are correct you probably don't need to run HaplotypeCaller on all 1152 of your samples then do the bootstrapping process. You may try running HaplotypeCaller on a subset of your samples (perhaps try 20-30) then run BQSR. In the next round, you can try with a different subset of 20-30 individuals, and so on, until you see convergence.

    When you say "merge based on read group at some point", I am assuming you mean aggregating bams sequenced on the same flowcell that belong to the same sample. We do that at the Mark Duplicates step.

    -Sheila

  • cdrurycdrury United StatesMember

    @Sheila

    Hi Sheila, Thanks for the info. So by 'files are prepped' I just meant I've got the trimmed bam files with marked duplicates ready to go.

    As for merge based on read groups I was trying to figure out if I should be running haplotype caller on bams containing all the data from a read group, but it seems running each sample individually and then combining GVCFs should work.

    On another note, if I use a subset of samples for BQSR, I assume I should distribute them across read groups, correct? From what I understand detecting the systematic differences between flow cells should require that information, so I think that seems like the way to go.

    Thank you!

  • cdrurycdrury United StatesMember

    Hi @Sheila,

    I have around 100M bases per sample but in some cases, half of that. I think the best way to run BSQR is to merge my bam files by read group and run BSQR on entire read groups at a time, generating the recalibration report from there on files that are well over 1B bases.

    My question is, do I need to include data from multiple read groups in these merged .bam files so that the systematic differences between read groups (due to machine) can be calculated? I'm basically trying to figure out how to organize the files for BSQR. I've run it on some individual files and it works fine (as far as I can see, it functions at least).

    Thanks!

  • mscjuliamscjulia United StatesMember

    Hi Sheila,

    For the -knownSites options, if I'm only interested in SNVs, is it good, bad or does-not-matter if I don't include the knownSites for indels from this step on please? Thanks a lot.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cdrury
    Hi Ford,

    I think you had some confusion about defining the read groups. But, it looks like you have everything taken care of in this thread.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mscjulia
    Hi,

    We do recommend using the indel sites as well. Have a look at this document for more information/recommendations.

    -Sheila

  • mscjuliamscjulia United StatesMember
  • hayworthhayworth ChinaMember

    Hi,

    I have a RRBS datasets including 80 samples, can I use GATK to call snp from RRBS data?

    Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hayworth
    Hi,

    Have a look at this thread.

    -Sheila

  • DragoniteDragonite DurhamMember

    Hi everyone!
    I will try to do bootstrapping as I am working on newly sequenced organisms.
    The explanation here seems already clear I guess, that I should reach kind of the last stage of the analysis by skipping BQSR first, and then I should use my filtered variants as the vcf input for BQSR and re-do the rest of the analysis(?) The question here is, how crucial is this step, I mean, if I had a known set of variants, I'd definitely do it, but bootstrapping will probably not be as efficient. And how can I know when I reach convergence?

    In case that may help me, Geraldine has a post from April 2015 that goes as:
    "In the "Non-human" presentation you will find recommendations for bootstrapping a set of known variants"
    and she shares a dropbox link, which gives 404. Is it possible to share that file again maybe?
    I also checked out BQSR presentations from some workshops, but couldn't find any details.

    Many Thanks
    Fatih

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Dragonite
    Hi Fatih,

    You can view the presentation here. Also, have a look at this thread.

    -Sheila

  • everestial007everestial007 GreensboroMember

    @Dragonite
    I had the same issue when following GATK best practices. I did my BQSR by calling the variants on my samples first and then went back to do BQSR - then rest of the best practices.
    But, for individuals working with emerging models, I think the deal breaker is your sample size and the quality of the your "per base sequence quality" in your fastq file, see this http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

    The standard minimum threshold for sequence quality at each base is 20, but with my sequence I had all the bases above quality 20 (actually 30) at all the positions. So, I realized that BQSR isn't making much difference since the whole purpose of BQSR is to recalibrate this sequence quality scores. And, I was gaining as much valuable variants without doing this extra step. So, check your fastqc first, if it looks good I would rather emphasize on VQSR using gvcf approach on cohorts.

    @Sheila @Geraldine_VdAuwera
    I think this particular question comes a lot and I understand the dilemma of using vs. not using this method (BQSR) since is is very tempting (honest opinion) but resource consuming. But, I think looking at the "per base sequence quality" prior making the judgement is a good idea. Am I missing something else that is more important on BQSR?

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @everestial007 FastQC is a great tool but it only tells you what the scores are, not whether they are accurate. So you can't know just from that whether BQSR will help your data or not.

    We have not yet found a good way to predict upfront whether a given dataset will benefit from recalibration, so we run it systematically on all our data. We consider that it's like fire insurance: most of the time it's not needed but when it is you're really happy you got it.

    The tool has been rewritten in the upcoming GATK4 version to be quite a bit faster, so in future that point of pain will be eased to some extent.

  • everestial007everestial007 GreensboroMember

    Thanks much @Geraldine_VdAuwera .
    I will think if I should reconsider doing BQSR on my pipeline. I have been doing my analyses for about more than a year but keep coming back and adding parameters in the analyses. Lol

  • ShawnWilliamsShawnWilliams Wuhan City. Hubei Province. ChinaMember

    Hi,
    I see this blog said that bqsr only correct the base score quality of mismatch bases(https://rstudio-pubs-static.s3.amazonaws.com/64456_4778547202f24f32b0edc325e96b061a.html),but I think all the bases need quality score recalibration, am I right?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ShawnWilliams
    Hi,

    Yes, BQSR corrects all the base quality scores.

    BaseRecalibrator uses the mismatched sites (that are not known variation sites) as errors to build the model. Then, the PrintReads step applies a correction (based on the model) to all the bases.

    -Sheila

  • ShawnWilliamsShawnWilliams Wuhan City. Hubei Province. ChinaMember

    Thanks so much @Sheila

    Now I have another question:
    After BaseRecalibrator we can get a report table which contains many EmpiricalQualities, I used to think that we just need to use these EmpQS in PrintReads, but I find that it recalculate EmpQS in PrintReads process, and the EmpQS is a little bit different with those in report table. I want to know why it has to recalculate the EmpQS rather than just use EmpQS in report table?

  • oskarvoskarv BergenMember

    It seems like BaseRecalibrator and PrintReads run with scatter gather parallelization compared to plain single threaded execution is giving me different output results, is that to be expected?

  • ShawnWilliamsShawnWilliams Wuhan City. Hubei Province. ChinaMember

    Hi, @Sheila @Geraldine_VdAuwera
    this link (http://gatkforums.broadinstitute.org/wdl/discussion/1326/per-base-alignment-qualities-baq-in-the-gatk )says that BAQs are no longer used in GATK. But I find bqsr still use baq to calculate fractional error by default in gatk's code. Why we need baq?Is it used for find false positive snp caused by true indel? and it's a little bit confused for me to understand the baq Algrithm, especially the hmm_glocal function, where can I get some information about baq algrithm?
    Thanks.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @ShawnWilliams,

    What version of GATK are you using? Can you share with us how the tool is telling you that it is using BAQ? If you are using an older version of GATK, I highly recommend you update to the latest version (v3.7). Unfortunately, we cannot help you with questions related to features that are deprecated.

  • ShawnWilliamsShawnWilliams Wuhan City. Hubei Province. ChinaMember

    @shlee,
    I use v3.6, I read GATK‘s code on github(https://github.com/broadgsa/gatk-protected/blob/master/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/bqsr/BaseRecalibrator.java), in BaseRecalibrator.java, calculateBAQArray function, BAQ.CalcutationMode is RECALCULATE rather than OFF.
    And I run gatk.jar(which is 3.6 version download from GATK official website, ) to analysis my data, I use Intel Vtune CPU Profiler to analysis the CPU Time, and I find hmm_glocal function() is the most time cost function.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @ShawnWilliams We are moving away from that code; in the GATK4 codebase, which is the future of GATK, we have removed BAQ entirely.

    Beyond that, I'm sorry but your question is outside the scope of support we provide.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @oskarv Sorry your question got lost amid the other conversation on this thread. Yes, we expect to get marginal differences in output when parallelizing execution of BaseRecalibrator. It shouldn't cause major differences, however. Are you seeing very different results?

  • OK, I've done my bets to dig through the forum to no avail: I am getting the following error, and I can't figure out a way to fix it. Any advice will be greatly appreciated.

    I am running the command: java -jar GenomeAnalysisTK.jar -T PrintReads -R Patel_Trinity_L_RNA_scaffolded.fasta -I nondorm1_marked_duplicates.bam -BQSR recalibration_report.grp -o nondorm1_marked_duplicates_recal.bam

    and getting the following:

    INFO 19:35:01,201 HelpFormatter - --------------------------------------------------------------------------------------
    INFO 19:35:01,204 HelpFormatter - The Genome Analysis Toolkit (GATK) vnightly-2017-11-22-1, Compiled 2017/11/22 00:01:18
    INFO 19:35:01,204 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 19:35:01,205 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 19:35:01,205 HelpFormatter - [Wed Nov 22 19:35:01 CST 2017] Executing on Linux 3.13.0-117-generic amd64
    INFO 19:35:01,205 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01
    INFO 19:35:01,210 HelpFormatter - Program Args: -T PrintReads -R Patel_Trinity_L_RNA_scaffolded.fasta -I nondorm1_marked_duplicates.bam -BQSR recalibration_report.grp -o nondorm1_marked_duplicates_recal.bam
    INFO 19:35:01,213 HelpFormatter - Executing as [email protected] on Linux 3.13.0-117-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01.
    INFO 19:35:01,214 HelpFormatter - Date/Time: 2017/11/22 19:35:01
    INFO 19:35:01,214 HelpFormatter - --------------------------------------------------------------------------------------
    INFO 19:35:01,215 HelpFormatter - --------------------------------------------------------------------------------------
    INFO 19:35:01,286 NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/dave/GenomeAnalysisTK.jar!/com/intel/gkl/native/libgkl_compression.so
    INFO 19:35:01,305 GenomeAnalysisEngine - Deflater: IntelDeflater
    INFO 19:35:01,306 GenomeAnalysisEngine - Inflater: IntelInflater

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version nightly-2017-11-22-1):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: The BQSR recalibration file, /home/dave/recalibration_report.grp, does not exist

    What am I doing wrong?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    It cannot find the recalibration_report.grp file.

    Can you check your paths?

Sign In or Register to comment.