Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about known variant resources

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: What should I use as known variants/sites for running tool X?.

Comments

  • When you do this "looping round until convergence" you use the VCF file from the previous loop as the -knownSites for the current loop. Do you do the same for the BAM file, or use the same one all the time?

  • rpoplinrpoplin Member ✭✭✭

    Technically the official answer is we don't know because it is very experimental. I think the best thing would be to use the original bam file each time for this. Cheers,

  • lbernalberna Member

    I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Good to know, thanks for sharing your observations, @Iberna.

  • mlcunhamlcunha Member

    The Mills_and_1000G_gold_standard.indels.b37.sites.vcf file is not present in the bundle of the current version. It means that we shall use the Mills_and_1000G_gold_standard.indels.b37.vcf instead of the Mills_and_1000G_gold_standard.indels.b37.sites.vcf? Why there are files with the suffix out, like Mills_and_1000G_gold_standard.indels.b37.vcf.out? Congratz for the good work!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Mills_and_1000G_gold_standard.indels.b37.vcf is a sites-only file...
    We were previously storing 2 copies of the same file, and now we only store one.
    You can ignore the .out files (I'll remove them).

  • KurtKurt Member ✭✭✭

    There is conflicting information on what the suggested value of the prior is for the dbSNP/known track to be used in VQSR. On this page (and a few others) the value is Q6. On the link below, (which is very well written by the way!), the suggested value is Q8. http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration

  • rpoplinrpoplin Member ✭✭✭

    Well, I wouldn't say it is conflicting information but I appreciate that it is confusing for them to be different command lines. The most up-to-date best practice recommendations are found on this page while the other page you reference is a VQSR tutorial with fixed input data and output results.

    I hope that helps,
    Cheers,

  • KurtKurt Member ✭✭✭

    Fair enough. In the end, all that matters (at least to me) is what would be the most up-to-date suggested parameter setting(s). If this where you go for them (FAQs), then that works for me ;) Thanks!

  • @lberna said:
    I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.

    Hi Iberna. Can you give me some details of the loop of programs you used for this?

  • silentiosilentio Member

    @Geraldine_VdAuwera I found something contradictory regarding VariantRecalibrator for indels.
    In this article, the resource for known indels is recommended as
    "-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf ";
    However, in another post, it is
    "-resource:mills,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \".
    So whether to set "known" as "true" or "false"?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The value of "known" does not affect the filtering process; its only effect is that any indels in your samples that overlap with this set will be annotated as "know" rather than "novel". Sometimes we want to use a different set of indels as "known" database. But generally the Mills resource is set to "known"="true".

  • feng_bfeng_b Member
    edited May 2013

    How about ESP variants? Can I use those as resources? If yes, how to set the parameters for VariantRecalibrator?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @feng_b,

    It is up to you to decide what resources are appropriate for you data. We can only provide on what we have found to work well with our human data. We encourage you to experiment with other types of resources that you may have at your disposal. If you find something interesting, please share your findings with the community, either by posting them in the forum or linking to your paper when you publish.

  • HomaHoma Member

    For the bam file, what do you mean by the 'new bam generate in each iteration'? because I have one original bam file and then I use that for snp calling which gives me the vcf files but the bam file itself doesn't change.
    thanks in advance for your help.

    @lberna said:
    I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Homa, I believe @lberna was talking about the files generated by iterative steps of recalibration to produce an adequate set of known sites for base recalibration. This is a very specific protocol that is not typically necessary if you already have a set of known sites available (as for humans).

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Am I correct in assuming the current (last edited March 25 ) VQSR settings in this document are superseded by http://gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project (last edited May 31)?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, that's correct.

  • thedamianthedamian Member
    edited August 2013

    If I would like to use booth "known" files: Mills_and_1000G_gold_standard.indels.hg19.vcf and 1000G_phase1.indels.hg19.vcf, how the command should looks like?
    java -jar GenomeAnalysisTK.jar -T IndelRealigner \ -known Mills_and_1000G_gold_standard.indels.hg19.vcf \ -known 1000G_phase1.indels.hg19.vcf \ -R ...
    Should I use "known" option twice?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you use the "known" option twice.

  • georgegeorge Member

    The "Summary table" seems to be missing a column "1KG SNPs", which should be assigned to VariantRecalibrator, to reflect the fact that 1000G_phase1.snps.high_confidence.* is also part of the suggested inputs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Fair point @george, I'll schedule an update of that doc. Thanks for pointing this out!

  • OliverDrechselOliverDrechsel EuropeMember

    Hi,

    the Mills data set is used quite frequently and a very good resource. But actually which publication could be used as reference? I seem to be unable to find this on your pages.
    "An initial map of insertion and deletion (INDEL) variation in the human genome" (Genome Res. 2006. 16: 1182-1190) looks like it, but it could also be the newer version
    "Natural genetic variation caused by small insertions and deletions in the human genome" (Genome Res. 2011. 21: 830-839).

    Thanks for your help,
    Oliver

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, we don't seem to have a preferred reference paper for this dataset. Either of those you propose would be fine, I think -- or in doubt, use both.

  • kamilo889kamilo889 kamilo889Member

    Hi, Looking in the bundle the lates version of dbSNP is 138 in the NCBI appear the 141 ... I´ve downloaded the VCF file but when I run GATK is in a different order due to the ChrM with BEDtools or other tool can I reorder my VCF to the reference (the dict and .bai are ordered with the ref founded in the bundle) or I can use the 138 version? Thanks!

    Camilo

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @kamilo889, you can do either but I would recommend just using the 138 version.

  • everestial007everestial007 GreensboroMember ✭✭

    @lberna said:
    I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.

    I am trying to create a known sites for my model too (A. lyrata). I will update how may looping will my data take for the vcf to converge.

  • everestial007everestial007 GreensboroMember ✭✭
    edited September 2015

    My situation:
    I don't have known site (*.vcf file) for the genome of the organism I am working with.
    I used the bwa -mem to align the genomic reseq data, coordinate sorted the sam files (to bam), marked duplicates, then prepared the realignment target (intervals.list file) and realigned the reads (realigned.bam files).
    As, I don't have the known vcf sites for the genome I am working with, so I first used the HaplotypeCaller to call the variants at the previously prepared (intervals.list) (suggested at https://www.broadinstitute.org/gatk/events/slides/1506/GATKwr8-B-3-Base_recalibration.pdf and http://gatkforums.broadinstitute.org/discussion/1247/what-should-i-use-as-known-variants-sites-for-running-tool-x)
    and applied stringent filter.
    **The command is: **
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R lyrata_genome.fa -I realigned_readsMA605.bam --genotyping_mode DISCOVERY -L MA605REaligner_targets.list -stand_emit_conf 30 -stand_call_conf 30 -o raw_variantsMA605.vcf

    Is anything wrong with the command? I am mainly concerned if I have submitted the approrpriate targets (MA605REaligner_targets.list)? and not messing up with the aligments.

    BQSR bootstrapping workflow:
    So, now I use this raw_variantsMA605.vcf to recalibrate my realigned.bam files at the identified intervals, rite?? until convergence??
    How do I know if my vcf/files have converged??

  • everestial007everestial007 GreensboroMember ✭✭
    edited October 2015

    It may not be a very good question, but I am little confused about the bootstrapping process to prepare -knownsites during BQSR for my realigned BAM.
    I have followed GATK best practices:
    step 1: Call variants on realigned.bam using Haplotype caller
    step 2: apply stringent filters to the variants (SNPs and indels) to prepare a knownSNPs.vcf and knownIndels.vcf
    step 3: now apply BQSR to prepare recal_data01.table using the -knownSites
    step 4: apply 2nd pass BQSR to prepare recal_data02.table
    step 5: analyze covariates on recal_data01.table vs. recal_data02.table using plots. At this stage we want to see convergence (which means we want to data to look like the one on https://www.broadinstitute.org/gatk/guide/article?id=44
    step 6: if I see convergence then I apply the recalibration to my realigned.bam files to prepare a new BAM (recalib_realigned.bam). At this step is it better to retain the original quality of the base using --emit_original_quals or just leave it?

    Step 7: if there is a situation that my comparative plots are not good enough, I proceed with bootstrapping. So, does that mean I have to printreads to prepare new BAM (recalib_realigned.bam) and call variants on the recalib_realigned.bam to prepare a new set of *.vcf files (and stringent filter it) and then apply the variants (knownSites) to do a new BQSR recalibration; and continue until I see convergence. My thought is that this process will most likely mess up the BAM files (alignment and/or quality).

    Or, should halt printreads and just apply BQSR on recal_data02.table to get recal_data03.table and analyze covariates. If the plots comparison look better then apply printreads, if not again apply BQSR on recal_data03.table to obtain recal_data04.table, and keep continue doing this. But, what am I gaining by doing BQSR on the same calibration table again and again?

    Thanks much in advance !!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In the successive rounds of bootstrapping, you run BQSR again on the original bam file using the variants you obtained at the previous round. It's the known variants that are different each time.

  • everestial007everestial007 GreensboroMember ✭✭

    Thanks Geraldine.
    But regarding the use of --emit_original_quals after the full bootstrapping process, Do you think its better to retain the original quality or not? How does it effect downstream variant analysis? I have searched for it but not finding a clear information on it.

    thanks again !

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We don't retain the original qualities in our production pipeline. There is no effect on downstream analysis. It's only useful if you want to be able to revert back to the unprocessed state of the data.

  • everestial007everestial007 GreensboroMember ✭✭
    edited October 2015

    Hi Geraldine,
    Thanks for the information.

    Now, I have run BQSR, I have some questions regarding the use of target.interval file (prepared from previous RealignmentTargetcreator) with BQSR while bootstrapping. I have read several forums on BQSR but still kind of confused about the appropriate use of -L with BQSR.

    This article mentions the use of -L flag: https://www.broadinstitute.org/gatk/events/slides/1506/GATKwr8-B-3-Base_recalibration.pdf specifically for WEx data.
    First question: Should I use the -L target.interval flag to pick the recal.table only from the target sites if I am using whole genome reseq data?
    What about in the case of RNAseq data?
    I have gone through several example that people have posted for their run and it does not seem to be the case.

    Next question: Also, are we using the -L target.interval while writing the final recalibrated values (on the realigned.bam) at the end (with PrintReads). The question is: Are we trying to recalibrate the quality scores only of the variant sites or of all the sites (both variants and non-variants) in the bam file?, and I am not confident what it is.

    Fortunately, I have found that use of HC with BQSR (for bootstrapping) is able to fix co-variation in a single round which has saved a lots of time during the bootstrapping process, but this process is still taking a lots of time.

    Thanks for all the help people on this forum have been !

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @everestial007

    I think you're confusing the realignment targets (produced by RealignerTargetCreator and meant to be used only by IndelRealigner), and general processing target intervals such as exome capture targets, which you can provide to any tool with -L and are usually tied to the exome capture kit used to prep the samples for sequencing (or equivalent for other experimental types).

    The goal of BQSR is to recalibrate all of the reads, not just a subset, so while you can provide exome or equivalent intervals to BaseRecalibrator (if appropriate for your data type) you should generally not restrict the analysis further than that. Note that if you use intervals in the PrintReads step you will exclude data outside those intervals from being written to the output file.

  • everestial007everestial007 GreensboroMember ✭✭

    Thanks @Geraldine_VdAuwera . Its so much clear now.
    I also want to add to confirm that when we do PrintReads we are writing the values from the first recalibration_data.table and not from the post_recalibration_data.table as suggested in this link: https://www.broadinstitute.org/gatk/guide/article?id=2801

    Another question is regarding the merging of bam/fasta files: At what stage do you think is best to merge the aligned bam files from your experience or the way analyses is done at Broad Institue: 1) right after BWA alignment 2)after IndelRealignment or 3)after completion of BQSR reacalibration. I understand that we should merge the sequence data by retaining unique @RG for each sample.

    As suggested in this link: https://www.broadinstitute.org/gatk/guide/bp_step.php?p=2
    for genome data its best to do variant calling by individual sample followed by joint genotyping across samples. My hunch is that I keep the bam files separate for each individual samples (until after BQSR recalibration) and proceed to variant calling with both merged and unmerged samples.

    Thank you !!!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @everestial007
    Hi,

    You are correct about calling variants on each individual sample then performaing joint genotyping. Have a look at this article for more information: http://gatkforums.broadinstitute.org/discussion/6057/i-have-multiple-read-groups-for-1-sample-how-should-i-pre-process-them#latest

    -Sheila

  • aschoenraschoenr Member
    edited April 2018

    Hi,
    I'm not sure if this is answered in another location, but I'm doing the preprocessing pipeline on some human data and now have reached the RBQS step. My question is what are the current GATK recommended known variant files to use at this stage? I downloaded the hg38 resource bundle and there seems to be a few in there that I could use. Specifically, which of the following are recommended to use:

    • 1000G_omni2.5.hg38.vcf.gz
    • 1000G_phase1.snps.high_confidence.hg38.vcf.gz
    • Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
    • dbsnp_138.hg38.vcf.gz
    • dbsnp_144.hg38.vcf.gz
    • dbsnp_146.hg38.vcf.gz
    • hapmap_3.3_grch38_pop_stratified_af.vcf.gz
    • hapmap_3.3.hg38.vcf.gz
    • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

    An article from 2012 (that was also updated last week, https://software.broadinstitute.org/gatk/documentation/article.php?id=1247) recommends the following:

    • The most recent dbSNP release (build ID > 132)
    • Mills_and_1000G_gold_standard.indels.b37.vcf
    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

    Based on the vcfs I have in my current bundle, that would translate to:

    • dbsnp_146.hg38.vcf.gz
    • 1000G_phase1.snps.high_confidence.hg38.vcf.gz
    • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

    So I guess my question ultimately is, are any of the following worth including as well, or should I just stick the the 3 in my last list?

    • 1000G_omni2.5.hg38.vcf.gz
    • Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
    • hapmap_3.3_grch38_pop_stratified_af.vcf.gz
    • hapmap_3.3.hg38.vcf.gz

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited April 2018

    @aschoenr
    Hi,

    Based on the vcfs I have in my current bundle, that would translate to:

    dbsnp_146.hg38.vcf.gz
    1000G_phase1.snps.high_confidence.hg38.vcf.gz
    Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

    That is correct. I think the team did some testing, and found those to produce the best results. Adding the other known sites may not help much.

    -Sheila

Sign In or Register to comment.