Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about filtering variants manually

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: How can I filter my callset if I cannot use VQSR / recalibrate variants ?.

Comments

  • pdexheimerpdexheimer Member ✭✭✭✭

    Thanks for this document, Geraldine - I hadn't appreciated before how tentative these recommendations are.

    we do want to hear from you about how we can best help you help yourself.

    Ask and ye shall receive! I don't have a very principled way to evaluate the quality of a callset. If there are enough variants, Ti/Tv and dbSNP % are very useful metrics (though hardly the complete picture). But I find that the data I have to hard filter is usually too small for even these numbers to be meaningful, and they never apply well to indels. Most of the time, I end up essentially just saying "Good Luck" to the investigator, which isn't especially helpful. Any advice you might be able to provide on evaluation would be very appreciated.

    Regarding the annotations, I would say that more information is helpful. For a long time (and I think this is still true), the only information about the annotations has been a one sentence description, barely more than is listed in the VCF. For things like DP and MQ this is fine, but some of the complex ones (I'm thinking particularly of InbreedingCoeff, HaplotypeScore, and FS here) would benefit from a much more detailed description. Personally, I like to see both the "less math" and "more math" descriptions available - maybe treat the "more math" track as Supplementary Methods. That way I can read the overview and get a good idea of what's going on, but if I ever want/need to really understand it I can dig into the math. The downside, of course, is that it doubles the documentation burden, but those annotations are complex enough that I think they would warrant it.

    Another note that would be helpful (which I think you alluded to) would be some reasoning for why the starting points here include some annotations but not others. For instance, why is InbreedingCoeff used for Indels but not SNVs? The explanation for DP is exactly the kind of thing I'm thinking of.

    Finally, I still encounter people (both in real life and online) who want to use what I think of as the "Sanger filters" - they set a minimum depth, and usually require a certain number of reads supporting the alternate allele, and that's it. It might be nice to have an explanation of why those kinds of filters are suboptimal that I can point them toward

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oy, talk about being careful what you wish for ;-)

    Most of the time, I end up essentially just saying "Good Luck" to the investigator, which isn't especially helpful.

    We've mostly been limited to that as well in the past, but now we're hoping to ramp up our efforts to provide guidance. The hard part can be in identifying what to tackle first, so thanks for your thoughtful comments.

    Improving the explanations of what the annotations represent (the "less math" route) is probably the lowest-hanging fruit; while "more math" will probably consist of "go look at the code" until we get around to documenting it in the real world. Explaining why we recommend/discourage use of some filters and not others basically means writing down the experience accumulated by several people over years of analysis; it will take some time but we see the value in this so we will get there.

    Seriously though, thanks for your feedback, @pdexheimer. When people comment like you did and say "this part is really helpful, for example" that is immensely useful for defining priorities but also for getting buy-in from team members to sit down and formulate this stuff explicitly.

  • pdexheimerpdexheimer Member ✭✭✭✭

    "more math" will probably consist of "go look at the code"

    Indeed, that's been my documentation for some time - one of the great benefits of open source. And the annotations are generally simple enough that you only need to look at one or two files to really grok them. But my experience doing that led to two different experiences:

    1. A wonderful "Oh, I get it!" moment with InbreedingCoeff. Once you see the math (and understand it), it's a very elegant metric
    2. A morning of confusion with HaplotypeScore. A forum discussion with ebanks cleared it up (in fact, most of the "less math" for that metric is probably already in the forum), but there were some subtleties in the code that I just didn't catch in my reading

    And there are certainly users of GATK that are just not comfortable or experienced reading code. I can understand the need to prioritize, I would just encourage you to consider that not all metrics are equal and there may be value in spending more effort on some of the more complex ideas (Preaching to the choir, you do that already. Just trying to bump up the priority a bit)

  • AdminTimAdminTim LondonMember

    Hi Folks,

    Are the recommended values suitable for use in the process of generating a bootstrapped 'truth' set for use in GATK, or should more stringent cutoffs be used for that? Is it possible to come up with some guidelines for that please? I suppose that in filtering for this purpose we don't want to include the 'twilight zone' that we might want for our final results. What metrics can be used to judge whether variants are 'gold standard'?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @AdminTim,

    To bootstrap a truth set you probably want to be more stringent on the filtering. Our gold standards are typically variants that have been validated using orthogonal approaches such as arrays. We're just now starting a project to develop guidelines for this kind of thing; it will probably take a month before we have them worked out and ready to share publicly.

  • AdminTimAdminTim LondonMember

    Hi Geraldine - sorry I've just seen your reply for some reason. Thanks for the info - I'll keep an eye open for those guidelines.

    For the time being I thought I could do the following for my 48 sample 100x coverage exome data:

    • Apply the above cutoffs to filter variations.
    • Include only variations where all samples are PASSes (i.e. not in an ambiguous/edge region where some samples were filtered).
    • Include only variations that occur in more than one sample.

    Is this reasonable for use in an iterative bootstrapping process?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Tim,

    It's a good start. Ideally you'll want to have a look at a bunch of your variants in context in IGV and check that the calls look reasonable. Obviously you can't do all of them, but it's worth doing as many as you can spare the time for, especially if you have no existing callset to compare them against.

  • ryanabashbashryanabashbash Oak Ridge National LaboratoryMember

    Hi all,

    I'm trying to bootstrap a truth set for an organism lacking an existing variant database, and I am looking for a sanity check. Here's the outline of what I've done; please let me know if you have any suggestions:

    I have WGS (i.e. not exome capture) for 2 different samples from an organism with a ~800Mbp genome.

    1. Alignment: BWA

    2. Mark Duplicates: Picard

    3. Generate realignment targets: GATK RealignerTargetCreator

    4. Realign around indels: GATK IndelRealigner

    5. Determine depth of coverage to determine depth threshold for step 7: GATK DepthOfCoverage


    Median for sample 1 and sample 2 are 28 and 26, respectively, with 82% and 80% of bases above 15, respectively.

    6. Initial variant discovery on both samples simultaneously: GATK HaplotypeCaller

    7. Hard filtration of variants to produce "initial truth set": GATK VariantFiltration

    Using the following filters:

    "DP < 10 || DP > 100" 
    "QD < 20.0"
    "MQ < 50.0"
    "FS > 10.0"
    "HaplotypeScore > 8.0"
    "MQRankSum < -12.5"
    "ReadPosRankSum < -3.0"
    "AC != 2 || AC != 4" # These samples are supposed to be fairly inbred so I'm considering heterozygous loci suspect.
    

    This reduces the initial ~3.7 million called variants across the two samples to 1.5 million variants (counted using GATK VariantEval). These 1.5 million variants were considered the "initial truth set" (and that value seems to be a reasonable ballpark value for a "pared down" variant set for the organism given reports by others).

    8. Base recalibration of initial BAMs using the "initial truth set": GATK BaseRecalibrator

    9. Generate Recalibrated BAMs with new quality scores: GATK PrintReads

    10. Base recalibration of Recalibrated BAMs using the "initial truth set": GATK BaseRecalibrator

    11. Compare the recalibration for both BAMs: GATK AnalyzeCovariates

    12. Perform variant discovery on both samples simultaneously using their Recalibrated BAMs: GATK HaplotypeCaller

    13. Repeat steps 7 through 12 until plots from AnalyzeCovariates converge.

    14. On convergence, perform step 12 and 7 a final time to generate a "final truth set".

    15. Use the "final truth set" as a training and truth set for VQSR, assigning a prior of 6 (~75%) to the "final truth set": GATK VariantRecalibrator

    Please let me know if you have any suggestions for improvement.

    Thank you,

    Ryan McCormick

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Ryan,

    That all sounds perfectly reasonable, I don't really have anything to add. Out of curiosity, what size final truth set do you end up with? Do you see a big difference compared to the initial truth set?

  • ryanabashbashryanabashbash Oak Ridge National LaboratoryMember

    Hi Geraldine,

    Thanks for the response and feedback. I'll report back when I have a final truth set for the comparison (I'm currently iterating to obtain convergence).

    Thanks,

    Ryan

  • AdminTimAdminTim LondonMember

    Hi Folks,

    When using HaplotypeCaller I get mixed indel/snp variant entries in the VCF which is great, but what's the best bet to apply hard filtering to these? Since HC has a single method for SNPs and indels is it possible to apply a single filter that will work well for both?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @AdminTim,

    No, as indicated in the Best Practices documentation as well as the article above, we have found that SNPs and Indels have different error modes, so it's better to use different filters on them.

  • AdminTimAdminTim LondonMember

    Hi Geraldine,

    Ok thanks, but HC produces SNPs and indels that are on the same line like this:

    chr10 712670 . C CG,T 36963.32 . AC=79,4;AF=0.823,0.042;AN=96;BaseQRankSum=1.656;ClippingRankSum=0.692;DP=851;FS=0.000;InbreedingCoeff=0.4204;MLEAC=79,4;MLEAF=0.823,0.042;MQ=56.43;MQ0=0;MQRankSum=-0.272;QD=26.11;ReadPosRankSum=-0.778 GT:AD:DP:GQ:PL 0/1:8,7,0:15:99:193,0,418,238,433,813 1/2:0,7,8:15:99:1042,229,313,855,0,1039 1/1:0,14,0:14:45:439,45,0,479,45,647 1/1:0,25,0:25:78:838,78,0,912,78,1440 ......

    these are selected using the MIXED selectType to SelectVariants. How should I apply filtering to these?

    Thanks

    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, fair question. I'm not sure, let me check with the team.

  • ryanabashbashryanabashbash Oak Ridge National LaboratoryMember
    edited December 2013

    Out of curiosity, what size final truth set do you end up with? Do you see a big difference compared to the initial truth set?

    The size of the truth set changed from ~1.551 million variants (~1.383 SNP and ~0.166 INDELs) in the initial set to ~1.532 million variants (~1.385 SNP and ~0.146 INDELs) in the final set after 4 rounds of recalibration; the number of SNPs increased by ~2,000 (~1% increase) and the number of INDELs decreased by about 20,000 (~12% decrease). Between the 3rd and 4th (final) iterations, the number of variants called was very similar (~0.01% difference), and the AnalyzeCovariates plots were very similar. Please let me know if any of that sounds out of the ordinary.

    Lastly, the AC filtering statement in my previous post doesn't logically work (it always evaluates to False). I ended up using "AC < 2 || AC > 4 || AC == 3", and I think "AC != 2 && AC != 4" would work as well.

    Thanks,

    Ryan

    Post edited by ryanabashbash on
  • KurtKurt Member ✭✭✭

    Howdy,

    I know that this is the section of how to filter when you are SOL (you can't do VQSR). But I would like to comment on the use of QD as a filter for indels in this situation (as a testimonial or whatever). I am in a situation where I have to do single sample calling AND am dealing with a small capture region (essentially a gene panel using Agilent as the capture product on human samples). When doing a quick check of the data, I noticed that for where the filters were applied to indels quite a few of them had been filtered by the QD < 2.0 (in fact that was the only parameter from the above cut-offs that failed when the indel calls did not pass). Since QD is normalized by event length, then it would seem that quite a few long indels will in the end fail this threshold. I believe all of the ones that i have come across have been confirmed via sanger sequencing (Hopefully I can have this confirmed after the new year, but am reasonably sure that is the case from what I have heard). They generally have qual ~4000, depth ~300, MQ ~ 60, event lengths are 9 or greater, are hets (fair balance b/w alt/het), etc.

    I did have one indel that failed QD that was definitely not confirmed by sanger, but that particular one had a QUAL of about 90 and did not have a very good balance b/w alt and ref. the overall depth was ~300 and MQ ~60.

    This was a 2.7-4 workflow using HC as the caller. bwa-mem as the aligner. MiSeq 2x100.

    In any case, I think in my situation, I will end up taking out QD < 2.0 (or supplement it with a conditional argument)

    Kurt

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Kurt,

    This is a very interesting observation, and we'd like to follow up on our end to evaluate whether we're doing the right thing with QD, or if we need to make the method smarter. Would you mind sending us a couple of snippets of good indels that fail the QD filter?

  • KurtKurt Member ✭✭✭

    I'll forward your request to the people whose data it is :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, fair enough :) Please assure them that we would treat any sensitive data with the utmost care for confidentiality.

  • KurtKurt Member ✭✭✭

    @Geraldine_VdAuwera

    and ummm, yeah.

    "...Rare mutations, as most of these larger indels are, can be considered as identifiable information, so if our data is transferred outside of Hopkins to the Broad we may need her to sign a confidentiality agreement (data will not be shared, discarded after use, etc.)."

    so would your group be willing to do so?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If necessary, yes we would; but we're first going to check if we can generate some test data ourselves before initiating a paperwork-laden process, to make things easier on everyone. Eric is going to look into this during the vacation period, and we'll get back to you if we still need the Hopkins data. Thanks for playing middle man :)

  • KurtKurt Member ✭✭✭

    Sounds good, Geraldine. You're welcome ;) and I'm sorry for giving Eric something to do during vacation!

  • KurtKurt Member ✭✭✭

    Also @Geraldine_VdAuwera

    This was the command line for the HC.

    $JAVA_1_7/java -jar $GATK_DIR/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $REF_GENOME \ --input_file $CORE_PATH/$OUTPATH/BAM/$SM_TAG".bam" \ --dbsnp /isilon/sequencing/GATK_resource_bundle/2.2/b37/dbsnp_137.b37.vcf \ -L $BED_DIR/$BAIT_BED \ -stand_emit_conf 0 \ -stand_call_conf 0 \ -A VariantType \ -A DepthPerSampleHC \ -A ClippingRankSumTest \ -A MappingQualityRankSumTest \ -A ReadPosRankSumTest \ -A FisherStrand \ -A GCContent \ -A AlleleBalanceBySample \ -A AlleleBalance \ -A QualByDepth \ --emitRefConfidence GVCF \ -dt NONE \ -et NO_ET \ -K $KEY \ -o $CORE_PATH/$OUTPATH/temp/$SM_TAG".HC.ERC.raw.OnBait.vcf"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks Kurt! Don't worry too much about Eric -- if it wasn't this he'd find something else to work on :)

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Any excuse not to have to take care of my kids.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yeah right he's such a terrible dad that he has a pic of his kids as profile picture ;-)

    You're not fooling anyone, @banks.

  • KurtKurt Member ✭✭✭

    Happy New Years! @banks and @Geraldine_VdAuwera. Just checking up to see if you have had time to see if you were seeing the same thing that I was seeing with larger indels and QD?

    Best Regards,

    Kurt

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Kurt,

    I am actively working on it now.

    (I am a bit behind because those aforementioned kids broke my laptop at the beginning of vacation. No joke)

  • KurtKurt Member ✭✭✭

    Apparently they decided that you were going to take care of them after all ;)

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Kurt, just wanted to let you know that the fix is going in over the next day or two. Thanks again for finding this big problem!

  • KurtKurt Member ✭✭✭

    Hi Eric, good to hear and you're welcome! Glad to have contributed!

  • KurtKurt Member ✭✭✭

    Oh, when deployed, will this be found in a nightly build or do you plan on doing a stable release?

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Nightly build until the 3.0 release.

  • KurtKurt Member ✭✭✭

    Ok, good to know :)

  • tkoyamatkoyama The University of TokyoMember
    edited July 2015

    Hello. I am trying to extract real variants from partial genome sequence (apx. 1 Mb) of non-model species.
    I mapped whole-genome shotgun reads obtained from Hiseq2000 to the partial genome. According to the best-practice workflow, I repeated variant call and base recalibration several times as follows;

    # First Variant Call
    $ java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta \
    -I sample.realigned.bam -o sample.raw.snps.indels.vcf
    # VariantAnnotator
    $ java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R reference.fasta \
    -I sample.realigned.bam -V sample.raw.snps.indels.vcf -o sample.raw.snps.indels.VA.vcf \
    -A BaseQualityRankSumTest -A ChromosomeCounts -A Coverage \
    -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest \
    -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
    -A SpanningDeletions -A TandemRepeatAnnotator -A StrandOddsRatio \
    -A InbreedingCoeff
    # SelectVariants
    ## SNPs
    $ java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta \
    -V sample.raw.snps.indels.VA.vcf -selectType SNP -o sample.raw.snps.VA.vcf
    ## INDELs
    $ java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta \
    -V sample.raw.snps.indels.VA.vcf -selectType INDEL -o sample.raw.indels.VA.vcf
    # VariantFiltration
    ## SNPs
    $ java -jar GenomeAnalysisTK.jar -T VariantFiltration -R reference.fasta -V sample.raw.snps.VA.vcf \
    --filterExpression "DP < 55 || DP > 1100 || QD < 10.0 || FS > 45.0 || MQ < 45.0 || MQRankSum < -7.0 || ReadPosRankSum < -7.0 || SOR > 4.0" \
    --filterName "snp_filter" -o sample.filtered.snps.VA.vcf
    ## INDELs
    $ java -jar GenomeAnalysisTK.jar -T VariantFiltration -R reference.fasta -V sample.raw.indels.VA.vcf \
    --filterExpression "DP < 55 || DP > 1100 || QD < 4.5 || FS > 50.0 || ReadPosRankSum < -1.82 || InbreedingCoeff < -0.467 || FS > 50 || SOR > 4.2" \
    --filterName "indel_filter" -o sample.filtered.indels.VA.vcf
    # CombineVariants
    $ java -jar GenomeAnalysisTK.jar -T CombineVariants -R reference.fasta \
    -V sample.filtered.snps.VA.vcf -V sample.filtered.indels.VA.vcf -o sample.filtered.snps.indels.VA.vcf \
    --genotypemergeoption UNIQUIFY
    # 1st BaseRecalibration
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.filtered.snps.indels.VA.vcf -o bqsr1.table
    # 2nd Variant Call
    $ java -jar GenomeAnalysisTK.jar -T HaplotypeCaller  -R reference.fasta \
    -I sample.realigned.bam -BQSR bqsr1.table -o sample.bqsr1.snps.indels.vcf
    # 2nd BaseRecalibration
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.bqsr1.snps.indels.vcf -BQSR bqsr1.table
    

    Variant Calls and Base Recalibrations were repeated 5 times respectively.
    And I found the number of variants dramatically decreased.

    # The number of reported variants in 1st variant call without hard-filter
    $ grep -v '#' sample.raw.snps.indels.VA.vcf | wc -l
    16056
    # The number of variants after hard-filter
    $ grep 'PASS' sample.filtered.snps.indels.VA.vcf | wc -l
    5057
    # The number of variants after 2nd variant call with 1st round bqsr table
    $ grep -v '#' sample.bqsr1.snps.indels.vcf | wc -l
    2041
    # The number of variants after 3rd variant call with 2nd round bqsr table
    $ grep -v '#' sample.bqsr2.snps.indels.vcf | wc -l
    10
    # The number of variants after 4th variant call with 3rd round bqsr table
    $ grep -v '#' sample.bqsr3.snps.indels.vcf | wc -l
    8
    # The number of variants after 5th variant call with 4th round bqsr table
    $ grep -v '#' sample.bqsr4.snps.indels.vcf | wc -l
    13
    

    The base quality score converge at at 3rd round BQSR so it would be nice to use 3rd round bqsr table for PrintReads. However, the number of variants is too small after 2nd round BQSR.
    Anything wrong with my protocol? Any suggestions would be appreciated.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tkoyama
    Hi,

    I think the best thing to try first is to relax your hard filtering criteria. How did you choose the criteria?

    Also, can you post the Analyze Covariates plots you get from the first round of recalibration?

    Thanks,
    Sheila

  • tkoyamatkoyama The University of TokyoMember

    Hi Sheila. Thank you for your comment.
    I basically chose parameters involving the hard-filter according to this and this but tuned them by assessing some variants on IGV.
    I will attach 1st, 2nd and 3rd BQSR plots. These plots were made as follows;

    # bqsr1.plots.pdf
    ## 1st BaseRecalibration with hard-filtered vcf
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.filtered.snps.indels.VA.vcf \
    -o bqsr1.table
    ## 1st BaseRecalibration with -BQSR bqsr1.table
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.filtered.snps.indels.VA.vcf -BQSR bqsr1.table \
    -o bqsr1.after.table
    ## AnalyzeCovariates between bqsr1.table and bqsr1.after.table
    $ java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta \
    -before bqsr1.table -after bqsr1.after.table -plots bqsr1.plots.pdf
    
    # Actually I performed 2nd variant call here and made sample.bqsr1.snps.indels.vcf as stated the previous post.
    
    # bqsr2.plots.pdf
    ## 2nd BaseRecalibration with sample.bqsr1.snps.indels.vcf and -BQSR bqsr1.table
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.bqsr1.snps.indels.vcf -BQSR bqsr1.table \
    -o bqsr2.table
    ## 2nd BaseRecalibration with -BQSR bqsr2.table
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.bqsr1.snps.indels.vcf -BQSR bqsr2.table \
    -o bqsr2.after.table
    ## AnalyzeCovariates between bqsr2.table and bqsr2.after.table
    $ java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta \
    -before bqsr2.table -after bqsr2.after.table -plots bqsr2.plots.pdf
    
    # 3rd variant call were performed here.
    
    # bqsr3.plots.pdf
    ## 3rd BaseRecalibration with sample.bqsr2.snps.indels.vcf and -BQSR bqsr2.table
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.bqsr2.snps.indels.vcf -BQSR bqsr2.table \
    -o bqsr3.table
    ## 3rd BaseRecalibration with -BQSR bqsr3.table
    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.bqsr2.snps.indels.vcf -BQSR bqsr3.table \
    -o bqsr3.after.table
    ## AnalyzeCovariates between bqsr3.table and bqsr3.after.table
    $ java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta \
    -before bqsr3.table -after bqsr3.after.table -plots bqsr3.plots.pdf
    

    Sorry for complicated explanation. I look forward to hearing from you.

    best,
    tkoyama

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tkoyama
    Hi Tkoyama,

    Can you tell me some more about your project? What organism are you working with?
    The quality scores seem quite low overall. Have you done any QC on your original data?

    Would it be possible to share your data with me, so I can do some experimentation with bootstrapping myself? I am developing some documentation on this and would like to test if the usual recommendations are widely applicable or if there are some caveats for some types of data, organisms, or levels of quality.

    Thanks,
    Sheila

  • tkoyamatkoyama The University of TokyoMember

    Hi Sheila,

    Thank you for your reply. I got my data from BGI. BGI gave me cleaned data so I have not done any QC before mapping.
    I can share my data and sample information if we can communicate personally (I mean outside gatk forum).
    Sorry for this but my data have not published yet so it is difficult to share the data here. Of course, we can share what is wrong with my protocol here if we could identify the cause.
    If we can communicate outside gatk forum, how can I contact you?

    best,
    tkoyama

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tkoyama
    Hi Tkoyama,

    Yes, of course I understand it is unpublished information. You can upload to our FTP server. Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • tkoyamatkoyama The University of TokyoMember

    Hi Sheila,

    I cannot connect the FTP server. I posted a question. Could you check?

    Thanks,
    tkoyama

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The FTP server is on the fritz, sorry. We have IT working on fixing it.

  • tkoyamatkoyama The University of TokyoMember

    Hi Sheila,

    I uploaded my data named 'tkoyama_to_shiela.tar.gz'.
    If you have any problem, please let me know.
    Thank you for your kind helps!

    tkoyama

    Issue · Github
    by Sheila

    Issue Number
    157
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • SumuduSumudu Sri LankaMember

    Hi,

    This is regarding the below commandline by tkoyama. Is it correct to use "-BQSR bqsr3.table" when performing BaseRecalibration again?

    3rd BaseRecalibration with sample.bqsr2.snps.indels.vcf and -BQSR bqsr2.table

    $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fasta \
    -I sample.realigned.bam -knownSites sample.bqsr2.snps.indels.vcf -BQSR bqsr2.table \
    -o bqsr3.table

    Thanks.

  • roselaw27roselaw27 ChinaMember

    Hi GATK team,

    It might sound like a silly question but the plot I got from comparing tables from 2 passes of BQSR has less graphs than what tkoyama got. specifically missing the part of "base insertion" and "base deletion" . Is this something I should worry about? I also wonder if there is a way to get those graphs. I think it might be me missing some parameters in previous steps.

    I am using GATK4, HaplotypeCaller --> VariantFiltration -->BQSR-->ApplyBQSR -->BQSR2 --> AnalyzeCovariates

    Thank you very much.
    Rose

Sign In or Register to comment.