BaseRecalibrator

jlrfloresjlrflores Posts: 11Member
edited October 2012 in Ask the GATK team

This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr? Confused, Juan

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi Juan,

    Th BQSR process has been reorganized; now the second step is simply done with PrintReads as detailed here: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibrator

    Sorry for the confusion, we'll try to make it clearer in the documentation.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited October 2012

    Hi,

    I used to run the base quality score recalibration from the old version, and now use the new version (GATK 2.0), which seems having quite a bit change, not just the commands used, but also the plotting. Here is what I used:

    first I did command for BaseRecalibrator:

    java -Xmx4g -jar  /Path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /Path/mySample.bam -o /Path/mySample_CovarTable_Recal.grp
    

    then command for PrintReads:

    java -Xmx4g -jar  /Path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T PrintReads -R /Path/hg19.fa -I /Path/mySample.bam -BQSR /Path/mySample_CovarTable_Recal.grp -o /Path/mySample_recalBam.bam
    

    All the commands went through fine, and I got the recalibrated bam file and one single corresponding pdf file for the plottings. In old version of GATK, I used CountCovariates, TableRecalibration, and AnalyzeCovariates to get plots for data before and after recalibration. Now I am only getting a single pdf file, which seem only got plots of original data, but did not have plots for the data after recalibration.

    What did I miss in the commands? How do I get the plots for data after recalibration or within the same plots of old data? I saw the plots has legend labels with Recalibration and ORIGINAL, but it seems only original data was plotted but not the recalibrated new data.

    Thanks for your help!

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin
  • mikemike Posts: 103Member

    Thanks so much for the info, Geraldine! Mike

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Just want to make sure. based on the link you just gave me:http://gatkforums.broadinstitute.org/discussion/1539/baserecalibrator-plots

    it mentioned: Now run the BaseRecalibrator again using the original bam and passing in the recalibration table you just generated (-BQSR recal_data.grp). This additional argument tells the GATK-engine to actually perform the recalibration. The output will be a pdf showing both the pre- and post- recalibration accuracy on the same plots....

    In other word, I shall use the following command: java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /Path/mySample.bam -BQSR /Path/mySample_CovarTable_Recal.grp

    it will create a single pdf file with the plot of both data before and after recalibration, right? Since previously I already got a pdf file with just plot of old data before recal. and teh above command will plot the recal data on top of the old pdf plots, right?

    Thanks again

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Yup, that should do the trick.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    In order to get the plots for data after recalibration or within the same plots of old data, I did try the command based on our above discussion as below: command I: java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /path/mySample_CovarTable_Recal.grp

    However, I got the error as below:

    ERROR
    ERROR MESSAGE: Argument with name '--out' (-o) is missing.
    ERROR ---------------------------------------------------

    However, before I run the above command, I already run command as below with -o option: command II java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19_chrM_1st.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -o /Path/mySmaple_CovarTable_Recal.grp

    This command II used -o /Path/mySmaple_CovarTable_Recal.grp to create the .grp file for the command I. If command I needs the -o, as the error message mentioned, what shall I use for -o ? like the command below?

    java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /path/mySample_CovarTable_Recal.grp -o /Path/mySmaple_CovarTable_Recal.grp

    so grp file /Path/mySmaple_CovarTable_Recal.grp will be in both -o and -BQSR option, what did I miss here?

    by the way, I did use PrintReads as below to create the recalibrated bam file, which works fine: java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T PrintReads -R /path/hg19.fa -I /path/mySample_realignedBam.bam -BQSR /path/mySample_CovarTable_Recal.grp -o /path/mySmaple_realign_recalBam.bam

    Thanks a lot for your help!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi mike, you need to give a different name for the -o .grp file in the second command. In that second run, the recalibrator creates a new output table.  The first time you run it the table represents the original data.  The second time you run it (with the original table as input to -BQSR) it represents the recalibrated data.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited October 2012

    Dear Geraldine:

    Thanks so much for the great advice and clarification!!! if I understand correctly this time based on your comments, here are what supposed to be (apprarently, the documentation did not state that clear):

    first I need to use the following command (command A) to create grp file for the original data (not yet being recalibrated).

    java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19_chrM_1st.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -o /Path/mySample_CovarTable_Orig.grp
    

    then I need to use the grp file (mySample_CovarTable_Orig.grp) for the original data created from the above command to do the recalibration using the following command (command B; this command will also plot data of both old and after recalibrated)

    java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /Path/mySample_CovarTable_Orig.grp -o /Path/mySample_CovarTable_Recal.grp
    

    Then I can use the newly created grp table file for rercalibrated data(/Path/mySample_CovarTable_Recal.grp) to use the following command (command C) to create the recalibrated bam file

    java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T PrintReads -R /path/hg19.fa -I /path/mySample.bam -BQSR /path/mySample_CovarTable_Recal.grp -o /path/mySample_recalBam.bam
    

    I originally only used the grp table right after the first BaseRecalibrator command to do the PrintReads command, it seems creating a bam file not being recalibrated (I previously did not do the command B).

    Please confirm with me on the above commands and hope my understanding finally got right.

    Thanks again for your great help!

    Mike

    Post edited by Geraldine_VdAuwera on
  • mikemike Posts: 103Member

    Hi, Dear Geraldine:

    Thank you for the great advice, the commnads work great now and I finally saw the plots with both Original data and recalibrated data.

    Just two follow-up questions:

    1. As I mentioned in my last post, I now used -BQSR /path/mySample_CovarTable_Recal.grp in command C to create the recalibrated bam file. But before that, due to my previous misunderstanding, I instead used -BQSR /path/mySample_CovarTable_Orig.grp in command C. I wonder if I do that, what I really get, since GATK did not give any error for that, does that mean I create the original bam file without recalibration?

    2. Is there any new documents to briefly explain what are plotted in the pdf file created in my command B as above? I saw the plots at the following URL:

    http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows

    but they are from old version (I did use old version as well). For the plots from new version, any a bit detailed descriptions available to interpretate the plots? for example, I understand that original usually overestimate the quality scores ( I can see that in top panel of page 2). However, page 2 of the pdf file, the middle panel shows quality score accuracy and middel panel of page 3 shows Mean Quality Score for base substitution, (also insertion, deletion etc) respectively for cycle covariate, I am not quite following what exactly are these base substitution, base insertion and deletion meant here? We are talking about reads of the bam file, right? But these seem talking about the SNPs. Indels relative to the reference bases?

    Thansk a lot for your help!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi Mike,

    Sorry to get back to you so late, things have been busy over here and your question slipped through the cracks.

    To clarify the base recalibration workflow:

    For basic usage, i.e. to just recalibrate your bam file, all you need to do is run BaseRecalibrator once, then use the .grp from that run to Print Reads, and that will yield your recalibrated dataset. The second BaseRecalibrator run is just used to generate the before/after plots; it is not required for the actual recalibration.

    So in your case, your Command C should use the .grp file generated by the first run of the recalibrator.

    I realize this is a little confusing; we'll try to improve the documentation to make that clearer. We're also going to add more explanations about how to interpret the plots, but we have a few higher-priority todo items to deal with first, so I can't promise you an ETA on those doc upgrades.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
    edited October 2012

    Hi, Dear Geraldine:

    Thanks a lot for the clarification and getting back to me, which is very helpful. Otherwise, my data looks got screwed since I did use the mySample_CovarTable_Recal.grp (from second run of BaseRecalibrator) to generate recalibrated bam file. It is kind of confusing though since the documentation did not explicitly mention that particularly "your Command C should use the .grp file generated by the first run of the recalibrator". So I have to re-run my data for command C with using the .grp come off the first run of the BaseRecalibrator run because of that. Then that brings up my curiosity about the following: when I run the command B as I copied and pasted below from above discussion:

    java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /Path/mySample_CovarTable_Orig.grp -o /Path/mySample_CovarTable_Recal.grp
    

    Based on your new comments, the output .grp file: mySample_CovarTable_Recal.grp is not needed for the command C to use PrintReads to generate recalibrated bam file, and so my question is; if it is used for the plotting (since I remember I did try command B without -o /Path/mySample_CovarTable_Recal.grp option, GATK did complain about the missing -o. Does this mean the created mySample_CovarTable_Recal.grp from 2nd run only has a role in plotting, whereas mySample_CovarTable_Orig.grp from the first run in fact already contains the info to create recalibrated bam file as you just mentioned.

    Thanks again for your great help! really appreciated your time for the supporting!!! I am getting much more knowledgeable with your clarification. Just try to get a bit clearer on how the output and input were used in these commands.

    Best

    Mike

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi Mike,

    That's correct, the mySample_CovarTable_Orig.grp from your first run contains the info for recalibration, and the mySample_CovarTable_Recal.grp file from your second run is only used for plotting.

    I realize this is confusing for many people, so I will try to make a picture of the process to show which file is used for what. I will add that to the documentation soon.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Dear Geraldine:

    Thanks a lot again for your very patient efforts to clarify things to me, which is highly appreciated!

    Thanks again and best

    Mike

  • mlcunhamlcunha Posts: 8Member

    Hi! I am running the latest version of BaseRecalibrator and in the release notes we can read that: "Added new arguments to separately enumerate the .csv and .pdf output files.". I have seen in the documentation that we can use the argument --plot_pdf_file to get the pdf output file but the argument for getting the .csv output file is not described. I tried to run adding the -k flag but I got the error message "Argument with name 'k' isn't defined". Thanks for any help!

  • ebanksebanks Posts: 684GATK Developer mod

    Sorry for the confusion. I've updated the release notes.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • lukehurlukehur Posts: 1Member
    edited January 2013

    Dear All,

    I have tried the below

    java -jar $GATK/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -nct 12 \
    -I /data/recal/SAMPLE_realign.bam \
    -R $REFERENCE/human_g1k_v37.fasta \
    -knownSites $REFERENCE/dbsnp_137.b37.vcf \
    -L 1:1-20000 \
    -o /data/recal/SAMPLE_recal.grp \
    -intermediate /data/recal/SAMPLE_recal.csv \
    -plots /data/recal/SAMPLE_recal.pdf
    

    It works fine except that pdf file was not produced. Please, help!

    Luke

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi Luke, this has been covered in other threads on this forum. Please search and read them to find the solutions to your problem.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member

    Every time I run baserecalibrator with -plots option it throws the following error

    MESSAGE: Argument with name 'plots' isn't defined.

    From the above post I believe it should work fine.

    java -Xmx4g -jar /u1/tools/public/GenomeAnalysisTK/GenomeAnalysisTK.jar -T BaseRecalibrator -I SRR064560_realigned.bam -R ../../Soybean_ref_genome.fasta -run_without_dbsnp_potentially_ruining_quality -BQSR recal_data.grp -o new_recal.grp -plots comp.pdf

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi there, what version of gatk are you using? If it is an older one the argument may not apply; if so you should upgrade.

    Geraldine Van der Auwera, PhD

  • smk_84smk_84 Posts: 59Member
    edited February 2013

    Oops you are right it works fine with the older version. I have run steps upto indelrealigner with the older version v2.0-38. I hope it would not be a problem if I run the rest of the steps with newer version. Kindly let me know if I am wrong.

    Thanks

    Post edited by smk_84 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    That's ok, you shouldn't have any problems.

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    Hi, this thread is very helpful, thanks. Could the parameter of -BQSR be added to the PrintReads documentation? (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_readutils_PrintReads.html). I thought -BQSR was needed but as I didn't see it on the documentation page, I needed to do alot more googling to confirm. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    I added a note, it will appear in the docs in next minor version release.

    Geraldine Van der Auwera, PhD

  • InnsbruckerInnsbrucker Posts: 2Member
    edited April 2013

    Sorry! i fixed the Error, it was only declaration Error!

    BIG SORRY

    Post edited by Innsbrucker on
  • mikemike Posts: 103Member

    Hi,

    I come back to this thread that I had asked a few questions before for the current version 2.6-4.

    a minor and quick question: in the new page of BQSR, the example gives the following option at URL: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html

    -o recal_data.table

    But I used to use -o recal_data.grp

    the file extension shall not matter,right? grp just extension or can be any extension like .table?

    Also, based on our previous discussion above: here is recap: For basic usage, i.e. to just recalibrate your bam file, all you need to do is run BaseRecalibrator once, then use the .grp from that run to Print Reads, and that will yield your recalibrated dataset. The second BaseRecalibrator run is just used to generate the before/after plots; it is not required for the actual recalibration.

    So the option -BQSR is still used to just replot the data before/after recalibration as well as PrintRead, right?

    It was mentioned "I realize this is confusing for many people, so I will try to make a picture of the process to show which file is used for what. I will add that to the documentation soon." I seem could not find the clarification for those.

    Just want to make sure. Thanks for your help!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi @mike,

    That's correct, the extension doesn't matter. We've changed it to .table in the newer examples because it's more explicit than .grp (which stood for gatkreport) but it doesn't affect how the file is recognized or handled in any way.

    For basic usage, yes, just do BaseRecalibrator once, then use PrintReads with -BQSR recal.table (or whatever your recalibration file is called) to generate the recalibrated bam file. The -BQSR option tells the engine to do on-the-fly recalibration on the input data using the recalibration table.

    The workflow for plotting has changed slightly, though I haven't had time to update all the relevant docs yet. It still requires doing a second pass with BaseRecalibrator, but now the plots are generated by a separate tool called AnalyzeCovariates. See the tech doc page for details.

    I am currently working on updating the docs for the entire Best Practices pipeline of tools. Once that's done it will include pictures of the workflow for each tool.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks a lot for your advice and info.

    However, you just mentioned that: It still requires doing a second pass with BaseRecalibrator, but now the plots are generated by a separate tool called AnalyzeCovariates. See the tech doc page for details.....

    When I went to that technical page, I am getting more confused, since -T AnalyzeCovariates is from old version of GATK at this step as I recalled, and for the new version after 2.0 just use BaseRecalibrator to plot, here is what I used to use:

    1. generate recalibration table: java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/dbsnp_135.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -I /Path/MyBam.bam -o /Path/myBam_CovarTable_Orig.grp

    2. Plot the data before and after recalibration java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/dbsnp_135.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -I /Path/MyBam.bam -BQSR /Path/myBam_CovarTable_Orig.grp -o /Path/myBam_CovarTable_Recal.grp

    3. Generate the recalibrated bam file. java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar -T PrintReads -R /Path/hg19.fa -I /Path/MyBam.bam -BQSR /Path/myBam_CovarTable_Orig.grp -o /Path/MyBam_recalBam.bam

    My question is: for verison 2.6-4, are above command still working or I have to use AnalyzeCovariates?

    Thanks again!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    I realize it's confusing, but AnalyzeCovariates is a reimplementation of the old tool, brought back to specifically handle the plotting. This makes it possible to plot various comparisons of recalibration runs more easily. Your command lines to run the recalibration are still applicable, yes. It's just that BaseRecalibrator itself will no longer output plots.

    I'm attaching a draft presentation that details all the steps; hopefully that will help.

    pdf
    pdf
    BQSR-protocol_draft.pdf
    1M

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks a lot for the info and advice, which is very helpful! Appreciated it very much!

    Best

    Mike

  • mikemike Posts: 103Member

    HI, Geraldine:

    I did run into some odd issues here, not sure if it is caused by our original BWA mapping or by the BaseRecalibrator of GATK. We have 24 samples, 22 of them went through phase I without any issues, but two samples stuck at BaseRecalibrator step for a very similar error message:

    the command: java -Xmx4g -jar /PATH/gatk/2.6-4/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /PATH/hg19.fa -knownSites /Path/dbsnp_137.hg19.vcf -knownSites /Path/Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites /Path/1000G_phase1.indels.hg19.vcf -I /Path/mybam.bam -o /Path/mybam_CovarTable_Orig.grp

    one sample has error message as below:

    ERROR MESSAGE: START (101) > (100) STOP -- this should never happen, please check read: HWI-ST1190R:85:D261HACXX:2:2204:3458:81916 2/2 101b aligned read. (CIGAR: 94M4I3M3D)

    I pull out the reads from bam file of our original and after realignment, they are the same (I was worrying that realigner of GATK may have changed the CIGAR string, but it did not), the reads (pair) are as below: HWI-ST1190R:85:D261HACXX:2:2204:3458:81916 81 chr6 133147917 37 101M = 133147921 -97 ATCTTATTCTCCTACTAAGACATTCTAATTTGAAAATTAGGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGGGCGGAT DDDDDCA?DCDEEDDEDEDCDCCCDDDCCDCEDEEDDDBDDBBCDFFHHGJJJJJJJJHJJJJJIIJIHHIHHHGCJJJJJIGJIIIJHGHHHFFFFFCCC X0:i:1 X1:i:0 MD:Z:1C0T0C97 RG:Z:D261HACXX_KK24_F-FXP0452L XG:i:0 AM:i:23 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:23 XT:A:U

    HWI-ST1190R:85:D261HACXX:2:2204:3458:81916 161 chr6 133147921 23 94M4I3M3D = 133147917 97 TATTCTCCTACTAAGACATTCTAATTTGAAAATTAGGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGGGCGGGGAGAT CCCFFFFFGGHHHJJJIJJJJJJJIJIIJJIJIJJIIJIIJJJJJJIHEHBBDFCEDEBDDDDEEEDDDDDDDDDDDDDDDDDDDDDDDDDB######### X0:i:1 X1:i:1 XA:Z:chr6,+133147921,94M4I3M2D,5; MD:Z:97^CAC0 RG:Z:D261HACXX_KK24_F-FXP0452L XG:i:1 AM:i:23 NM:i:7 SM:i:23 XM:i:4 XO:i:1 MQ:i:37 XT:A:U

    Another sample has similar error message as below:

    ERROR MESSAGE: START (101) > (100) STOP -- this should never happen, please check read: HWI-ST1190R:84:D24TEACXX:3:1204:5040:69275 1/2 101b aligned read. (CIGAR: 95M4I2M3D)

    Any idea? Do you think it is our BWA alignment issue or some issue within BaseRecalibrator causing that ? Anybody saw this type of error before?

    Thanks a lot!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi Mike,

    We've seen issues before with reads whose CIGAR ends in deletions, but I thought that was fixed in 2.6. You can try running with the latest nightly build in case the fix I'm thinking of was put in there after the recent release. If that doesn't work you can add -rf BadCigar to filter out those reads. Assuming you don't have too many of these it shouldn't affect your analysis. But it's good practice to have a look at the filtering summary that gets printed at the end of the run, to make sure that you're not losing too much data to this.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks a lot for the info and advice., I will give a try on your advice,

    Best

    Mike

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Just FYI, the -rf BadCigar option works with 244 reads (0.00% of total) failing BadCigarFilter, so not so bad.

    but not the nightly build at version 2.6-4-gd481978 still failed with the same error message.

    Thanks,

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Thanks for reporting back with your status, Mike. Sounds like you have few bad reads so you can go ahead with that solution. I'll look into why we're not handling the case properly by default.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    I did run into another relevant issue downstream: when I used Unified genotyper to call variants, I got error like below:

    ERROR MESSAGE: SAM/BAM file SAMFileReader{/is2/projects/kraemer_ttd/scratch/variants/BWA-aln_GATK-newdb/phase_I/Sample_5.BWA-aln.sorted.RG.sorted.MKDUP_realign_recalBam.bam} is malformed: read starts with deletion. Cigar: 1D3M2I96M. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the

    extra option: -rf BadCigar

    I can re-run Unified genotyper with the extra option: -rf BadCigar, just want to check back with you: is this error normal? and run with that option would have no issues with variant calling? So are downstream analysis such as VQSR step, do I need this option again?

    Thanks,

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    This can occur for various reasons. As long as the filtering summary shows that the number of reads filtered out is small (as in your previous comment) it is not a cause for concern, and will not affect the quality of your results. Once you're past the variant calling step and working with the resulting vcf, you no longer need this option for the remaining operations, such as VQSR. You may need to use it again if you do something that involves using the bam file, for example if you use VariantAnnotator to add annotations based on the read data, but that's all.

    If you want to generate a bam file that does not include the bad reads, you can use PrintReads with the -rf BadCigar filter. Then you can use that file without the filter.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks a lot for the info. Appreciated very much!

    Mike

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Kind of strange: I can post my comments and questions in this thread but I can not post my questions and comments in another HyplotypeCaller discussion page. Not sure why? a web server issue

    http://gatkforums.broadinstitute.org/discussion/1203/haplotype-caller

    Thanks and best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi Mike,

    Your questions fell into the spam filter, I'm restoring them now. We've been fighting off a big wave of spam so the filter settings are on more stringent settings than usual. Hopefully now that I've verified you as not-spam you won't have any more problems.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Great! Thanks a lot for the info, Geraldine! I do have some questions on Hyplotypercallers with links, I noticed that including links could fall into the filters. Shall I post my questions with links in a attachment using attach a file button here? Hope you can recover my questions on Hyplotypecaller.

    Thanks again, Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    I deleted some of the duplicate comments to clean things up a bit, but answered all the unique questions I think.

    I will check the spam filter settings; it is true that including links makes it more likely that something will be flagged as spam, but you should be able to include links to our own documents. Thanks for your patience while we iron out these technical difficulties.

    Geraldine Van der Auwera, PhD

  • modi2020modi2020 Posts: 15Member

    Hi Geraldine, I work with horses and there are no gold standard SNPs or indels that I can use for the BaseRecalibrator step. I now have the reads processed up to this step after following the GATK best practices recommendations. Since I dont have the gold standards I am thinking to skip this step. Would this affect the quality of SNPs/indels that GATK will call? I am thinking to use a more stringent quality score to consider a SNP or indel real down the pipeline but I am not sure! Is there possibly something else I can do to be more confident in the variant calls that I obtain?

    Thank you so much

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    We don't recommend skipping base recalibration as this can significantly affect the quality of your results. The solution to your problem is to bootstrap your own set of high-confidence variants. This process is explained in our documentation, please have a look at it and let me know if you need any additional clarifications.

    Geraldine Van der Auwera, PhD

  • modi2020modi2020 Posts: 15Member

    Hi Geraldine,

    I am starting the bootstrap process and I had a small question in min. can we input the Haplotypecaller vcf output into the BaseRecalibrator using the --knownsites directly ? The reason I am asking is because it will encompass SNPs,INDELS and structural Variants. It seems like we have to use separate --knownsites commands for each type of variants from the following (page 16): http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-3-Base_recalibration.pdf

    They used dbsnp.vcf for SNPs and a gold.standard.indels.vcf for indels.

    On the other hand the refference at: (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html) Doesn't talk about having to separate them.

    Thank you :-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi there,

    You don't need to separate variant types to input as knowns for BaseRecalibrator. It is jut a coincidence that it seems that way in the example, which is only meant to show that you can provide several different sources of known sites. So in theory you can use the output of HaplotypeCaller directly. However, in practice you may want to filter it first (using manual hard filters) to increase the confidence in your set of known sites.

    Geraldine Van der Auwera, PhD

  • modi2020modi2020 Posts: 15Member

    Thank you Geraldine,

    I will do that ! I am a little confused about the filtering part though. In the GATK Manual 2.4.3 (page 127) its mentioned that "it's generally best to turn down the confidence threshold to allow more borderline calls" They recommend filtering at a quality score of 20. http://www.broadinstitute.org/gatk/pdfdocs/GATK_GuideBook_2.4-3.pdf Do you think I should follow that or that I should be more strict filtering. If I should use the strict filtering, what threshold do you recommend?

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    @modi2020,

    There are two steps where you can impose some criteria to output only the higher-quality variants.

    1. At the calling step: you can adjust the call and emit confidence thresholds, which effectively filter calls based on the QUAL metric. That is what the article you mention is talking about. You can read more about these thresholds in our documentation. For "normal" calling, it is best to use low thresholds at this step, then be more strict later on when you either do variant recalibration or hard-filtering. If you are doing initial rounds of calling to produce a set of high confidence calls for base recalibration, you can increase the thresholds. Exactly what values to use depends on your data but I would recommend you start with at least 30. You can experiment a little depending on how many variants that yields.

    2. At the hard-filtering step: you can adjust the filtering parameters, which filter calls based on metrics other than QUAL, to be more stringent. Again, what values to use depends on your data, but you can start with the filtering recommendations we provide at the very end of this document: http://www.broadinstitute.org/gatk/guide/article?id=1186

    Just remember that here you are trying to be as specific as possible (not sensitive) so it is better to use more selective parameters than you would normally use in calling variants.

    Geraldine Van der Auwera, PhD

  • modi2020modi2020 Posts: 15Member

    Great! I got it! :-) Thank you so much for your thorough answer Geraldine. Have a great day!

    @Geraldine_VdAuwera said: modi2020,

    There are two steps where you can impose some criteria to output only the higher-quality variants.

    1. At the calling step: you can adjust the call and emit confidence thresholds, which effectively filter calls based on the QUAL metric. That is what the article you mention is talking about. You can read more about these thresholds in our documentation. For "normal" calling, it is best to use low thresholds at this step, then be more strict later on when you either do variant recalibration or hard-filtering. If you are doing initial rounds of calling to produce a set of high confidence calls for base recalibration, you can increase the thresholds. Exactly what values to use depends on your data but I would recommend you start with at least 30. You can experiment a little depending on how many variants that yields.

    2. At the hard-filtering step: you can adjust the filtering parameters, which filter calls based on metrics other than QUAL, to be more stringent. Again, what values to use depends on your data, but you can start with the filtering recommendations we provide at the very end of this document: http://www.broadinstitute.org/gatk/guide/article?id=1186

    Just remember that here you are trying to be as specific as possible (not sensitive) so it is better to use more selective parameters than you would normally use in calling variants.

  • drashu_11drashu_11 Posts: 11Member

    Hi

    I am using GATK ** PrintReads** after base-re-calibration__`````` my command as follows-------

       java -Xmx4g -jar /mypath/GenomeAnalysisTK.jar  -T PrintReads  --platform ILLUMINA  -R /mypath/genome.fasta 
       -I /mypath/S_01_realigned.bam  -BQSR /mypath/S_01_rcal.grp  -o /mypath/S_01_rcal.bam   -S LENIENT -nct 4
    

    It seems like working with progress meter and done for 100% ,but after the job done I could not get my output file

    Any help?

    Thanks in advance

    Ashutosh

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    Hi @drashu_11,

    Can you post the console output?

    Geraldine Van der Auwera, PhD

  • drashu_11drashu_11 Posts: 11Member

    Hi Geraldine

    I like to mention here my input file is 59 Gb but the out file only 2 kb

    My console output was ended as follows

    INFO  19:11:06,891 ProgressMeter -    ChrX:9512123        5.72e+08  107.0 m       11.0 s     94.8%       112.9 m     5.9 m
    INFO  19:11:38,331 ProgressMeter -   ChrX:29712341        5.76e+08  107.5 m       11.0 s     95.5%       112.5 m     5.0 m
    INFO  19:15:01,854 ProgressMeter -   ChrX:32266594        5.77e+08  110.9 m       11.0 s     95.6%       116.0 m     5.1 m
    INFO  19:15:46,478 ProgressMeter -   ChrX:60829604        5.83e+08  111.6 m       11.0 s     96.7%       115.5 m     3.8 m
    INFO  19:16:16,987 ProgressMeter -   ChrX:82404445        5.88e+08  112.2 m       11.0 s     97.5%       115.0 m     2.9 m
    INFO  19:16:50,350 ProgressMeter -  ChrX:105695745        5.94e+08  112.7 m       11.0 s     98.4%       114.6 m   111.0 s
    INFO  19:17:23,795 ProgressMeter -  ChrX:130351776        5.99e+08  113.3 m       11.0 s     99.3%       114.1 m    47.0 s
    INFO  19:17:59,245 ProgressMeter -  ChrX:148215659        6.03e+08  113.9 m       11.0 s    100.0%       113.9 m     1.0 s
    INFO  19:18:25,233 ReadShardBalancer$1 - Loading BAM index data for next contig
    INFO  19:18:25,266 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@5c19f8f0
    INFO  19:18:25,287 ProgressMeter -            done        6.03e+08  114.3 m       11.0 s    100.0%       114.3 m     0.0 s
    INFO  19:18:25,288 ProgressMeter - Total runtime 6857.58 secs, 114.29 min, 1.90 hours
    INFO  19:18:25,372 MicroScheduler - 5553 reads were filtered out during traversal out of 205559 total (2.70%)
    INFO  19:18:25,373 MicroScheduler -   -> 5553 reads (2.70% of total) failing MappingQualityZeroFilter
    INFO  19:18:26,854 GATKRunReport - Uploaded run statistics report to AWS S3
    

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,681Administrator, GATK Developer admin

    I just noticed you have --platform ILLUMINA in your command line. You do realize that this argument excludes reads with this platform tag from the output, right? (as documented here)

    If your data was produced with Illumina, all your reads will be excluded in that way, which would explain the empty file.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.