The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# BaseRecalibrator

Posts: 11Member
edited October 2012

This method is described to be the "First pass of the base quality score recalibration". What is the second pass? It is not mentioned anywhere, or am I looking in the wrong place? In v1.2 there were two steps, is there only one step now for bqsr? Confused, Juan

Post edited by Geraldine_VdAuwera on
Tagged:

Hi Juan,

Th BQSR process has been reorganized; now the second step is simply done with PrintReads as detailed here: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibrator

Sorry for the confusion, we'll try to make it clearer in the documentation.

Geraldine Van der Auwera, PhD

• Posts: 103Member
edited October 2012

Hi,

I used to run the base quality score recalibration from the old version, and now use the new version (GATK 2.0), which seems having quite a bit change, not just the commands used, but also the plotting. Here is what I used:

first I did command for BaseRecalibrator:

java -Xmx4g -jar  /Path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /Path/mySample.bam -o /Path/mySample_CovarTable_Recal.grp


java -Xmx4g -jar  /Path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T PrintReads -R /Path/hg19.fa -I /Path/mySample.bam -BQSR /Path/mySample_CovarTable_Recal.grp -o /Path/mySample_recalBam.bam


All the commands went through fine, and I got the recalibrated bam file and one single corresponding pdf file for the plottings. In old version of GATK, I used CountCovariates, TableRecalibration, and AnalyzeCovariates to get plots for data before and after recalibration. Now I am only getting a single pdf file, which seem only got plots of original data, but did not have plots for the data after recalibration.

What did I miss in the commands? How do I get the plots for data after recalibration or within the same plots of old data? I saw the plots has legend labels with Recalibration and ORIGINAL, but it seems only original data was plotted but not the recalibrated new data.

Mike

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 103Member

Thanks so much for the info, Geraldine! Mike

• Posts: 103Member

Hi, Geraldine:

Just want to make sure. based on the link you just gave me:http://gatkforums.broadinstitute.org/discussion/1539/baserecalibrator-plots

it mentioned: Now run the BaseRecalibrator again using the original bam and passing in the recalibration table you just generated (-BQSR recal_data.grp). This additional argument tells the GATK-engine to actually perform the recalibration. The output will be a pdf showing both the pre- and post- recalibration accuracy on the same plots....

In other word, I shall use the following command: java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /Path/mySample.bam -BQSR /Path/mySample_CovarTable_Recal.grp

it will create a single pdf file with the plot of both data before and after recalibration, right? Since previously I already got a pdf file with just plot of old data before recal. and teh above command will plot the recal data on top of the old pdf plots, right?

Thanks again

Mike

Yup, that should do the trick.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Hi, Geraldine:

In order to get the plots for data after recalibration or within the same plots of old data, I did try the command based on our above discussion as below: command I: java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /path/mySample_CovarTable_Recal.grp

However, I got the error as below:

##### ERROR ---------------------------------------------------

However, before I run the above command, I already run command as below with -o option: command II java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19_chrM_1st.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -o /Path/mySmaple_CovarTable_Recal.grp

This command II used -o /Path/mySmaple_CovarTable_Recal.grp to create the .grp file for the command I. If command I needs the -o, as the error message mentioned, what shall I use for -o ? like the command below?

java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /path/mySample_CovarTable_Recal.grp -o /Path/mySmaple_CovarTable_Recal.grp

so grp file /Path/mySmaple_CovarTable_Recal.grp will be in both -o and -BQSR option, what did I miss here?

by the way, I did use PrintReads as below to create the recalibrated bam file, which works fine: java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T PrintReads -R /path/hg19.fa -I /path/mySample_realignedBam.bam -BQSR /path/mySample_CovarTable_Recal.grp -o /path/mySmaple_realign_recalBam.bam

Thanks a lot for your help!

Mike

Hi mike, you need to give a different name for the -o .grp file in the second command. In that second run, the recalibrator creates a new output table.  The first time you run it the table represents the original data.  The second time you run it (with the original table as input to -BQSR) it represents the recalibrated data.

Geraldine Van der Auwera, PhD

• Posts: 103Member
edited October 2012

Dear Geraldine:

Thanks so much for the great advice and clarification!!! if I understand correctly this time based on your comments, here are what supposed to be (apprarently, the documentation did not state that clear):

first I need to use the following command (command A) to create grp file for the original data (not yet being recalibrated).

java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19_chrM_1st.fa -knownSites /Path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -o /Path/mySample_CovarTable_Orig.grp


then I need to use the grp file (mySample_CovarTable_Orig.grp) for the original data created from the above command to do the recalibration using the following command (command B; this command will also plot data of both old and after recalibrated)

java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /Path/mySample_CovarTable_Orig.grp -o /Path/mySample_CovarTable_Recal.grp


Then I can use the newly created grp table file for rercalibrated data(/Path/mySample_CovarTable_Recal.grp) to use the following command (command C) to create the recalibrated bam file

java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T PrintReads -R /path/hg19.fa -I /path/mySample.bam -BQSR /path/mySample_CovarTable_Recal.grp -o /path/mySample_recalBam.bam


I originally only used the grp table right after the first BaseRecalibrator command to do the PrintReads command, it seems creating a bam file not being recalibrated (I previously did not do the command B).

Please confirm with me on the above commands and hope my understanding finally got right.

Thanks again for your great help!

Mike

Post edited by Geraldine_VdAuwera on
• Posts: 103Member

Hi, Dear Geraldine:

Thank you for the great advice, the commnads work great now and I finally saw the plots with both Original data and recalibrated data.

Just two follow-up questions:

1. As I mentioned in my last post, I now used -BQSR /path/mySample_CovarTable_Recal.grp in command C to create the recalibrated bam file. But before that, due to my previous misunderstanding, I instead used -BQSR /path/mySample_CovarTable_Orig.grp in command C. I wonder if I do that, what I really get, since GATK did not give any error for that, does that mean I create the original bam file without recalibration?

2. Is there any new documents to briefly explain what are plotted in the pdf file created in my command B as above? I saw the plots at the following URL:

but they are from old version (I did use old version as well). For the plots from new version, any a bit detailed descriptions available to interpretate the plots? for example, I understand that original usually overestimate the quality scores ( I can see that in top panel of page 2). However, page 2 of the pdf file, the middle panel shows quality score accuracy and middel panel of page 3 shows Mean Quality Score for base substitution, (also insertion, deletion etc) respectively for cycle covariate, I am not quite following what exactly are these base substitution, base insertion and deletion meant here? We are talking about reads of the bam file, right? But these seem talking about the SNPs. Indels relative to the reference bases?

Thansk a lot for your help!

Mike

Hi Mike,

Sorry to get back to you so late, things have been busy over here and your question slipped through the cracks.

To clarify the base recalibration workflow:

For basic usage, i.e. to just recalibrate your bam file, all you need to do is run BaseRecalibrator once, then use the .grp from that run to Print Reads, and that will yield your recalibrated dataset. The second BaseRecalibrator run is just used to generate the before/after plots; it is not required for the actual recalibration.

So in your case, your Command C should use the .grp file generated by the first run of the recalibrator.

I realize this is a little confusing; we'll try to improve the documentation to make that clearer. We're also going to add more explanations about how to interpret the plots, but we have a few higher-priority todo items to deal with first, so I can't promise you an ETA on those doc upgrades.

Geraldine Van der Auwera, PhD

• Posts: 103Member
edited October 2012

Hi, Dear Geraldine:

Thanks a lot for the clarification and getting back to me, which is very helpful. Otherwise, my data looks got screwed since I did use the mySample_CovarTable_Recal.grp (from second run of BaseRecalibrator) to generate recalibrated bam file. It is kind of confusing though since the documentation did not explicitly mention that particularly "your Command C should use the .grp file generated by the first run of the recalibrator". So I have to re-run my data for command C with using the .grp come off the first run of the BaseRecalibrator run because of that. Then that brings up my curiosity about the following: when I run the command B as I copied and pasted below from above discussion:

java -Xmx4g -jar /path/GenomeAnalysisTK-2.1-8-g5efb575/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/hg19.fa -knownSites /path/bundle-1.5/hg19/dbsnp_135.hg19.vcf -I /path/mySample.bam -BQSR /Path/mySample_CovarTable_Orig.grp -o /Path/mySample_CovarTable_Recal.grp


Based on your new comments, the output .grp file: mySample_CovarTable_Recal.grp is not needed for the command C to use PrintReads to generate recalibrated bam file, and so my question is; if it is used for the plotting (since I remember I did try command B without -o /Path/mySample_CovarTable_Recal.grp option, GATK did complain about the missing -o. Does this mean the created mySample_CovarTable_Recal.grp from 2nd run only has a role in plotting, whereas mySample_CovarTable_Orig.grp from the first run in fact already contains the info to create recalibrated bam file as you just mentioned.

Thanks again for your great help! really appreciated your time for the supporting!!! I am getting much more knowledgeable with your clarification. Just try to get a bit clearer on how the output and input were used in these commands.

Best

Mike

Post edited by Geraldine_VdAuwera on

Hi Mike,

That's correct, the mySample_CovarTable_Orig.grp from your first run contains the info for recalibration, and the mySample_CovarTable_Recal.grp file from your second run is only used for plotting.

I realize this is confusing for many people, so I will try to make a picture of the process to show which file is used for what. I will add that to the documentation soon.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Hi, Dear Geraldine:

Thanks a lot again for your very patient efforts to clarify things to me, which is highly appreciated!

Thanks again and best

Mike

• Posts: 8Member

Hi! I am running the latest version of BaseRecalibrator and in the release notes we can read that: "Added new arguments to separately enumerate the .csv and .pdf output files.". I have seen in the documentation that we can use the argument --plot_pdf_file to get the pdf output file but the argument for getting the .csv output file is not described. I tried to run adding the -k flag but I got the error message "Argument with name 'k' isn't defined". Thanks for any help!

• Posts: 683GATK Developer mod

Sorry for the confusion. I've updated the release notes.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 1Member
edited January 2013

Dear All,

I have tried the below

java -jar GATK/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -nct 12 \ -I /data/recal/SAMPLE_realign.bam \ -RREFERENCE/human_g1k_v37.fasta \
-knownSites REFERENCE/dbsnp_137.b37.vcf \ -L 1:1-20000 \ -o /data/recal/SAMPLE_recal.grp \ -intermediate /data/recal/SAMPLE_recal.csv \ -plots /data/recal/SAMPLE_recal.pdf  It works fine except that pdf file was not produced. Please, help! Luke Post edited by Geraldine_VdAuwera on • Posts: 6,423Administrator, GATK Developer admin Hi Luke, this has been covered in other threads on this forum. Please search and read them to find the solutions to your problem. Geraldine Van der Auwera, PhD • Posts: 59Member Every time I run baserecalibrator with -plots option it throws the following error MESSAGE: Argument with name 'plots' isn't defined. From the above post I believe it should work fine. java -Xmx4g -jar /u1/tools/public/GenomeAnalysisTK/GenomeAnalysisTK.jar -T BaseRecalibrator -I SRR064560_realigned.bam -R ../../Soybean_ref_genome.fasta -run_without_dbsnp_potentially_ruining_quality -BQSR recal_data.grp -o new_recal.grp -plots comp.pdf • Posts: 6,423Administrator, GATK Developer admin Hi there, what version of gatk are you using? If it is an older one the argument may not apply; if so you should upgrade. Geraldine Van der Auwera, PhD • Posts: 59Member I am using v2.0-38 • Posts: 59Member edited February 2013 Oops you are right it works fine with the older version. I have run steps upto indelrealigner with the older version v2.0-38. I hope it would not be a problem if I run the rest of the steps with newer version. Kindly let me know if I am wrong. Thanks Post edited by smk_84 on • Posts: 6,423Administrator, GATK Developer admin That's ok, you shouldn't have any problems. Geraldine Van der Auwera, PhD • Posts: 37Member Hi, this thread is very helpful, thanks. Could the parameter of -BQSR be added to the PrintReads documentation? (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_readutils_PrintReads.html). I thought -BQSR was needed but as I didn't see it on the documentation page, I needed to do alot more googling to confirm. Thanks. • Posts: 6,423Administrator, GATK Developer admin I added a note, it will appear in the docs in next minor version release. Geraldine Van der Auwera, PhD • Posts: 2Member edited April 2013 Sorry! i fixed the Error, it was only declaration Error! BIG SORRY Post edited by Innsbrucker on • Posts: 103Member Hi, I come back to this thread that I had asked a few questions before for the current version 2.6-4. a minor and quick question: in the new page of BQSR, the example gives the following option at URL: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html -o recal_data.table But I used to use -o recal_data.grp the file extension shall not matter,right? grp just extension or can be any extension like .table? Also, based on our previous discussion above: here is recap: For basic usage, i.e. to just recalibrate your bam file, all you need to do is run BaseRecalibrator once, then use the .grp from that run to Print Reads, and that will yield your recalibrated dataset. The second BaseRecalibrator run is just used to generate the before/after plots; it is not required for the actual recalibration. So the option -BQSR is still used to just replot the data before/after recalibration as well as PrintRead, right? It was mentioned "I realize this is confusing for many people, so I will try to make a picture of the process to show which file is used for what. I will add that to the documentation soon." I seem could not find the clarification for those. Just want to make sure. Thanks for your help! Mike • Posts: 6,423Administrator, GATK Developer admin Hi @mike, That's correct, the extension doesn't matter. We've changed it to .table in the newer examples because it's more explicit than .grp (which stood for gatkreport) but it doesn't affect how the file is recognized or handled in any way. For basic usage, yes, just do BaseRecalibrator once, then use PrintReads with -BQSR recal.table (or whatever your recalibration file is called) to generate the recalibrated bam file. The -BQSR option tells the engine to do on-the-fly recalibration on the input data using the recalibration table. The workflow for plotting has changed slightly, though I haven't had time to update all the relevant docs yet. It still requires doing a second pass with BaseRecalibrator, but now the plots are generated by a separate tool called AnalyzeCovariates. See the tech doc page for details. I am currently working on updating the docs for the entire Best Practices pipeline of tools. Once that's done it will include pictures of the workflow for each tool. Geraldine Van der Auwera, PhD • Posts: 103Member Hi, Geraldine: Thanks a lot for your advice and info. However, you just mentioned that: It still requires doing a second pass with BaseRecalibrator, but now the plots are generated by a separate tool called AnalyzeCovariates. See the tech doc page for details..... When I went to that technical page, I am getting more confused, since -T AnalyzeCovariates is from old version of GATK at this step as I recalled, and for the new version after 2.0 just use BaseRecalibrator to plot, here is what I used to use: 1. generate recalibration table: java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/dbsnp_135.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -I /Path/MyBam.bam -o /Path/myBam_CovarTable_Orig.grp 2. Plot the data before and after recalibration java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /Path/hg19.fa -knownSites /Path/dbsnp_135.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -I /Path/MyBam.bam -BQSR /Path/myBam_CovarTable_Orig.grp -o /Path/myBam_CovarTable_Recal.grp 3. Generate the recalibrated bam file. java -Xmx4g -jar /Path/GenomeAnalysisTK-2.1-13-g1706365/bin/GenomeAnalysisTK.jar -T PrintReads -R /Path/hg19.fa -I /Path/MyBam.bam -BQSR /Path/myBam_CovarTable_Orig.grp -o /Path/MyBam_recalBam.bam My question is: for verison 2.6-4, are above command still working or I have to use AnalyzeCovariates? Thanks again! Mike • Posts: 6,423Administrator, GATK Developer admin I realize it's confusing, but AnalyzeCovariates is a reimplementation of the old tool, brought back to specifically handle the plotting. This makes it possible to plot various comparisons of recalibration runs more easily. Your command lines to run the recalibration are still applicable, yes. It's just that BaseRecalibrator itself will no longer output plots. I'm attaching a draft presentation that details all the steps; hopefully that will help. Geraldine Van der Auwera, PhD • Posts: 103Member Hi, Geraldine: Thanks a lot for the info and advice, which is very helpful! Appreciated it very much! Best Mike • Posts: 103Member HI, Geraldine: I did run into some odd issues here, not sure if it is caused by our original BWA mapping or by the BaseRecalibrator of GATK. We have 24 samples, 22 of them went through phase I without any issues, but two samples stuck at BaseRecalibrator step for a very similar error message: the command: java -Xmx4g -jar /PATH/gatk/2.6-4/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R /PATH/hg19.fa -knownSites /Path/dbsnp_137.hg19.vcf -knownSites /Path/Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites /Path/1000G_phase1.indels.hg19.vcf -I /Path/mybam.bam -o /Path/mybam_CovarTable_Orig.grp one sample has error message as below: ##### ERROR MESSAGE: START (101) > (100) STOP -- this should never happen, please check read: HWI-ST1190R:85:D261HACXX:2:2204:3458:81916 2/2 101b aligned read. (CIGAR: 94M4I3M3D) I pull out the reads from bam file of our original and after realignment, they are the same (I was worrying that realigner of GATK may have changed the CIGAR string, but it did not), the reads (pair) are as below: HWI-ST1190R:85:D261HACXX:2:2204:3458:81916 81 chr6 133147917 37 101M = 133147921 -97 ATCTTATTCTCCTACTAAGACATTCTAATTTGAAAATTAGGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGGGCGGAT DDDDDCA?DCDEEDDEDEDCDCCCDDDCCDCEDEEDDDBDDBBCDFFHHGJJJJJJJJHJJJJJIIJIHHIHHHGCJJJJJIGJIIIJHGHHHFFFFFCCC X0:i:1 X1:i:0 MD:Z:1C0T0C97 RG:Z:D261HACXX_KK24_F-FXP0452L XG:i:0 AM:i:23 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:23 XT:A:U HWI-ST1190R:85:D261HACXX:2:2204:3458:81916 161 chr6 133147921 23 94M4I3M3D = 133147917 97 TATTCTCCTACTAAGACATTCTAATTTGAAAATTAGGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGGGCGGGGAGAT CCCFFFFFGGHHHJJJIJJJJJJJIJIIJJIJIJJIIJIIJJJJJJIHEHBBDFCEDEBDDDDEEEDDDDDDDDDDDDDDDDDDDDDDDDDB######### X0:i:1 X1:i:1 XA:Z:chr6,+133147921,94M4I3M2D,5; MD:Z:97^CAC0 RG:Z:D261HACXX_KK24_F-FXP0452L XG:i:1 AM:i:23 NM:i:7 SM:i:23 XM:i:4 XO:i:1 MQ:i:37 XT:A:U Another sample has similar error message as below: ##### ERROR MESSAGE: START (101) > (100) STOP -- this should never happen, please check read: HWI-ST1190R:84:D24TEACXX:3:1204:5040:69275 1/2 101b aligned read. (CIGAR: 95M4I2M3D) Any idea? Do you think it is our BWA alignment issue or some issue within BaseRecalibrator causing that ? Anybody saw this type of error before? Thanks a lot! Mike • Posts: 6,423Administrator, GATK Developer admin Hi Mike, We've seen issues before with reads whose CIGAR ends in deletions, but I thought that was fixed in 2.6. You can try running with the latest nightly build in case the fix I'm thinking of was put in there after the recent release. If that doesn't work you can add -rf BadCigar to filter out those reads. Assuming you don't have too many of these it shouldn't affect your analysis. But it's good practice to have a look at the filtering summary that gets printed at the end of the run, to make sure that you're not losing too much data to this. Geraldine Van der Auwera, PhD • Posts: 103Member Hi, Geraldine: Thanks a lot for the info and advice., I will give a try on your advice, Best Mike • Posts: 103Member Hi, Geraldine: Just FYI, the -rf BadCigar option works with 244 reads (0.00% of total) failing BadCigarFilter, so not so bad. but not the nightly build at version 2.6-4-gd481978 still failed with the same error message. Thanks, Mike • Posts: 6,423Administrator, GATK Developer admin Thanks for reporting back with your status, Mike. Sounds like you have few bad reads so you can go ahead with that solution. I'll look into why we're not handling the case properly by default. Geraldine Van der Auwera, PhD • Posts: 103Member Hi, Geraldine: I did run into another relevant issue downstream: when I used Unified genotyper to call variants, I got error like below: ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/is2/projects/kraemer_ttd/scratch/variants/BWA-aln_GATK-newdb/phase_I/Sample_5.BWA-aln.sorted.RG.sorted.MKDUP_realign_recalBam.bam} is malformed: read starts with deletion. Cigar: 1D3M2I96M. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar I can re-run Unified genotyper with the extra option: -rf BadCigar, just want to check back with you: is this error normal? and run with that option would have no issues with variant calling? So are downstream analysis such as VQSR step, do I need this option again? Thanks, Mike • Posts: 6,423Administrator, GATK Developer admin This can occur for various reasons. As long as the filtering summary shows that the number of reads filtered out is small (as in your previous comment) it is not a cause for concern, and will not affect the quality of your results. Once you're past the variant calling step and working with the resulting vcf, you no longer need this option for the remaining operations, such as VQSR. You may need to use it again if you do something that involves using the bam file, for example if you use VariantAnnotator to add annotations based on the read data, but that's all. If you want to generate a bam file that does not include the bad reads, you can use PrintReads with the -rf BadCigar filter. Then you can use that file without the filter. Geraldine Van der Auwera, PhD • Posts: 103Member Hi, Geraldine: Thanks a lot for the info. Appreciated very much! Mike • Posts: 103Member Hi, Geraldine: Kind of strange: I can post my comments and questions in this thread but I can not post my questions and comments in another HyplotypeCaller discussion page. Not sure why? a web server issue http://gatkforums.broadinstitute.org/discussion/1203/haplotype-caller Thanks and best Mike • Posts: 6,423Administrator, GATK Developer admin Hi Mike, Your questions fell into the spam filter, I'm restoring them now. We've been fighting off a big wave of spam so the filter settings are on more stringent settings than usual. Hopefully now that I've verified you as not-spam you won't have any more problems. Geraldine Van der Auwera, PhD • Posts: 103Member Great! Thanks a lot for the info, Geraldine! I do have some questions on Hyplotypercallers with links, I noticed that including links could fall into the filters. Shall I post my questions with links in a attachment using attach a file button here? Hope you can recover my questions on Hyplotypecaller. Thanks again, Mike • Posts: 6,423Administrator, GATK Developer admin I deleted some of the duplicate comments to clean things up a bit, but answered all the unique questions I think. I will check the spam filter settings; it is true that including links makes it more likely that something will be flagged as spam, but you should be able to include links to our own documents. Thanks for your patience while we iron out these technical difficulties. Geraldine Van der Auwera, PhD • Posts: 15Member Hi Geraldine, I work with horses and there are no gold standard SNPs or indels that I can use for the BaseRecalibrator step. I now have the reads processed up to this step after following the GATK best practices recommendations. Since I dont have the gold standards I am thinking to skip this step. Would this affect the quality of SNPs/indels that GATK will call? I am thinking to use a more stringent quality score to consider a SNP or indel real down the pipeline but I am not sure! Is there possibly something else I can do to be more confident in the variant calls that I obtain? Thank you so much • Posts: 6,423Administrator, GATK Developer admin We don't recommend skipping base recalibration as this can significantly affect the quality of your results. The solution to your problem is to bootstrap your own set of high-confidence variants. This process is explained in our documentation, please have a look at it and let me know if you need any additional clarifications. Geraldine Van der Auwera, PhD • Posts: 15Member Hi Geraldine, I am starting the bootstrap process and I had a small question in min. can we input the Haplotypecaller vcf output into the BaseRecalibrator using the --knownsites directly ? The reason I am asking is because it will encompass SNPs,INDELS and structural Variants. It seems like we have to use separate --knownsites commands for each type of variants from the following (page 16): http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-3-Base_recalibration.pdf They used dbsnp.vcf for SNPs and a gold.standard.indels.vcf for indels. On the other hand the refference at: (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html) Doesn't talk about having to separate them. Thank you :-) • Posts: 6,423Administrator, GATK Developer admin Hi there, You don't need to separate variant types to input as knowns for BaseRecalibrator. It is jut a coincidence that it seems that way in the example, which is only meant to show that you can provide several different sources of known sites. So in theory you can use the output of HaplotypeCaller directly. However, in practice you may want to filter it first (using manual hard filters) to increase the confidence in your set of known sites. Geraldine Van der Auwera, PhD • Posts: 15Member Thank you Geraldine, I will do that ! I am a little confused about the filtering part though. In the GATK Manual 2.4.3 (page 127) its mentioned that "it's generally best to turn down the confidence threshold to allow more borderline calls" They recommend filtering at a quality score of 20. http://www.broadinstitute.org/gatk/pdfdocs/GATK_GuideBook_2.4-3.pdf Do you think I should follow that or that I should be more strict filtering. If I should use the strict filtering, what threshold do you recommend? Thank you • Posts: 6,423Administrator, GATK Developer admin There are two steps where you can impose some criteria to output only the higher-quality variants. 1. At the calling step: you can adjust the call and emit confidence thresholds, which effectively filter calls based on the QUAL metric. That is what the article you mention is talking about. You can read more about these thresholds in our documentation. For "normal" calling, it is best to use low thresholds at this step, then be more strict later on when you either do variant recalibration or hard-filtering. If you are doing initial rounds of calling to produce a set of high confidence calls for base recalibration, you can increase the thresholds. Exactly what values to use depends on your data but I would recommend you start with at least 30. You can experiment a little depending on how many variants that yields. 2. At the hard-filtering step: you can adjust the filtering parameters, which filter calls based on metrics other than QUAL, to be more stringent. Again, what values to use depends on your data, but you can start with the filtering recommendations we provide at the very end of this document: http://www.broadinstitute.org/gatk/guide/article?id=1186 Just remember that here you are trying to be as specific as possible (not sensitive) so it is better to use more selective parameters than you would normally use in calling variants. Geraldine Van der Auwera, PhD • Posts: 15Member Great! I got it! :-) Thank you so much for your thorough answer Geraldine. Have a great day! @Geraldine_VdAuwera said: modi2020, There are two steps where you can impose some criteria to output only the higher-quality variants. 1. At the calling step: you can adjust the call and emit confidence thresholds, which effectively filter calls based on the QUAL metric. That is what the article you mention is talking about. You can read more about these thresholds in our documentation. For "normal" calling, it is best to use low thresholds at this step, then be more strict later on when you either do variant recalibration or hard-filtering. If you are doing initial rounds of calling to produce a set of high confidence calls for base recalibration, you can increase the thresholds. Exactly what values to use depends on your data but I would recommend you start with at least 30. You can experiment a little depending on how many variants that yields. 2. At the hard-filtering step: you can adjust the filtering parameters, which filter calls based on metrics other than QUAL, to be more stringent. Again, what values to use depends on your data, but you can start with the filtering recommendations we provide at the very end of this document: http://www.broadinstitute.org/gatk/guide/article?id=1186 Just remember that here you are trying to be as specific as possible (not sensitive) so it is better to use more selective parameters than you would normally use in calling variants. • Posts: 11Member Hi I am using GATK ** PrintReads** after base-re-calibration__ my command as follows-------  java -Xmx4g -jar /mypath/GenomeAnalysisTK.jar -T PrintReads --platform ILLUMINA -R /mypath/genome.fasta -I /mypath/S_01_realigned.bam -BQSR /mypath/S_01_rcal.grp -o /mypath/S_01_rcal.bam -S LENIENT -nct 4  It seems like working with progress meter and done for 100% ,but after the job done I could not get my output file Any help? Thanks in advance Ashutosh • Posts: 6,423Administrator, GATK Developer admin Hi @drashu_11, Can you post the console output? Geraldine Van der Auwera, PhD • Posts: 11Member Hi Geraldine I like to mention here my input file is 59 Gb but the out file only 2 kb My console output was ended as follows INFO 19:11:06,891 ProgressMeter - ChrX:9512123 5.72e+08 107.0 m 11.0 s 94.8% 112.9 m 5.9 m INFO 19:11:38,331 ProgressMeter - ChrX:29712341 5.76e+08 107.5 m 11.0 s 95.5% 112.5 m 5.0 m INFO 19:15:01,854 ProgressMeter - ChrX:32266594 5.77e+08 110.9 m 11.0 s 95.6% 116.0 m 5.1 m INFO 19:15:46,478 ProgressMeter - ChrX:60829604 5.83e+08 111.6 m 11.0 s 96.7% 115.5 m 3.8 m INFO 19:16:16,987 ProgressMeter - ChrX:82404445 5.88e+08 112.2 m 11.0 s 97.5% 115.0 m 2.9 m INFO 19:16:50,350 ProgressMeter - ChrX:105695745 5.94e+08 112.7 m 11.0 s 98.4% 114.6 m 111.0 s INFO 19:17:23,795 ProgressMeter - ChrX:130351776 5.99e+08 113.3 m 11.0 s 99.3% 114.1 m 47.0 s INFO 19:17:59,245 ProgressMeter - ChrX:148215659 6.03e+08 113.9 m 11.0 s 100.0% 113.9 m 1.0 s INFO 19:18:25,233 ReadShardBalancer1 - Loading BAM index data for next contig
INFO  19:18:25,266 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@5c19f8f0
INFO  19:18:25,287 ProgressMeter -            done        6.03e+08  114.3 m       11.0 s    100.0%       114.3 m     0.0 s
INFO  19:18:25,288 ProgressMeter - Total runtime 6857.58 secs, 114.29 min, 1.90 hours
INFO  19:18:25,372 MicroScheduler - 5553 reads were filtered out during traversal out of 205559 total (2.70%)
INFO  19:18:25,373 MicroScheduler -   -> 5553 reads (2.70% of total) failing MappingQualityZeroFilter
INFO  19:18:26,854 GATKRunReport - Uploaded run statistics report to AWS S3


Thank you

I just noticed you have --platform ILLUMINA in your command line. You do realize that this argument excludes reads with this platform tag from the output, right? (as documented here)