The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Base Quality Score Recalibration (BQSR)

Posts: 122GATK Developer mod
edited September 2013

Detailed information about command line options for BaseRecalibrator can be found here.

## Introduction

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.

This process is accomplished by analyzing the covariation among several features of a base. For example:

• Reported quality score
• The position within the read
• The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

## Running the tools

### BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

• assigned quality score
• machine cycle producing this base
• current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

### Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I input.bam \
-BQSR recalibration_report.grp \
-o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

• the sum of the global difference between reported quality scores and the empirical quality
• plus the quality bin specific shift
• plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

### Miscellaneous information

• The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
• A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
• Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
• The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

## Example pre and post recalibration results

• Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
• There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

## The output of the BaseRecalibrator

• A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

### The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

• Arguments Table -- a table with all the arguments and its values
• Quantization Table
• Quality Score Table
• Covariates Table

#### Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

#### Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

#### Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

#### Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

## Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

• First do an initial round of SNP calling on your original, unrecalibrated data.
• Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
• Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

### Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

Post edited by Geraldine_VdAuwera on
Tagged:
«1

• Posts: 58Member

Hi,

Could anyone provide a mathematical notes on what exactly BaseRecalibrator is doing?

Have a look at our archive of presentations:

https://www.dropbox.com/sh/e31kvbg5v63s51t/6GdimgsKss

Geraldine Van der Auwera, PhD

• Posts: 16Member

Hi,

I've generated the plots during recalibration, but I'm only getting plots for the original data (not pairs of plots like above). How can I get the plots for the recalibrated scores in order to see what's going on with my data?

This has been addressed previously on this forum, please do a search (box in the top right corner) and you will find an explanation.

Geraldine Van der Auwera, PhD

• Posts: 19Member

About the BaseRecalibrator performance. I am using GATKLite-2.3.4 and in a "best-practices"-like pipeline the base recalibration is the most time consuming task.

The base recalibration (BaseRecalibrator + PrintReads) took about 5 hours for an exome sample, while local realignment took 1 hour end 15 minutes and variant discovery little more than 1 hour. I am using multi-threading whenever it is possible.

I've read that there were some issues with BaseRecalibrator multi-threading and performance in previous releases, but when running BaseRecalibrator in a single thread performance does not seem to increase. Is there any alternative to increase this performance? Maybe disabling some functionality or preparing the data somehow?

Thanks,

Pablo.

Hi Pablo,

Have you tried using scatter-gather (e.g. using Queue)? There was a recent forum discussion on the topic, you may find it interesting to look up.

Geraldine Van der Auwera, PhD

• Posts: 11Member

Hello, Having trouble with the PrintReads aspect of the GATK 2+ recalibration scheme. I am using v2.3-9-ge5ebf34 with reducereads on illumina generated exomes. The BaseRecalibrator & PrintReads have been complaining about the cycle covariate length being too long

##### ERROR MESSAGE: The maximum allowed value for the cycle is 500, but a larger cycle was detected in read C47. Please use the --maximum_cycle_value argument to increase this value (at the expense of requiring more memory to run)

So I implemented a filtering step to clip the cycles > 500 as below. BaseRecalibrator now runs without errors, however PrintReads still complains maximum cycle length. Additionally it suggests I use the --maximum_cycle_value argument, which neither works in the command line nor appears in the documentation for PrintReads.

Is this an error or omission in PrintReads? I may try HardClipping the bam file though this is not supported.

thoughts?

java -Xmx8g -jar /home/jpriest/priest_apps/GATKv2.3.9/GenomeAnalysisTK.jar \ -T ClipReads \ -I ./CHD001_novomask_sorted_realigned.rr.bam \ -o ./CHD001_novomask.clip.sort.realign.rr.bam \ -R /home/jpriest/reference_genomes/hg19/ucsc.hg19.fasta \ -CT 500-10000 \ -CR SOFTCLIP_BASES \ -os testfilter.txt

java -Xmx8g -jar /home/jpriest/priest_apps/GATKv2.3.9/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I ./CHD001_novomask.clip.sort.realign.rr.bam \ -R /home/jpriest/reference_genomes/hg19/ucsc.hg19.fasta \ -knownSites /home/jpriest/reference_genomes/hg19/dbsnp_135.hg19.vcf \ -knownSites /home/jpriest/reference_genomes/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf \ -o CHD001.recal_data.grp \ -nct 4 \ -filterMBQ

java -jar /home/jpriest/priest_apps/GATKv2.3.9/GenomeAnalysisTK.jar \ -T PrintReads \ -R /home/jpriest/reference_genomes/hg19/ucsc.hg19.fasta \ -I ./CHD001_novomask.clip.sort.realign.rr.bam \ -BQSR CHD001.recal_data.grp \ -o ./CHD001_novomask.final.rr.bam \

• Posts: 11Member

update: using the -CR HARDCLIP_BASES \ flag with ClipReads seems to solve the problem, though this is a little unnerving as it is not directly supported

• Posts: 683GATK Developer mod

Hi @jamesrpriest, I just re-read your original post and it dawned on me what you are doing: you are trying to run BQSR on a reduced BAM file. This is very, very bad - and does not at all follow our best practices recommendations for a good workflow. BQSR absolutely must be done before BAM reduction (which should be the very last step before variant calling) because reduction squashes all of the base qualities into a condensed form.

That being said, it is true that reads that are too long will fail in the PrintReads step so I will add a patch for that case (to be available in the upcoming 2.4 release).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member

Can your input be multiple BAM files?

@hmk123, are you asking if BQSR accepts multiple bam files as input? If so the answer is yes.

Geraldine Van der Auwera, PhD

• Posts: 11Member

Great thank you for the update

• Posts: 12Member

(# mismatches + 1) / (# bases + 2) --> why +1 & +2?

• Posts: 122GATK Developer mod

@pmint said: (# mismatches + 1) / (# bases + 2) --> why +1 & +2?

We use this correction factor to smooth out low occupancy bins in the calculations. It is saying that each bin starts with one error observation and one true observation.

• Posts: 12Member

Oh! I understand. Thank you!!

• Posts: 1

I was wondering where I can find the QualityScoreCovariate.java source code. I was browsing through the source code on github but difficult to find as there are so many folders and classes, so would you be able to point me to the full path? Thanks!

It is located in package org.broadinstitute.sting.utils.recalibration.covariates

Geraldine Van der Auwera, PhD

• Posts: 6Member

Is Base Quality Score Recalibration (BQSR) supported for Ion Torrent bams? It is stated that it is supported for: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc. and I was not sure if "etc." included Ion Torrent. The error I receive is "##### ERROR MESSAGE: The platform (PL) associated with read group org.broadinstitute.sting.utils.sam.GATKSAMReadGroupRecord@91b is not a recognized platform. Implemented options are e.g. illumina, 454, and solid" so obviously the algorithm is checking for a keyword for PL.

Thanks.

Hi @Mutagenic, sorry about the vague error message. Ion data should work; we'll check and update the list of compatible platforms.

Geraldine Van der Auwera, PhD

• Posts: 2Member

how many memorys need for GATK PrintReads ?

@murphy, it can depend on what you're doing with it, but you should start with 2 Gb and increase it if the program complains that it's not enough.

Geraldine Van der Auwera, PhD

• Posts: 21Member

I want to set the quality scores in a SAM file to 20 as mentioned here -> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3443046/. What program in GATK should I use to do this, I could not find that in the paper. Please get back as soon as you could. Thank you so much.

• Posts: 21Member

Thank you so very much.

• Posts: 21Member

The link you just gave me - reassigns Read Mapping Qualities. I want to re-assign the Base Qualities - (I beleive the BSQR recalibrates base qualities and not the read mapping qualities). Could you please let me know of a tool to reassign base quality?

Oh, I didn't realize you wanted to change base qualities, sorry. I don't know of any tool that does that. Maybe in Samtools...

Geraldine Van der Auwera, PhD

• Posts: 21Member

But, the paper I asked you about was published by Broad Institute, it says - "Quality scores for the bases were generated by using the GATK’s base quality score recalibration pipeline, starting with a default original base quality of Q20 for every base in the read. " This means that base qual scores in the BAM files were changed to a default value of 20 for every base. I just wanted to know of the tool you used to do that. Mauricio O Carneiro is the first author in the paper. - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3443046/

Hmm, I'm not familiar with the methodology used in that paper. I'll ask @Carneiro to clarify.

Geraldine Van der Auwera, PhD

• Posts: 21Member

great thank you so much. i'd appreciate that.

That was before PacBio had quality scores in their BAM files. Now that they do, you can safely use their Base Qualities as priors for BQSR as you do with illumina reads.

• Posts: 21Member

Dear Carneiro, Thank you so much for replying. Ok, Just to confirm, you are saying that in case/If quality scores are absent in the pacbio's bam files, we need to update them to 20; but there was a time when pacbio's BAM files did not have quality scores?? and that is when you updated them to a default value of 20?? And I do not need to do this anymore. I just need this confirmation as I have to answer this to my leads :) Also, while waiting for your reply, I used the following command to change the qual scores to 20: Do you think I should get rid of the options ddq/idq/mdq?? java -jar new_gatk/GenomeAnalysisTK.jar -T BaseRecalibrator -I test/A2.bam -R test/reference.fasta -knownSites test/knownSites.vcf -ddq 20 -idq 20 -mdq 20 --maximum_cycle_value 10000 -o test/A2_recal_data.grp

Hi Ashu,

That's right, what Mauricio is saying is that he had to artificially add quality scores to the PacBio bams because at the time PacBio didn't output quality scores. So if you have to work with older bam files, you need to add quality scores, but if you're working with more recent files that already have quality scores, you don't need to modify them.

Geraldine Van der Auwera, PhD

• Posts: 21Member

awesome thanks a lot. :) I dont get any email updates when you reply to my comments. I dont know if that is an issue or is it supposed to work that way? Just letting you guys know, I have to search this page explicitly to read these comments regularly.

Well, it's not really a technical issue -- there is an option you can turn on to receive notifications, but the forum software doesn't let us turn it on by default, so people have to turn it on themselves. I forget exactly where (and am on a smartphone right now so I'm limited) but the setting is in your user profile somewhere. If you can't find it, remind me and I will look it up for you.

Geraldine Van der Auwera, PhD

• Posts: 21Member

found it... thanks again :)

• Posts: 1Member

Dear all, did you find that the value in "Errors" column contained the decimal fraction, in the recalibration report generated by the the new version GATK (2.4-7), as shown in the picture . But, why? I think it should be a integer? Looking forward to your reply, many thanks!

Hi there,

That value is a count on the number of errors that are in that bin, but it is now a probabilistic count based on the likelihood of actually being at that location so it is a decimal value instead of an integer.

Geraldine Van der Auwera, PhD

• Posts: 36Member

Hello there, I have been studying on a non-human organism (chicken). Recently, I have a problems confusing me for almost half a month: when I use the BaseRecalibrator to generate a grp file, it always complains there is something wrong with my input file. I downloaded the reference genome from UCSC and the snp files from NCBI, since all snp files are provided according to chromosomes, I merge them (linux split command) into a dbSNP.vcf file which I used as the input file. However, the GATK complained "The provided VCF file is malformed at approximately line number 626572: The VCF specification does not allow for whitespace in the INFO field." well, when I used a perl script to solve this problem and rerun, it again gives another complaints "Input vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe" or "The provided VCF file is malformed at approximately line number 15: 4554 is not a valid start position in the VCF format" and so on. Now my work completely stop. I can not move on. Does anyone have encountered similar problems? any help is appreciated!

Hi @Jack,

I would recommend that you start again from scratch and try merging the original VCF files again, but this time instead of using linux commands, you should use a tool specialized for merging this type of file, such as the GATK's CombineVariants tool, or VCFtools.

Geraldine Van der Auwera, PhD

• Posts: 36Member

Thanks for your quick response,Geraldine. Now I have fixed it by combing those vcf files using GATK CombineVariants, I can move on now!

• Posts: 16Member

We are now trying with the Queue.jar where we wrote a script to do scatter gather with the BQSR. However, whenever we run this script, we always end up not getting the pdf file and got the error: org.broadinstitute.sting.queue.QException: Unable to find gather inputs We have checked the director stated by the command and we couldn't find the intermediate file with the pdf. However, in the scala script, we did included the plot_pdf argument:

trait GATKCommon extends CommandLineGATK{
this.memoryLimit = pipeline.memory
this.intervals = pipeline.intervals
this.excludeIntervals = pipeline.excludeIntervals
this.reference_sequence = pipeline.referenceFasta

}
def script(){
val baseRecalibrate = new BaseRecalibrator with GATKCommon
baseRecalibrate.input_file:+= inputFile
baseRecalibrate.knownSites = realignTarget.known
baseRecalibrate.out = swapExt(indelRealigner.out, "bam", "bqsr.grp")
baseRecalibrate.plot_pdf_file = swapExt(indelRealigner.out, "bam", "bqsr.pdf")
}

Is it that if we do scatter gather with the BQSR we will not get the pdf? If so, is there anyway we can quickly generate the pdf file? If not, is there any problem with our command?

edited May 2013

There's nothing obviously wrong about your code, and plots should run with scatter/gather (as far as I know -- this may have changed). But I have a few questions first:

1. What version of the GATK are you using?
2. What version of R are you
3. Can you show us the output of any scattered BQSR job here to see if the plot was generated?

Unfortunately there is no quick way to generate the plots other than through the program.

Post edited by Carneiro on
• Posts: 16Member

First, I use the latest GATK and R 2.15. When I tried to re-run the script with additional

this.S_=(ValidationStringency.LENIENT)

in the trait and were able to pull out more information. The run time error code was as follow:

org.broadinstitute.sting.queue.QException: Unable to find gather inputs: /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_01_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_02_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_03_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_04_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_05_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_06_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_07_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_08_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_09_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf
at org.broadinstitute.sting.queue.function.scattergather.GatherFunction$class.waitForGatherParts(GatherFunction.scala:70) at org.broadinstitute.sting.queue.function.scattergather.SimpleTextGatherFunction.waitForGatherParts(SimpleTextGatherFunction.scala:36) at org.broadinstitute.sting.queue.function.scattergather.SimpleTextGatherFunction.run(SimpleTextGatherFunction.scala:38) at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:53) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)

Typing ls /.pdf within the ./queue/scatterGather/Pipeline-5-sg folder I found only one of the pdf file (temp_10_of_10)

The .out file within those 9 folders were relatively long but I noticed the following error:

WARN 11:35:23,960 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.

As a result of that, I also enable DEBUG and look into the RScript errors:

gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: âgdataâ

The following object(s) are masked from âpackage:statsâ:

nobs

The following object(s) are masked from âpackage:utilsâ:

object.size

Attaching package: âgplotsâ

The following object(s) are masked from âpackage:statsâ:

lowess

Attaching package: âreshapeâ

The following object(s) are masked from âpackage:plyrâ:

rename, round_any

Error in colnames<-(*tmp*, value = c("ReadGroup", "EventType", "EmpiricalQuality",  :
'names' attribute [6] must be the same length as the vector [1]
Calls: source ... finishTable -> .gsa.assignGATKTableToEnvironment -> colnames<-
Execution halted

It seems like there is some error with the RScript? I am not sure why such error occurs considering that I am using the Queue directly without manipulating the data....

edited May 2013

the error is before the Rscript.

Unable to find gather inputs

something happened in your Queue run that your scattered jobs did not produce the right outputs or your filesystem may have failed there? It seems like a network failure that prevented the scattered files to be reached by the gather function. Without all the files, the Rscript will fail.

Can you check what happened in the .queue scatter directories, see what output files are there? Maybe you can figure out what happened on your side. If not, just rerun with -startFromScratch and hopefully this won't happen again.

Post edited by Carneiro on
• Posts: 16Member

Thank you Carneiro, I will try and see if that is the case. I am not sure why only one folder contain the pdf file though. Maybe startFromScratch will help.

• Posts: 16Member

I have just finished the startFromScratch yet I still got the following error:

SimpleTextGatherFunction: List(/nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_01_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_02_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_03_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_04_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_05_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_06_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_07_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_08_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_09_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_10_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf) > List(/nfs/home/sam/Test_Zone/ScalaTest/105NT.chrY.dedup.indelRe.bqsr.pdf)
org.broadinstitute.sting.queue.QException: Unable to find gather inputs: /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_01_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_02_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_03_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_04_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_05_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_06_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_07_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_08_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf, /nfs/home/sam/Test_Zone/ScalaTest/.queue/scatterGather/Pipeline-5-sg/temp_09_of_10/105NT.chrY.dedup.indelRe.bqsr.pdf
at org.broadinstitute.sting.queue.function.scattergather.GatherFunction$class.waitForGatherParts(GatherFunction.scala:70) at org.broadinstitute.sting.queue.function.scattergather.SimpleTextGatherFunction.waitForGatherParts(SimpleTextGatherFunction.scala:36) at org.broadinstitute.sting.queue.function.scattergather.SimpleTextGatherFunction.run(SimpleTextGatherFunction.scala:38) at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:53) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)

Again, I only got a pdf on temp_10_of_10, however, I have checked that PDF and it is empty. To make sure it is a real problem, I have re-run the programme 3 times and I still got the same error message. So I am not sure if that is the

this looks like a bug, let me take a look.

• Posts: 16Member

Thank you Carneiro

it is a bug, we are fixing it internally and will be in the nightly build in the next few days. In the meantime, I recommend you don't use the -plots option when scatter/gathering.

• Posts: 16Member

Dear Carneiro,

Thank you very much, it is nice to know that it is being solved. Thank you for your help!

Sam

• Posts: 5Member

does GATK support Ion_Torrent data for teh base quality recalibration process? (i know taht it was asked here before, but i wonder if anything changed since then.) i get this message error: RROR MESSAGE: The platform (ION_TORRENT) associated with read group org.broadinstitute.sting.utils.sam.GATKSAMReadGroupRecord@241b09 is not a recognized platform. Implemented options are e.g. illumina, 454, and solid

Thanks :-)

Hey @galmoran, can you tell me what version of GATK you are using? That looks like an older version of the platform lineup. We added Ion Torrent to the allowable platforms a little while ago IIRC.

Geraldine Van der Auwera, PhD

• Posts: 3Member

Hello, I was wondering if BQSR was using depth as an indication of erroneous base. Does it look at every single reads or only at a consensus sequence with degenerate base? Thanks

Hi Yannick,

BQSR looks at all reads, but it does so independently for each read, using its CIGAR information (= it doesn't look at pileups). So depth is not among the covariates used for recalibration.

Geraldine Van der Auwera, PhD

• Posts: 36Member

Hi,there. I'm a little confused about BQSR, does this step rely on a large dataset of dbSNP? I mean my dbSNP is not very large, does the results will still be better if I run with my small dbSNP set than that got without recalibrating the realn.bam file ?

Yes, it is better to do BQSR with a small database rather than not do it at all.

Geraldine Van der Auwera, PhD

• Posts: 12Member

Hi there. I'm wondering if you can help with an issue I've been stumbling over with BQSR? I'm working with a non-model species; we have done exon-capture and sequencing on about 600 individuals. As part of our SNP calling pipeline, we are planning to use BQSR, but we need to build a new dbSNP. To do this, we used BWA-mem to align, Picard to mark duplicates, and samtools mpileup to call SNPs. The mpileup output vcf files are very large, and I was wondering how I can reformat the vcf file to omit the genotypes and just have the first 9 columns that are necessary to use a vcf file as dbSNP in BQSR? I've done some testing but I keep getting error messages like this one:

The provided VCF file is malformed at approximately line number 1036: there are 1 genotypes while the header requires that 9 genotypes be present for all records at Scaffold_1:6720

I feel I must be doing something obviously wrong, but I can't seem to find any documentation telling me how to fix it. Thanks for all your hard work on this useful tool!

Hi @yeaman,

Simplest way to do that is to feed your variants to CombineVariants with the -minimalVCF argument. It will produce a sites-only file without any genotypes.

Geraldine Van der Auwera, PhD

edited September 2013

Oh wait, just realized -minimalVCF is the wrong argument; it will strip out some info but not the genotypes. What you want is --sites_only. It's not currently documented anywhere, but you can use it with any GATK tool that writes a VCF.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 12Member

Thanks for the quick reply! I tried running CombineVariants with -sitesOnly but got an error message saying:

"Argument with name 'sitesOnly' isn't defined"

I tried upgrading to the newest version but still got the same message. I also tried -minimalVCF but as you mentioned, that won't do it. I forgot to mention above that I had previously just used AWK to strip the genotypes off, leaving the first 9 columns. Any thoughts?

• Posts: 12Member

Just for reference, this is the command I was using:

java -Xmx32G -XX:MaxPermSize=16g -XX:PermSize=16g -jar /data/programs/GenomeAnalysisTK-2.7-2/GenomeAnalysisTK.jar -T CombineVariants -R /data/seqcap/spruce/pseudo.sequences.all3_reduced_genome.fa --sitesOnly --variant var.flt.vcf.1 --variant var.flt.vcf.2 -o test_combine_variants.vcf --assumeIdenticalSamples -genotypeMergeOptions UNSORTED

Yeah sorry that was my fault, I misremembered the name (that's the problem with undocumented arguments), the actual argument name is --sites_only. This is useful if you want to create a set of knowns with one of our callers and make GATK output a sites-only file directly, and not have to strip it in a second step.

If you awk the samples info out, remember to edit the header accordingly, otherwise the parser will expect additional fields and flip out (which might be what you're seeing already).

Geraldine Van der Auwera, PhD

• Posts: 12Member

Thanks! That worked like a charm and seems to be running fine with the BQSR tools. I fiddled around with awk and the headers for a bit but had trouble getting a format that would work (an assortment of other errors). But that's unimportant now that CombineVariants is working.

• Posts: 12Member

For future reference, I found a quicker way to use awk to chop off the genotypes and create a vcf formatted file that BQSR will accept. I printed out the first 8 columns (#CHROM POS ID REF ALT QUAL FILTER INFO), excluding the 'format' column (rather than the 6 or 9 I had tried previously) and removing the '##FORMAT' lines from the header. Using grep and cat, I then combined these together and added a header back on. If you use GNU parallel with multiple threads for the awk chopping, this can be quite fast. It doesn't seem to matter that the order of the contents of the 'INFO' column differs between the mpileup output and the format that is outputted by 'CombineVariants'.

Good to know, thanks for reporting your solution, @yeaman.

Geraldine Van der Auwera, PhD

• Posts: 10Member
edited October 2013

Hello, I am running BQSR and I'm getting the following error message:

##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/Volumes/Passport/IA_084/IA084C.realigned.fixed.dedup.bam} is malformed: BAM file has a read with mismatching number of bases and base qualities. Offender: HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 [101 bases] [0 quals]

So I found the place in the bam using grep and it looks like this:

HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 99  1   13405   22  101M    =   13495   186 CCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGCCGC   *   PG:Z:MarkDuplicates RG:Z:IA084C NM:i:1  MQ:i:22 AS:i:97 XS:i:97
HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 147 1   13495   22  5S96M   =   13405   -186    CTCCTCTGCCTGGCGATGTGCCCGTCCTTTGCTCTGACCGCTGGAGACAGTGTTTGTCATTGGCATGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCA   "

"

"!

!
PG:Z:MarkDuplicates RG:Z:IA084C NM:i:6  MQ:i:22 AS:i:66 XS:i:66

I am guessing that the large space with the ", "! and ! is not supposed to be there. How can I remove this from the bam file? This is not the only exome I'm having trouble with at this step - we just got 8 exome sequences back from a new company and half of them are giving me strange errors like this. Thanks for your help.

Post edited by Geraldine_VdAuwera on
• Posts: 10Member

Also, I tried adding "-filterMBQ" to the command and got a different error:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -5 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:246) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) • Posts: 6,412Administrator, GATK Developer admin If you got the exome bams like that straight from the company, you should contact their customer support and let them know you have this problem. They should not be giving you malformed bams. If you did some kind of processing before getting this error, you should check if the originals were okay. Geraldine Van der Auwera, PhD • Posts: 4Member For large datasets, it is often reasonable to split the base quality recalibration process across chromosomes. Given that I have one bam file and one grp file per chromosome, is it possible to merge the reads with the printReads command, e.g: java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -I input_1.bam -I input_2.bam \ -BQSR rep_1.grp -BQSR rep_1.grp \ -o output.bam • Posts: 10Member Hi, Thanks for your help. I only got the raw fastq files from the company - could these errors be something in the raw data? • Posts: 10Member I have been using GATK 2.3. So I downloaded the latest version (2.7) and updated java. I'm still getting the error message: Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0 • Posts: 6,412Administrator, GATK Developer admin Hi @julian, No, you have to run PrintReads individually per chromosome bam because there is no way to pass multiple grp files. But the good news is that in the next step, if you're doing ReduceReads, you can just pass in all the chromosome as a list, and GATK will produce a single merged and reduced output bam. Geraldine Van der Auwera, PhD • Posts: 6,412Administrator, GATK Developer admin @acj8, check the FASTQ records and see if the same reads are okay or not in there. Maybe it's the aligner that screwed them up -- what program (and what version) did you use? Re: the java issue, if you're on a Mac, you have to install the java 7 JDK from Oracle, not just the regular JRE. Geraldine Van der Auwera, PhD • Posts: 10Member edited October 2013 Hi again, I had not installed the correct Java, but doing so did not make a difference. Here are the two fastq reads: @HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 1:N:0:AAACAT CCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGCCGC + ??@AAABADADFDE?@@GGE><?9EHBH<F<FD@9:?D<0?9DDD@DG8BCHCGG;@C)..=CCECCHCH@?B7?;;A?;=??################## @HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 2:N:0:AAACAT TGGGTTTCACCTTTGTAGCAGGATCCCTGCAGACCATGCCAATGACAAACACTGTCTCCAGCGGTCAGAGCAAAGGACGGGCACATCGCCAGGCAGAGGAG + :==+@+22?22<+++22,@A7++3A7=<=+1**1?A################################################################# And here is the original bam: HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 99 1 13405 22 101M = 13495 186 CCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGCCGC ??@AAABADADFDE?@@GGE><?9EHBH<F<FD@9:?D<0?9DDD@DG8BCHCGG;@C)..=CCECCHCH@?B7?;;A?;=??################## NM:i:1 AS:i:97 XS:i:97 RG:Z:IA084C HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 147 1 13495 22 5S96M = 13405 -186 CTCCTCTGCCTGGCGATGTGCCCGTCCTTTGCTCTGACCGCTGGAGACAGTGTTTGTCATTGGCATGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCA #################################################################A?1**1+=<=7A3++7A@,22+++<22?22+@+==: NM:i:6 AS:i:66 XS:i:66 RG:Z:IA084C I am using bwa-mem to do the alignment and then making the bam in samtools. Does any of this look strange to you. I'm currently pulling the ID out of the bam files created at each step (e.g. indel realignment (GATK), mark duplicates (Picard). Allison Post edited by Geraldine_VdAuwera on • Posts: 10Member edited October 2013 update: it is happening during the indel realignment - this is the entry from the realigned bam: MDF869:samtools-0.1.18 Allison ./samtools view /Volumes/Passport/IA_084/IA084C.realigned.bam |grep HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824
HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 99  1   13405   22  101M    =   13495   186 CCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGCCGC   *   RG:Z:IA084C NM:i:1  MQ:i:22 AS:i:97 XS:i:97
HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 147 1   13495   22  5S96M   =   13405   -186    CTCCTCTGCCTGGCGATGTGCCCGTCCTTTGCTCTGACCGCTGGAGACAGTGTTTGTCATTGGCATGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCA   "

"

"!

!
RG:Z:IA084C NM:i:6  MQ:i:22 AS:i:66 XS:i:66

Post edited by Geraldine_VdAuwera on

Hi @ajc8, that looks like the realigner is messing up the read. Could you make a couple of BAM snippets (before/after realignment) with the read (plus about 100 bp of padding before/after it), and upload it to our FTP so we can debug locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

Geraldine Van der Auwera, PhD

• Posts: 10Member

Yes, I will do that now. I have been troubleshooting with one of the files since my last post. I normally fix mates and mark duplicates in Picard after realigning around indels, but I switched the order - first Picard steps and then indel realignment. After doing so, I was able to use -filterMBQ (was getting an error before) - which removed 5% of the reads. Then, running BQST still gives me an error but not a malformed bam error, but instead saying that there is an array out of index error.

Hmm. Well, we typically recommend doing MarkDuplicates then Indel Realignment, so you're good there. Note that Indel Realigner should fix mates for you automatically, so you don't need to run FixMate separately. For the BQSR error, what's your command line and exact error message?

Geraldine Van der Auwera, PhD

• Posts: 10Member

hi, Well I ran everything through again and I got the malformed bam error message again.

Here is the command line:java -Xmx8g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 4 -I /Volumes/Thunderbolt/IA_084/IA084C.realigned.bam -R /Volumes/Thunderbolt/ref_genome/human_g1k_v37.fasta -knownSites /Volumes/Thunderbolt/2.3/b37/1000G_phase1.indels.b37.vcf -knownSites /Volumes/Thunderbolt/2.3/b37/Mills_and_1000G_gold_standard.indels.b37.vcf -knownSites /Volumes/Thunderbolt/2.3/b37/dbsnp_137.b37.vcf -o /Volumes/Thunderbolt/IA_084/IA084C.recal_data.grp

And the error message is:

SAM/BAM file SAMFileReader{/Volumes/Thunderbolt/IA_084/IA084C.realigned.bam} is malformed: BAM file has a read with mismatching number of bases and base qualities. Offender: HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 [101 bases] [0 quals]

If I add -filterMBQ to the command, I get this error message:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -5

Oh well that's still the same problem -- that bad read is causing different issues depending on which tool you run, because they use different parts of the read info. You have to find out exactly what is causing this bad read to occur before you can continue with your work. That's why I asked for the before/after snippets.

Geraldine Van der Auwera, PhD

• Posts: 10Member

I made the snippet from the pre-aligned bam no problem. When I try to do it with the aligned bam, I get the same malformed read error message. Is it okay if I just send you the prealigned snippet? If not, what should I do?

Here is the command and error message:

java -Xmx8g -jar GenomeAnalysisTK.jar -T PrintReads -R /Volumes/Thunderbolt/ref_genome/human_g1k_v37.fasta -I /Volumes/Thunderbolt/IA_084/IA084C.realigned.bam -o /Volumes/Thunderbolt/IA_084/IA084C.aligned_snippet.bam -L 1:13300-13600

ERROR MESSAGE: SAM/BAM file SAMFileReader{/Volumes/Thunderbolt/IA_084/IA084C.realigned.bam} is malformed: BAM file has a read with mismatching number of bases and base qualities. Offender: HWI-ST1122:264:C2LCWACXX:1:2113:17084:65824 [101 bases] [0 quals]

Sure, that's fine. Let me know when you have uploaded it.

Geraldine Van der Auwera, PhD

• Posts: 10Member

I just uploaded it. The filename is ajc8_GATK_help. I also unintentionally uploaded it to a file on the site - Breakendbug. Can you delete it for me? Sorry about that. Thanks again for all your help.

• Posts: 33Member

Is there any difference between 1: Running BQSR on all samples simultaneously generating a single .grp file, then running printreads for each sample, generating multiple recalibrated bams. 2: Running BQSR on all samples simultaneously generating a single.grp file, then running printreads for all samples simultenaously, generating a single recalibrated bam.

Thanks,

MC

Hi @chongm,

No, there is no functional difference in terms of how the data will be processed, since recalibration is done at the read group level. However, be aware that there is some computational cost to running all the samples through simultaneously. We typically process lane-level or sample-level bams individually for performance reasons. Also, keep in mind that if you want to compress your files using ReduceReads, you need to process them individually (you cannot do multisample compression).

Geraldine Van der Auwera, PhD

Hi @ajc8, sorry to get back to you so late -- your bug slipped off my radar somehow. So, I've looked at your files and ran through everything, but I can't reproduce your bug -- at least not with the current version of GATK, which is 2.7. I realize reading your logs that you were using 2.3, so it's very possible that the bug was fixed by subsequent updates. But of course if you run into this again with the latest GATK, please do let me know. Good luck with your work!

Geraldine Van der Auwera, PhD

• Posts: 7Member

My apologies for posting an error in this forum, but given the fact that "this (error) should never happen," I decided I needed to post it. If Mauricio, or any other members of the Broad/GATK team, can provide insight as to why one of my BAMs consistently gets hung up with the BQSR step of the GATK "Best Practices" workflow, I would be greatly appreciative. I am processing additional BAMs using the same pipeline (all other which past the BQSR step fine) and have previously used GATK for local realignment of my this problematic BAM file. Is it possible this bug could have been fixed if I run a later version of GATK? If I updated my version of GATK to a more recent version, will I need to re-run all my samples using the same version? Look forward to hearing back from you

Thanks!

##### ERROR ------------------------------------------------------------------------------------------

Hi @mpvivero,

I'm happy to inform you (on Mauricio's behalf) that this is a known bug that has been fixed in subsequent versions of GATK.

Our recommendation is to use the same version of GATK for all processing steps, because that is the safest way to avoid version-change incompatibilities. So ideally you should reprocess all your samples with the newer version when you upgrade, yes. Sorry for the inconvenience!

Geraldine Van der Auwera, PhD

• Posts: 7Member

• Posts: 12Member

Hello all,

Can someone explain exactly how covariation works, in simple terms? I looked at the presentations in the dropbox folder from the link at the beginning of the comments, but I still don't understand. How exactly are context, cycle, read quality, and read group used to recalibrate the qualities? What is the dependent variable? And how does one determine that the rate of mismatch is higher than expected? If we determine that a covariate (context, e.g.) has a high positive covariation with the dependent variable (which is quality?), then how does that help us recalibrate the quality? Any help is highly appreciated! Thanks!

• Nik.

Hi Nik,

Have a look at the video of the presentation on BQSR from out latest workshop here, hopefully it will help:

Geraldine Van der Auwera, PhD

• Posts: 14Member

Hello,

I am trying to analyse a rat genome (I am still a newbie) and I have downloaded the Ensembl reference from igenomes and more recently the .vcf file from Ensembl. The latest version of the Ensembl available ".vcf" file is 4.2 so it seems it is not supported. but when I try to run the previous version (4.1) (I got rid as well of some non standard annotations used in the .vcf file) but I am getting the following error:

_##### ERROR MESSAGE: Input files /Volumes/Data2/genomes/Rn5/Rattus_norvegicus/Ensembl/Rnor_5.0/Annotation/Rattus_norvegicus_mod2.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

##### ERROR reference contigs = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1, 20, 2, 3, 4, 5, 6, 7, 8, 9, MT, X]_

Any ideas or suggestions about why might this be happening or how to solve it? Thanks in advance

Hi @ghlopez,

As the error message states, the order of contigs is not the same in your data and in your reference. You need to re-order the data to match the reference. Have a look at this document for more details: http://www.broadinstitute.org/gatk/guide/article?id=1204

Geraldine Van der Auwera, PhD

• Posts: 12Member

Hello,

I'm working on an exome capture project in a non-model species where we have multiplexed 12 individuals/lane of Illumina HiSeq. We have been running BQSR to give a recalibration table for each individual, but we thought perhaps it would be better to pool across all individuals in a lane before running BQSR, so that we'd have more data for the error profile. Do you have any recommendations about the best practices here?

Thanks! Sam

Hi @yeaman,

I think in production we separate multiplexed individuals from each lane as well, but in theory there is nothing that prevents you from doing recalibration on them together in order to have more data for the model. It depends how much data you're getting per individual per lane. May be worth comparing the recalibration results you get in both cases.

Geraldine Van der Auwera, PhD

• Posts: 12Member

Thanks for the help! If we get around to trying the pooling, I'll report back.

• UKPosts: 2Member

Hi Geraldine,

Could you explain how the cycle number is referenced within the BQSR GATK report?

Best wishes, Matthew

edited May 6

Hi @mlyon‌, if I recall correctly it should be referenced as a number in the CovariateValue column for lines where CovariateName is Cycle.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• SpainPosts: 10Member

Hi,

I have relaized that my recal_realigned.bam (280 M) file is much smaller than the previous realigned.bam (6.5 G) file. I don't think that is normal, right? However I am not getting any error message. Any hints on where may be any possible error?

Thanks