It looks like you're new here. If you want to get involved, click one of these buttons!
rpoplin
Posts: 92GSA Official Member mod
Detailed information about command line options for BaseRecalibrator can be found here.
The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.
New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.
This process is accomplished by analyzing the covariation among several features of a base. For example:
These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.
For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.
The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.
Detailed information about command line options for BaseRecalibrator can be found here.
This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:
For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.
To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:
java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam
After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.
This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.
Effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:
This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).
Example Arguments table:
#:GATKTable:true:1:17::; #:GATKTable:Arguments:Recalibration argument collection values used in this run Argument Value covariate null default_platform null deletions_context_size 6 force_platform null insertions_context_size 6 ...
The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.
The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.
Example Arguments table:
#:GATKTable:true:2:94:::; #:GATKTable:Quantized:Quality quantization map QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ...
This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:; #:GATKTable:RecalTable0: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762
This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable1: ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ...
This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable2: ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ...
The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.
If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.
I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.
All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.
I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?
Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.
I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.
The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.
However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:
For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.
Comments
Hi,
Could anyone provide a mathematical notes on what exactly BaseRecalibrator is doing?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Have a look at our archive of presentations:
https://www.dropbox.com/sh/e31kvbg5v63s51t/6GdimgsKss
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi,
I've generated the plots during recalibration, but I'm only getting plots for the original data (not pairs of plots like above). How can I get the plots for the recalibrated scores in order to see what's going on with my data?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •This has been addressed previously on this forum, please do a search (box in the top right corner) and you will find an explanation.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •About the BaseRecalibrator performance. I am using GATKLite-2.3.4 and in a "best-practices"-like pipeline the base recalibration is the most time consuming task.
The base recalibration (BaseRecalibrator + PrintReads) took about 5 hours for an exome sample, while local realignment took 1 hour end 15 minutes and variant discovery little more than 1 hour. I am using multi-threading whenever it is possible.
I've read that there were some issues with BaseRecalibrator multi-threading and performance in previous releases, but when running BaseRecalibrator in a single thread performance does not seem to increase. Is there any alternative to increase this performance? Maybe disabling some functionality or preparing the data somehow?
Thanks,
Pablo.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Pablo,
Have you tried using scatter-gather (e.g. using Queue)? There was a recent forum discussion on the topic, you may find it interesting to look up.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello, Having trouble with the PrintReads aspect of the GATK 2+ recalibration scheme. I am using v2.3-9-ge5ebf34 with reducereads on illumina generated exomes. The BaseRecalibrator & PrintReads have been complaining about the cycle covariate length being too long
ERROR MESSAGE: The maximum allowed value for the cycle is 500, but a larger cycle was detected in read C47. Please use the --maximum_cycle_value argument to increase this value (at the expense of requiring more memory to run)
So I implemented a filtering step to clip the cycles > 500 as below. BaseRecalibrator now runs without errors, however PrintReads still complains maximum cycle length. Additionally it suggests I use the --maximum_cycle_value argument, which neither works in the command line nor appears in the documentation for PrintReads.
Is this an error or omission in PrintReads? I may try HardClipping the bam file though this is not supported.
thoughts?
java -Xmx8g -jar /home/jpriest/priest_apps/GATKv2.3.9/GenomeAnalysisTK.jar \ -T ClipReads \ -I ./CHD001_novomask_sorted_realigned.rr.bam \ -o ./CHD001_novomask.clip.sort.realign.rr.bam \ -R /home/jpriest/reference_genomes/hg19/ucsc.hg19.fasta \ -CT 500-10000 \ -CR SOFTCLIP_BASES \ -os testfilter.txt
java -Xmx8g -jar /home/jpriest/priest_apps/GATKv2.3.9/GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -I ./CHD001_novomask.clip.sort.realign.rr.bam \ -R /home/jpriest/reference_genomes/hg19/ucsc.hg19.fasta \ -knownSites /home/jpriest/reference_genomes/hg19/dbsnp_135.hg19.vcf \ -knownSites /home/jpriest/reference_genomes/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf \ -o CHD001.recal_data.grp \ -nct 4 \ -filterMBQ
java -jar /home/jpriest/priest_apps/GATKv2.3.9/GenomeAnalysisTK.jar \ -T PrintReads \ -R /home/jpriest/reference_genomes/hg19/ucsc.hg19.fasta \ -I ./CHD001_novomask.clip.sort.realign.rr.bam \ -BQSR CHD001.recal_data.grp \ -o ./CHD001_novomask.final.rr.bam \
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •update: using the -CR HARDCLIP_BASES \ flag with ClipReads seems to solve the problem, though this is a little unnerving as it is not directly supported
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @jamesrpriest, I just re-read your original post and it dawned on me what you are doing: you are trying to run BQSR on a reduced BAM file. This is very, very bad - and does not at all follow our best practices recommendations for a good workflow. BQSR absolutely must be done before BAM reduction (which should be the very last step before variant calling) because reduction squashes all of the base qualities into a condensed form.
That being said, it is true that reads that are too long will fail in the PrintReads step so I will add a patch for that case (to be available in the upcoming 2.4 release).
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Can your input be multiple BAM files?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •@hmk123, are you asking if BQSR accepts multiple bam files as input? If so the answer is yes.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Great thank you for the update
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •(# mismatches + 1) / (# bases + 2) --> why +1 & +2?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •We use this correction factor to smooth out low occupancy bins in the calculations. It is saying that each bin starts with one error observation and one true observation.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Oh! I understand. Thank you!!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I was wondering where I can find the QualityScoreCovariate.java source code. I was browsing through the source code on github but difficult to find as there are so many folders and classes, so would you be able to point me to the full path? Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •It is located in package org.broadinstitute.sting.utils.recalibration.covariates
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Is Base Quality Score Recalibration (BQSR) supported for Ion Torrent bams? It is stated that it is supported for: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc. and I was not sure if "etc." included Ion Torrent. The error I receive is "##### ERROR MESSAGE: The platform (PL) associated with read group org.broadinstitute.sting.utils.sam.GATKSAMReadGroupRecord@91b is not a recognized platform. Implemented options are e.g. illumina, 454, and solid" so obviously the algorithm is checking for a keyword for PL.
Thanks.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @Mutagenic, sorry about the vague error message. Ion data should work; we'll check and update the list of compatible platforms.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •how many memorys need for GATK PrintReads ?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •@murphy, it can depend on what you're doing with it, but you should start with 2 Gb and increase it if the program complains that it's not enough.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I want to set the quality scores in a SAM file to 20 as mentioned here -> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3443046/. What program in GATK should I use to do this, I could not find that in the paper. Please get back as soon as you could. Thank you so much.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Have a look at this read filter:
http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_filters_ReassignMappingQualityFilter.html
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thank you so very much.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The link you just gave me - reassigns Read Mapping Qualities. I want to re-assign the Base Qualities - (I beleive the BSQR recalibrates base qualities and not the read mapping qualities). Could you please let me know of a tool to reassign base quality?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Oh, I didn't realize you wanted to change base qualities, sorry. I don't know of any tool that does that. Maybe in Samtools...
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •But, the paper I asked you about was published by Broad Institute, it says - "Quality scores for the bases were generated by using the GATK’s base quality score recalibration pipeline, starting with a default original base quality of Q20 for every base in the read. " This means that base qual scores in the BAM files were changed to a default value of 20 for every base. I just wanted to know of the tool you used to do that. Mauricio O Carneiro is the first author in the paper. - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3443046/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hmm, I'm not familiar with the methodology used in that paper. I'll ask @Carneiro to clarify.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •great thank you so much. i'd appreciate that.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •That was before PacBio had quality scores in their BAM files. Now that they do, you can safely use their Base Qualities as priors for BQSR as you do with illumina reads.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Dear Carneiro, Thank you so much for replying. Ok, Just to confirm, you are saying that in case/If quality scores are absent in the pacbio's bam files, we need to update them to 20; but there was a time when pacbio's BAM files did not have quality scores?? and that is when you updated them to a default value of 20?? And I do not need to do this anymore. I just need this confirmation as I have to answer this to my leads :) Also, while waiting for your reply, I used the following command to change the qual scores to 20: Do you think I should get rid of the options ddq/idq/mdq?? java -jar new_gatk/GenomeAnalysisTK.jar -T BaseRecalibrator -I test/A2.bam -R test/reference.fasta -knownSites test/knownSites.vcf -ddq 20 -idq 20 -mdq 20 --maximum_cycle_value 10000 -o test/A2_recal_data.grp
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Ashu,
That's right, what Mauricio is saying is that he had to artificially add quality scores to the PacBio bams because at the time PacBio didn't output quality scores. So if you have to work with older bam files, you need to add quality scores, but if you're working with more recent files that already have quality scores, you don't need to modify them.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •awesome thanks a lot. :) I dont get any email updates when you reply to my comments. I dont know if that is an issue or is it supposed to work that way? Just letting you guys know, I have to search this page explicitly to read these comments regularly.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Well, it's not really a technical issue -- there is an option you can turn on to receive notifications, but the forum software doesn't let us turn it on by default, so people have to turn it on themselves. I forget exactly where (and am on a smartphone right now so I'm limited) but the setting is in your user profile somewhere. If you can't find it, remind me and I will look it up for you.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •found it... thanks again :)
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Dear all, did you find that the value in "Errors" column contained the decimal fraction, in the recalibration report generated by the the new version GATK (2.4-7), as shown in the picture . But, why? I think it should be a integer? Looking forward to your reply, many thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
That value is a count on the number of errors that are in that bin, but it is now a probabilistic count based on the likelihood of actually being at that location so it is a decimal value instead of an integer.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello there, I have been studying on a non-human organism (chicken). Recently, I have a problems confusing me for almost half a month: when I use the BaseRecalibrator to generate a grp file, it always complains there is something wrong with my input file. I downloaded the reference genome from UCSC and the snp files from NCBI, since all snp files are provided according to chromosomes, I merge them (linux split command) into a dbSNP.vcf file which I used as the input file. However, the GATK complained "The provided VCF file is malformed at approximately line number 626572: The VCF specification does not allow for whitespace in the INFO field." well, when I used a perl script to solve this problem and rerun, it again gives another complaints "Input vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe" or "The provided VCF file is malformed at approximately line number 15: 4554 is not a valid start position in the VCF format" and so on. Now my work completely stop. I can not move on. Does anyone have encountered similar problems? any help is appreciated!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @Jack,
I would recommend that you start again from scratch and try merging the original VCF files again, but this time instead of using linux commands, you should use a tool specialized for merging this type of file, such as the GATK's CombineVariants tool, or VCFtools.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for your quick response,Geraldine. Now I have fixed it by combing those vcf files using GATK CombineVariants, I can move on now!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •We are now trying with the Queue.jar where we wrote a script to do scatter gather with the BQSR. However, whenever we run this script, we always end up not getting the pdf file and got the error: org.broadinstitute.sting.queue.QException: Unable to find gather inputs We have checked the director stated by the command and we couldn't find the intermediate file with the pdf. However, in the scala script, we did included the plot_pdf argument:
Is it that if we do scatter gather with the BQSR we will not get the pdf? If so, is there anyway we can quickly generate the pdf file? If not, is there any problem with our command?
Thank you for your help
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •There's nothing obviously wrong about your code, and plots should run with scatter/gather (as far as I know -- this may have changed). But I have a few questions first:
Unfortunately there is no quick way to generate the plots other than through the program.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •First, I use the latest GATK and R 2.15. When I tried to re-run the script with additional
this.S_=(ValidationStringency.LENIENT)in the trait and were able to pull out more information. The run time error code was as follow:
Typing ls /.pdf within the ./queue/scatterGather/Pipeline-5-sg folder I found only one of the pdf file (temp_10_of_10)
The .out file within those 9 folders were relatively long but I noticed the following error:
WARN 11:35:23,960 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.As a result of that, I also enable DEBUG and look into the RScript errors:
It seems like there is some error with the RScript? I am not sure why such error occurs considering that I am using the Queue directly without manipulating the data....
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •the error is before the Rscript.
something happened in your Queue run that your scattered jobs did not produce the right outputs or your filesystem may have failed there? It seems like a network failure that prevented the scattered files to be reached by the gather function. Without all the files, the Rscript will fail.
Can you check what happened in the .queue scatter directories, see what output files are there? Maybe you can figure out what happened on your side. If not, just rerun with -startFromScratch and hopefully this won't happen again.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thank you Carneiro, I will try and see if that is the case. I am not sure why only one folder contain the pdf file though. Maybe startFromScratch will help.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I have just finished the startFromScratch yet I still got the following error:
Again, I only got a pdf on temp_10_of_10, however, I have checked that PDF and it is empty. To make sure it is a real problem, I have re-run the programme 3 times and I still got the same error message. So I am not sure if that is the
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •this looks like a bug, let me take a look.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •