Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Realignment with high base qualities

dklevebringdklevebring Posts: 53Member
edited March 2013 in Ask the team

Hi,

I am currently working with a project where we have sequenced a library of approximately 70 bps insert sizes using 2x100 paired-end seq. While this can seem unnecessary, it can improve base qualities a lot.

I have used SeqPrep (https://github.com/jstjohn/SeqPrep) which strips adaptors and merges reads that overlap, in our case the entire read most of the times. This also boosts the base qualities, if a base was sequenced twice, the quality improves quite a bit. This way, base qualities can stretch up to 70 and over (probability of error 0.0001 x 0.0001 if both reads had Q40 at that base, it merged qual = 80). No funny business there. :)

However, this does not seem to play nicely with GATK. The realignment crashes (see below) saying the the base quals must be erroneous. In my case, but they are correct. Can I force GATK to work with these BQs? (--validation_strictness LENIENT didn't help as you can see below :)

cheers Daniel Klevebring

   INFO  13:04:07,408 HelpFormatter - -------------------------------------------------------------------------------- 
   INFO  13:04:07,411 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28 
   INFO  13:04:07,411 HelpFormatter - Copyright (c) 2010 The Broad Institute 
   INFO  13:04:07,411 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
   INFO  13:04:07,416 HelpFormatter - Program Args: -T RealignerTargetCreator -I /scratch/3041404/P394_102.prmdup.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -o /scratch/3041404/P394_102.realn.intervals --intervals /bubo/proj/b2010040/private/GoldenPath/NG_design/1000G_REF_picard_custom_design_target_regions_HG19.bed.interval_list --validation_strictness LENIENT 
   INFO  13:04:07,416 HelpFormatter - Date/Time: 2013/03/13 13:04:07 
   INFO  13:04:07,416 HelpFormatter - -------------------------------------------------------------------------------- 
   INFO  13:04:07,416 HelpFormatter - -------------------------------------------------------------------------------- 
   INFO  13:04:08,461 GenomeAnalysisEngine - Strictness is LENIENT 
   INFO  13:04:08,632 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
   INFO  13:04:08,640 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
   INFO  13:04:08,655 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
   INFO  13:04:09,782 IntervalUtils - Processing 39772003 bp from intervals 
   INFO  13:04:10,001 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files 
   INFO  13:04:10,262 GenomeAnalysisEngine - Done creating shard strategy 
   INFO  13:04:10,262 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
   INFO  13:04:10,263 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
   INFO  13:04:18,482 GATKRunReport - Uploaded run statistics report to AWS S3 
   ##### ERROR ------------------------------------------------------------------------------------------
   ##### ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01): 
   ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
   ##### ERROR Please do not post this error to the GATK forum
   ##### ERROR
   ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
   ##### ERROR Visit our website and forum for extensive documentation and answers to 
   ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
   ##### ERROR
   ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/3041404/P394_102.prmdup.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 70; please see the GATK --help documentation for options related to this error
   ##### ERROR ------------------------------------------------------------------------------------------
Post edited by dklevebring on

Best Answers

  • goblin_wizardgoblin_wizard Posts: 2
    Answer ✓

    Dear all, does not the polymerase error rate of 0.0001 seem high? Looking at the classical paper by Vogelstein and colleagues "Detection and quantification of mutations in the plasma of patients with colorectal tumors", where they could estimate the polymerase error rate after 30 cycles of PCR through Beaming, the maximum error rate was 1.95*10^-5 after 30 cycles. Nowadays the polymerases have lower error rate and commonly, less than 30 cycles are used in total for e.g. a capture experiment (we use 9 cycles pre and 9 cycles of PCR post capture). Wouldn't a Q50 be a reasonable BQ after merging?

Answers

  • dklevebringdklevebring Posts: 53Member

    Ummmm... never mind.

    From http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html :

    -allowPotentiallyMisencodedQuals

    Yeah, I withdraw the question. Sorry about that :)

    cheers Daniel

  • dklevebringdklevebring Posts: 53Member

    Ok, back again.

    The BaseRecalibrator also fails. Again, an important point here is that my qualities are high, and I reckon they should be (see my orginal post about SeqPrep and merging...)

    I was curious to see if BR would work, and the answer was nope. Any opinions? Is there

    1) A parameter I haven't found (I've been looking this time :)? 2) A workaround of some kind?

    Any input would be great.

    Thanks again., Daniel

    INFO 13:20:53,831 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:20:53,833 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28 INFO 13:20:53,833 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:20:53,833 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:20:53,838 HelpFormatter - Program Args: -T BaseRecalibrator -I /scratch/3041404/P394_102.realn.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -knownSites /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf -o /bubo/proj/b2010040/private/GoldenPath/nobackup/cbc_exome/autoseq/P394_102/P394_102.merged.prmdup.realn.report.grp --validation_strictness LENIENT -allowPotentiallyMisencodedQuals INFO 13:20:53,838 HelpFormatter - Date/Time: 2013/03/13 13:20:53 INFO 13:20:53,838 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:20:53,838 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:20:53,869 ArgumentTypeDescriptor - Dynamically determined type of /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf to be VCF INFO 13:20:54,746 GenomeAnalysisEngine - Strictness is LENIENT INFO 13:20:54,818 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:20:54,825 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:20:54,840 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 13:20:54,852 RMDTrackBuilder - Creating Tribble index in memory for file /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf INFO 13:26:12,843 RMDTrackBuilder - Writing Tribble index to disk for file /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf.idx INFO 13:26:58,997 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 13:26:59,002 GenomeAnalysisEngine - Done creating shard strategy INFO 13:26:59,002 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:26:59,002 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 13:26:59,033 BaseRecalibrator - The covariates being used here:
    INFO 13:26:59,033 BaseRecalibrator - ReadGroupCovariate INFO 13:26:59,033 BaseRecalibrator - QualityScoreCovariate INFO 13:26:59,033 BaseRecalibrator - ContextCovariate INFO 13:26:59,033 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 13:26:59,033 BaseRecalibrator - CycleCovariate INFO 13:26:59,052 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 13:26:59,053 ReadShardBalancer$1 - Done loading BAM index data for next contig INFO 13:27:27,187 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/3041404/P394_102.realn.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (71) with BAQ correction factor of 4; please see the GATK --help documentation for options related to this error
    ERROR ------------------------------------------------------------------------------------------
  • BerntPoppBerntPopp Posts: 47Member

    I think you should try to use unmerged reads first and follow the workflow in the GATK guidlines. Recalibration and realignment of merged reads is not supported to my knowledge (there is a post on recalibration of reduced reads here from ebanks: gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr) After doing the pipeline you can reduce the reads and filesize with the inbuild GATK walker: broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_compression_reducereads_ReduceReads.html

  • dklevebringdklevebring Posts: 53Member

    @ebanks thanks for reply on the math behind seqprep quals, I do see the issue. So your recommendation would be to either use seqprep and cap at 50 (1/10000 is likely an overestimation of the PCR error rate, but fair enough) or use Picards MergeBamAlignment in the resulting BAM file? Surely there must be some advantage in reading the fragment twice?

    @berntpopp The overall point here is not to reduce the file size by merging, but get a true sequencing depth, and take advantage of the fact that each sequence was read twice in the process.

  • BerntPoppBerntPopp Posts: 47Member

    Quite contrary: seeing the same reads many times is indicative for a PCR duplicate which does not give you more information (as it is polyclonal) but makes analysis harder. Only different reads at the same position realy help the analysis, thats why marking duplicates (with samtools or picard) is recommended.

  • dklevebringdklevebring Posts: 53Member

    @berntpopp I think that's two different things. The duplicates are removed in a separate process, to reduce the risk of taking PCR errors seriously (as variants). If I see the same base twice with Q20 in the two ends I do see a reason to increase it's quality. If the bases differ, set them to N (very likely one of them is a seq error).

    @ebanks point is very important though. Capping is necessary not to overtrust PCR errors, and it seems reasonable to set it around the known error rates. Engineered Taq pols have an error rate ≈ 1 in 10M and then one has to account for each cycle of the PCR, so capping at 50 (1 in 10k) seems like a good tradeoff.

    http://www.kapabiosystems.com/products/name/kapa-hifi-pcr-kits http://www.kapabiosystems.com/public/files/images/products/next-generation/kapahifi-hotstart/image1.png

  • delangeldelangel Posts: 71GSA Member mod

    @berntpopp - you're right, but in this particular case MarkDuplicates will not eliminate duplicates coming from overlapping pairs coming from the same fragment, as these pairs will be marked with opposite directions (one read forward, one read reverse) and MarkDuplicates is direction-aware. @dklevebring - the GATK will do the right thing with overlapping fragments when calling variation, BUT some of the annotations may be misleading and many of these (e.g. DP) are not overlapping fragment-aware and will count overlapping reads twice when computing depth and other statistics. We'll correct this in a future GATK version. The most useful thing that SeqPrep can potentially do for you is not the merging (which as has been pointed out is not really correct and may not help), but in removing adaptor sequence. If your insert is 70 bp long and reads are 100 bp, you will definitely read into the adaptor at the end of the read and you may have trouble aligning reads.

  • dklevebringdklevebring Posts: 53Member

    @delangel so what you're saying is that the bst way to go right now is merging with a cap at 40? Do you agree that two bases at Q20 is ok to increase to 40? Or any two quals where the base is identical but cap at 40.

  • dklevebringdklevebring Posts: 53Member
    edited March 2013

    I have to say that I agree with the goblin wizard on this (@goblin_wizard).

    @ebanks I always pictured the base quals to reflect the chance that the base was present in the molecule in question (i.e. is correctly sequenced), rather than being present in the sample sequenced. The sequencer itself, which assigns the base quals, has no idea about the sample the user put on it. Wouldn't it then make sense for the variant caller not to trust positions with a single read supprting the variants, independent of its qual? Please straighten my thoughs out if they are completely off here... :)

    cheers Daniel

    Post edited by dklevebring on
Sign In or Register to comment.