The current GATK version is 3.5-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Realignment with high base qualities

Posts: 79Member
edited March 2013

Hi,

I am currently working with a project where we have sequenced a library of approximately 70 bps insert sizes using 2x100 paired-end seq. While this can seem unnecessary, it can improve base qualities a lot.

I have used SeqPrep (https://github.com/jstjohn/SeqPrep) which strips adaptors and merges reads that overlap, in our case the entire read most of the times. This also boosts the base qualities, if a base was sequenced twice, the quality improves quite a bit. This way, base qualities can stretch up to 70 and over (probability of error 0.0001 x 0.0001 if both reads had Q40 at that base, it merged qual = 80). No funny business there.

However, this does not seem to play nicely with GATK. The realignment crashes (see below) saying the the base quals must be erroneous. In my case, but they are correct. Can I force GATK to work with these BQs? (--validation_strictness LENIENT didn't help as you can see below

cheers
Daniel Klevebring

   INFO  13:04:07,408 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:04:07,411 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO  13:04:07,411 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  13:04:07,416 HelpFormatter - Program Args: -T RealignerTargetCreator -I /scratch/3041404/P394_102.prmdup.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -o /scratch/3041404/P394_102.realn.intervals --intervals /bubo/proj/b2010040/private/GoldenPath/NG_design/1000G_REF_picard_custom_design_target_regions_HG19.bed.interval_list --validation_strictness LENIENT
INFO  13:04:07,416 HelpFormatter - Date/Time: 2013/03/13 13:04:07
INFO  13:04:07,416 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:04:07,416 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:04:08,461 GenomeAnalysisEngine - Strictness is LENIENT
INFO  13:04:08,632 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  13:04:08,640 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:04:08,655 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  13:04:09,782 IntervalUtils - Processing 39772003 bp from intervals
INFO  13:04:10,001 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO  13:04:10,262 GenomeAnalysisEngine - Done creating shard strategy
INFO  13:04:10,262 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  13:04:10,263 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  13:04:18,482 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/3041404/P394_102.prmdup.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 70; please see the GATK --help documentation for options related to this error
##### ERROR ------------------------------------------------------------------------------------------

Post edited by dklevebring on
Tagged:

• Posts: 2Member

Dear all, does not the polymerase error rate of 0.0001 seem high? Looking at the classical paper by Vogelstein and colleagues "Detection and quantification of mutations in the plasma of patients with colorectal tumors", where they could estimate the polymerase error rate after 30 cycles of PCR through Beaming, the maximum error rate was 1.95*10^-5 after 30 cycles. Nowadays the polymerases have lower error rate and commonly, less than 30 cycles are used in total for e.g. a capture experiment (we use 9 cycles pre and 9 cycles of PCR post capture). Wouldn't a Q50 be a reasonable BQ after merging?

• Posts: 79Member

Ummmm... never mind.

-allowPotentiallyMisencodedQuals

Yeah, I withdraw the question. Sorry about that

cheers
Daniel

• Posts: 79Member

Ok, back again.

The BaseRecalibrator also fails. Again, an important point here is that my qualities are high, and I reckon they should be (see my orginal post about SeqPrep and merging...)

I was curious to see if BR would work, and the answer was nope. Any opinions? Is there

1) A parameter I haven't found (I've been looking this time ?
2) A workaround of some kind?

Any input would be great.

Thanks again.,
Daniel

INFO 13:20:53,831 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:20:53,833 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28
INFO 13:20:53,833 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 13:20:53,838 HelpFormatter - Program Args: -T BaseRecalibrator -I /scratch/3041404/P394_102.realn.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -knownSites /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf -o /bubo/proj/b2010040/private/GoldenPath/nobackup/cbc_exome/autoseq/P394_102/P394_102.merged.prmdup.realn.report.grp --validation_strictness LENIENT -allowPotentiallyMisencodedQuals
INFO 13:20:53,838 HelpFormatter - Date/Time: 2013/03/13 13:20:53
INFO 13:20:53,838 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:20:53,838 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:20:53,869 ArgumentTypeDescriptor - Dynamically determined type of /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf to be VCF
INFO 13:20:54,746 GenomeAnalysisEngine - Strictness is LENIENT
INFO 13:20:54,818 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 13:20:54,825 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:20:54,840 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 13:20:54,852 RMDTrackBuilder - Creating Tribble index in memory for file /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf
INFO 13:26:12,843 RMDTrackBuilder - Writing Tribble index to disk for file /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/dbsnp_137.b37.vcf.idx
INFO 13:26:58,997 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 13:26:59,002 GenomeAnalysisEngine - Done creating shard strategy
INFO 13:26:59,002 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 13:26:59,033 BaseRecalibrator - The covariates being used here:
INFO 13:26:59,033 BaseRecalibrator - QualityScoreCovariate
INFO 13:26:59,033 BaseRecalibrator - ContextCovariate
INFO 13:26:59,033 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 13:26:59,033 BaseRecalibrator - CycleCovariate
INFO 13:26:59,052 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 13:26:59,053 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO 13:27:27,187 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------
• Posts: 49Member

I think you should try to use unmerged reads first and follow the workflow in the GATK guidlines.
Recalibration and realignment of merged reads is not supported to my knowledge (there is a post on recalibration of reduced reads here from ebanks: gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr)
After doing the pipeline you can reduce the reads and filesize with the inbuild GATK walker:

• Posts: 79Member

@ebanks thanks for reply on the math behind seqprep quals, I do see the issue. So your recommendation would be to either use seqprep and cap at 50 (1/10000 is likely an overestimation of the PCR error rate, but fair enough) or use Picards MergeBamAlignment in the resulting BAM file? Surely there must be some advantage in reading the fragment twice?

@berntpopp The overall point here is not to reduce the file size by merging, but get a true sequencing depth, and take advantage of the fact that each sequence was read twice in the process.

• Posts: 49Member

Quite contrary: seeing the same reads many times is indicative for a PCR duplicate which does not give you more information (as it is polyclonal) but makes analysis harder. Only different reads at the same position realy help the analysis, thats why marking duplicates (with samtools or picard) is recommended.

• Posts: 79Member

@berntpopp I think that's two different things. The duplicates are removed in a separate process, to reduce the risk of taking PCR errors seriously (as variants). If I see the same base twice with Q20 in the two ends I do see a reason to increase it's quality. If the bases differ, set them to N (very likely one of them is a seq error).

@ebanks point is very important though. Capping is necessary not to overtrust PCR errors, and it seems reasonable to set it around the known error rates. Engineered Taq pols have an error rate ≈ 1 in 10M and then one has to account for each cycle of the PCR, so capping at 50 (1 in 10k) seems like a good tradeoff.

• Posts: 71Dev mod

@berntpopp - you're right, but in this particular case MarkDuplicates will not eliminate duplicates coming from overlapping pairs coming from the same fragment, as these pairs will be marked with opposite directions (one read forward, one read reverse) and MarkDuplicates is direction-aware.
@dklevebring - the GATK will do the right thing with overlapping fragments when calling variation, BUT some of the annotations may be misleading and many of these (e.g. DP) are not overlapping fragment-aware and will count overlapping reads twice when computing depth and other statistics. We'll correct this in a future GATK version.
The most useful thing that SeqPrep can potentially do for you is not the merging (which as has been pointed out is not really correct and may not help), but in removing adaptor sequence. If your insert is 70 bp long and reads are 100 bp, you will definitely read into the adaptor at the end of the read and you may have trouble aligning reads.

• Posts: 79Member

@delangel so what you're saying is that the bst way to go right now is merging with a cap at 40? Do you agree that two bases at Q20 is ok to increase to 40? Or any two quals where the base is identical but cap at 40.

• Posts: 2Member

Dear all, does not the polymerase error rate of 0.0001 seem high? Looking at the classical paper by Vogelstein and colleagues "Detection and quantification of mutations in the plasma of patients with colorectal tumors", where they could estimate the polymerase error rate after 30 cycles of PCR through Beaming, the maximum error rate was 1.95*10^-5 after 30 cycles. Nowadays the polymerases have lower error rate and commonly, less than 30 cycles are used in total for e.g. a capture experiment (we use 9 cycles pre and 9 cycles of PCR post capture). Wouldn't a Q50 be a reasonable BQ after merging?

• Posts: 79Member
edited March 2013

I have to say that I agree with the goblin wizard on this (@goblin_wizard).

@ebanks I always pictured the base quals to reflect the chance that the base was present in the molecule in question (i.e. is correctly sequenced), rather than being present in the sample sequenced. The sequencer itself, which assigns the base quals, has no idea about the sample the user put on it. Wouldn't it then make sense for the variant caller not to trust positions with a single read supprting the variants, independent of its qual? Please straighten my thoughs out if they are completely off here...

cheers
Daniel

Post edited by dklevebring on
• Posts: 26Member

Warming up this old thread more than two years later. > @delangel said:

the GATK will do the right thing with overlapping fragments when calling variation, BUT some of the annotations may be misleading and many of these (e.g. DP) are not overlapping fragment-aware and will count overlapping reads twice when computing depth and other statistics. We'll correct this in a future GATK version.
The most useful thing that SeqPrep can potentially do for you is not the merging (which as has been pointed out is not really correct and may not help), but in removing adaptor sequence. If your insert is 70 bp long and reads are 100 bp, you will definitely read into the adaptor at the end of the read and you may have trouble aligning reads.

Did you fix the statistics for overlapping PE reads by now? What is your current recommendation for overlapping PE reads? Trim the adapters obviously, but then do the alignment (e.g. with BWA) without overlapping reads merged and then input this bam to the classic MD, IR, BQSR?

Thanks