After BQSR, base quality dramatically decrease without a set of known variants

YafeiYafei OkinawaMember

Hi all,

I am following best practice with my case, the command looks like below:

        bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 24 Aten.fa Aten_R1.fastq Aten_R2.fastq > Aten.sam
        samtools view -bS Aten.sam -o Aten.bam
        samtools fixmate -O bam Aten.sam Aten_fixmate.bam
        samtools sort [email protected] 24 -O bam -o Aten_sorted.bam -T /tmp/Aten_temp Aten_fixmate.bam

        java -jar picard-tools-2.1.0/picard.jar MarkDuplicates INPUT=Aten_sorted.bam OUTPUT=Aten_DM_sorted.bam METRICS_FILE=Aten.bam.metrics
        samtools index Aten_DM_sorted.bam
        java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R Aten.fa -I Aten_DM_sorted.bam -o Aten_realignment_targets.list
        java -jar GenomeAnalysisTK.jar -T IndelRealigner -R Aten.fa -I Aten_DM_sorted.bam -targetIntervals Aten_realignment_targets.list -o Aten_realigned_reads.bam
        java -jar picard-tools-2.1.0/picard.jar AddOrReplaceReadGroups I=Aten_realigned_reads.bam O=Aten_group_realigned_reads.bam RGID=4 RGLB=library1 RGPL=illumina RGPU=unit5 RGSM=bar.Aten
        samtools index Aten_group_realigned_reads.bam

        java -jar GenomeAnalysisTK.jar -R Aten.fa -T UnifiedGenotyper -I Aten_realigned_reads.bam -o Aten_gatk.raw.vcf -stand_call_conf 30.0 -stand_emit_conf 10 -nct 24 -glm BOTH -rf BadCigar
        samtools mpileup -ugf Aten.fa Aten_realigned_reads.bam | bcftools call -vmO v -o Aten.samtools.raw.vcf
        java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T SelectVariants --variant Aten_gatk.raw.vcf --concordance Aten.samtools.raw.vcf -o Aten.concordance.raw.vcf -nt 24
        java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T VariantFiltration --filterExpression " QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0" --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant Aten.concordance.raw.vcf --logging_level ERROR -o Aten.concordance.flt.vcf
        grep -v "Filter" Aten.concordance.flt.vcf > Aten.confidence.raw.vcf

        java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Aten_group_realigned_reads.bam -knownSites Aten.confidence.raw.vcf -o Aten_recal_data.table -nct 24
        java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Aten_group_realigned_reads.bam -BQSR Aten_recal_data.table -o Aten_recal_reads.bam -nct 24

I checked the base quality for Aten_group_realigned_reads.bam and Aten_recal_reads.bam, and then I found the all of base quality of Aten_recal_reads.bam are below 21, but 80% base quality of Aten_group_realigned_reads.bam are higher than 30. I am not sure what's wrong in my case.

Do you mind giving some suggestions?

BTW, I also found that some said that if you had a high coverage data, you do not need to do BQSR. Do you agree it?

Thanks a lot,


Best Answer


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Yafei,

    Can you please post the AnalyzeCovariates plots?


  • YafeiYafei OkinawaMember

    Hi Sheila,
    I put it in attachment, please check it. Importantly, this pdf file are based on Adif analysis (other sample), not Aten in above command, but I am sure I used same condition to all of my sample. Also, you could see the base quality are dramatically decrease in final table.

    Thanks, ^^


  • YafeiYafei OkinawaMember

    Hi Geraldine,

    Thank you so much.
    Do you think I should use Aten.concordance.flt.vcf to trianning BQSR, or I still do some filter with ' QD < 2.0 || ReadPosRankSum < -8.0 || FS > 10.0'. which is better?
    BTW, do you think the step that I did select variance, which both detected by Gatk and samtools, was good choice or not?

    Thanks ^^


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 2016

    Hi Yafei,

    You can try both with and without comparing the results of GATK and Samtools to see which gives you better results.

    As for filtering the VCF, we have some basic hard filtering recommendations here.


  • YafeiYafei OkinawaMember
    edited March 2016

    Hi all,

    I tried to use kinds of known sites built by different condition to do do BASR, but the result looks similar. I put the results in attachment, you could see the base quality dropped a lot. Also, I put the command I used below:

    For Aech_gatk_recalibration_plots.pdf: [Using filtered SNP data generated by GATK]

        java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T VariantFiltration --filterExpression " QD < 2.0 || ReadPosRankSum < -8.0 || FS > 10.0" --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant Asub_gatk.raw.vcf --logging_level ERROR -o Asub_gatk.raw.flt.vcf
        grep -v "Filter" Asub_gatk.raw.flt.vcf > Asub_gatk.confidence.raw.vcf
        java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Asub_group_realigned_reads.bam -knownSites Asub_gatk.confidence.raw.vcf -o Asub_gatk_recal_data.table -nct 24
        java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Asub_group_realigned_reads.bam -BQSR Asub_gatk_recal_data.table -o Asub_gatk_recal_reads.bam -nct 24

    For Aech_gatk_raw_recalibration_plots.pdf: [Using SNP data generated by GATK, no filtering]

        java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Agem_group_realigned_reads.bam -knownSites  Agem_gatk.raw.vcf -o Agem_gatk_raw_recal_data.table -nct 24
        java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Agem_group_realigned_reads.bam -BQSR Agem_gatk_raw_recal_data.table -o Agem_gatk_raw_recal_reads.bam -nct 24

    For Aech_sam_recalibration_plots.pdf: [Using SNP data generated by Samtools]

        java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Asub_group_realigned_reads.bam -knownSites  Asub.samtools.raw.vcf -o Asub_sam_recal_data.table -nct 24
        java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Asub_group_realigned_reads.bam -BQSR Asub_sam_recal_data.table -o Asub_sam_recal_reads.bam -nct 24

    For Aech_recalibration_plots.pdf: [Using the consensus SNP between GATK and Samtools as above describe]

        java -Xmx4g -jar GenomeAnalysisTK.jar -R Aten.fa -T VariantFiltration --filterExpression " QD < 2.0 || ReadPosRankSum < -8.0 || FS > 10.0" --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant Aech.concordance.raw.vcf --logging_level ERROR -o Aech.concordance.flt.vcf
        grep -v "Filter" Aech.concordance.flt.vcf > Aech.confidence.raw.vcf
        java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R Aten.fa -I Aech_group_realigned_reads.bam -knownSites Aech.confidence.raw.vcf -o Aech_recal_data.table -nct 24
        java -jar GenomeAnalysisTK.jar -T PrintReads -R Aten.fa -I Aech_group_realigned_reads.bam -BQSR Aech_recal_data.table -o Aech_recal_reads.bam -nct 24

    It seems the Known sites I built did not work for BQSR in my case, could you give some help on how could I improve the BQSR step?


    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Well, there are a few possible explanations. One is that the known sites are not adequate/extensive enough; but the set without filtering should be sufficient. Another is that the quality of your sequence is really that low. Yet another is that your samples are too divergent relative to the reference (I assume you are working with a non-human organism). One thing you can check is the quality of the alignments -- if the reads don't align well due to phylogenetic divergence, that can cause the appearance of low quality sequence. In that case, ideally you want to use a better reference; if that's not available then you'll have to skip BQSR and hope there are no major systematic biases in your data..

  • YafeiYafei OkinawaMember

    Thanks so much ^^

Sign In or Register to comment.