Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Deletion tends to have lower indel quality than insertion

liuxingliangliuxingliang SingaporeMember

Hi the Team,

I am analyzing some small target PCR-based deep sequencing data. The target region is 264067 bp, median of interval average coverage is around 5000x. I did indel realignment and base quality recalibration in preprocess. After BQSR, I found that for almost all reads, the BI quality (round 55) is higher than BD quality (round 45).

In the following variants calling step, we found much more deletions than insertions.

May I know that it is a normal behavior for the data having higher BI quality than BD quality?

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @liuxingliang
    Hi,

    I don't think there is any need to be worried. Have a look at this thread for more information.

    -Sheila

  • liuxingliangliuxingliang SingaporeMember

    @Sheila

    Hello Sheila,

    I have checked the thread suggested by you. It seems that their problem is not exactly same with mine. My problem is that the INSERTION quality (around 90-33) is always higher than DELETION quality (around 80-33) in my data. I am not sure it is caused by my data or it is just a default behavior of GATK BQSR.

    What makes more worried is that we got amazingly higher number of deletions than insertions in our vars calling result. I am not sure it is caused by lower deletion quality in my data or not.

    bless~
    Xingliang

  • SheilaSheila Broad InstituteMember, Broadie admin

    @liuxingliang
    Hi Xingliang,

    Can you tell us what kind of data you are working with? Are you following the Best Practices? Can you post the AnalyzeCovariates plots?

    Thanks,
    Sheila

  • liuxingliangliuxingliang SingaporeMember

    @Sheila said:
    @liuxingliang
    Hi Xingliang,

    Can you tell us what kind of data you are working with? Are you following the Best Practices? Can you post the AnalyzeCovariates plots?

    Thanks,
    Sheila

    Hi Sheila,

    Thank you for your response.

    My data is PCR-based small target (target region is around 260,000 bp) deep sequencing (around 5000x) data, as I mentioned in my question. I am not sure what other information I should provided. They are DNA sequences. They are coming from Illumina sequencer (They are probably coming from MiSeq, but I am not sure).

    For the Best Practices, I think I have followed them:
    1. BWA mapping (mem, 0.7.5a)

    Per-lane data preprocess (GATK 3.5):
    2. sort (Picard SortSam 1.138), remove secondary reads (samtools 1.3), AddOrReplaceReadGroups, we DID NOT mark duplicates since the data is PCR-based.
    3. RealignerTargetCreator (-L, -dt NONE, -known 1000G_phase1.indels.b37.vcf, -known Mills_and_1000G_gold_standard.indels.b37.vcf), IndelRealigner (without -L, -maxReads 1000,000)
    4. BaseRecalibrator (-L, -knownSites 1000G_phase1.indels.b37.vcf, -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf, -knownSites dbsnp144), PrintReads(-baq RECALCULATE), samtools (1.3) calmd -Abr

    Merged-lane data preprocess:
    5. RealignerTargetCreator (-L, -dt NONE, -known 1000G_phase1.indels.b37.vcf, -known Mills_and_1000G_gold_standard.indels.b37.vcf), IndelRealigner (without -L, -maxReads 1000,000)
    6. BaseRecalibrator (-L, -knownSites 1000G_phase1.indels.b37.vcf, -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf, -knownSites dbsnp144), PrintReads(-baq RECALCULATE), samtools (1.3) calmd -Abr

    For AnalyzeCovariates plot:

    They are attached. I randomly picked two samples (AHC333 and AHC338), for each of them, they are sequenced in 4 lanes (L001, L002, L003, L004). I attached all lanes' plots of AHC333 for you to check the inter-lane case and attached one lane plot of AHC338 for you to check inter-sample case. I also attached the plot for merged-lane plot of AHC333.

    The plots are generated based on BaseRecalibrator output. (From my view, in merged data of AHC333, there is significant difference between mean insertion quality and mean deletion quality)

    Thank you very much for your help.

    bless~
    Xingliang

    Issue · Github
    by Sheila

    Issue Number
    763
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • liuxingliangliuxingliang SingaporeMember

    @Sheila

    Sorry, forgot to inform you, please see my reply above.

    bless~
    Xingliang

  • SheilaSheila Broad InstituteMember, Broadie admin

    @liuxingliang
    Hi Xingliang,

    I am just checking with the team. We will get back to you soon :smile:

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @liuxingliang , can you tell me why you are using -baq RECALCULATE at the PrintReads step?

    Also, can you please post the exact command you used to run AnalyzeCovariates?

  • liuxingliangliuxingliang SingaporeMember

    Hi @Sheila , Thank you very much for your help!!

    Hi @Geraldine_VdAuwera ,

    // BAQ
    I use -baq RECALCULATE in PrintReads step because I want my output bam with BAQ tag which I want to use to cap my base quality using samtools calmd (1.3, -Abr). The reason why I involved BAQ is because indels called from our previous small taget deep sequencing data was very noisy (large number of indels for each sample), I was just wandering whether cap the base quality with BAQ would help (actually it didn't). For how to add BAQ into bam I got the idea from Per-base alignment qualities (BAQ) in the GATK. The article gives "A highly efficient BAQ extended pipeline":

    `RealignerTargetCreator
    IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator

    BaseRecalibrator
    PrintReads (with --BQSR input) -baq RECALCULATE // now the reads will have a BAQ tag added. Slows the tool down some
    `
    // AnalyzeCovariates command
    java -jar GenomeAnalysisTK.jar \
    -T AnalyzeCovariates \
    -R human_g1k_v37.fasta \
    -BQSR recal_data.table (output of BaseRecalibrator) \
    -plots BQSR.pdf

    bless~
    Xingliang

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, the article about BAQ is very old. I just recently moved it to the archive to avoid this kind of misunderstanding... It's definitely not part of our Best Practices anymore, sorry about that.

    Your AnalyzeCovariates command is incomplete; you're missing the "after" recalibration component (which relies on a second run of BseRecalibrator). You need that to be able to evaluate what recalibration was applied to the data.

  • liuxingliangliuxingliang SingaporeMember

    @Geraldine_VdAuwera

    I see. I originally assumed that trends of insertion quality and deletion quality should be same. Thank you very much.

    bless~
    Xingliang

Sign In or Register to comment.