Resequencing using PacBio + using GATK HaplotypeCaller to call variants

eoberortnereoberortner Walnut CreekMember

Dear all,

We use the PacBio Blasr aligner for the alignment of our sequence constructs to the reference sequences.
Then, we use GATK HaplotypeCaller (default parameters) to call the variants (SNPs, Indels).

We're experiencing often the case that the HaplotypeCaller does not call variants although nicely highlighted by IGV (see screenshot attached).
Please, could anybody provide some helping hand?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @eoberortner

    Just to be clear are you talking about the purple bars in the plot? If that is what you are talking about then it looks like those maybe technical artifacts as opposed to variants. HaplotypeCaller calls variants if you see that variation in multiple reads.

    Also please keep in mind, PacBio Blasr aligner is known to show false positive insertions.

    Please let me know if there is more information I could provide to you. I would be happy to direct you to some documents or presentations that would explain this in more detail.

    Regards
    Bhanu

  • eoberortnereoberortner Walnut CreekMember

    Hi Bhanu,

    thanks for your response!

    1) I was talking about the vertical lightgreen bars in the "BBTools_alignment" lane. You can see the white bars going through all alignments. In the IGV view, this clearly looks like an indel. I was just wondering if this comes from the IGV displays or if there is a real indel? If there's a real indel, then why does the HaplotypeCaller not call it.

    2) We do not use Blasr for the alignments. We use BBTools' (https://jgi.doe.gov/data-and-tools/bbtools/) PacBio aligner. If you or anybody else have good recommendations regarding what aligner to use for PacBio Consensus Sequences (CCS), then please let me know!

    Thanks,
    Ernst

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @eoberortner

    1) we can take a look at this for you. Would you please send us bam and bam.out for that region/chromosome. You can find steps to send data here: https://software.broadinstitute.org/gatk/documentation/article.php?id=1894

    2) We do not use this sort of data so unfortunately we have no recommendation for CCS aligner.

    Regards
    Bhanu

  • williamrowellwilliamrowell Menlo Park, CAMember

    Hello,

    I've been working on this subject (CCS resequencing and variant calling with GATK4 HC) at PacBio recently, and while I don't yet have final recommendations, I do have a system that seems to work pretty well so far.

    1) I've shifted over completely to minimap2 for these analyses. It's much faster than blasr, and still yields good alignments. Align CCS using minimap2 with the following parameters:

    minimap2 -a \
        -k 19 \
        -O 5,56 \
        -E 4,1 \
        -B 5 \
        -z 400,50 \
        -r 2k \
        --eqx \
        --secondary=no \
        -R "@RG\tID:${SAMPLE}_${MOVIE}\tSM:${SAMPLE}" \
        "${REF}" "${FASTQ}"
    

    2) Call variants with GATK HaplotypeCaller using the appropriate --pcr-indel-model. Because PacBio error mode is similar to PCR error, I typically use CONSERVATIVE or AGGRESSIVE for unamplified data, and AGGRESSIVE or HOSTILE for amplicons or target capture data. This is a little hacky, but it does reduce the number of spurious indel calls, and I'm not sure we have a better solution in GATK4 HC given the parameters available to us. I also use --minimum-mapping-quality 60 if that is available to me (i.e., if I'm focusing on a region that is not difficult to map).

    3) Filter variants using hard thresholds. I've tried VSQR on our data against a benchmark, and it just wasn't beating common sense thresholds. I can only really suggest starting points here, as this could be more specific to your dataset.

    • Split the variants into three groups: SNPs, 1bp indels + mixed variants where the indel is 1bp, >2bp indels + mixed variants where the indel is >2bp
    • For each group, evaluate the generic hard-filtering recommendations and decide which apply. For CCS alignments, FS and SOR don't apply as the CCS consensus is calculated using both strands. If you already set the minimum mapping quality to 60, then MQ and MQRankSum won't add any additional information. Otherwise, I currently stick with the recommended values except for QD. In large target capture panels or WGS, I've noticed that the QD distribution for SNPs and >2bp indels have the expected distribution shape. 1bp indels, however, have a large dominant peak on the far left. When filtering the "1bp indels + mixed " group from above, I increase the QD threshold to something closer to 8. I recommend looking at your QD distribution to see if this applies to your data. If you're working with amplicons, the QD distribution plots are likely useless, and you're better off setting a higher QUAL threshold. In addition, if you're working with PCR amplified data, you might want to mask homopolymer stretches.

    4) Finally, if you're suspicious of a call, you can add --bam-output "${GATKBAM}" to your HaplotypeCaller command and inspect the realigned/reassembled output. It's likely that minimap2, while doing a great job with the global alignment, has produced a local alignment that favors SNPs over indels, or vice versa. HaplotypeCaller local realignment fixes this.

    I'm curious, is this shotgun sequencing, target capture, or amplicons?

    Hope this helps,

    Billy

  • williamrowellwilliamrowell Menlo Park, CAMember

    Looking back at your IGV screenshot, it looks like the variants you see in your BBTools alignment weren't real variants and went away during the GATK HaplotypeCaller realignment/reassembly. I've seen this before in repetitive or low complexity sequence, where the same variant can be aligned as

    1) an insertion followed by a deletion, or
    2) one or more SNPs

    It would be helpful for diagnosis if you zoomed in on the region of interest and showed only around 50 bases from your alignments as well as the reference sequence.

  • alphahmedalphahmed JAPANMember

    For some reason I tried using HaplotypeCaller to genotype PacBio bam file, which was aligned using the recommended parameters gratefully provided by @williamrowell , but I only get the headers of vcf file with no single variant output.

    @williamrowell ! Could you please provide the full command line you usually use for HaplotpyeCaller on PacBio data?

  • williamrowellwilliamrowell Menlo Park, CAMember

    @alphahmed Are you using CCS reads? I've only had luck with GATK HC on CCS reads. I use:

    gatk --java-options -Xmx8G HaplotypeCaller \
            --native-pair-hmm-threads 16 \
            --reference "${REF}" \
            --input "${BAM}" \
            --output "${INTERVAL}.GATKHaplotypeCaller.vcf" \
            --pcr-indel-model AGGRESSIVE \
            --intervals "${INTERVAL}" \
            --minimum-mapping-quality 60 \
            --read-filter MappingQualityReadFilter \
            --read-filter NotSecondaryAlignmentReadFilter \
            --read-filter NotSupplementaryAlignmentReadFilter
    
  • alphahmedalphahmed JAPANMember

    Thank you @williamrowell !
    You are right, I am not using CCS reads. I guess I should not rely on HaplotypeCaller for subreads output then.

  • eoberortnereoberortner Walnut CreekMember

    @williamrowell thank you very much for all your help!
    We are using amplicons and all your suggestions improve our variant calling so far. Please find attached two screenshots (general IGV view and zoom in) of a very common scenario that happens (homopolymer stretches). I'm still struggling figuring out how to overcome this problem and and would appreciate your help.
    1. you mentioned to mask homopolymer stretches. how can this be done?
    2. when filtering variants using GATK's VariantFiltration, is it possible to set the % of reads that contain a variant as a threshold. For example, if more than 30% of the reads show a variant, then it's likely that it's a real variant

    Thank you very much for your help!

  • williamrowellwilliamrowell Menlo Park, CAMember

    For question 1, use the --mask, --mask-name, and --mask-extension options and provide a bed file. I generate a bed file of homopolymer stretches using a python script (there are many executions of this script out there, but let me know if you need help finding one) and pass this to the --mask option with --mask-name to provide a convenient filter name. I typically extend the mask by 1 base in either direction using --mask-extension, because errors in homopolymer runs will sometimes manifest as variants in the first or last position.
    For question 2, the rest of the GATK community is likely more helpful than me. I think you want a JEXL filter expression to check the ratio of AD to DP for each sample.. I've toyed with this in the past, but it gets very complicated.
    You can get in touch with me through your PacBio BFX FAS if you want to talk through any of this. I'll update this forum post with anything we decide that will be helpful to the community.

Sign In or Register to comment.