Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller reassembles reads so that no variants are called in a specific region

danilovkiridanilovkiri Moscow, RussiaMember
edited October 2018 in Ask the GATK team

Hi.

gatk-4.0.7.0
java openjdk version "1.8.0_181"

I've encountered a problem while running HaplotypeCaller in -ERC BP_RESOLUTION --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES mode on a single-sample paired-end WGS data. The problem is that HC reassembles reads so that a particular region of interest contains no supporting reads in the corresponding bamout file while the original BAM file does contain them. The region of interest is the CYP2D6 gene on chr22.

Commands

After trimming the reads I align them to GRCh38 with Bowtie2:

bowtie2 --threads 16 --trim5 0 --trim3 0 --phred33 --no-mixed --no-discordant --no-unal --local -x /hg38/bowtie2/hg38 -q -1 /trimmomatic_output/sample_R1.paired.fastq.gz -2 /trimmomatic_output/sample_R2.paired.fastq.gz -t -S /bowtie2_output/sample_raw.sam --met-file /path_to_met_file

This is followed by sorting the resulting SAM file and its conversion to BAM, running Picard's MarkDuplicates on the resulting BAM and further filtering using samtools with flags -q 10 -f 0x2 -F 0x4 -F 0x100 -F 0x400 -F 0x800. BQSR is applied next followed by calling on separate chromosomes/ROIs:

/home/ubuntu/Tools/gatk-4.0.7.0/gatk  HaplotypeCaller --reference /hg38.fa --input /bqsr_output/sample.bqsr.bam --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES --read-filter NotSecondaryAlignmentReadFilter --read-filter NotDuplicateReadFilter --read-filter MappingQualityAvailableReadFilter  --read-filter NotSupplementaryAlignmentReadFilter --smith-waterman FASTEST_AVAILABLE --min-base-quality-score 10 --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --output /gatk_gvcf_output/sample.diploid.raw.g.vcf.gz --emit-ref-confidence BP_RESOLUTION --sample-ploidy 2 --native-pair-hmm-threads 16 -L chr22

The resulting files are combined with CombineGVCFs followed by genotype calling using GenotypeGVCFs (gatk 3.8-1-0-gf15c1c3ef). This old version is used due to the fact that --includeNonVariantSites option is not implemented in up-to-date package (as far as I know), while I need to get the reference calls for further analysis.

Problems

The sample output from final VCF file from GenotypeGVCFs for all positions in a region of interest looks like this (zero AD, DP, which seems reasonable as bamout contains no supporting reads):

chr22   42130000    .   G   .   .   .   .   GT:AD:DP:RGQ    ./.:0,0:0:0

The IGV screenshot of the original BAM file after samtools filtering and bamout BAM file from HC is attached below.

I have digged into some of the old threads for similar problems (link, link, link, link) and found no appropriate solution. I tried to run HC as above but with adding --dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref parameters, which helped in some of the above-mentioned cases. It resulted in complete absence of supporting reads in the bamout (the upper empty panel is for bamout output, the lower is for original BAM) as in the previous case:

Finally I tried running HC with --dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref and --read-filter AllowAllReadsReadFilter --disable-tool-default-read-filters True. The IGV screenshot and its zoomed region are shown below:

As you can see, there are some regions with supporting reads in the bamout, but no calls are made within those regions.

chr22   42128819        .       C       .       .       .       .       GT:AD:DP:RGQ    ./.:0,0:0:0
chr22   42128813        .       G       .       .       .       .       GT:AD:DP:RGQ    ./.:0,0:0:0

I don't understand why is that happening as samtools view -q 10 has already filtered reads with MAPQ below 10 (which is a default value for MappingQualityReadFilter). It seems like HC filters everything out based on some criteria, but even given some supporting reads in the bamout like in the example above HC doesn't make any calls in the regions with those reads.
Any help would be greatly appreciated.

Post edited by danilovkiri on

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @danilovkiri,

    Unfortunately, we don't support questions that use bowtie2. Our Best Practice Workflow for germline variant calling uses BWA MEM. Here are some general questions to consider towards deciphering what is going on.

    • Do the read depths for the clean bam vs the bamout make sense or are you losing depth along the way? Towards this, you can toggle IGV settings, e.g. to include/exclude supplementary, secondary, duplicate reads.
    • Try using GATK4 HaplotypeCaller
    • Try disabling the MateOnSameContigOrNoMappedMateReadFilter. This is the GATK4 nomenclature so be sure to look up the equivalent for GATK3.

    Good luck.

  • danilovkiridanilovkiri Moscow, RussiaMember

    Thanks, @shlee

    Actually, I'm already using GATK4 HaplotypeCaller for variant calling, it's explicitly mentioned in the code blocks above and in the text. I'm using only GenotypeGVCFs from GATK3.8 due to the above-mentioned reasons.

    I didn't quite get your question about the read depth. The read depth is certainly becoming lower while filtering the original BAM file. The problem is that the bamout file has NO reads at all for the region, while the filtered BAM file does. Playing with parameters can result in some reads being present in the bamout file (IGV screenshot #3), but calls are still not made within the regions covered by those reads.

    Do you imply that I should run the alignment with bwa mem and only then you will consider this issue? The problem seems to be related not to the aligner but to the calling process. You can think of the data as an artificial BAM and then any bowtie/bwa/whatever related issues disappear.

  • danilovkiridanilovkiri Moscow, RussiaMember

    Dear @Geraldine_VdAuwera,

    I'm not quite sure if it is polite but I would be grateful if you could elaborate on this question when you have time.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Sorry @danilovkiri, but again, we don't support questions that use aligners besides BWA or STAR. It makes the scope of what the GATK support team has to cover too broad and we are a small team. We certainly wouldn't want to mislead you with answers that could be wrong because we did not completely understand the implications of aligning with some unfamiliar aligner. I hope you can solve your issue.

Sign In or Register to comment.