This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
HaplotypeCaller reassembles reads so that no variants are called in a specific region
java openjdk version "1.8.0_181"
I've encountered a problem while running HaplotypeCaller in
-ERC BP_RESOLUTION --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES mode on a single-sample paired-end WGS data. The problem is that HC reassembles reads so that a particular region of interest contains no supporting reads in the corresponding
bamout file while the original BAM file does contain them. The region of interest is the CYP2D6 gene on chr22.
After trimming the reads I align them to GRCh38 with Bowtie2:
bowtie2 --threads 16 --trim5 0 --trim3 0 --phred33 --no-mixed --no-discordant --no-unal --local -x /hg38/bowtie2/hg38 -q -1 /trimmomatic_output/sample_R1.paired.fastq.gz -2 /trimmomatic_output/sample_R2.paired.fastq.gz -t -S /bowtie2_output/sample_raw.sam --met-file /path_to_met_file
This is followed by sorting the resulting SAM file and its conversion to BAM, running Picard's MarkDuplicates on the resulting BAM and further filtering using samtools with flags
-q 10 -f 0x2 -F 0x4 -F 0x100 -F 0x400 -F 0x800. BQSR is applied next followed by calling on separate chromosomes/ROIs:
/home/ubuntu/Tools/gatk-188.8.131.52/gatk HaplotypeCaller --reference /hg38.fa --input /bqsr_output/sample.bqsr.bam --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES --read-filter NotSecondaryAlignmentReadFilter --read-filter NotDuplicateReadFilter --read-filter MappingQualityAvailableReadFilter --read-filter NotSupplementaryAlignmentReadFilter --smith-waterman FASTEST_AVAILABLE --min-base-quality-score 10 --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --output /gatk_gvcf_output/sample.diploid.raw.g.vcf.gz --emit-ref-confidence BP_RESOLUTION --sample-ploidy 2 --native-pair-hmm-threads 16 -L chr22
The resulting files are combined with
CombineGVCFs followed by genotype calling using
GenotypeGVCFs (gatk 3.8-1-0-gf15c1c3ef). This old version is used due to the fact that
--includeNonVariantSites option is not implemented in up-to-date package (as far as I know), while I need to get the reference calls for further analysis.
The sample output from final VCF file from
GenotypeGVCFs for all positions in a region of interest looks like this (zero AD, DP, which seems reasonable as bamout contains no supporting reads):
chr22 42130000 . G . . . . GT:AD:DP:RGQ ./.:0,0:0:0
The IGV screenshot of the original BAM file after
samtools filtering and bamout BAM file from HC is attached below.
I have digged into some of the old threads for similar problems (link, link, link, link) and found no appropriate solution. I tried to run HC as above but with adding
--dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref parameters, which helped in some of the above-mentioned cases. It resulted in complete absence of supporting reads in the bamout (the upper empty panel is for bamout output, the lower is for original BAM) as in the previous case:
Finally I tried running HC with
--dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref and
--read-filter AllowAllReadsReadFilter --disable-tool-default-read-filters True. The IGV screenshot and its zoomed region are shown below:
As you can see, there are some regions with supporting reads in the bamout, but no calls are made within those regions.
chr22 42128819 . C . . . . GT:AD:DP:RGQ ./.:0,0:0:0 chr22 42128813 . G . . . . GT:AD:DP:RGQ ./.:0,0:0:0
I don't understand why is that happening as
samtools view -q 10 has already filtered reads with MAPQ below 10 (which is a default value for
MappingQualityReadFilter). It seems like HC filters everything out based on some criteria, but even given some supporting reads in the bamout like in the example above HC doesn't make any calls in the regions with those reads.
Any help would be greatly appreciated.