The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# HaplotypeCaller doesn't call true variants which are located on the outside of duplicated reads

Member

Hi,

I was running the haplotypeCaller for many samples, but some variants (validated as true positives by using other techniques) within these samples are not called by the haplotypeCaller. I saw in the bam files that most of these variants are located on the outside of duplicated reads (around 200 reads). Most of my data consists of duplicated reads. First I thought that the duplicated reads were filtered out by the read filters which are automatically applied (like duplicateReadFilter), but when I checked it this was not the case. I was wondering why my true variants are not called by the HaplotypeCaller and if there is an option to resolve this problem?

Thank you!

Tagged:

edited November 2013

Hi there,

You can try specifying a bam output file to examine the results of the local assembly produced by HaplotypeCaller. This may help you understand what the program thinks is going on in there.

• Member

Hi Geraldine,

yes so I think we figured out why the variants are not reported. We have reads that are all exactly the same and the variant is located at the very end of the reads. According to the documentation it seems that GATK therefore thinks it is a likely artifact and does not report it. However in our specific data we typically have a couple of 100 times coverage and we know that the quality of our reads is not deteriorated at the end. So my question is: is there an option that tells GATK to not take into consideration the position within a read for calling a variant. Let me know if we need to send you any additional information our data from our side to make it more clear.

Kind regards,

Christian Gilissen

Hi Christian,

I understand -- unfortunately there is no way to modify the behaviour re: handling end-of-read evidence from command line. The typical approach to dealing with variants that are on the edge of an interval of interest (for exome or small target experiments) is to make sure to pad the capture targets before sequencing to ensure that there will be overlapping coverage.

You can always try lowering the confidence thresholds for calling, or try to call the site in Genotype Given Alleles mode. Are the base and mapping quality scores good at that site?

• Member

Hi Geraldine,

So our reads are indeed very specific, which I think causes the problem. I've attached a picture from IGV (condensed reads view) to hopefully make it more clear.
So any target (i.e. exon) is covered by multiple different "reads" (in this case 4, you can see 4 blocks). Reads overlap each other only a few bases. Reads within the block are identical. Now as you can see if a variant is called at a read's end, even though we have really a lot of coverage, GATK is not calling it because the variant is always at the read's end.
We tried lowering the "stand_call_conf" and the "stand_emit_conf to 5", but that did not help. I think you already answered my question, but just to be sure: can GATK be used to call these variants in this kind of data?

• Member, Dev

This looks like Haloplex data, which has always made me nervous (I really don't like the low complexity of reads that it produces). Just to be clear, you're NOT Marking Duplicates with this data, correct? And have you checked the position of your variant relative to the regions you're calling on?

• Member

Hi,

so these are not haloplex data The sequencing was actually done on an illumina hiseq. No we're not marking duplicates, and we checked that all the default filters were not removing too many reads or anything by applying a single filter and then looking at the resulting bam file. None of the default filters seems to remove a significant amount of reads.
We're not really specifying in advance what regions to call on. Is that required? We're running GATK (version 2.7-4-g6f46d11) with the following options:

-T HaplotypeCaller
-R /data/references/hg19/ref_hg19.fasta
-I bam_file.bam
--dbsnp /data/references/hg19/dbsnp_137.hg19.vcf
-stand_call_conf 30.0
-stand_emit_conf 30.0
-o variants.vcf
-nct 8

I just want to correct some things stated earlier in this thread. First, the Haplotype Caller definitely ignores reads that are marked as duplicates. Second, the Haplotype Caller is just as likely to call variants at the ends of reads as it is for those in the middle of reads. The only constraint is that it needs to see one or more reads actually span the variant (so that it can build a connected assembly graph). In general your data type is really not ideal but it's not inherently problematic either. Maybe you could post an example of a SNP that isn't being called?

• Member

Here are two examples.

The first two pictures represent a SNP which isn't called by the Haplotype Caller. In the two pictures the bam file above is the original bam file which has the variant and the second bam file is the output retrieved from the bamout option with ALL_POSSIBLE_HAPLOTYPES as bamWriterType. You can see that the variants isn't a candidate variant anymore, because the reads are filtered out.

The last two pictures represent a SNP which is called by the Haplotype Caller. Again above the original bam file and the second bam file is retrieved from the bamout option. This SNP is known in the dbSNP database and you can see that it is called properly.

I don't think the reads are being filtered out. Rather I think it's simply that this data type really just doesn't work well in the world of assembly. Having short, non-overlapping reads means that you won't be able to build a connected graph without ambiguous cycles. I would recommend that you not use the HC if you are forced to use this constrained data type.

• Member

Ok, thanks, we'll try out some other variant callers then

Try the UnifiedGenotyper, it doesn't have the HC's requirements since it doesn't do any assembly.

• Member

Hi Geraldine,
Unified Genotyper seems to work better for this type of data Thanks!

• Member

Hi,

I have a related question. So I didn't open a new thread.
If I call variants with HC I get calls for candidates and in the bam output the candidates are visible, but I couldn't see the same amount of reads in the bam output.
So I used the following command for calling:

java -Xmx12G -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fasta -I rmdup_realigned_recal.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp human_9606_00-All.vcf -o snp_indel.raw.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -L exome_targetedregions.bed;

and further to create the bam output:

java -Xmx12G -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fasta -I rmdup_realigned_recal.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp human_9606_00-All.vcf -o snp_indel.raw.vcf -bamout hc.bam -stand_call_conf 50.0 -stand_emit_conf 10.0 -L some_candidate_regions.bed;

And for example I get a call of 10,4 or another 2,2
but in the bamoutput I see 12,6 or 3,3 respectively.

First I thought some reads are filtered out but AD is described as "AD is the unfiltered allele depth".
If there is a internal filtering in HC, how can I identify the reads which are filtered out?
Or did I missunderstand something?

thank you.