Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

HaplotypeCaller doesn't call true variants which are located on the outside of duplicated reads

tuxidotuxido Posts: 7Member

Hi,

I was running the haplotypeCaller for many samples, but some variants (validated as true positives by using other techniques) within these samples are not called by the haplotypeCaller. I saw in the bam files that most of these variants are located on the outside of duplicated reads (around 200 reads). Most of my data consists of duplicated reads. First I thought that the duplicated reads were filtered out by the read filters which are automatically applied (like duplicateReadFilter), but when I checked it this was not the case. I was wondering why my true variants are not called by the HaplotypeCaller and if there is an option to resolve this problem?

Thank you!

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin
    edited November 2013

    Hi there,

    You can try specifying a bam output file to examine the results of the local assembly produced by HaplotypeCaller. This may help you understand what the program thinks is going on in there.

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html#--bamOutput

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • tuxidotuxido Posts: 7Member

    Hi Geraldine,

    yes so I think we figured out why the variants are not reported. We have reads that are all exactly the same and the variant is located at the very end of the reads. According to the documentation it seems that GATK therefore thinks it is a likely artifact and does not report it. However in our specific data we typically have a couple of 100 times coverage and we know that the quality of our reads is not deteriorated at the end. So my question is: is there an option that tells GATK to not take into consideration the position within a read for calling a variant. Let me know if we need to send you any additional information our data from our side to make it more clear.

    Kind regards,

    Christian Gilissen

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hi Christian,

    I understand -- unfortunately there is no way to modify the behaviour re: handling end-of-read evidence from command line. The typical approach to dealing with variants that are on the edge of an interval of interest (for exome or small target experiments) is to make sure to pad the capture targets before sequencing to ensure that there will be overlapping coverage.

    You can always try lowering the confidence thresholds for calling, or try to call the site in Genotype Given Alleles mode. Are the base and mapping quality scores good at that site?

    Geraldine Van der Auwera, PhD

  • tuxidotuxido Posts: 7Member

    Hi Geraldine,

    So our reads are indeed very specific, which I think causes the problem. I've attached a picture from IGV (condensed reads view) to hopefully make it more clear. So any target (i.e. exon) is covered by multiple different "reads" (in this case 4, you can see 4 blocks). Reads overlap each other only a few bases. Reads within the block are identical. Now as you can see if a variant is called at a read's end, even though we have really a lot of coverage, GATK is not calling it because the variant is always at the read's end. We tried lowering the "stand_call_conf" and the "stand_emit_conf to 5", but that did not help. I think you already answered my question, but just to be sure: can GATK be used to call these variants in this kind of data?

    IGV_reads.png
    1152 x 741 - 21K
  • pdexheimerpdexheimer Posts: 299Member, GSA Collaborator ✭✭✭

    This looks like Haloplex data, which has always made me nervous (I really don't like the low complexity of reads that it produces). Just to be clear, you're NOT Marking Duplicates with this data, correct? And have you checked the position of your variant relative to the regions you're calling on?

  • tuxidotuxido Posts: 7Member

    Hi,

    so these are not haloplex data :) The sequencing was actually done on an illumina hiseq. No we're not marking duplicates, and we checked that all the default filters were not removing too many reads or anything by applying a single filter and then looking at the resulting bam file. None of the default filters seems to remove a significant amount of reads. We're not really specifying in advance what regions to call on. Is that required? We're running GATK (version 2.7-4-g6f46d11) with the following options:

    -T HaplotypeCaller -R /data/references/hg19/ref_hg19.fasta -I bam_file.bam --dbsnp /data/references/hg19/dbsnp_137.hg19.vcf -stand_call_conf 30.0 -stand_emit_conf 30.0 -o variants.vcf -rf BadCigar -nct 8

  • ebanksebanks Posts: 671GSA Member mod

    I just want to correct some things stated earlier in this thread. First, the Haplotype Caller definitely ignores reads that are marked as duplicates. Second, the Haplotype Caller is just as likely to call variants at the ends of reads as it is for those in the middle of reads. The only constraint is that it needs to see one or more reads actually span the variant (so that it can build a connected assembly graph). In general your data type is really not ideal but it's not inherently problematic either. Maybe you could post an example of a SNP that isn't being called?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • tuxidotuxido Posts: 7Member

    Here are two examples.

    The first two pictures represent a SNP which isn't called by the Haplotype Caller. In the two pictures the bam file above is the original bam file which has the variant and the second bam file is the output retrieved from the bamout option with ALL_POSSIBLE_HAPLOTYPES as bamWriterType. You can see that the variants isn't a candidate variant anymore, because the reads are filtered out.

    The last two pictures represent a SNP which is called by the Haplotype Caller. Again above the original bam file and the second bam file is retrieved from the bamout option. This SNP is known in the dbSNP database and you can see that it is called properly.

    igv_snapshot1.png
    1680 x 879 - 30K
    igv_snapshot2.png
    1680 x 879 - 28K
    igv_snapshot4.png
    1680 x 879 - 30K
    igv_snapshot5.png
    1680 x 879 - 30K
  • ebanksebanks Posts: 671GSA Member mod

    I don't think the reads are being filtered out. Rather I think it's simply that this data type really just doesn't work well in the world of assembly. Having short, non-overlapping reads means that you won't be able to build a connected graph without ambiguous cycles. I would recommend that you not use the HC if you are forced to use this constrained data type.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • tuxidotuxido Posts: 7Member

    Ok, thanks, we'll try out some other variant callers then :(

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Try the UnifiedGenotyper, it doesn't have the HC's requirements since it doesn't do any assembly.

    Geraldine Van der Auwera, PhD

  • tuxidotuxido Posts: 7Member

    Hi Geraldine, Unified Genotyper seems to work better for this type of data :) Thanks!

Sign In or Register to comment.