The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

HaplotypeCaller doesn't call true variants which are located on the outside of duplicated reads

tuxidotuxido Member Posts: 7

Hi,

I was running the haplotypeCaller for many samples, but some variants (validated as true positives by using other techniques) within these samples are not called by the haplotypeCaller. I saw in the bam files that most of these variants are located on the outside of duplicated reads (around 200 reads). Most of my data consists of duplicated reads. First I thought that the duplicated reads were filtered out by the read filters which are automatically applied (like duplicateReadFilter), but when I checked it this was not the case. I was wondering why my true variants are not called by the HaplotypeCaller and if there is an option to resolve this problem?

Thank you!

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin
    edited November 2013

    Hi there,

    You can try specifying a bam output file to examine the results of the local assembly produced by HaplotypeCaller. This may help you understand what the program thinks is going on in there.

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html#--bamOutput

    Geraldine Van der Auwera, PhD

  • tuxidotuxido Member Posts: 7

    Hi Geraldine,

    yes so I think we figured out why the variants are not reported. We have reads that are all exactly the same and the variant is located at the very end of the reads. According to the documentation it seems that GATK therefore thinks it is a likely artifact and does not report it. However in our specific data we typically have a couple of 100 times coverage and we know that the quality of our reads is not deteriorated at the end. So my question is: is there an option that tells GATK to not take into consideration the position within a read for calling a variant. Let me know if we need to send you any additional information our data from our side to make it more clear.

    Kind regards,

    Christian Gilissen

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    Hi Christian,

    I understand -- unfortunately there is no way to modify the behaviour re: handling end-of-read evidence from command line. The typical approach to dealing with variants that are on the edge of an interval of interest (for exome or small target experiments) is to make sure to pad the capture targets before sequencing to ensure that there will be overlapping coverage.

    You can always try lowering the confidence thresholds for calling, or try to call the site in Genotype Given Alleles mode. Are the base and mapping quality scores good at that site?

    Geraldine Van der Auwera, PhD

  • tuxidotuxido Member Posts: 7

    Hi Geraldine,

    So our reads are indeed very specific, which I think causes the problem. I've attached a picture from IGV (condensed reads view) to hopefully make it more clear.
    So any target (i.e. exon) is covered by multiple different "reads" (in this case 4, you can see 4 blocks). Reads overlap each other only a few bases. Reads within the block are identical. Now as you can see if a variant is called at a read's end, even though we have really a lot of coverage, GATK is not calling it because the variant is always at the read's end.
    We tried lowering the "stand_call_conf" and the "stand_emit_conf to 5", but that did not help. I think you already answered my question, but just to be sure: can GATK be used to call these variants in this kind of data?

    IGV_reads.png
    1152 x 741 - 21K
  • pdexheimerpdexheimer Member, Dev Posts: 543 ✭✭✭✭

    This looks like Haloplex data, which has always made me nervous (I really don't like the low complexity of reads that it produces). Just to be clear, you're NOT Marking Duplicates with this data, correct? And have you checked the position of your variant relative to the regions you're calling on?

  • tuxidotuxido Member Posts: 7

    Hi,

    so these are not haloplex data :) The sequencing was actually done on an illumina hiseq. No we're not marking duplicates, and we checked that all the default filters were not removing too many reads or anything by applying a single filter and then looking at the resulting bam file. None of the default filters seems to remove a significant amount of reads.
    We're not really specifying in advance what regions to call on. Is that required? We're running GATK (version 2.7-4-g6f46d11) with the following options:

    -T HaplotypeCaller
    -R /data/references/hg19/ref_hg19.fasta
    -I bam_file.bam
    --dbsnp /data/references/hg19/dbsnp_137.hg19.vcf
    -stand_call_conf 30.0
    -stand_emit_conf 30.0
    -o variants.vcf
    -rf BadCigar
    -nct 8

  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 698 admin

    I just want to correct some things stated earlier in this thread. First, the Haplotype Caller definitely ignores reads that are marked as duplicates. Second, the Haplotype Caller is just as likely to call variants at the ends of reads as it is for those in the middle of reads. The only constraint is that it needs to see one or more reads actually span the variant (so that it can build a connected assembly graph). In general your data type is really not ideal but it's not inherently problematic either. Maybe you could post an example of a SNP that isn't being called?

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • tuxidotuxido Member Posts: 7

    Here are two examples.

    The first two pictures represent a SNP which isn't called by the Haplotype Caller. In the two pictures the bam file above is the original bam file which has the variant and the second bam file is the output retrieved from the bamout option with ALL_POSSIBLE_HAPLOTYPES as bamWriterType. You can see that the variants isn't a candidate variant anymore, because the reads are filtered out.

    The last two pictures represent a SNP which is called by the Haplotype Caller. Again above the original bam file and the second bam file is retrieved from the bamout option. This SNP is known in the dbSNP database and you can see that it is called properly.

    igv_snapshot1.png
    1680 x 879 - 30K
    igv_snapshot2.png
    1680 x 879 - 28K
    igv_snapshot4.png
    1680 x 879 - 30K
    igv_snapshot5.png
    1680 x 879 - 30K
  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 698 admin

    I don't think the reads are being filtered out. Rather I think it's simply that this data type really just doesn't work well in the world of assembly. Having short, non-overlapping reads means that you won't be able to build a connected graph without ambiguous cycles. I would recommend that you not use the HC if you are forced to use this constrained data type.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • tuxidotuxido Member Posts: 7

    Ok, thanks, we'll try out some other variant callers then :(

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    Try the UnifiedGenotyper, it doesn't have the HC's requirements since it doesn't do any assembly.

    Geraldine Van der Auwera, PhD

  • tuxidotuxido Member Posts: 7

    Hi Geraldine,
    Unified Genotyper seems to work better for this type of data :) Thanks!

  • airtimeairtime Member Posts: 28

    Hi,

    I have a related question. So I didn't open a new thread.
    If I call variants with HC I get calls for candidates and in the bam output the candidates are visible, but I couldn't see the same amount of reads in the bam output.
    So I used the following command for calling:

    java -Xmx12G -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fasta -I rmdup_realigned_recal.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp human_9606_00-All.vcf -o snp_indel.raw.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -L exome_targetedregions.bed;

    and further to create the bam output:

    java -Xmx12G -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fasta -I rmdup_realigned_recal.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp human_9606_00-All.vcf -o snp_indel.raw.vcf -bamout hc.bam -stand_call_conf 50.0 -stand_emit_conf 10.0 -L some_candidate_regions.bed;

    And for example I get a call of 10,4 or another 2,2
    but in the bamoutput I see 12,6 or 3,3 respectively.

    First I thought some reads are filtered out but AD is described as "AD is the unfiltered allele depth".
    If there is a internal filtering in HC, how can I identify the reads which are filtered out?
    Or did I missunderstand something?

    thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,291 admin
    edited October 2015

    @airtime
    Hi,

    I suspect some of the extra reads are the Artificial Haplotypes created by HaplotypeCaller. Can you try coloring the reads in IGV by sample? (Right click -> Color alignments by -> sample)

    -Sheila

Sign In or Register to comment.