False Homozygous Variant call within a repeat?

drmjcdrmjc Garvan Institute of Medical ResearchMember
edited August 2015 in Ask the GATK team

Hi,
I'm hoping you can help resolve the behaviour of HaplotypeCaller with respect to a certain position.

Here's the IGV screenshot, with these filters: MQ>30, filter secondaries and dups. The DP is 14-18 across this deletion.
image

HC called a TATA deletion in this proband, with this gvcf call:
5 67597220 rs71655141 GTATA G,<NON_REF> 95.14 . DB;DP=12;MLEAC=2,0;MLEAF=1.00,0.00;MQ=57.93;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:10:132,10,0,132,10,132:0,0,1,2

It's calling this as GT=1/1 with AD=0,3.

Clearly this is likely all noise, and a tough region of the genome to make a call in, but i'm curious why the depth is 3, how HC handles the multiple overlapping deletions - ie how it only makes the delTATA call.

I'm using GATK 3.3, and following best practices.

cheers,
Mark

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @drmjc
    Hi Mark,

    Can you post the bamout file of the region? I am guessing the post is of the original bam file. https://www.broadinstitute.org/gatk/guide/article?id=5484

    Thanks,
    Sheila

  • drmjcdrmjc Garvan Institute of Medical ResearchMember

    Thanks for the advice Sheila,

    I've set IGV to: reads coloured by RG, MQ>20, filter dups, and filter secondaries:
    image

    HC sure has done a nice job cleaning up this region. There appears to be 23x coverage (6x ref/17x deleted). And now the delTATA call is more apparent.
    The GVCF still has GT=1/1, DP=3, AD=0,3, so i'm still not sure how that call was made. What do you think is going on?

    Here's how I ran GATK:
    java -jar $JAR -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I SW.PIK3R1.bam -o SW.PIK3R1.g.vcf.gz -L $pos -bamout SW.PIK3R1.bamout.bam -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -pairHMM VECTOR_LOGLESS_CACHING
    from the output:

    INFO  13:54:30,409 MicroScheduler - 65 reads were filtered out during the traversal out of approximately 256 total reads (25.39%) 
    INFO  13:54:30,409 MicroScheduler -   -> 52 reads (20.31% of total) failing DuplicateReadFilter 
    INFO  13:54:30,410 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
    INFO  13:54:30,410 MicroScheduler -   -> 13 reads (5.08% of total) failing HCMappingQualityFilter 
    INFO  13:54:30,410 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
    INFO  13:54:30,410 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
    INFO  13:54:30,411 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
    INFO  13:54:30,411 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
    

    and an excerpt from the g.vcf:

    5   67597213    .   T   <NON_REF>   .   .   END=67597218    GT:DP:GQ:MIN_DP:PL  0/0:24:9:23:0,9,135
    5   67597219    .   T   <NON_REF>   .   .   END=67597219    GT:DP:GQ:MIN_DP:PL  0/0:23:3:23:0,3,45
    5   67597220    .   GTATA   G,<NON_REF> 95.14   .   DP=12;MLEAC=2,0;MLEAF=1.00,0.00;MQ=57.93;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,3,0:3:10:132,10,0,132,10,132:0,0,1,2
    5   67597225    .   T   <NON_REF>   .   .   END=67597228    GT:DP:GQ:MIN_DP:PL  0/0:21:0:21:0,0,326
    5   67597229    .   T   <NON_REF>   .   .   END=67597229    GT:DP:GQ:MIN_DP:PL  0/0:20:22:20:0,22,588
    

    cheers,
    Mark

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @drmjc
    Hi Mark,

    Thanks. Notice there are two different colors of reads in the bamout. 1 color (I suspect the red) represents the artificial haplotypes produced by Haplotype Caller. The other color (I suspect the blue) represents the actual reads from your data. You can read more about haplotype construction here: https://www.broadinstitute.org/gatk/guide/article?id=4146

    Can you check the mapping qualities and base qualities (bases before and after the deletion) of the actual sample reads? I think there may be some reads that do not have good qualities and that is why they are getting filtered.

    Thanks,
    Sheila

  • drmjcdrmjc Garvan Institute of Medical ResearchMember

    Indeed the red reads are from the ArtificialHaplotypes. There are 12 different haplotypes at this site, representing many of the possible combinations of the 4 variants. All have MQ=60. The BD quals around the delGATA variant are about H-J, which if i've got it right are 39-41.
    What choices does HC make when there are so many haplotypes?

    cheers,
    Mark

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
Sign In or Register to comment.