Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

HaplotypeCaller lost 80% variants in a sample

Hi,

We have several close-related samples each contain 3 million variants relative to the reference genome (both UG and HC give this results for most samples), while HC only give less than 1 million for a few samples. I searched the forum and did some test, and failed to figure out why this happen.

Here goes some detailed descriptions:

I tested two samples, those two look very similar, and the mapping results both seem proper with good mapping quality (>20)

HC gives proper varaint results and output bam file for sample1:

image

However, the sample2 gives the similar mapping results, but HC did not give any variants, the output bam file contain no sequence, after adding "--disableOptimizations --forceActive" options, the output bam file contain one sequence without any variants

image

The strangest thing is when I tried joint-calling of two samples, those variants called in sample1 also get disappeared, and the output bam file is sample as the sample2

image

Those reads were generated from Hiseq4000 platform, most base qualities are also high (Illumina seems changed the highest quality value to 42, I'm not sure whether this could affect, however other samples seem proper ...).

Thank you very much in advance!

Wang Long

Tagged:

Issue · Github
by Sheila

Issue Number
1103
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

Best Answer

Answers

  • Wang LongWang Long ChinaMember

    After some further test, I found this problem was triggered only due to a single read in sample2:

    K00137:24:H2TVJBBXX:1:2222:14306:43404 83 Pp01 32030 40 52M1I97M = 31851 -328 ATTACAGATTAAGGATTCGAATGGATACAAAAGATCTGCTGAAACATGAATGAAAAAAATGAACCGATTCAACAATACATGTCTGTACTGTGAGCTAGGGTTAGCTGTAAAAATGGCTAGAAGTTGCCAATTTGATTGGAAAAAGTATTA KKKKFKKFKKFAAAKKKFFKFKKKKKKKKKKFF<AKKKKKKKKKFKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA MC:Z:150M MD:Z:6T3G26A3G0G21A16T5C35G25 PG:Z:MarkDuplicates RG:Z:G1-12 NM:i:10 MQ:i:45 AS:i:97 XS:i:106

    Remove of this single read gives proper variant results.

    When runing in BP_RESOLUTION mode, this read seem to cause a genotyping failure in variant site, which gives results like

    Pp01 32040 . G . . . GT:AD:DP:GQ:PL 0/0:0,14:14:0:0,0,0

    After removing this read, the results changed to

    Pp01 32040 . G A, 556.77 . DP=13;MLEAC=2,0;MLEAF=1.00,0.00;MQ=51.29;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,13,0:13:39:0|1:32036_T_G:585,39,0,585,39,585:0,0,7,6

    This seems rather weird to me, as I could not figure out why this read differs from others and could lead to the whole failure of genotyping. Here are same example of other reads:

    K00137:24:H2TVJBBXX:1:2124:18265:3637 99 Pp01 32002 53 80M1I69M = 32175 323 TTTGATTGGAAAAAGTATGAGAATTCAAATTACAGATTAAGGATTCGAATGGATACAAAAGATCTGCTGAAACATGAATGAAAAAAATGAACCGATTCAACAATACATGTCTGTACTGTGAGCTAGGGTTAGCTGTAAAAATGGCTAGAA AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKFK7FFAFFFKKKKKKF<FFKKFKK XA:Z:Pp08,+6832,150M,12;Pp06,-15884,20S130M,9;Pp05,-18463500,20S47M1D83M,9; MC:Z:150M MD:Z:34T3G26A3G0G21A16T5C33 PG:Z:MarkDuplicates RG:Z:G1-12 NM:i:9 MQ:i:60 AS:i:102 XS:i:94
    K00137:24:H2TVJBBXX:1:1106:26740:16734 99 Pp01 32015 55 67M1I82M = 32211 332 AGTATGAGAATTCAAATTACAGATTAAGGATTCGAATGGATACAAAAGATCTGCTGAAACATGAATGAAAAAAATGAACCGATTCAACAATACATGTCTGTACTGTGAGCTAGGGTTAGCTGTAAAAATGGCTAGAAGTGGCCAATTTGA AAAFFFKKKKKKKKKKKKKFKKKKFKKKFKKKKKKKKKKKKKKFKKKKKKKKFFKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFKKKKKKKKKKKKKKKKFKFKKKKKKAKKKFAFK XA:Z:Pp08,+5899,72M1D78M,11; MC:Z:14S136M MD:Z:21T3G26A3G0G21A16T5C46 PG:Z:MarkDuplicates RG:Z:G1-12 NM:i:9 MQ:i:60 AS:i:102 XS:i:93
    K00137:24:H2TVJBBXX:1:2203:27440:47060 99 Pp01 32032 60 50M1I99M = 32198 316 TACAGATTAAGGATTCGAATGGATACAAAAGATCTGCTGAAACATGAATGAAAAAAATGAACCGATTCAACAATACATGTCTGTACTGTGAGCTAGGGTTAGCTGTAAAAATGGCTAGAAGTGGCCAATTTGATTGGAAAAAGTATTAGA AAFFFKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKAKKKKKKKKKKKKKKKKFKKAFKKKKKKKKKKKKKFKFKKKKFFKKKKKKFKFAFKKKKKKKKKFKKAFKKFK<A77AKKKFKKKFFKKFKKKKAKFKKF MC:Z:150M MD:Z:4T3G26A3G0G21A16T5C63 PG:Z:MarkDuplicates RG:Z:G1-12 NM:i:9 MQ:i:60 AS:i:103 XS:i:70
    K00137:24:H2TVJBBXX:1:1107:17737:12744 83 Pp01 32039 60 43M1I106M = 31866 -322 TAAGGATTCGAATGGATACAAAAGATCTGCTGAAACATGAATGAAAAAAATGAACCGATTCAACAATACATGTCTGTACTGTGAGCTAGGGTTAGCTGTAAAAATGGCTAGAAGTGGCCAATTTGATTGGAAAAAGTATTAGAATTCAAA AKKFKKFFKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA MC:Z:150M MD:Z:1G26A3G0G21A16T5C70 PG:Z:MarkDuplicates RG:Z:G1-12 NM:i:8 MQ:i:60 AS:i:110 XS:i:71
    K00137:24:H2TVJBBXX:1:2122:23604:38868 99 Pp01 32045 60 37M1I112M = 32215 312 TTCGAATGGATACAAAAGATCTGCTGAAACATGAATGAAAAAAATGAACCGATTCAACAATACATGTCTGTACTGTGAGCTAGGGTTAGCTGTAAAAATGGCTAGAAGTGGCCAATTTGATTGGAAAAAGTATTAGAATTCAAATTAGAG AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKFFKKKKKKKKF MC:Z:8S142M MD:Z:22A3G0G21A16T5C76 PG:Z:MarkDuplicates RG:Z:G1-12 NM:i:7 MQ:i:60 AS:i:112 XS:i:70

    Wish anyone can help.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Wang Long
    Hi Wang Long,

    This is very odd. A single read should really not cause such a failure. Have you tried running Picard's Validate Sam File on the bam file?

    Can you please tell me which version of GATK you are using and the exact command line you used to get this strange result?

    Thanks,
    Sheila

  • Wang LongWang Long ChinaMember

    Here is the command line I use:

    java -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R Ppersica_298_v2.0.fa \
    -T HaplotypeCaller -nct 1 -stand_call_conf 30.0 -stand_emit_conf 0.0 \
    -I test2.bam -L Pp01:32036-32117 -o test2.vcf -bamout test2_out.bam -ERC BP_RESOLUTION

    I used GATK version3.3, while 3.2 and 3.4 seem to give the same results.

    Picard's ValidateSamFile did not complain much except "Mate not found for paired read" as I only extract reads within the target region.

    java -jar picard-tools-1.128/picard.jar ValidateSamFile INPUT=test2.bam OUTPUT=test2.validate.txt

    ERROR: Read name K00137:24:H2TVJBBXX:1:2206:26842:14625, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2206:26771:14642, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1106:26740:16734, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1227:6379:12199, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1106:23482:27124, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2228:4095:8770, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2124:18265:3637, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2222:14306:43404, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2114:4786:16541, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1107:17737:12744, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2227:11901:30517, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2215:32455:27757, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2207:21787:30693, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1119:18793:48783, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2209:11952:44107, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2203:27440:47060, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1114:28780:8735, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:1220:12672:18387, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2108:29785:36635, Mate not found for paired read
    ERROR: Read name K00137:24:H2TVJBBXX:1:2122:23604:38868, Mate not found for paired read

    Runing ValidateSamFile over whole bam file also gives no error:

    No errors found

    GATK says only 2 reads were removed due to duplicates

    INFO 19:30:03,359 MicroScheduler - 2 reads were filtered out during the traversal out of approximately 22 total reads (9.09%)
    INFO 19:30:03,359 MicroScheduler - -> 2 reads (9.09% of total) failing DuplicateReadFilter
    INFO 19:30:03,359 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 19:30:03,360 MicroScheduler - -> 0 reads (0.00% of total) failing HCMappingQualityFilter
    INFO 19:30:03,360 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 19:30:03,360 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 19:30:03,360 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO 19:30:03,361 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    Thanks,
    Wang Long

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Wang Long
    Hi Wang Long,

    Thanks. Can you try running Picard's FixMateInformation on the bam? Also, can you explain step by step what you did before running Haplotype Caller? Did you follow GATK Best Practices?

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, can you clarify what you meant when you said "Here are same example of other reads"? Do you mean they also cause problems in their regions, or do you mean that those are the other reads in that area, and they are ok for calling variants?

  • Wang LongWang Long ChinaMember

    @Sheila
    The original bam file is generated through following steps as described in GATK BP

    step1: Align reads with BWA-MEM

    bwa mem -t 4 -M -R "@RG\tID:G1-12\tLB:G1-12\tPL:Illumina\tPU:G1-12\tSM:G1-12" Ppersica_298_v2.0.fa \
    G1-12_1.fq.gz G1-12_2.fq.gz > G1-12.Ppersica_298_v2.bwa.sam

    step2: sort bam file

    java -jar picard-tools-1.114/SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT \
    INPUT=G1-12.Ppersica_298_v2.bwa.sam \
    OUTPUT=G1-12.Ppersica_298_v2.bwa.sort.bam

    step3: mark duplicates

    java -jar picard-tools-1.114/MarkDuplicates.jar MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 VALIDATION_STRINGENCY=LENIENT \
    INPUT=G1-12.Ppersica_298_v2.bwa.sort.bam \
    OUTPUT=G1-12.Ppersica_298_v2.bwa.sort.dedup.bam \
    METRICS_FILE=G1-12.Ppersica_298_v2.bwa.sort.dedup.metrics

    samtools index G1-12.Ppersica_298_v2.bwa.sort.dedup.bam

    step4: realignment

    java -jar GenomeAnalysisTK-3.2-0/GenomeAnalysisTK.jar -R Ppersica_298_v2.0.fa \
    -T RealignerTargetCreator -nt 2 -fixMisencodedQuals \
    -o G1-12.Ppersica_298_v2.bwa.sort.dedup.realn.intervals \
    -I G1-12.Ppersica_298_v2.bwa.sort.dedup.bam

    java -jar GenomeAnalysisTK-3.2-0/GenomeAnalysisTK.jar -R Ppersica_298_v2.0.fa \
    -T IndelRealigner -fixMisencodedQuals \
    -targetIntervals G1-12.Ppersica_298_v2.bwa.sort.dedup.realn.intervals \
    -o G1-12.Ppersica_298_v2.bwa.sort.dedup.realn.bam \
    -I G1-12.Ppersica_298_v2.bwa.sort.dedup.bam

    step5: variants calling

    java -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R Ppersica_298_v2.0.fa \
    -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 \
    -o G1-12.Ppersica_298_v2.bwa.sort.dedup.realn.hc.gvcf \
    -I G1-12.Ppersica_298_v2.bwa.sort.dedup.realn.bam

    I have 14 samples, all of which sampling from a single peach tree, so they should be very similar, all 14 samples were processed through the exactly same pipeline as shown above. After genotyping all 14 samples (GenotypeGVCFs), I found 5 samples have a very high missing rate (about 80%), but UG could give normal results for all the samples, so I try to figure out why this happen.

    After some manually inspection of the variants, I chose one region "Pp01:32036-32117", in this region some samples were normal called, but some samples are total missing (marked as "./."), so I extracted the mapping results in this region from two samples:

    samtools view -q20 -h G1-1.Ppersica_298_v2.bwa.sort.dedup.realn.bam Pp01:32036-32117 > test1.sam
    samtools view -Sb test1.sam > test1.bam
    samtools index test1.bam

    samtools view -q20 -h G1-12.Ppersica_298_v2.bwa.sort.dedup.realn.bam Pp01:32036-32117 > test2.sam
    samtools view -Sb test2.sam > test2.bam
    samtools index test2.bam

    As I only extracted a narrow region, only one pair of the reads were present in the bam files used for test, so Picard's FixMateInformation does not help...

    As described before, while using UG, both samples are proper, and give same variant results. However, with HC, test1.bam from sample G1-1 could always give correct results, while test2.bam from sample G1-12 could not.

    After some test I started to think whehter this could be caused by some abnormal mapped reads present in the second sample, so I manually removed some mapping results in file test2.sam, and regenerated the bam file, then do the calling. Until I remove the mapping result of one read "K00137:24:H2TVJBBXX:1:2222:14306:43404", it started to give proper results.

    @Geraldine_VdAuwera

    "other reads" here mean the mapping results of other reads in the same area, and they are ok for calling variants. However, once the single read "K00137:24:H2TVJBBXX:1:2222:14306:43404" was present in this region, HC fail to report those variants.

    Thanks,
    Wang Long

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Wang Long
    Hi Wang Long,

    Can you submit a bug report? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • Wang LongWang Long ChinaMember

    @Sheila
    Hi Sheila,

    Files have been uploaded to the ftp "discussion-5817.tar.gz". All command lines could be found in the file "test_cmd.sh".

    Thanks for you all!

  • Wang LongWang Long ChinaMember
    edited August 2015

    Thanks very much for your answer, it works now!

    However I found the choice of kmer size do have different effects, I tested -kmerSize 10 and 20, which give same results as 30, while -kmersize 15, 25, 40, 45 and 50 all emit reduced calls. This raise some concerns about the choice of the kmer size, could we know how to choose the correct kmer size? What's the default value of kmer size? Do they affect the speed of HC?

    Thanks again!

    Post edited by Wang Long on
  • SheilaSheila Broad InstituteMember, Broadie admin

    @Wang Long
    Hi Wang Long,

    Yes, we have had some issues with kmer size in messy regions. The default values are 10 and 25. Did you say -kmerSize 10 by itself produces the correct calls? Usually if you see a region is messy or some other samples are getting calls, it is a good idea to play around with the kmer size. We have found the default values of 10 and 25 to work well in clean regions.

    -Sheila

  • Wang LongWang Long ChinaMember

    @Sheila

    Yes, use -kmerSize 10 or 20 alone produces the same calls as use 30 in my test. Anyway, I'm glad to have learned more about the kmer size, really appreciate your helps!

Sign In or Register to comment.