US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

HaplotypeCaller sensitivity

aaronchuaaronchu Posts: 9Member

Hi there,

I compared HaplotypeCaller and UnifiedGenotype of the latest version of GATK (2.7-4) on the same NA12878 trio. It seems HC missed some true variants (specifically, 7 out of the 49 experimentally validated de novo SNVs are missed). I guess this might be partly due to HCMappingQualityFilter, because the log says ~9% of the total reads have been filtered by this filter, and it looks there is not such a filter in UnifiedGenotyper. I wonder is there a safe way to relax HCMappingQualityFilter (default 20) to achieve a better sensitivity for HC?

Thanks in advance!

Aaron

Best Answer

  • ebanksebanks Posts: 684GATK Developer mod
    Answer ✓

    Hi Aaron, Thank you very much for supplying all of these files - it was extremely helpful. Your problem is simply that you are running with '-minPruning 1' which isn't right. If you remove that from the command line then you achieve equal SNP sensitivity to the UG (and with much better specificity; I can see several SNP calls from the UG only that are clearly indel artifacts). We will figure out what to do in the future to make it harder for users to add the wrong command-line arguments!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Answers

  • ebanksebanks Posts: 684GATK Developer mod

    Hi Aaron, Thanks for sharing this. In our hands the HC consistently outperforms the UG both in terms of sensitivity and specificity, so results like this are surprising. We are always eager to improve both tools - but the HC in particular - so we would like to ask you to help us if you are able. Is there any chance you'd be willing to make a test bam file for us with the regions (+/- 300 bp) around your SNPs (all 49 if possible, but the 7 missing ones are the most important) so that we can dive into it ourselves and figure out if any of our parameters are too restrictive? We would not share this data outside the GATK group.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • aaronchuaaronchu Posts: 9Member

    Thanks Eric! I attached the test bam files here. There are 3 bams within the NA12878 trio each containing alignments within +/- 300 bp regions centered at each of the 49 validated de novo SNV site. The list of 49 sites is also attached. The 7 sites missed by HC are all called by UG. Thanks in advance for looking into this!

    command for HC:

    java -Xmx20g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -nct 12 -T HaplotypeCaller -R hs37d5.fa -I na12878.bam.list --dbsnp gatk/gatk_bundle_2.5_b37/dbsnp_137.b37.vcf -minPruning 1 -maxAltAlleles 100 -stand_call_conf 30.0 -stand_emit_conf 10.0 -o na12878.hc.raw.vcf

    command for UG:

    java -Xmx20g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -nt 12 -T UnifiedGenotyper -R hs37d5.fa -I na12878.bam.list --dbsnp gatk/gatk_bundle_2.5_b37/dbsnp_137.b37.vcf -maxAltAlleles 50 -stand_call_conf 30.0 -stand_emit_conf 10.0 -glm BOTH -o na12878.ug.raw.vcf

    zip
    zip
    na12878.zip
    11M
  • OprahOprah Posts: 24Member

    Is UG more sensitive when calling 400 samples?

    I know HC isn't suitable for that many samples, but I tried it anyway on a small region:

    5:150097786-150097997
    5:150099204-150099407
    5:150102404-150102607
    5:150110141-150110354

    Using default parameters, HC (v2.7-4 and 2.8-1) reports 1 snp and no indels;
    1390 reads were filtered out during the traversal out of approximately 589272 total reads (0.24%)
    (all 1390 reads failed the HCMappingQualityFilter)

    Using default parameters, UG reports the same snp, plus 4 more (2 singletons, 2 with maf in the range of 1 - 5%)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi @Oprah,

    Calling many samples together does increase sensitivity, yes. Note that the HCMappingQualityFilter threshold is Q20, so you might want to check the base call qualities throughout your data -- if it's just in this region it's not too bad, but if a large proportion of your base calls have that kind of low quality, it's not a very good sign.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 24Member

    Hmm, I should've phrased my question this way: for calling rare SNVs, is HC as sensitive as UG, regardless of sample size up to a reasonable number e.g. 400 ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Oh, I see; regardless of cohort size, we expect UG and HC to be equally sensitive for SNPs in most regions, with HC performing better in complex regions, near indels etc. However HC will struggle to process large numbers of samples (above ~100) unless some accommodations are made to speed up runtime (e.g. tweak minPruning etc), but those tweaks will typically reduce sensitivity on low-frequency variants. So for a large cohort, you may get better results with UG as long as you're only interested in SNPs and not indels.

    We're currently working on a new workflow that will enable single-sample calling with HaplotypeCaller leading to joint interpretation, to bypass the performance issue. Should be available in 6 to 8 weeks.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 24Member

    For my samples, # reads filtered out during traversal = # reads failing HCMappingQualityFilter, so I used IGV to sort the reads by mapping quality, then checked the mapping quality of several reads at the top, middle, and bottom of the IGV window, at a singleton reported by UG but not by HC. The lowest m.q. is 29. I re-ran HC v2.8-1 twice, using a bam list of 150 samples (subset of the original 400), and then with 75 samples (subset of 150), with a bare-bones command line (java -Xmx ... -jar ... -T HaplotypeCaller or UnifiedGenotyper -R ... -I ... -L ... -o ... -l INFO)

    HC reported the singleton with 75 but not with 150 nor 400 samples.
    And regardless of the number of samples, HC does not report a snp that UG does. The maf varies from 4.7 to 6%, depending on whether you look at UG's output using 75, 150, or 400 samples. There is one sample (present in all subsets) which is homozygous variant i.e. 1/1. The lowest m.q. is 37

    Is there a bug in HC, or maybe I'm doing something wrong, or ... ?

    p.s. HC reports no indels, as I wrote last month.
    p.p.s. UG reports the following for the sample with m.q. 37
    GT:AD:DP:GQ:PL
    1/1:0,52:52:99:1838,147,0

    UG reports the following for the singleton
    0/1:84,47:132:99:1210,0,1806
    while HC reports this
    0/1:71,43:114:99:1180,0,1865

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi @Oprah,

    Not sure what's going on with your singleton; we may need to look at the files directly to figure it out. Would you be willing to share a snippet of your data so we can test this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    That said I'm a little worried that you're not seeing any indels at all with HC, it might indicate some kind of data processing issue upstream of calling. Have you tried calling indels with HC on the original data? Can you remind me what type of data/organism/sequencing tech it is from?

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 24Member

    Thanks for looking into this.
    HC reported indels in other regions, just not in this tiny region of chr5, so there's no cause for concern.
    I narrowed down the problem and uploaded haplotypeCaller_sensitivity.tar.gz containing various text files and snippets of 78 bam files.

    HC fails to report snp1 (the singleton, at 150102474) but not if you reduce the number of samples by deleting the 1st, 2nd, 3rd, or 78th (bottom) row of bam.list. The sample with genotype 0/1 is on row 67. HC may report snp1 using other subsets; I didn't test them all.

    HC reports snp2 (maf ~5%, at 150097883) using all samples, or certain subsets. In particular, you can delete the 1st, 2nd or 3rd row of bam.list, but not the 78th (bottom) row. The samples with variants are on row 2, 22, 36, 42, 47, 59, and 60, so you would expect that deleting row 78 should not affect HC's output, but it does.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Thanks for the test files, @Oprah. We'll have a look at these asap and I'll let you know in this thread when we have an idea of what's going on.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi @Oprah,

    I just ran a few tests and in my hands (no special arguments, just the cmd line you specified), HC finds your snp1 variant even when some of the other samples you indicated are removed from the bam list. I haven't tried all combinations (just deleting the first two bams in the bam list), but can you confirm that it's supposed to reproduce the issue (i.e. not call the SNP) when bams/3395.bam and bams/3396.bam are removed from the cohort?

    This is what I get:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  3400    3401    3402    3403    3404    3405
    3406    3407    3408    3410    3411    3418    3419    3420    3421    3422    3423    3426    3434    3435    3436    3437    3438    3439    3440    3441    3450    3451    3452    671     672     673     674     675     676
     677     678     679     680     681     682     683     684     685     686     687     688     689     690     691     692     693     694     695     696     697     698     699     700     702     703     704     705     706
     707     708     709     710     711     712     713     714     715     716     717     718
    

    5 150102474 . G A 1138.70 . AC=1;AF=6.579e-03;AN=152;BaseQRankSum=3.218;ClippingRankSum=1.270;DP=8318;FS=6.501;InbreedingCoeff=-0.0066;MLEAC=1;MLEAF=6.579e-03;MQ=57.61;MQ0=0;MQRankSum=3.962;QD=9.99;ReadPosRankSum=-0.909 GT:AD:DP:GQ:PL 0/0:123,0:123:99:0,369,4096 0/0:115,0:115:99:0,344,3824 0/0:107,0:107:99:0,322,3710 0/0:88,0:88:99:0,265,3243 0/0:106,0:106:99:0,319,3946 0/0:102,0:102:99:0,307,3771 0/0:103,0:103:99:0,310,3893 0/0:130,0:130:99:0,391,4914 0/0:96,0:96:99:0,289,3633 0/0:124,0:124:99:0,372,4107 0/0:135,0:135:99:0,405,4502 0/0:151,0:151:99:0,454,4929 0/0:145,0:145:99:0,435,4654 0/0:125,0:125:99:0,374,4031 0/0:97,0:97:99:0,290,3066 0/0:123,0:123:99:0,370,3897 0/0:137,0:137:99:0,410,4454 0/0:151,0:151:99:0,454,5047 0/0:65,0:65:99:0,196,2437 0/0:54,0:54:99:0,162,1920 0/0:54,0:54:99:0,162,1982 0/0:67,0:67:99:0,202,2553 0/0:68,0:68:99:0,205,2565 0/0:50,0:50:99:0,150,1802 0/0:62,0:62:99:0,187,2320 0/0:130,0:130:99:0,390,4776 0/0:98,0:98:99:0,294,3318 0/0:149,0:149:99:0,448,4979 0/0:100,0:100:99:0,306,3367 0/0:115,0:115:99:0,345,3559 0/0:31,0:31:93:0,93,955 0/0:40,0:40:99:0,120,1230 0/0:105,0:105:99:0,315,3356 0/0:124,0:124:99:0,372,3940 0/0:112,0:112:99:0,336,3659 0/0:72,0:72:99:0,216,2421 0/0:76,0:76:99:0,228,2387 0/0:124,0:124:99:0,372,3445 0/0:95,0:95:99:0,285,2936 0/0:143,0:143:99:0,429,4406 0/0:123,0:123:99:0,369,3961 0/0:122,0:122:99:0,365,3726 0/0:86,0:86:99:0,257,2542 0/0:102,0:102:99:0,306,2823 0/0:111,0:111:99:0,333,3430 0/0:121,1:122:99:0,326,3837 0/0:106,0:106:99:0,319,3609 0/0:131,0:131:99:0,396,4050 0/0:137,0:137:99:0,411,4271 0/0:132,0:132:99:0,402,4093 0/0:120,0:120:99:0,363,3978 0/0:99,0:99:99:0,295,2884 0/0:115,0:115:99:0,345,3455 0/0:94,0:94:99:0,284,2951 0/0:142,0:142:99:0,426,4428 0/0:137,1:138:99:0,404,4234 0/0:133,0:133:99:0,402,4118 0/0:117,0:117:99:0,357,3807 0/0:140,0:140:99:0,420,4553 0/0:101,0:101:99:0,302,3200 0/0:89,0:89:99:0,270,2622 0/0:110,0:110:99:0,330,3615 0/0:103,0:103:99:0,309,3281 0/0:87,0:87:99:0,261,2821 0/0:134,0:134:99:0,402,4506 0/1:71,43:114:99:1180,0,1865 0/0:133,0:133:99:0,399,4346 0/0:133,0:133:99:0,398,4180 0/0:139,0:139:99:0,417,4173 0/0:138,0:138:99:0,414,4465 0/0:97,0:97:99:0,291,3157 0/0:100,0:100:99:0,300,3142 0/0:136,0:136:99:0,409,4003 0/0:122,0:122:99:0,366,4081 0/0:115,0:115:99:0,345,3853 0/0:96,0:96:99:0,288,3229

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 24Member

    I should've renamed bam.list to bam78.list, and uploaded another file named bam77.list (containing the first 77 rows of bam78.list). Then I could've written:

    Given bam77.list, HC reports snp1 but not snp2.
    Given bam78.list, HC reports snp2 but not snp1.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Oh, that clarifies things immensely, thank you. I missed the double negative in the first statement about snp1 in your earlier post. Alright then, I'll re-test in light of this information.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    I was able to reproduce your issue this time, so I'm passing it on to the devs for in-depth debugging. I'll let you know what they find.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 24Member

    It's been over a month; I'm guessing this is on hold until GATK version 3 is released?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Yes I'm afraid the devs haven't been able to find time to deal with this. Sorry for the inconvenience!

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 24Member

    Good news: GenotypeGVCFs in version 3.0 solves the issue.
    Way to go devs !

  • ebanksebanks Posts: 684GATK Developer mod

    Great! We had hoped this was the case (most of the issues we have seen with the HaplotypeCaller are fixed with the new pipeline), but we do know that we still have some more things to fix. Thanks for the feedback, keep it coming, and we hope to continue improving things for you each release!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.