To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Variant calls being missed - how to improve variant detection using Sanger sequencing data?

mjmaurermjmaurer Berkeley, CAMember

@Geraldine_VdAuwera and @Sheila - Please help!

I've read through many of your posts/responses regarding HaplotypeCaller not calling variants, and tried many of the suggestions you've made to others, but I'm still missing variants. My situation is a little different (I'm trying to identify variants from Sanger sequence reads) but I'm hoping you might have additional ideas or can see something I've overlooked. I hope I haven't given you too much information below, but I've seen it mentioned that too much info is better than not enough.

A while back, I generated a variant call set from Illumina Next Gen Sequencing data using UnifiedGenotyper (circa v2.7.4), identifying ~46,000 discordant variants between the genomes of two haploid strains of S. cerevisiae. Our subsequent experiments included Sanger sequencing ~95 kb of DNA across 17 different loci in these two strains. I don't think any of the SNP calls were false positives, and there were very, very few were false negatives.

Since then, we've constructed many strains by swapping variants at these loci between these two strains of yeast. To check if a strain was constructed successfully, we PCR the loci of interest, and Sanger sequencing the PCR product. I'm trying to use GATK (version 3.4-46) HaplotypeCaller (preferably, or alternatively UnifiedGenotyper) in a variant detection pipeline to confirm a properly constructed strain. I convert the .ab1 files to fastqs using EMBOSS seqret, map the Sanger reads using bwa mem ($ bwa mem -R $RG $refFasta $i > ${outDir}/samFiles/${fileBaseName}.sam), merge the sam files for each individual, and then perform the variant calling separately for each individual. I do not dedup (I actually intentionally leave out the -M flag in bwa), nor do I realign around indels (I plan to in the future, but there aren't any indels in any of the regions we are currently looking at), or do any BQSR in this pipeline. Also, when I do the genotyping after HaplotypCaller, I don't do joint genotyping, each sample (individual) gets genotyped individually.

In general, this pipeline does identify many variants from the Sanger reads, but I'm still missing many variant calls that I can clearly see in the Sanger reads. Using a test set of 36 individuals, I examined the variant calls made from 364 Sanger reads that cover a total of 63 known variant sites across three ~5kb loci (40 SNPs in locus 08a-s02, 9 SNPs in locus 10a-s01, 14 SNPs in locus 12c-s02). Below are some example calls to HaplotypeCaller and UnifiedGenotyper, as well as a brief summary statement of general performance using the given command. I've also included some screenshots from IGV showing the alignments (original bam files and bamOut files) and SNP calls from the different commands.

Ideally, I'd like to use the HaplotypeCaller since not only can it give me a variant call with a confidence value, but it can also give me a reference call with a confidence value. And furthermore, I'd like to stay in DISCOVERY mode as opposed to Genotype Given Alleles, that way I can also assess whether any experimental manipulations we've performed might have possibly introduced new mutations.

Again, I'm hoping someone can advice me on how to make adjustments to reduce the number of missed calls.

Call 1:
The first call to HaplotypeCaller I'm showing produced the least amount of variant calls at sites where I've checked the Sanger reads.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_hapcallRealigned.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hc_bp_raw.g.vcf

Call 2:
I tried a number of different -kmerSize values [(-kmerSize 10 -kmerSize 25), (-kmerSize 9), (-kmerSize 10), (-kmerSize 12), (-kmerSize 19), (-kmerSize 12 -kmerSize 19), (maybe some others). I seemed to have the best luck when using -kmerSize 12 only; I picked up a few more SNPs (where I expected them), and only lost one SNP call as compared Call 1.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_kmer_hapcallRealigned.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    -kmerSize 12 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hc_bp_kmer_raw.g.vcf

Call 3:
I tried adjusting --minPruning 1 and --minDanglingBranchLength 1, which helped more than playing with kmerSize. I picked up many more SNPs compared to both Call 1 and Call 2 (but not necessarily the same SNPs I gained in Call 2).

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_adv_hapcallRealigned.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    --minPruning 1 \
    --minDanglingBranchLength 1 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hc_bp_adv_raw.g.vcf

Call 4:
I then tried adding both --minPruning 1 --minDanglingBranchLength 1 and -kmerSize 12 all at once, and I threw in a --min_mapping_quality_score 5. I maybe did slightly better... than in Calls 1-4. I did actually lose 1 SNP compared to Calls 1-4, but I got most of the additional SNPs I got from using Call 3, as well as some of the SNPs I got from using Call 2.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_hailMary_raw.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --min_mapping_quality_score 10 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    --minPruning 1 \
    --minDanglingBranchLength 1 \
    -kmerSize 12 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hailMary_raw.g.vcf

Call 5:
As I mentioned above, I've experience better performance (or at least I've done a better job executing) with UnifiedGenotyper. I actually get the most SNPs called at the known SNP sites, in individuals where manual examination confirms a SNP.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T UnifiedGenotyper \
    -I $inBam \
    -ploidy 1 \
    --output_mode EMIT_ALL_SITES \
    -glm BOTH \
    -dt NONE -dcov 0 \
    -nt 4 \
    -nct 1 \
    --intervals $outDir/tmp.intervals.bed \
    --interval_padding 500 \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    -minIndelCnt 1 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_ug_emitAll_raw.vcf

I hope you're still with me :)

None of the above commands are calling all of the SNPs that I (maybe naively) would expect them to. "Examples 1-3" in the first attached screenshot are three individuals with reads (two reads each) showing the alternate allele. The map quality scores for each read are 60, and the base quality scores at this position for individual #11 are 36 and 38, and for the other individuals, the base quality scores are between 48-61. The reads are very clean upstream of this position, the next upstream SNP is about ~80bp away, and the downstream SNP at the position marked for "Examples 4-6" is ~160 bp away. Commands 1 and 2 do not elicit a SNP call for Examples 1-6, Command 3 get the calls at both positions for individual 10, Command 4 for gets the both calls for individuals 10 and the upstream SNP for individual 11. Command 5 (UnifiedGenotyper) gets the alt allele called in all 3 individuals at the upstream position, and the alt allele called for individuals 10 and 12 at the downstream position. Note that in individual 11, there is only one read covering the downstream variant position, where UnifiedGenotyper missed the call.

Here is the vcf output for those two positions from each command. Note that there are more samples in the per-sample breakdown for the FORMAT tags. The last three groups of FORMAT tags correspond to the three individuals I've shown in the screenshots.

Command 1 output

Examples 1-3    649036  .   G   .   .   .   AN=11;DP=22;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:84    0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:89    0:0:2:0 0:0:2:0 0:0:2:0
Examples 4-6    649160  .   C   .   .   .   AN=11;DP=21;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:0 0:0:2:0 0:2:2:71    0:0:2:0 0:2:2:44    0:0:2:0 0:0:1:0 0:0:2:0

Command 2 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hc_bp_kmer_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf   GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .       .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 3 output

Examples 1-3    649036  .   G   A   36.01   .   ABHom=1.00;AC=3;AF=0.273;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:66:66,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    0:0:2:.:.:0 0:2:2:.:.:89    0:0:2:.:.:0 0:0:2:.:.:0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   ABHom=1.00;AC=1;AF=0.091;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    0:0:2:.:.:0 0:2:2:.:.:44    0:0:2:.:.:0 0:0:1:.:.:0 0:0:2:.:.:0

Command 4 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hailMary_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 5 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=7;AF=0.636;AN=11;DP=22;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=2.303;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 1:0,2:2:56:56,0 0:.:2   1:0,2:2:99:117,0    0:.:2   1:0,2:2:99:122,0    0:.:2   1:0,2:2:67:67,0 0:.:2   1:0,2:2:99:110,0    1:0,2:2:84:84,0 1:0,2:2:99:127,0
Examples 4-6    649160  .   C   A   46  .   ABHom=1.00;AC=5;AF=0.455;AN=11;DP=21;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 0:.:2   0:.:2   1:0,2:2:76:76,0 0:.:2   1:0,2:2:70:70,0 0:.:2   1:0,1:2:37:37,0 0:.:2   1:0,2:2:60:60,0 0:.:1   1:0,2:2:75:75,0

There are many more examples of missed SNP calls. When using the HaplotypeCaller, I'm missing ~23% of the SNP calls. So...what can I do to tweak my variant detection pipeline so that I don't miss so many SNP calls?

As I mentioned, I'm currently getting better results with the UnifiedGenotyper walker. I'm only missing about 2% of all Alt SNP calls. Also, about half of that 2% are improperly being genotyped as Ref by Command #5. It appears to me that most of the variant calls I'm missing using the UnifiedGenotyper are at positions where I only have a single Sanger read covering the base, and the base quality score starts to fall below 25 (such as in individual #11 in the first attached screen shot, base quality score was 20). Attached is a second IGV screenshot of a different locus where I've also missed SNP calls using Command 5 (Examples 7-9). I've also included the read details for those positions, as well as the VCF file output from Command 5. I have seen at least one instance where I had two Sanger reads reporting an alternate allele, however, UG did not call the variant. In that case though, the base quality scores in both reads were very low (8); mapping quality was 60 for both reads.

Does anyone have any suggestions as to how I might alter any of the parameters to reduce (hopefully eliminate) the missed SNP calls. I think I would accept false positives over false negatives in this case. Or does anyone have any other idea as to what my problem might be?

Thanks so much!
Matt Maurer

Command 5 output for second screen shot file:

VCF output
The samples shown in the second attached screen shot correspond to the 11th and 12th groupings in the per-sample breakdown of the FORMAT tags.

Examples 7-8    163422  .   G   C   173 .   ABHom=1.00;AC=2;AF=0.182;AN=11;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:54:54,0 0:.:1   ./. 0:.:1   0:.:1   ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
Example 9   163476  .   A   G   173 .   ABHom=1.00;AC=2;AF=0.167;AN=12;DP=15;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:57:57,0 0:.:1   0:.:1   0:.:1   0:.:1   .   .   .   .   .   .   .   .   .   .   .

Also, why are the GT's sometimes "./." as they are for site163422, and sometimes "." as they are for site 163476?

Read Details:

Example#7

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
Sample = qHZT-08a-s02_r2657_p4094_dJ-011
Read group = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
----------------------
Location = 163,422
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 23
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example 8

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
Sample = qHZT-08a-s02_r2657_p4094_dJ-011
Read group = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
----------------------
Location = 163,476
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = G
Base phred quality = 15
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example #9

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
Sample = qHZT-08a-s02_r2657_p4094_dJ-012
Read group = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
----------------------
Location = 163,422
Alignment start = 163,329 (+)
Cigar = 67S16M1D181M1D634M1D9M1I8M1I62M1D17M4S
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 18
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
NM = 87
AS = 480
XS = 0
-------------------

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Matt,

    Thanks for re-posting here. I've deleted your comment post from the other thread.

    Your main problem here is that you're hitting the low-coverage limitations of the algorithms. These tools are designed for high-throughput sequencing data, so they expect to see many short(ish) reads rather than a couple of long ones. The probability models are tuned accordingly, so when they see just one or two observations, they tend to consider that inconclusive (hence the many no-calls) unless the observation quality is particularly good.

    In HC there is the additional constraint at the reassembly step of how many reads must support a variant allele for it to be retained as a node in the graph. That is why you are seeing the most improvements with the min pruning and dangling branch tweaks in HC, as those lower the bar for the amount of data that needs to be seen per node in the reassembly graph.

    At the moment there is no good workaround beyond what you have already tried; I'm afraid your use case is simply too far outside the scope of what the tools are designed to support. There has been some talk internally of pushing toward improved support of very low coverage use cases, but this isn't even on the development roadmap yet so don't hold your breath. In the meantime you may benefit from using a caller that is less sophisticated, ironically.

    Regarding the . vs ./. no-calls, I suspect this may be a bug in how the no-call representation is emitted to the VCF in the non-diploid case. No-calls can be emitted from two different places in the code, iirc, and what you observe suggests to me that in one of those functions, we're emitting a hard-coded ./. instead of composing the field based on the specified ploidy. I could be wrong but this seems like the most plausible explanation to me. We'll check but we may need a test file to find and fix it.

  • mjmaurermjmaurer Berkeley, CAMember

    Hi Geralidine,

    Thanks for the quick response. I saw your post and I have a few other things I want to ask, but I haven't finished looking into everything first. I totally understand my situation is too unique to redirect your development, but I'll keep my fingers crossed for any advancement in low-coverage calling.

    I've wanted to get back to you regarding the no-calls since you mentioned that there might be a bug causing a mix of "." and "./." in the output. Here's what I think I've observed regarding this:

    First of all, I don't see it when I use HaplotypeCaller in my pipeline, only when I use UnifiedGenotyper. I perform per-sample variant detection (regardless of whether I'm using HaplotypeCaller or UnifiedGenotyper), but not for the same chromosomal region for each individual (ex. for individual 1, I use -L intervals1.bed, but for individual 2, I use -L intervals2.bed). Then when I merge my VCFs, I get a no-call in individual 1 at every site within the range(s) covered by intervals2.bed, and visa versa for individual 2 within intervals1.bed. In the output I posted, there are three intervals that I'm calling variants in, and for each region, I looked at 12 individuals. That produced 36 vcf files, which I then merge into a single vcf file. So at each site, there should be a no-call for 24 of the 36 per-sample breakdowns in the format tags. As I already mentioned, I only get "./." when using UnifiedGenotyper. Now, if I got a GT call for 12 out of 12 individuals where I actually have sequence info, than the per-sample GT tags for the 24 other individuals in the merged vcf file will all be ".". However, if I got a no-call in one (or more) of the 12 individuals where I actually have sequence info, I see that I have a "./." in that sample's vcf file, and when I merge the vcf files, any sample without a GT call gets a "./.".

    Was my explanation clear, and if so, is this info helpful to you?

    Matt

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, thanks for the additional details, Matt. It does sound like the non-diploidy is not being handled properly at the merging step. I assume you're using CombineVariants? And what version of GATK is this?

    By the way, I have to point out that the approach you're taking, of calling variants per sample on different intervals, then merging for analysis, is really not what we recommend for best results.

  • mjmaurermjmaurer Berkeley, CAMember

    Hi Geraldine,

    Yes, I'm using CombineVariants. The version of GATK is:
    The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12

    You mentioned my approach isn't ideal. Is the problem simply that I'm doing the variant calling on each sample individual (particularly with UnifiedGenotyper), or is it the combination of per sample variant detection on different intervals and merging the calls?

    Thanks again,
    Matt

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Both, actually, but the first takes priority. I would recommend reading the Best Practices and associated documentation on why we push for joint calling.

  • mjmaurermjmaurer Berkeley, CAMember

    Hi Geraldine,

    It is true, I'm not as familiar with the current Best Practices guidelines as I once was; most of my use of GATK took place between versions 2.2 and 2.7-4. The reason I'm not using multisample SNP calling with UnifiedGenotyper in this case is because my samples don't conform with assumptions (I think are) made about the population structure during multisample variant detection. Isn't there an assumption about Hardy-Weinberg equilibrium? We are manipulating the genomes of yeast, and sequencing them. If our manipulations are successful, we hope to see 100% of the variants at the sites we are querying in 100% of the individuals we analyze, relative to our reference. If our manipulations are hardly successful, we might only see one individual with a few of our variants of interest. So wouldn't this be a problem for multisample variant detection?

    I'm aware your current Best Practices guidelines call for per-sample variant detection, followed by joint-genotyping. When I was using HaplotyperCaller in my pipeline I've described in this thread, I did not do joint-genotyping because (a) I presumed there might be similar violations of model assumptions in my particular data set, and (b) I saw the comments about missing singletons, which I could definitely have, and would definitely not want to miss.

    A while back, I looked into (and maybe even tried) MuTech on WGS samples from yeast individuals isolated from non-mating populations (growing mitotically) under strong selective pressure, and in some cases mutagenized. I don't remember, but I either read something about MuTech that wasn't going to work for me (does it work on haploids? that might have been the problem), or I might of even tried it and didn't have much luck. [Ultimately, I used per-sample variant detection with UnifiedGenotyper for those data sets. Was that not ideal?] I thought about trying MuTech again for this project, but I believe it doesn't give you a probability that a Ref-called site is a true Ref call, right? During all of my previous variant detection projects, I'd always wished HaplotypeCaller would work on haploids...I'm so glad it does now :)

    Any corrections/thoughts about what I've said above would be greatly appreciated!

    Thanks :)

  • mjmaurermjmaurer Berkeley, CAMember

    BTW, is there a way to be alerted when you or someone else replies to my posts?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Mostly the assumption that is made in the joint genotyping is that if we see variation at the same site in more than one sample, it's more likely to be real. We do calculate some HWE-related annotations for later filtering and analysis purposes, but iirc they are not used for making judgment calls at the calling/ genotyping levels. You're right that singletons are at a slight disadvantage, but if you have decent coverage you should have enough evidence to call them. Those that get missed are mostly in low-covered, difficult regions. I would still recommend using the joint protocol in your case.

    MuTect is designed primarily to cope with low allelic fractions, so it has been sometimes used "off-label" for pooled experiments where you're sequencing a mixed population and you don't have a good estimate of the member count/overall ploidy. But it does make a bunch of assumptions for internal filtering purposes that are derived from cancer models, which can be a problem if you're working with a radically different biology.

    HC is indeed your only option at this time if you want a set ploidy other than 2 and the ability to emit reference confidence.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Forgot to say, see this FAQ if you'd like to turn on forum notifications: https://www.broadinstitute.org/gatk/guide/article?id=27

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mjmaurer
    Hi Matt,

    You can change your notification preferences in your profile. To the above right of your "About" section, there is an icon of a person's shadow you can click on. Click on that then edit profile then Notification Preferences. You can check which notifications you would like to be emailed to you.

    -Sheila

Sign In or Register to comment.