GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

Bug with multi-sample UnifiedGenotyper

sicottesicotte Posts: 3Member
edited February 2013 in Ask the GATK team

I am running into a bug with an older version of GATK. Calls for 1bp insertions sometimes have bad genotypes, e.g. I see 11 reference reads and 0 insertions and the genotype is 1/1

GT:AD:DP:DP4:GQ:PL
1/1:11,0:10:7,4,0,0:18.03:160,18,0

In one recent project, 72 out of 4069 1bp insertions are of this pattern.

When I look in IGV at the realigned BAM's, I see that the allele counts are OK (did that on multiple variants and samples).. but that the genotype call is not. (note the likelyhoods match the genotype call).

Can someone run the following command on a recent multi-sample vcf? to see if the bug still exists?

The command prints 1bp insertions, limits to indels not near runs of single nucleotides (may be false positives), and looks for any homozygous calls.. finally the perl looks for the pattern above

gawk -F "\t" '$0 ~ /^#.*/{next}(length($5)==(length($4)+1)){print $0}'  group_1.variants.filter.vcf | grep 'HRun=0' | gawk -F "\t" '{for(i=10;i<=NF;i++) {if($i ~ /1\/1/) {print $i}}}' | perl -ane 'if($_ =~ /1\/1\:([0-9])+\,0\:/){if($1>5) {print $_;}}'

I get something like

1/1:16,0:10:8,8,0,0:23.91:274,24,0
1/1:28,0:24:16,12,0,0:72.24:1033,72,0
1/1:17,0:14:6,11,0,0:42.14:598,42,0
1/1:18,0:18:9,9,0,0:54.15:776,54,0
1/1:19,0:16:10,9,0,0:48.16:713,48,0
1/1:9,0:2:7,2,0,0:6.02:100,6,0
1/1:18,0:18:8,10,0,0:38.82:133,39,0
1/1:28,0:28:12,16,0,0:39.54:117,40,0
...

Sorry, this is part of a large workflow and updating to a more GATK version is a large endeavour that I only want to undertake if the bug is fixed.

These were the main parameters of the calling.

genotype_likelihoods_model=BOTH 
p_nonref_model=EXACT 
genotyping_mode=DISCOVERY  
output_mode=EMIT_VARIANTS_ONLY 

Here is the header information

##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[/..chr5.cleaned.bam] read_buffer_size=null phone_home=NO_ET gatk_key=/projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.7/Hossain.Asif_mayo.edu.key read_filter=[] intervals=[chr5.target.bed] excludeIntervals=null i
nterval_set_rule=UNION interval_merging=ALL reference_sequence=/data2/bsi/reference/sequence/human/ncbi/37.1/allchr.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null d
ownsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=
4 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_leve
l=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.0010 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confide
nce_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=30.0 noSLOD=false annotateNDA=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05
max_alternate_alleles=5 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 noBandedInde
l=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWrit
erStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##VariantAnnotator="analysis_type=VariantAnnotator input_file=[chr5.cleaned.bam] read_buffer_size=null phone_home=NO_ET gatk_key=/projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.7/Hossain.Asif_mayo.edu.key read_filter=[] intervals=[variants.chr5.raw.vcf] excludeIntervals=
null interval_set_rule=UNION interval_merging=ALL reference_sequence=/data2/bsi/reference/sequence/human/ncbi/37.1/allchr.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=
null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_t
hreads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false loggi
ng_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=variants.chr5.raw.vcf) snpEffFile=(RodBinding name= source=UNBOUND) dbsnp=(RodBinding name=dbsnp source=/data2/bsi/reference/annotation/dbSNP/hg19/dbsnp_135.hg19.
vcf.gz) comp=[] resource=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterSt
ub annotation=[QualByDepth, MappingQualityRankSumTest, ReadPosRankSumTest, HaplotypeScore, DepthOfCoverage, MappingQualityZero, DepthPerAlleleBySample, RMSMappingQuality, FisherStrand, ForwardReverseAlleleCoun
ts] excludeAnnotation=[] group=[] expression=[] useAllAnnotations=false list=false alwaysAppendDbsnpId=false MendelViolationGenotypeQualityThreshold=0.0 requireStrictAlleleMatch=false filter_mismatching_base_a
nd_quals=false"
Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,641Administrator, GATK Developer admin

    It is extremely likely that this bug has been fixed. Feel free to test the latest version on a problematic region of your bam file.

    Geraldine Van der Auwera, PhD

  • sicottesicotte Posts: 3Member

    Given the recent post about wanting to extensively test the 2.4 release, I think that running the very simple 1-liner I gave is a quick way to test for a bug. I did not see any post about it this type of bug, so it may not be fixed.
    From my side, I'd have a lot of code to change to upgrade to the new commands and options in 2.3 . Can you please run the command on a VCF you have already generated, it'll take one minute tops.

    I did a little more digging and this affects both SNV's and INDELS (0.4% of all homozygous alternate alleles are affected). I just happened to be looking at indels.

    Hugues

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,641Administrator, GATK Developer admin

    Hugues, I understand your point of view but we really can't take requests for specific tests. We have an established workflow and plenty of work on our plate trying to get release 2.4 ready to ship. If you show that the bug persists in the latest version, we will be happy to fix it of course.

    Geraldine Van der Auwera, PhD

  • jfarrelljfarrell Posts: 50Member

    I just ran the above script on a multi-sample vcf (14 whole genome samples at 30-50x coverage per sample) generated by GATK unified genotyper 2.39. No lines were flagged. It looks like the bug was fixed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,641Administrator, GATK Developer admin

    Thanks, @jfarrell. We appreciate the assist.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.