If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
CombineVariants unwanted merge behaviour at identical positions
Question: Is it possible to have CV merge like bcftools does it?
I get this warning, when running UG in GGA mode using an
-alleles vcf generated with CV:
WARN 10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record
I made this call with HC from 10 samples:
20 106089 . CA C
And this call with UG from 10 other samples:
20 106089 . C A
CV merges like this:
20 106089 . C A 20 106089 . CA C
bcftools merges like this:
20 106089 . CA AA,C
The UG recall from the CV generated
-alleles vcf is incomplete:
20 106089 . C A
The UG recall from the bcftools generated
-alleles vcf is complete:
20 106089 . CA AA,C
Is it possible to have CV merge like bcftools does it?
I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.
I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.
--interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.
UG calls on samples 1-10:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 535 545 546 550 554 564 567 574 575 578 20 106089 . C A 16.19 . AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP GT:AD:DP:GQ:PL 0/0:3,0:3:6:0,6,76 0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80 0/0:6,0:6:12:0,12,130 0/1:1,1:2:29:30,0,29 ./. 0/0:7,0:7:15:0,15,188 0/0:3,1:4:6:0,6,74 ./. 0/0:5,0:5:12:0,12,142
HC calls on samples 11-20:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 585 590 622 625 628 640 655 668 687 693 20 106089 . CA C 47.95 . AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49 0/0:2,0:6:0,6,49 0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46 0/1:2,1:14:14,0,39 0/1:2,1:15:15,0,38 0/0:4,0:12:0,12,93 0/1:3,1:12:12,0,46 1/1:0,3:9:67,9,0
UG GGA recalls on samples 1-20:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693 20 106089 . CA AA,C 110.56 . AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB 0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0 0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0 0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0 0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0 ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0 0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1 ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1 0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0 0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0 0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0 0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0 0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0 0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1 0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1 0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0 0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1 2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0
This thread is related to the following threads on GGA:
http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode http://gatkforums.broadinstitute.org/discussion/5018/ug-call-combined-snp-indel-sites-in-gga-mode http://gatkforums.broadinstitute.org/discussion/4936/not-all-sites-emitted-with-genotype-given-alleles http://gatkforums.broadinstitute.org/discussion/4024/genotype-and-validate-or-haplotype-caller-gga-what-am-i-doing-wrong
P.S. I might gate crash your Cambridge party this week despite not being invited The course was already fully booked, when you announced it. I don't have a time machine!