The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Heterozygous position called as Homozygous

Member
edited June 2013

Dear developers,
I have looked into the forum for similar questions but I couldn't find any. I have several cases in which I get homozygous calls in positions with ~50% of reads calling the mutation (or less), please find here an example of a position validated by Sanger (as het) in which I have a high coverage (~400 reads in total) here is the results with UnifiedGenotyper:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  L1


can you help me in this case? I am really puzzled.
I have run it with v2.2, 2.4 and 2.5 and I always had the same genotype call (the excerpt here is from the 2.5, I have downloaded it just to check it was not caused by a bug already fixed). It's not a downsampling issue since I have high coverage samples (HaloPlex) and used higher dcov than the default.

Post edited by Geraldine_VdAuwera on
Tagged:

Hi there,

Can you please post the command line you used for calling, and if possible a screenshot of the pileup at that position?

• Member

Hi Geraldine,
please find attached the screenshot. The majority of reads spanning the variation are from the same amplicon, but you can see at the top also a few reads from a different amplicon and also nearly half of them carry the mutation. Some of the reads at the bottom have been trimmed, probably for low quality of the ends (we trim to remove adapter sequence and Q<20 for HaloPlex data).
Here is the command:
gatk=/opt/software/ngs/lib/GenomeAnalysisTK-2.5-2-gf57256b reffa=/home/ngsworkspace/references/ucsc.hg19.fa dbvcf=/home/ngsworkspace/references/gatk/current/dbsnp_137.hg19.vcf recalbam=/home/users/ngs/exomeAnalysis/Project_Exome4/Aligned/S136.bam rawsnps=L1_snp.vcf snpmets=L1_snp.metrics java -Xmx47g -jar ${gatk}/GenomeAnalysisTK.jar -T UnifiedGenotyper -I$recalbam \ -R $reffa -o$rawsnps -metrics $snpmets -glm SNP -ploidy 2 -D$dbvcf -mbq 30 \ -stand_call_conf 50 -L chr4:998,049-998,172 -nct 16 -dcov 5000 

Hi there - so, this is a hypothesis of what can be happening (I could be wrong though). Your VCF record has an extremely high value in your BaseQRankSum - this means that reads carrying the ALT allele have a significantly higher quality score than the reads carrying REF at that position (I can't offer a plausible explanation of why this might be the case, as normally we see the opposite). This might be an artifact of your upstream processing but if you can confirm this at least it should point you to where the problem might be.

• Member

Hi delangel,
thank you, you pointed at the right direction: I have rerun the analysis on the realigned bam (before Base Quality Score Recalibration), to check if the base quality unbalance could have been introduced after that step, and the result is an heterozygous call:
chr4 998101 . C T 6853.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.273;DP=411;Dels=0.00;FS=0.000;HaplotypeScore=11.6256;MLEAC=1;MLEAF=0.500;MQ=59.94;MQ0=0;MQRankSum=-1.416;QD=16.68;ReadPosRankSum=-0.859 GT:AD:DP:GQ:PL 0/1:203,208:411:99:6882,0,6484 `
We will extract the qualities of the bases in that position to check their values before and after BQSR, just to check how big the difference is. Could it be that, being the target quite small (it is ~ 500 kb, including ~300 genes) and the reads generated by amplicons (so they are inherently not independent) the BQSR may not be advisable if some basic assumptions in the BQSR model are violated?
Please let me know what you think or if you have any suggestion in changing some parameter that may apply in this kind of data.

In any case, I will be happy to collaborate in the generation of specific guidelines for targeted and the new whole exome HaloPlex sequencing data, since we are working quite a lot with them in the last few months and there are a series of caveats to take into account that may be helpful for the community.

Best,

Margherita

Hmm, that's an interesting observation. We'd definitely be interested in hearing about your experiences. I'm sure other users would welcome having recommendations on how to deal with the idiosyncrasies of this platform. Thanks!