Heterozygous position called as Homozygous

atreeneedsaforestatreeneedsaforest Posts: 4Member
edited June 2013 in Ask the GATK team

Dear developers, I have looked into the forum for similar questions but I couldn't find any. I have several cases in which I get homozygous calls in positions with ~50% of reads calling the mutation (or less), please find here an example of a position validated by Sanger (as het) in which I have a high coverage (~400 reads in total) here is the results with UnifiedGenotyper:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  L1
chr4    998101  .       C       T       7296.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=16.344;DP=411;Dels=0.00;FS=0.000;HaplotypeScore=11.8258;MLEAC=2;MLEAF=1.00;MQ=59.94;MQ0=0;MQRankSum=-1.436;QD=17.75;ReadPosRankSum=-0.062   GT:AD:DP:GQ:PL  1/1:203,208:411:99:7325,581,0

can you help me in this case? I am really puzzled.
I have run it with v2.2, 2.4 and 2.5 and I always had the same genotype call (the excerpt here is from the 2.5, I have downloaded it just to check it was not caused by a bug already fixed). It's not a downsampling issue since I have high coverage samples (HaloPlex) and used higher dcov than the default.

Post edited by Geraldine_VdAuwera on

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Hi there,

    Can you please post the command line you used for calling, and if possible a screenshot of the pileup at that position?

    Geraldine Van der Auwera, PhD

  • atreeneedsaforestatreeneedsaforest Posts: 4Member

    Hi Geraldine, please find attached the screenshot. The majority of reads spanning the variation are from the same amplicon, but you can see at the top also a few reads from a different amplicon and also nearly half of them carry the mutation. Some of the reads at the bottom have been trimmed, probably for low quality of the ends (we trim to remove adapter sequence and Q<20 for HaloPlex data). Here is the command:

    gatk=/opt/software/ngs/lib/GenomeAnalysisTK-2.5-2-gf57256b
    reffa=/home/ngsworkspace/references/ucsc.hg19.fa
    dbvcf=/home/ngsworkspace/references/gatk/current/dbsnp_137.hg19.vcf
    recalbam=/home/users/ngs/exomeAnalysis/Project_Exome4/Aligned/S136.bam
    rawsnps=L1_snp.vcf
    snpmets=L1_snp.metrics
    java -Xmx47g -jar ${gatk}/GenomeAnalysisTK.jar -T UnifiedGenotyper -I $recalbam \
        -R $reffa -o $rawsnps -metrics $snpmets -glm SNP -ploidy 2 -D $dbvcf -mbq 30 \
        -stand_call_conf 50 -L chr4:998,049-998,172 -nct 16 -dcov 5000
    L1_screenshot.png
    1141 x 558 - 14K
  • delangeldelangel Posts: 71GATK Developer mod

    Hi there - so, this is a hypothesis of what can be happening (I could be wrong though). Your VCF record has an extremely high value in your BaseQRankSum - this means that reads carrying the ALT allele have a significantly higher quality score than the reads carrying REF at that position (I can't offer a plausible explanation of why this might be the case, as normally we see the opposite). This might be an artifact of your upstream processing but if you can confirm this at least it should point you to where the problem might be.

  • atreeneedsaforestatreeneedsaforest Posts: 4Member

    Hi delangel, thank you, you pointed at the right direction: I have rerun the analysis on the realigned bam (before Base Quality Score Recalibration), to check if the base quality unbalance could have been introduced after that step, and the result is an heterozygous call: chr4 998101 . C T 6853.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.273;DP=411;Dels=0.00;FS=0.000;HaplotypeScore=11.6256;MLEAC=1;MLEAF=0.500;MQ=59.94;MQ0=0;MQRankSum=-1.416;QD=16.68;ReadPosRankSum=-0.859 GT:AD:DP:GQ:PL 0/1:203,208:411:99:6882,0,6484 We will extract the qualities of the bases in that position to check their values before and after BQSR, just to check how big the difference is. Could it be that, being the target quite small (it is ~ 500 kb, including ~300 genes) and the reads generated by amplicons (so they are inherently not independent) the BQSR may not be advisable if some basic assumptions in the BQSR model are violated? Please let me know what you think or if you have any suggestion in changing some parameter that may apply in this kind of data.

    In any case, I will be happy to collaborate in the generation of specific guidelines for targeted and the new whole exome HaloPlex sequencing data, since we are working quite a lot with them in the last few months and there are a series of caveats to take into account that may be helpful for the community.

    Best,

    Margherita

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Hmm, that's an interesting observation. We'd definitely be interested in hearing about your experiences. I'm sure other users would welcome having recommendations on how to deal with the idiosyncrasies of this platform. Thanks!

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.