Periodicity in variant calling quality - is this normal?

After applying the standard RNA-Seq pipeline (with STAR, etc) I called varients with the command:

java -jar GenomeAnalysisTK.jar
    -T HaplotypeCaller
    -R chromosome.fa
    -I ./final.bam
    --variant_index_type LINEAR
    --variant_index_parameter 128000
    --emitRefConfidence GVCF -o ./final.gvcf

On the resultant gVCF file, I ran a little python script to see the distribution of calling quality across the different called genotypes:

  • x-axis is quality score rounded to the nearest integer
  • y-axis is the number of variants at that quality score


As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice.
What i didn't expect however is the periodicity. Is that normal?
Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
    data = {}
    for line in f:
        if line[0] == '#': continue
        line = line.split('\t')
        if line[5] == '.': continue
        gt = line[9][:3]
        try: data[gt][int(float(line[5]))] += 1
        except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
    print '\n',gt
    for qual,count in sorted(qualities.items()):
        print qual,count

Best Answer


  • LonginottoLonginotto FreiburgMember

    Thank you Geraldine! That's very reassuring :)
    No, hehehe, we can leave the devs alone - so long as its an understood phenomena then im happy :)

Sign In or Register to comment.