Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

BaseCounts annotation in vcf

SophiaSophia Posts: 33Member
edited September 2012 in Ask the team

Dear all,

I came across something that confuses me concerning the BaseCounts annotation in the vcf output after running the following commands for UnifiedGenotyper and VariantAnnotator:

java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R genome.fa -I my.bam -o my.vcf -glm BOTH java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T VariantAnnotator -R genome.fa -I my.bam -o my_annotated.vcf --variant my.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

One of the positions in the final vcf file looks like this: chrM 1407 . C A 1246.01 . ABHet=0.639;AC=1;AF=0.500;AN=2;BaseCounts=360,637,2,1;BaseQRankSum=-4.587;DP=1000;DS;Dels=0.00;FS=50.421;HaplotypeScore=53.4969;MLEAC=1;MLEAF=0.500;MQ=35.00;MQ0=0;MQRankSum=-0.127;OND=3.000e-03;QD=4.98;ReadPosRankSum=-6.950;SB=-6.519e-03;Samples=F321 GT:AB:AD:DP:GQ:PL 0/1:0.640:637,360:250:99:1276,0,6082

The BaseCounts annotation is supposed to give the number of times each base was called, so at this position, I would have 360 A's, 637 C's (which is the reference), 2 G's and 1 T.

However, looking at the pileup of the input bam file at this position, the vast majority of bases is reference C, however, I fail to see this substantial number of A's (I see maybe 20), while I definitely see more than one T, e.g.

Maybe there is something I missunderstood but should not the BaseCounts (more or less) reflect what I see in the bam file?

Many thanks in advance for your comments. cheers, Sophia

Post edited by Sophia on
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hi Sophia, can you tell us how you're viewing the pileup? Are you sure you are seeing all reads, including any marked as duplicates?

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member
    edited September 2012

    In this case, no duplicates have been removed or marked in the bam file. The pileup is run for the position of interest only, like:

    samtools mpileup -r chrM:1407-1407 -f genome.fa my.bam

    Post edited by Sophia on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    How many reads are you seeing in total in the samtools pileup? What you're seeing with the GATK tools is downsampled, which is why you have exactly 1000 reads. It's possible that if you have a lot more depth than that, after downsampling the proportions can change a bit as the downsampling is seeded randomly.

    Geraldine Van der Auwera, PhD

  • SophiaSophia Posts: 33Member

    I was also thinking of that, and, in fact I have a very high depth at this position; however, since I can see the bases for all reads in the pileup, I was able to check by eye that there are not that many A's. I am attaching a file showing the mpileup at the position.

  • SophiaSophia Posts: 33Member

    here goes the file...

    gz
    gz
    chrM_1407.mpileup.gz
    4K
  • ebanksebanks Posts: 671GSA Member mod

    Hi Sophia,

    Can you please pull out these reads (using -T PrintReads -L chrM:1407) and upload them to our server so I can take a look? Thanks

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • SophiaSophia Posts: 33Member
    edited September 2012

    I uploaded the bam file and its index to your ftp server:

    D12Y4ACXX_1_11_1.fastq_process_MQ35_1407.bam

    D12Y4ACXX_1_11_1.fastq_process_MQ35_1407.bai

    We have very high depth in this experiment, and I see that I have more than 500000 reads covering this position. I also realized that samtools mpileup by default downsamples to a depth of 8000. I can read more bases at a given position when changing this parameter to a higher value. Still, even allowing for a depth of 100000 at this position, I only see very sparse A's, and so I keep wondering how in the GATK downsampling (to only 1000!), so many of them seem to accumulate.

    Thanks a lot for looking into this!

    Post edited by Sophia on
  • ebanksebanks Posts: 671GSA Member mod

    Hi Sophia,

    The version of the GATK that you are running definitely was imperfect in the way it handled downsampling for certain conditions (which yours seems to fit). This should be better in the latest version, so you might want to consider updating. In any case, given that your pileup has 2226 As in it, I don't see that there's an actual bug here.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • SophiaSophia Posts: 33Member

    Thanks Eric. I reanalysed my data with the latest version, GenomeAnalysisTKLite-2.1-8-gbb7f038, and it basically gives me the same base counts at the same positions. I also spent more time with samtools mpileup, and it looks like for the downsampled list of bases at each position (the default 8000 or anything smaller than the total number of reads at the position) reads are not chosen quite at random, so unless you mpileup for ALL the reads covering the position, you might not get a representative fraction of alternative alleles. This was the main source of my confusion, but now I see through it. So no further worries here, I will stick to GATK for this analysis.

Sign In or Register to comment.