The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

BaseCounts annotation in vcf

SophiaSophia Member
edited September 2012 in Ask the GATK team

Dear all,

I came across something that confuses me concerning the BaseCounts annotation in the vcf output after running the following commands for UnifiedGenotyper and VariantAnnotator:

java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R genome.fa -I my.bam -o my.vcf -glm BOTH
java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T VariantAnnotator -R genome.fa -I my.bam -o my_annotated.vcf --variant my.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

One of the positions in the final vcf file looks like this:
chrM 1407 . C A 1246.01 . ABHet=0.639;AC=1;AF=0.500;AN=2;BaseCounts=360,637,2,1;BaseQRankSum=-4.587;DP=1000;DS;Dels=0.00;FS=50.421;HaplotypeScore=53.4969;MLEAC=1;MLEAF=0.500;MQ=35.00;MQ0=0;MQRankSum=-0.127;OND=3.000e-03;QD=4.98;ReadPosRankSum=-6.950;SB=-6.519e-03;Samples=F321 GT:AB:AD:DP:GQ:PL 0/1:0.640:637,360:250:99:1276,0,6082

The BaseCounts annotation is supposed to give the number of times each base was called, so at this position, I would have 360 A's, 637 C's (which is the reference), 2 G's and 1 T.

However, looking at the pileup of the input bam file at this position, the vast majority of bases is reference C, however, I fail to see this substantial number of A's (I see maybe 20), while I definitely see more than one T, e.g.

Maybe there is something I missunderstood but should not the BaseCounts (more or less) reflect what I see in the bam file?

Many thanks in advance for your comments.



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Sophia, can you tell us how you're viewing the pileup? Are you sure you are seeing all reads, including any marked as duplicates?

  • SophiaSophia Member
    edited September 2012

    In this case, no duplicates have been removed or marked in the bam file.
    The pileup is run for the position of interest only, like:

    samtools mpileup -r chrM:1407-1407 -f genome.fa my.bam

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    How many reads are you seeing in total in the samtools pileup? What you're seeing with the GATK tools is downsampled, which is why you have exactly 1000 reads. It's possible that if you have a lot more depth than that, after downsampling the proportions can change a bit as the downsampling is seeded randomly.

  • I was also thinking of that, and, in fact I have a very high depth at this position; however, since I can see the bases for all reads in the pileup, I was able to check by eye that there are not that many A's. I am attaching a file showing the mpileup at the position.

  • here goes the file...

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Sophia,

    Can you please pull out these reads (using -T PrintReads -L chrM:1407) and upload them to our server so I can take a look? Thanks

  • SophiaSophia Member
    edited September 2012

    I uploaded the bam file and its index to your ftp server:



    We have very high depth in this experiment, and I see that I have more than 500000 reads covering this position.
    I also realized that samtools mpileup by default downsamples to a depth of 8000. I can read more bases at a given position when changing this parameter to a higher value. Still, even allowing for a depth of 100000 at this position, I only see very sparse A's, and so I keep wondering how in the GATK downsampling (to only 1000!), so many of them seem to accumulate.

    Thanks a lot for looking into this!

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Sophia,

    The version of the GATK that you are running definitely was imperfect in the way it handled downsampling for certain conditions (which yours seems to fit). This should be better in the latest version, so you might want to consider updating. In any case, given that your pileup has 2226 As in it, I don't see that there's an actual bug here.

  • Thanks Eric.
    I reanalysed my data with the latest version, GenomeAnalysisTKLite-2.1-8-gbb7f038, and it basically gives me the same base counts at the same positions. I also spent more time with samtools mpileup, and it looks like for the downsampled list of bases at each position (the default 8000 or anything smaller than the total number of reads at the position) reads are not chosen quite at random, so unless you mpileup for ALL the reads covering the position, you might not get a representative fraction of alternative alleles. This was the main source of my confusion, but now I see through it.
    So no further worries here, I will stick to GATK for this analysis.

Sign In or Register to comment.