The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# BaseCounts annotation in vcf

Member Posts: 50
edited September 2012

Dear all,

I came across something that confuses me concerning the BaseCounts annotation in the vcf output after running the following commands for UnifiedGenotyper and VariantAnnotator:

java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R genome.fa -I my.bam -o my.vcf -glm BOTH
java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T VariantAnnotator -R genome.fa -I my.bam -o my_annotated.vcf --variant my.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

One of the positions in the final vcf file looks like this:

The BaseCounts annotation is supposed to give the number of times each base was called, so at this position, I would have 360 A's, 637 C's (which is the reference), 2 G's and 1 T.

However, looking at the pileup of the input bam file at this position, the vast majority of bases is reference C, however, I fail to see this substantial number of A's (I see maybe 20), while I definitely see more than one T, e.g.

Maybe there is something I missunderstood but should not the BaseCounts (more or less) reflect what I see in the bam file?

cheers,
Sophia

Tagged:

Hi Sophia, can you tell us how you're viewing the pileup? Are you sure you are seeing all reads, including any marked as duplicates?

Geraldine Van der Auwera, PhD

• Member Posts: 50
edited September 2012

In this case, no duplicates have been removed or marked in the bam file.
The pileup is run for the position of interest only, like:

samtools mpileup -r chrM:1407-1407 -f genome.fa my.bam

How many reads are you seeing in total in the samtools pileup? What you're seeing with the GATK tools is downsampled, which is why you have exactly 1000 reads. It's possible that if you have a lot more depth than that, after downsampling the proportions can change a bit as the downsampling is seeded randomly.

Geraldine Van der Auwera, PhD

• Member Posts: 50

I was also thinking of that, and, in fact I have a very high depth at this position; however, since I can see the bases for all reads in the pileup, I was able to check by eye that there are not that many A's. I am attaching a file showing the mpileup at the position.

• Member Posts: 50

here goes the file...

Hi Sophia,

Can you please pull out these reads (using -T PrintReads -L chrM:1407) and upload them to our server so I can take a look? Thanks

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 50
edited September 2012

I uploaded the bam file and its index to your ftp server:

D12Y4ACXX_1_11_1.fastq_process_MQ35_1407.bam

D12Y4ACXX_1_11_1.fastq_process_MQ35_1407.bai

We have very high depth in this experiment, and I see that I have more than 500000 reads covering this position.
I also realized that samtools mpileup by default downsamples to a depth of 8000. I can read more bases at a given position when changing this parameter to a higher value. Still, even allowing for a depth of 100000 at this position, I only see very sparse A's, and so I keep wondering how in the GATK downsampling (to only 1000!), so many of them seem to accumulate.

Thanks a lot for looking into this!

Hi Sophia,

The version of the GATK that you are running definitely was imperfect in the way it handled downsampling for certain conditions (which yours seems to fit). This should be better in the latest version, so you might want to consider updating. In any case, given that your pileup has 2226 As in it, I don't see that there's an actual bug here.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 50

Thanks Eric.
I reanalysed my data with the latest version, GenomeAnalysisTKLite-2.1-8-gbb7f038, and it basically gives me the same base counts at the same positions. I also spent more time with samtools mpileup, and it looks like for the downsampled list of bases at each position (the default 8000 or anything smaller than the total number of reads at the position) reads are not chosen quite at random, so unless you mpileup for ALL the reads covering the position, you might not get a representative fraction of alternative alleles. This was the main source of my confusion, but now I see through it.
So no further worries here, I will stick to GATK for this analysis.