The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# BaseCounts annotation in vcf

Member
edited September 2012

Dear all,

I came across something that confuses me concerning the BaseCounts annotation in the vcf output after running the following commands for UnifiedGenotyper and VariantAnnotator:

java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R genome.fa -I my.bam -o my.vcf -glm BOTH
java -Xmx6g -jar GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar -T VariantAnnotator -R genome.fa -I my.bam -o my_annotated.vcf --variant my.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

One of the positions in the final vcf file looks like this:

The BaseCounts annotation is supposed to give the number of times each base was called, so at this position, I would have 360 A's, 637 C's (which is the reference), 2 G's and 1 T.

However, looking at the pileup of the input bam file at this position, the vast majority of bases is reference C, however, I fail to see this substantial number of A's (I see maybe 20), while I definitely see more than one T, e.g.

Maybe there is something I missunderstood but should not the BaseCounts (more or less) reflect what I see in the bam file?

cheers,
Sophia

Tagged:

Hi Sophia, can you tell us how you're viewing the pileup? Are you sure you are seeing all reads, including any marked as duplicates?

• Member
edited September 2012

In this case, no duplicates have been removed or marked in the bam file.
The pileup is run for the position of interest only, like:

samtools mpileup -r chrM:1407-1407 -f genome.fa my.bam

How many reads are you seeing in total in the samtools pileup? What you're seeing with the GATK tools is downsampled, which is why you have exactly 1000 reads. It's possible that if you have a lot more depth than that, after downsampling the proportions can change a bit as the downsampling is seeded randomly.

• Member

I was also thinking of that, and, in fact I have a very high depth at this position; however, since I can see the bases for all reads in the pileup, I was able to check by eye that there are not that many A's. I am attaching a file showing the mpileup at the position.

• Member

here goes the file...

Hi Sophia,

Can you please pull out these reads (using -T PrintReads -L chrM:1407) and upload them to our server so I can take a look? Thanks

• Member
edited September 2012

I uploaded the bam file and its index to your ftp server:

D12Y4ACXX_1_11_1.fastq_process_MQ35_1407.bam

D12Y4ACXX_1_11_1.fastq_process_MQ35_1407.bai

We have very high depth in this experiment, and I see that I have more than 500000 reads covering this position.
I also realized that samtools mpileup by default downsamples to a depth of 8000. I can read more bases at a given position when changing this parameter to a higher value. Still, even allowing for a depth of 100000 at this position, I only see very sparse A's, and so I keep wondering how in the GATK downsampling (to only 1000!), so many of them seem to accumulate.

Thanks a lot for looking into this!