The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# How exactly does downsample_to_coverage work with UnifiedGenoyper?

New YorkMember Posts: 54
edited October 2012

I haven't been using GATK for long, but I assumed that downsample_to_coverage feature wouldn't ever be a cause for concern. I just tried running UnifiedGenotyper with -dcov set at 500, 5,000, and 50,000 on the same 1-sample BAM file. One would expect the results to be similar. However, 500 yielded 26 variants, 5,000 yielded 13, and 50,000 yielded just 1. Depth of that one variant was about 1,300 in the 50,000 cutoff. Why are the results so different?

Most of the other variants are in the biggest set were cut off at 500, so some reads were filtered. A few of them are at relatively low frequency, but most are at 25% or higher. If they are appearing by chance, they should not be at such high frequencies.

In addition, there are some variants that are below 500, so they should not be affected by the cutoff. Why are those showing up with the low cutoff and not the higher cutoff?

I am using GATK 2.1-8. I am looking at a single gene only, so that is why there are so few variants and such high coverage.

Tagged:

• New YorkMember Posts: 54
edited October 2012

This is targeted sequencing, so we are only amplifying a few kb of the human genome. Alignment is to hg19. Average coverage over the region of interest is ~13000x with all bases above 1000x according to GATK DepthOfCoverage. This is run on the Illumina MiSeq.

This is the command (in case there is something else I may be overlooking):

GATK -T UnifiedGenotyper -L path/intervals.bed -R path/hg19.fasta \
-dcov NNN -nt 6 -glm BOTH -stand_call_conf 30 -stand_emit_conf 10 \
-I sample.bam -o sample.vcf

These are the variants called with dcov at 5,000 (I am skipping 500 to save space):

chrX   76777866    .   C   G   31908.01    .   AC=2;AF=1.00;AN=2;BaseQRankSum=2.345;DP=995;DS;Dels=0.00;FS=11.871;HaplotypeScore=172.0982;MLEAC=2;MLEAF=1.00;MQ=34.17;MQ0=0;MQRankSum=-0.511;QD=32.07;ReadPosRankSum=-4.069;SB=-7.181e+03  GT:AD:DP:GQ:PL  1/1:23,966:995:99:31908,2093,0
chrX    76940057    .   A   T   1367.01 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-26.723;DP=5000;DS;Dels=0.00;FS=21.495;HaplotypeScore=205.1271;MLEAC=1;MLEAF=0.500;MQ=53.67;MQ0=0;MQRankSum=2.130;QD=0.27;ReadPosRankSum=-33.511;SB=-6.519e-03  GT:AD:DP:GQ:PL  0/1:4510,464:4999:99:1397,0,32767
chrX 76777866 . C G 32767.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=2.164;DP=1340;Dels=0.00;FS=15.774;HaplotypeScore=383.2877;MLEAC=2;MLEAF=1.00;MQ=35.09;MQ0=0;MQRankSum=1.261;QD=24.45;ReadPosRankSum=-4.060;SB=-1.030e+04 GT:AD:DP:GQ:PL 1/1:25,1302:1340:99:32767,2990,0`