Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Why I obtained a g.vcf with wrong variant DPs, and too few variants according to coverage?
I extracted Exome regions from public bam files to apply the same pipeline that I did for my samples and merge them. It is weird that looking in the g.vcf files, I have very few "variants", and those have very few reads (these are high coverage samples). Here there are two examples: As you can see this regions has a coverage of 22-40, How can I have a DP=1??
1 146639579 . C . . END=146639580 GT:DP:GQ:MIN_DP:PL 0/0:38:99:37:0,99,1485
1 146639581 . C . . END=146639581 GT:DP:GQ:MIN_DP:PL 0/0:38:96:38:0,96,1440
1 146639582 . C . . END=146639587 GT:DP:GQ:MIN_DP:PL 0/0:40:99:38:0,99,1485
1 146639588 rs55797044 A G, 0.01 . DB;DP=1;MLEAC=0,0;MLEAF=0.00,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,10,3,10,10:0,1,0,0
1 146639589 . T . . END=146639612 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,105,1575
1 146643468 . G . . END=146643471 GT:DP:GQ:MIN_DP:PL 0/0:23:54:22:0,54,810
1 146643472 . G . . END=146643472 GT:DP:GQ:MIN_DP:PL 0/0:22:57:22:0,57,855
1 1263129 . C . . END=1263129 GT:DP:GQ:MIN_DP:PL 0/0:26:59:26:0,59,461
1 1263130 . T . . END=1263143 GT:DP:GQ:MIN_DP:PL 0/0:24:60:22:0,60,900
1 1263144 rs307350 G A, 0.14 . DB;DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:10,3,0,10,3,10:0,0,1,0
1 1263145 . G . . END=1263163 GT:DP:GQ:MIN_DP:PL 0/0:25:63:23:0,63,945
1 1263164 . A . . END=1263164 GT:DP:GQ:MIN_DP:PL 0/0:22:49:22:0,49,378
When I compare the lines of this g.vcf file with what I obtained with extracting the same regions in individuals from other public bam files (the 1000GP) of (10x), I got:
1000GP bam (10X): Total reads: 13924960, wc raw.g.vcf = 20819861, grep 0/1 raw.g.vcf = 57,362
Public bam (39X): Total reads: 43168049, wc raw.g.vcf = 8136676, grep 0/1 raw.g.vcf = 9
So, why having high coverage bam files, I end up with so few variants?
Could it be the result of some error I had in the IndelRealignment step (GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner) ?
ERROR MESSAGE: Exception when processing alignment for BAM index HS2000-1017_290:2:2106:4486:29552 2/2 100b unmapped read.
The only way to solve that error (according to ValidateSamFile there was no errors) was to remove those unmapped reads:
samtools view -b -F 4 file.bam > mapped.bam
BaseRecalibrator Round 2
Thank you very much,