Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Why I obtained a g.vcf with wrong variant DPs, and too few variants according to coverage?

I extracted Exome regions from public bam files to apply the same pipeline that I did for my samples and merge them. It is weird that looking in the g.vcf files, I have very few "variants", and those have very few reads (these are high coverage samples). Here there are two examples: As you can see this regions has a coverage of 22-40, How can I have a DP=1??

1 146639579 . C . . END=146639580 GT:DP:GQ:MIN_DP:PL 0/0:38:99:37:0,99,1485
1 146639581 . C . . END=146639581 GT:DP:GQ:MIN_DP:PL 0/0:38:96:38:0,96,1440
1 146639582 . C . . END=146639587 GT:DP:GQ:MIN_DP:PL 0/0:40:99:38:0,99,1485
1 146639588 rs55797044 A G, 0.01 . DB;DP=1;MLEAC=0,0;MLEAF=0.00,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,10,3,10,10:0,1,0,0
1 146639589 . T . . END=146639612 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,105,1575
1 146643468 . G . . END=146643471 GT:DP:GQ:MIN_DP:PL 0/0:23:54:22:0,54,810
1 146643472 . G . . END=146643472 GT:DP:GQ:MIN_DP:PL 0/0:22:57:22:0,57,855

1 1263129 . C . . END=1263129 GT:DP:GQ:MIN_DP:PL 0/0:26:59:26:0,59,461
1 1263130 . T . . END=1263143 GT:DP:GQ:MIN_DP:PL 0/0:24:60:22:0,60,900
1 1263144 rs307350 G A, 0.14 . DB;DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:10,3,0,10,3,10:0,0,1,0
1 1263145 . G . . END=1263163 GT:DP:GQ:MIN_DP:PL 0/0:25:63:23:0,63,945
1 1263164 . A . . END=1263164 GT:DP:GQ:MIN_DP:PL 0/0:22:49:22:0,49,378

When I compare the lines of this g.vcf file with what I obtained with extracting the same regions in individuals from other public bam files (the 1000GP) of (10x), I got:

1000GP bam (10X): Total reads: 13924960, wc raw.g.vcf = 20819861, grep 0/1 raw.g.vcf = 57,362
Public bam (39X): Total reads: 43168049, wc raw.g.vcf = 8136676, grep 0/1 raw.g.vcf = 9

So, why having high coverage bam files, I end up with so few variants?

Could it be the result of some error I had in the IndelRealignment step (GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner) ?

ERROR MESSAGE: Exception when processing alignment for BAM index HS2000-1017_290:2:2106:4486:29552 2/2 100b unmapped read.

The only way to solve that error (according to ValidateSamFile there was no errors) was to remove those unmapped reads:
samtools view -b -F 4 file.bam > mapped.bam
index again
BaseRecalibrator step1
BaseRecalibrator Round 2
Print Reads

Thank you very much,



  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited March 5

    Hi Magda,

    To confirm, you were able to run the Indel Realignment tools after removing the offending reads?

    Can you try with the latest version of GATK? GATK4 would be best, but newest GATK3 is fine too. There may have been fixes that get rid of this issue. You can just run on the interval you are concerned about.


Sign In or Register to comment.