Why I obtained a g.vcf with wrong variant DPs, and too few variants according to coverage?

Hi,
I extracted Exome regions from public bam files to apply the same pipeline that I did for my samples and merge them. It is weird that looking in the g.vcf files, I have very few "variants", and those have very few reads (these are high coverage samples). Here there are two examples: As you can see this regions has a coverage of 22-40, How can I have a DP=1??

1 146639579 . C . . END=146639580 GT:DP:GQ:MIN_DP:PL 0/0:38:99:37:0,99,1485
1 146639581 . C . . END=146639581 GT:DP:GQ:MIN_DP:PL 0/0:38:96:38:0,96,1440
1 146639582 . C . . END=146639587 GT:DP:GQ:MIN_DP:PL 0/0:40:99:38:0,99,1485
1 146639588 rs55797044 A G, 0.01 . DB;DP=1;MLEAC=0,0;MLEAF=0.00,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,10,3,10,10:0,1,0,0
1 146639589 . T . . END=146639612 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,105,1575
1 146643468 . G . . END=146643471 GT:DP:GQ:MIN_DP:PL 0/0:23:54:22:0,54,810
1 146643472 . G . . END=146643472 GT:DP:GQ:MIN_DP:PL 0/0:22:57:22:0,57,855

1 1263129 . C . . END=1263129 GT:DP:GQ:MIN_DP:PL 0/0:26:59:26:0,59,461
1 1263130 . T . . END=1263143 GT:DP:GQ:MIN_DP:PL 0/0:24:60:22:0,60,900
1 1263144 rs307350 G A, 0.14 . DB;DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:10,3,0,10,3,10:0,0,1,0
1 1263145 . G . . END=1263163 GT:DP:GQ:MIN_DP:PL 0/0:25:63:23:0,63,945
1 1263164 . A . . END=1263164 GT:DP:GQ:MIN_DP:PL 0/0:22:49:22:0,49,378

When I compare the lines of this g.vcf file with what I obtained with extracting the same regions in individuals from other public bam files (the 1000GP) of (10x), I got:

1000GP bam (10X): Total reads: 13924960, wc raw.g.vcf = 20819861, grep 0/1 raw.g.vcf = 57,362
Public bam (39X): Total reads: 43168049, wc raw.g.vcf = 8136676, grep 0/1 raw.g.vcf = 9

So, why having high coverage bam files, I end up with so few variants?

Could it be the result of some error I had in the IndelRealignment step (GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner) ?

ERROR MESSAGE: Exception when processing alignment for BAM index HS2000-1017_290:2:2106:4486:29552 2/2 100b unmapped read.

The only way to solve that error (according to ValidateSamFile there was no errors) was to remove those unmapped reads:
samtools view -b -F 4 file.bam > mapped.bam
index again
IndelRealingment
BaseRecalibrator step1
BaseRecalibrator Round 2
Print Reads
HaplotypeCaller

Thank you very much,

magda

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 5

    @Magda
    Hi Magda,

    To confirm, you were able to run the Indel Realignment tools after removing the offending reads?

    Can you try with the latest version of GATK? GATK4 would be best, but newest GATK3 is fine too. There may have been fixes that get rid of this issue. You can just run on the interval you are concerned about.

    -Sheila

Sign In or Register to comment.