Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

down sampling is not random?

xushawxushaw Member
edited December 2012 in Ask the GATK team

Hi
i got something strange with GATK down sampling. it seems the reads are not chosen randomly for one SNP. i was using unifiedgenotyper command with -dcov 25 -dt by_sample -ndrs to down sample from several thousand coverage to 25 reads.
without down sampling, the coverage is 3626, 1838 As and 1788 Gs, almost equal(result below).

20 43047293 . A G 32767.01 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=6.593;DP=3626;Dels=0.00;FS=5.570;HaplotypeScore=34.1333;MLEAC=1;MLEAF=0.500;MQ=59.78;MQ0=0;MQRankSum=-0.054;QD=9.04;ReadPosRankSum=0.293;SB=-2.037e+04 GT:AD:DP:GQ:PL 0/1:1838,1788:3626:99:32767,0,32767

but after down sampling there are obviously more As than Gs. i tried several times, but still there is a bias towards A, you can see the results below.
i have around 80 SNPs in this sample, all others seems fine, just except this one. the only strange thing about this SNP is: it located near the end of our capture bait, so it got two times negative reads than positive reads. is this a factor that influence random selection of reads?

20 43047293 . A G 311.01 . AC=1;AF=0.500;AN=2;BaseCounts=15,0,10,0;BaseQRankSum=1.193;DP=25;DS;Dels=0.00;FS=18.119;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.416;QD=12.44;ReadPosRankSum=0.471;SB=-4.401e+01 GT:AD:DP:GQ:PL 0/1:15,10:25:99:341,0,520

20 43047293 . A G 61.01 . AC=1;AF=0.500;AN=2;BaseCounts=21,0,4,0;BaseQRankSum=1.297;DP=25;DS;Dels=0.00;FS=2.609;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.778;QD=2.44;ReadPosRankSum=0.111;SB=-2.901e+01 GT:AD:DP:GQ:PL 0/1:21,4:25:91:91,0,779

20 43047293 . A G 222.01 . AC=1;AF=0.500;AN=2;BaseCounts=17,0,8,0;BaseQRankSum=1.252;DP=25;DS;Dels=0.00;FS=24.380;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=-0.495;QD=8.88;ReadPosRankSum=0.786;SB=-3.201e+01 GT:AD:DP:GQ:PL 0/1:17,8:25:99:252,0,599

20 43047293 . A G 103.01 . AC=1;AF=0.500;AN=2;BaseCounts=20,0,5,0;BaseQRankSum=1.393;DP=25;DS;Dels=0.00;FS=8.751;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.645;QD=4.12;ReadPosRankSum=-0.238;SB=-6.519e-03 GT:AD:DP:GQ:PL 0/1:20,5:25:99:133,0,726

20 43047293 . A G 145.01 . AC=1;AF=0.500;AN=2;BaseCounts=19,0,6,0;BaseQRankSum=2.195;DP=25;DS;Dels=0.00;FS=15.055;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.032;QD=5.80;ReadPosRankSum=-0.604;SB=-6.200e+01 GT:AD:DP:GQ:PL 0/1:19,6:25:99:175,0,680

20 43047293 . A G 228.01 . AC=1;AF=0.500;AN=2;BaseCounts=17,0,8,0;BaseQRankSum=1.252;DP=25;DS;Dels=0.00;FS=27.681;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.903;QD=9.12;ReadPosRankSum=-0.845;SB=-6.519e-03 GT:AD:DP:GQ:PL 0/1:17,8:25:99:258,0,605

20 43047293 . A G 268.01 . AC=1;AF=0.500;AN=2;BaseCounts=16,0,9,0;BaseQRankSum=0.538;DP=25;DS;Dels=0.00;FS=24.428;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=1.047;QD=10.72;ReadPosRankSum=1.104;SB=-3.007e+00 GT:AD:DP:GQ:PL 0/1:16,9:25:99:298,0,582

20 43047293 . A G 144.01 . AC=1;AF=0.500;AN=2;BaseCounts=19,0,6,0;BaseQRankSum=0.732;DP=25;DS;Dels=0.00;FS=9.046;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=1.559;QD=5.76;ReadPosRankSum=-0.541;SB=-6.801e+01 GT:AD:DP:GQ:PL 0/1:19,6:25:99:174,0,683

20 43047293 . A G 103.01 . AC=1;AF=0.500;AN=2;BaseCounts=20,0,5,0;BaseQRankSum=0.917;DP=25;DS;Dels=0.00;FS=16.298;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.121;QD=4.12;ReadPosRankSum=-0.170;SB=-6.519e-03 GT:AD:DP:GQ:PL 0/1:20,5:25:99:133,0,727

20 43047293 . A G 61.01 . AC=1;AF=0.500;AN=2;BaseCounts=21,0,4,0;BaseQRankSum=1.297;DP=25;DS;Dels=0.00;FS=9.734;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.852;QD=2.44;ReadPosRankSum=-0.334;SB=-2.001e+01 GT:AD:DP:GQ:PL 0/1:21,4:25:91:91,0,768

20 43047293 . A G 61.01 . AC=1;AF=0.500;AN=2;BaseCounts=21,0,4,0;BaseQRankSum=1.668;DP=25;DS;Dels=0.00;FS=2.609;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=0.259;QD=2.44;ReadPosRankSum=0.334;SB=-2.901e+01 GT:AD:DP:GQ:PL 0/1:21,4:25:91:91,0,764

20 43047293 . A G 187.01 . AC=1;AF=0.500;AN=2;BaseCounts=18,0,7,0;BaseQRankSum=1.483;DP=25;DS;Dels=0.00;FS=21.673;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=59.25;MQ0=0;MQRankSum=-0.635;QD=7.48;ReadPosRankSum=0.212;SB=-6.519e-03 GT:AD:DP:GQ:PL 0/1:18,7:25:99:217,0,644

Answers

  • droazendroazen Cambridge, MAMember, Broadie, Dev ✭✭

    Going from several thousand reads to only 25 reads I'd expect even a completely unbiased sampling to occasionally produce a subset that's not perfectly representative. Having said that, the existing GATK downsampler is known to be biased/flawed in several ways. We're working on a new downsampling implementation which you can try out for yourself in more recent versions of the GATK via the --enable_experimental_downsampling option. Since this implementation is still experimental, we cannot vouch for the quality of its results, though I'd be interested to hear if you see any improvement using it.

    David

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    FYI the new downsampler is now activated by default (starting in version 2.3). See release notes for details.

Sign In or Register to comment.