Structural Variation identification using DepthOfCoverage query

slubbeslubbe LondonPosts: 10Member
edited October 2012 in Ask the GATK team

Hi GATK Team

You are doing an amazing job, keep it up!

I apologise in advance if this question has come up and I've not found it within the forum, but I am quite new to all of this and would like to ask you a few questions regarding identifying structural variation from exome resequencing data:

I am trying to assess the best method to identify potential structural variants from a single bam file: One way of doing this proposed to me was to look at DP values (using UnifiedGenotyper) that are less than 5 and understandably there are inherent confounders in doing so. So I ran the same bam file through the DepthOfCoverage tool to focus on regions of interest which have zero coverage. However, when I overlaid the data from both and mapped their co-ordinates to the human genome, I have found that the overlap between the DP values and DoC regions was extremely small (<5%) - why could this be? Surely there should be more overlap? Are they therefore measuring different things? Have I done something wrong somewhere and I don't know it? I have tried to access the documentation for DepthOfCoverage to try and make sense of it but it seems unavailable on the website (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). Please could you advise?

Below are the command lines I've been using:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -omitBaseOutput -omitLocusTable -R referencefilename.fa -I samplefilename.bam -L regionsofinterest.txt -o outputfile.coverage

java -jar GenomeAnalysisTK.jar -R referencefilename.fa -T UnifiedGenotyper -I samplefilename.bam --dbsnp dbsnpreferencefile.vcf --genotype_likelihoods_model SNP -o outputfilename.vcf --output_mode EMIT_ALL_SITES -stand_call_conf 50.0 -stand_emit_conf 0.0  -dcov 200 -L regionsofinterest.bed

Thank you in advance for your help, it is much appreciated

SL

Post edited by Geraldine_VdAuwera on
Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,804Administrator, GATK Developer admin

    Ah, compliments will get you very far with us :)

    Thanks for pointing out the missing doc on DepthOfCoverage -- the tech docs are generated automatically when we release a new version, so I'll need to dig into the system to find out why this one article was not. In the meantime, your best bet is to look at the comments in the source code for info on the different modes and arguments of the tool. You'll find it at this link. I apologize for the inconvenience.

    To address the differences you're observing: there are a number of things that could explain them, as the DoC annotator which generates the DP field, and the DoC walker are quite different tools. For example the DP field is sensitive to downsampling. There is also the question of whether you are measuring filtered or unfiltered depth, and whether you're measuring absolute depth per position, or averaging over intervals. Incidentally, we also have a tool called CoverageBySample which is a much simpler, straightforward coverage counting tool -- less powerful but also less tricky to use. It would also be easy for you to customize it to measure exactly want you want.

    I hope this helps! Good luck.

    Geraldine Van der Auwera, PhD

  • slubbeslubbe LondonPosts: 10Member

    Dear Geraldine Thank you so much for your quick response, and more so for the advice given. I am using the DP values not found in the INFO column (from UnifiedGenotyper) as I am seeking sample specific DP values, and comparing them to regions with total and average depth of zero (from DoC). Would they not be the same? If not, which depth values do you suggest I look at for looking at potential structural variants? Thanks again for your help SL

Sign In or Register to comment.