Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

DepthOfCoverage, QualifyMissingIntervals and DiagnoseTargets?

BiocybermanBiocyberman Member
edited May 2015 in Ask the GATK team

I using GATK for Clinical Whole-Exome Sequencing. I often have to answer questions for evaluating quality the sequencing run:

  1. How good is my genes of interested covered?
  2. Which exons are not well covered?
  3. Which interval are not well covered?.

I've tried several tools which try to address this question (i.e. bcbio-nextgen, chanjo). But now I have a feeling that the tools (mentioned in the title) can answer more or less my questions, except the nice feature of chajo which allows storing/querying statistics across samples.

My question about these tools: When to use which?
From the names and reading description of their command line arguments, I can't answer the question clearly. I tend to try all three in this order: DepthOfCoverage -> QualifyMissingIntervals -> DiagnoseTargets.

So again, when to use which?

Thanks
Vang

Post edited by Biocyberman on

Answers

  • BlueBlue Member

    Hi @Biocyberman

    I just happened to see your question, and thought I could help.

    Try the Diagnostics and Quality Control Tools and the Variant Evaluation and Manipulation Tools.

    These are for bam files....
    GenomeAnalysisTK -T BaseCoverageDistribution \
    -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \
    -R ~/reference_sequences/dmel/v6.0/dm6.fa \
    -I mysample.bam -o mysample.BasCovDis.txt

    GenomeAnalysisTK -T CallableLoci \
    -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \
    -R ~/reference_sequences/dmel/v6.0/dm6.fa \
    -I mysample.bam -o mysample.CallableLoci.txt -summary mysample.CallableLocSummary.txt

    GenomeAnalysisTK -T UnifiedGenotyper \
    -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \
    -R ~/reference_sequences/dmel/v6.0/dm6.fa \
    -glm SNP
    -I mysample.bam -o mysample.SNP.UniGenotyper.vcf

    These are for vcf files, the second of which facilitates analysis in R, particularly dplyr and ggplot2.
    GenomeAnalysisTK -T VariantEval \
    -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \
    -R ~/reference_sequences/dmel/v6.0/dm6.fa \
    -eval:set1 mysample.SNP.UniGenotyper.vcf -o mysample.SNP.VariantEval.txt

    GenomeAnalysisTK -T VariantsToTable \
    -L chr2L -L chr2R -L chr3L -L chr3R -L chrX -L chr4 -L chrM \
    -R ~/reference_sequences/dmel/v6.0/dm6.fa \
    --variant mysample.UniGenotyper.vcf \
    -F CHROM -F POS -F ID -F REF -F ALT -F QUAL -F DP -GF GT \
    -o mysample.SNP.variantTable.txt

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Biocyberman
    Hi Vang,

    You can use DepthOfCoverage for your first task. It takes a RefSeq file for input. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_coverage_DepthOfCoverage.php#--calculateCoverageOverGenes

    You can use DiagnoseTargets for the remaining tasks. You can use -L to input the exons and intervals you are interested in. The missing intervals argument will give you a list of intervals that do not pass your filters. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_diagnostics_diagnosetargets_DiagnoseTargets.php#--missing_intervals

    -Sheila

  • BiocybermanBiocyberman Member

    @Blue : Thanks for your answer. I was busy with other things and only manage to login now. I will try BaseCoverageDistribution with -L option to see what the results tell. However I don't see the reasons to use CallableLoci and other variant related tools below it. Could you say something about them?

    @Sheila It looks clear enough. I will try your suggestion as well.

  • BlueBlue Member
    edited May 2015

    @Biocyberman

    Note that the -L option is for defining intervals.

    CallableLoci generates two files, one detailed (and extensive), one a summary. Both describe numerically "Which interval are not well covered", which I am assuming is useful to you.

    I have provided some example output.

    The extensive output looks like:
    chr2L 0 4555 NO_COVERAGE
    chr2L 4555 4685 LOW_COVERAGE
    chr2L 4685 4686 NO_COVERAGE
    chr2L 4686 4817 LOW_COVERAGE
    chr2L 4817 47678 CALLABLE
    chr2L 47678 47679 LOW_COVERAGE
    chr2L 47679 47694 CALLABLE
    chr2L 47694 47695 LOW_COVERAGE
    chr2L 47695 47741 CALLABLE
    chr2L 47741 47743 LOW_COVERAGE
    chr2L 47743 47765 CALLABLE
    chr2L 47765 47843 LOW_COVERAGE

    The summary looks like:
    state nBases
    REF_N 232752
    CALLABLE 124509960
    NO_COVERAGE 6856792
    LOW_COVERAGE 2225692
    EXCESSIVE_COVERAGE 0
    POOR_MAPPING_QUALITY 74936

Sign In or Register to comment.