Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Deletion QC plots

Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

Hi @bhandsaker

I'm inspecting my results from the Genomestrip/2.0 deletion pipeline: 244 of them reported on one 23Mb chromosome in 221 flies, at 20-40X coverage.

Attached is a density distribution plot for all of the quantitative metric in the final vcf.

I'm considering excluding variants that are in the tails of the distributions for coherence, depth p-value, maybe for the pairs measurements, and maybe outer left and right.

What do you reckon?

I'm also going to have quick look at a few of the pile-ups using Intergrated Genomics Viewer.

Best regards,

William Gilks

Best Answers

Answers

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @bhandsaker ... maybe the attached pdf less blurry than the previous png ...?

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @bhandsaker

    Looking at the most common deletion passing filters, using IGV, it's pretty convincing, with greater distance between pairs (see attachment). I looked at a 10K deletion that failed filters, and is only present in one individual, and that was pretty unconvincing.

    I've a minor problem with PlotGenotypingResults, because I deleted the intermediate files P000* due to running discovery+genotyping on the each major chromosome separately but using the same preparation data. Basically I have to re-do this and maintain the intermediate files.

    I've attached some more summary graphs of the QC data, using independent metrics. I was thinking of using PCA or a mixed-model to filter variants - though I guess it would be quicker just to put simple thresholds on the most influential factors.

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    Hi @bhandsaker

    I've attached pdf and png of various deletion metrics across the genome. They've been filtered as:
    GSCOHERENCE > -100, GSDEPTHPVALUE < 1e-06.

    Not shown is a 300Kb event on chr3R. There's still a lot of events detected around the centromeres, and in the chromatin-rich X-chromosome, which might be artefacts of the difficult sequence type.

    I'm try to run plotGenotypingResults to get a closer look at the events, using:

    SV_TMPDIR=./tmpdir
    SV_DIR=/cm/shared/apps/svtoolkit/2.0.1602/
        which java > /dev/null || exit 1
        which Rscript > /dev/null || exit 1
        which samtools > /dev/null || exit 1
            mx="-Xmx4g"
            classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar"
            java -cp ${classpath} ${mx} -jar ${SV_DIR}/lib/SVToolkit.jar
    
    java -Xmx4g -cp ${classpath} ${mx} org.broadinstitute.sv.apps.PlotGenotypingResults \
        -site chr2L.DEL19 \
        -O chr2L.DEL19.pdf \
        -runDirectory chr2L_deletions
    

    Returns:
    Warning: No data found for site chr2L.DEL19

    The chr2L_deletions directory has all the metrics data files, SR info, P00* and .P00* files, as well as discovery and genotyped vcf. Do the metadata from preprocessing or the SV pipeline reports have to be in the plotting runDir?

    Any advice?

    Cheers,

    Will

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    PlotGenotypingResults will look first for a file partition.genotypes.map.dat in the run directory. This file is used to map the site(s) to the correct partition (Pnnnn) files. If there is no partition file, you can also use -auxFilePrefix to direct the plotting code to the right set of auxilliary output files (e.g. -auxFilePrefix ${runDir}/P0001). Grep through these files (tab delimited text files), e.g. P0001.gmm.dat, to make sure the site is there.

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @bhandsaker Partition file seems to me in place. I'm nearly there but aren't there meant to be two other graphs for this output? I've tried --pretty TRUE, -pretty true etc. Is the final vcf required at all here, as I customised the variant ids in it.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Final VCF is not currently required (although I would love to re-implement this code to use only the final VCF and not the auxilliary files).
    I don't think there should be other output - what other output were you expecting?

    -pretty true just produces the same plot but formatted and scaled differently, better for presenting good looking results, less useful for assessing and diagnosing problems.

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @bhandsaker Thanks for the tips. Actually, I was anticipating three plots, with the two others showing results from different methods, and read depth in individuals across the event as shown in http://www.broadinstitute.org/software/genomestrip/sites/default/files/gs_plots_0.png Are these still in development? Maybe we could put together an R package for genomestrip post-analysis.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I'm curious about the example site you plotted. On the one hand, it has characteristics that make it look like a good site: The read depth is clustering and there is good cluster separation. The read pair support correlates very well with the samples that have copy number 1 instead of 2. On the other hand, a couple of things are worrisome. Most obviously, the site does not appear to be in Hardy-Weinberg equilibrium. Is there some reason one wouldn't expect HWE at this site?

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @bhandsaker My sample is unusual in that all individuals have a mother from the reference assembly strain, and different fathers from an wild, outbred population. DEL2224 is genotyped as heterozygous in the maternal (reference) line. It seems that this site is monomorphic in the outbred population, pending phasing. Note that the sample type means that HWE departure isn't an indicator of genotyping error, although knowing the in-house maternal reference line sequence helps identify Mendelian errors.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Thanks for the explanation - makes sense now!

Sign In or Register to comment.