Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

UnifiefGenotyper -dcov

Hi all!
I'm trying to call variants from multiple samples at once with high coverage using UnifiedGenotyper. In this sense, I set the -dcov parameter in a few thousands reads by position. However, using such a strategy, GATK takes too much time. As far as I know, -dcov is a way for downsampling, but I don't know if such a downsampling is random or implies just the first reads for a given position. Perhaps using -dfrag option could be a right way to randomly downsampling multiple samples highly covered??
Thank you very much for your help!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Have a look at the documentation for these two options here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html

    Downsampling to coverage is usually what you want to use to deal with localized patches of excessive coverage. Fractional downsampling will downsample everywhere including where coverage is not excessive. Notice that downsampling is done per sample, so typically you would set -dcov much lower than that; 250 (the default setting) is more than enough for the UG to work its magic.

  • bioSGbioSG Member

    Hi all / Geraldine,
    thank you very much for your answer.
    I would like to obtain an unbiased set of reads, so according to the manual, -dfrac option should be set in order to obtain an unbiased percentage of reads, shouldn't it? The question now is: for a given -dfrac option, GATK will take as many reads as necessary just in a single iteration?? Or by contrast, GATK will start several times until it reach certain percentage??
    Due to enrichment system I have regions 10000x covered and results from variant calling taking into consideration -dcov 10000 are different from those when default (250) coverage per position is considered.
    Any advice will be appreciated.
    Regards,

  • droazendroazen Cambridge, MAMember, Broadie, Dev ✭✭

    The value you give -dfrac represents the chance that each individual read will survive the downsampling process. So, -dfrac 0.6 means that each read has a 60% chance of being retained. Since there is randomness involved, the number of reads kept will not typically exactly match what you requested, but with a sufficiently large number of reads it will be very close.

    If you want an unbiased sampling, -dfrac is definitely the way to go. The -dcov option tries to maintain an even representation of reads from every alignment start position, and so is not unbiased. For example, suppose that the UG is about to examine the locus at chr1:54, and for coverage at this position you have 20 reads starting at chr1:50, 20 more reads starting at chr1:52, and another set of 20 reads starting at position chr1:54. If your -dcov value were 30 in this situation, the downsampler would randomly eliminate 10 reads from each alignment start position to bring the total height of the pileup at chr1:54 down to 30 while keeping the same number of reads from each start position.

    Hope this helps,
    David

Sign In or Register to comment.