Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Unified genotyper with target region vs without target region

ersguptaersgupta Bangalore, IndiaMember

I am running some exome samples with UG. I tried this in 2 ways:
1. Run UG and then apply exome region filter.
2. Run UG with exome regions.

Is there any difference in these approaches if I am only concerned about the variants in exome region?
I have ran in both ways and on comparison I see that some values are different for the same locus. For eg. PL, MQRankSum, BaseQRankSum. However, the different is not huge but does UG perform slightly differently for making calls with target regions vs without?

PS: I have ran this for UG with multiple samples

Best Answers

  • SheilaSheila admin Broad Institute admin
    edited June 2014 Accepted Answer

    @ersgupta‌

    Hi,

    Theoretically, there is no difference in the two approaches. Your first approach of running UG then applying the exome filter will increase runtime, but the end result should be the same as running UG with the exome regions specified.

    There is one small exception to my statements above. When running UG, the GTAK applies downsampling in order to deal with areas where there is too much sequence coverage. (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html#--downsample_to_coverage) To make downsampling random, UG randomly picks reads based on the interval lengths (there is an algorithm built in to do this). When running on the whole genome, the interval lengths will be different from running on exomes, so the reads that are not included may be different in the different runs. This is usually not a problem, but you will see slightly different values for annotations. Again, this is due to different reads being used in the different runs.

    Please also note, sometimes in borderline cases, different downsampling may result in a different variant call. Borderline cases are when there is not quite enough information to determine if a site is variant, but with some variant supporting reads included and some non-variant supporting reads not included, a variant is called.

    I hope this makes sense.

    -Sheila

  • SheilaSheila admin Broad Institute admin
    Accepted Answer

    @ersgupta‌

    It really does not matter. The downsampling is all random; the interval length is only a value that is used by the algorithm to determine which reads will not be taken into account.

    You are free to work with whichever you please.

    -Sheila

Answers

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin
    edited June 2014 Accepted Answer

    @ersgupta‌

    Hi,

    Theoretically, there is no difference in the two approaches. Your first approach of running UG then applying the exome filter will increase runtime, but the end result should be the same as running UG with the exome regions specified.

    There is one small exception to my statements above. When running UG, the GTAK applies downsampling in order to deal with areas where there is too much sequence coverage. (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html#--downsample_to_coverage) To make downsampling random, UG randomly picks reads based on the interval lengths (there is an algorithm built in to do this). When running on the whole genome, the interval lengths will be different from running on exomes, so the reads that are not included may be different in the different runs. This is usually not a problem, but you will see slightly different values for annotations. Again, this is due to different reads being used in the different runs.

    Please also note, sometimes in borderline cases, different downsampling may result in a different variant call. Borderline cases are when there is not quite enough information to determine if a site is variant, but with some variant supporting reads included and some non-variant supporting reads not included, a variant is called.

    I hope this makes sense.

    -Sheila

  • ersguptaersgupta Bangalore, IndiaMember

    @Sheila‌
    Thanks for the detailed explanation. It all makes sense. Just one clarification, so in my case running on exomes is a better option as the interval lengths will be suited best given the regions?

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin
    Accepted Answer

    @ersgupta‌

    It really does not matter. The downsampling is all random; the interval length is only a value that is used by the algorithm to determine which reads will not be taken into account.

    You are free to work with whichever you please.

    -Sheila

  • ersguptaersgupta Bangalore, IndiaMember

    @Sheila‌
    Ok. That answers my queries. Thanks. :)

Sign In or Register to comment.