Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

PhaseByTransmission help!

Greetings!
I am working in a exome sequencing project in family trios. I have used all the Best Practices to analyze the sequences. I ran the PhaceByTransmission walker with the ‘’MendelianViolationsFile flag hoping that the generated Mendel violation filed would give me information about the possible de novo variants. Nonetheless, I did not understand what the columns of the output file mean. Could you please help me to understand the file’s nomenclature?

I also tried to extract the de novo variants using an in house script but I am not sure about the nomenclature of the phased .vcf file. I would like to know the difference between ./. , 1/0, 0/1 and 1/1 among the three family members. For example, I obtained 0/1 ./. ./. (for child, mother, father) but I do not know if I extract this row, this might represent a de novo variant.

Thank you for your time
OMR

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Oscar,

    I don't have an example of a mendelian violations file on hand, but if you post the header and a few lines from the one you have I should be able to help you interpret it.

    Regarding the notation of genotypes in your VCF file, have a look at this document explaining the GATK's VCF output. If you want more details, I would recommend looking up the VCF specification. It is a very helpful document.

    Note that phased genotypes are written with "|" instead of "/" as separator, so the genotypes you posted were apparently not successfully phased.

  • Thank you for your answer, Geraldine!

    I am posting you the lines and header then so you could kindly give me a hand. Related to the .vcf phased file, how would look like a de novo mutation according to the child, father mother genotypes. Would be something like "[0-1]|[0-1] ./. ./."? I am a little confuzed about that so I will appreciate any help you could offer me. Same, I will take a look at the files you suggested me.

    Thank you again,

    OMR

  • pdexheimerpdexheimer Member ✭✭✭✭

    It's been a while since I've messed with PBT, but I'm pretty sure TP is a measure of the quality of the phasing call (maybe Phred-scaled?). I think it comes straight from the VCF, and so is defined in that header

  • Thank you both for your answers!

    Geraldine, why would you think my depth is -1 for certain genotypes? Is it some kind of bug? When I checked the vcf generated, the depth is not -1 for any of the family members on those positions.

    Thanks!

    OMR

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I don't know -- I'll check the code tomorrow when I'm in the office. Maybe the -1 value has a special meaning.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Well, I did find that TP stands for Transmission Probability, but I couldn't find any reason why depth would be -1 for any case. That seems like it might be a bug. Are you seeing this at all sites you phased or only a subset?

  • Hi Geraldine,

    Sorry for the late response. It seems to be in a small subset. When I used awk to filter the lines with that probability, seems to be a smaller data set.

    I wonder if that might be because I first used the version 3.9 and I phased it with 4.9.

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, the version shouldn't have much impact. Can you post the vcf lines corresponding to some of the calls that have the DP=-1 symptom?

  • AviAvi Member

    Hi All,
    Just a follow up on the TP field. The TP field as I understand is the Transmission Probability in the Phred Scale. So 10^(-TP/10) should give me a probability in the (0-1] range. Now does 1 mean a highly confident call or is a value close to 0 a highly confident call.

    I'm trying to call Denovo mutations in trios using the GATK PhaseByTransmission tool and understanding this would help me prioritize calls in the Mendelian Violations file !
    Cheers,
    Avinash

  • LaurentLaurent Member, Broadie ✭✭

    Dear all,

    TP is indeed a phred scale probability, so 10 => p(error) = 1/10, 20 => p(error) = 1/100, etc. So the greater the TP, the more confident the call. From experience, I have to warn you that while the order of TPs in a dataset is meaningful (i.e. the greater the better), it is not well calibrated when it comes for de novo mutations and tends to overestimate the confidence for de novo mutations.
    TP = -1 is actually an overflow (I thought I corrected this bug, but if it is appearing with the latest version, I'm happy to look into it).

    I hope this helps and I'm happy to answer more questions!
    Cheers,
    Laurent

  • sehrrotsehrrot Member

    Hi all

    Thanks for helpful comments. I am just wondering the best criteria for selecting de novo variants. I've tried to find it out from previous literatures but couldn't as PBT is a new function.

    Now I am working on trio exome data set and have a filtering criteria: TP>20 and DP(mother, father, child)>20. For AC (allele count), what's it the exact meaning? Allele count of child?

    If anyone works on this, please share/comment on your suggestion.

    Cheers
    J

  • LaurentLaurent Member, Broadie ✭✭

    Hi sehrrot,

    I am sorry for the delayed response... I hadn't seen your post. Can you give me a little more information about the data you're looking at? (platform, coverage, target)

    From my experience, I usually use TP>=30 for a specific set; TP>=20 will be more sensitive obviously. Note that DP and AC are used to compute the genotype likelihoods (PLs) and therefore also TP. That said, looking for de novo variant is always a difficult endeavor and it can help to look at these fields too. Regarding the meaning of AC, it is the total allele count in genotypes, for each ALT allele, in the same order as listed. Note that for many fields you can find their description in the VCF format description:
    http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41

    I hope this helps. Also, I'd be very interested in hearing more about your experience using PhaseByTransmission, whether it was useful to find de novo variants, if you have suggestions for improvements, etc.

    Thanks!
    Laurent

  • sehrrotsehrrot Member

    Hi Laurent

    Thanks for your response. Yes, I've used TP>=20 for choosing de novo mutations (DNMs) from the output. I also tried to choose genotypes that have >=20 DP in mother, father and child. Then, checked them by manual integration (IGV) - took a lot of time. It looks fine so far but I need to do validation sequencing.

    However, I haven't got any DNMs on sex chromosomes. Also, though I used pedigree information, PhasebyTransmission did not properly find DNMs on sex chromosomes. I still don't have any idea on that. Is there anyone having similar experiences?

    J

  • LaurentLaurent Member, Broadie ✭✭

    Hi sehrrot,

    Great to hear that the mutations look good on IGV; I hope they'll validate too. Regarding sex chromosomes, the current version of PBT treats all chromosomes as autosomal, so be careful! We are currently working on a sex-chromosomes aware version but not sure when it'll be ready and released.

    Cheers,
    Laurent

  • sehrrotsehrrot Member

    Hi Laurent

    thanks for your comment about sex chromosomes. I reckoned that as well. So, I did write a manual script to filter/find dnvs from them but I couldn't find any (it might be due to low number of samples)

    I am looking forward to an updated version
    Thanks
    J

Sign In or Register to comment.