Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

PhaseByTransmission help!

Greetings!
I am working in a exome sequencing project in family trios. I have used all the Best Practices to analyze the sequences. I ran the PhaceByTransmission walker with the ‘’MendelianViolationsFile flag hoping that the generated Mendel violation filed would give me information about the possible de novo variants. Nonetheless, I did not understand what the columns of the output file mean. Could you please help me to understand the file’s nomenclature?

I also tried to extract the de novo variants using an in house script but I am not sure about the nomenclature of the phased .vcf file. I would like to know the difference between ./. , 1/0, 0/1 and 1/1 among the three family members. For example, I obtained 0/1 ./. ./. (for child, mother, father) but I do not know if I extract this row, this might represent a de novo variant.

Thank you for your time
OMR

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Oscar,

    I don't have an example of a mendelian violations file on hand, but if you post the header and a few lines from the one you have I should be able to help you interpret it.

    Regarding the notation of genotypes in your VCF file, have a look at this document explaining the GATK's VCF output. If you want more details, I would recommend looking up the VCF specification. It is a very helpful document.

    Note that phased genotypes are written with "|" instead of "/" as separator, so the genotypes you posted were apparently not successfully phased.

  • Thank you for your answer, Geraldine!

    I am posting you the lines and header then so you could kindly give me a hand. Related to the .vcf phased file, how would look like a de novo mutation according to the child, father mother genotypes. Would be something like "[0-1]|[0-1] ./. ./."? I am a little confuzed about that so I will appreciate any help you could offer me. Same, I will take a look at the files you suggested me.

    Thank you again,

    OMR

  • It's been a while since I've messed with PBT, but I'm pretty sure TP is a measure of the quality of the phasing call (maybe Phred-scaled?). I think it comes straight from the VCF, and so is defined in that header

  • Thank you both for your answers!

    Geraldine, why would you think my depth is -1 for certain genotypes? Is it some kind of bug? When I checked the vcf generated, the depth is not -1 for any of the family members on those positions.

    Thanks!

    OMR

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I don't know -- I'll check the code tomorrow when I'm in the office. Maybe the -1 value has a special meaning.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Well, I did find that TP stands for Transmission Probability, but I couldn't find any reason why depth would be -1 for any case. That seems like it might be a bug. Are you seeing this at all sites you phased or only a subset?

  • Hi Geraldine,

    Sorry for the late response. It seems to be in a small subset. When I used awk to filter the lines with that probability, seems to be a smaller data set.

    I wonder if that might be because I first used the version 3.9 and I phased it with 4.9.

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, the version shouldn't have much impact. Can you post the vcf lines corresponding to some of the calls that have the DP=-1 symptom?

  • AviAvi Member

    Hi All,
    Just a follow up on the TP field. The TP field as I understand is the Transmission Probability in the Phred Scale. So 10^(-TP/10) should give me a probability in the (0-1] range. Now does 1 mean a highly confident call or is a value close to 0 a highly confident call.

    I'm trying to call Denovo mutations in trios using the GATK PhaseByTransmission tool and understanding this would help me prioritize calls in the Mendelian Violations file !
    Cheers,
    Avinash

  • LaurentLaurent Member, Broadie

    Dear all,

    TP is indeed a phred scale probability, so 10 => p(error) = 1/10, 20 => p(error) = 1/100, etc. So the greater the TP, the more confident the call. From experience, I have to warn you that while the order of TPs in a dataset is meaningful (i.e. the greater the better), it is not well calibrated when it comes for de novo mutations and tends to overestimate the confidence for de novo mutations.
    TP = -1 is actually an overflow (I thought I corrected this bug, but if it is appearing with the latest version, I'm happy to look into it).

    I hope this helps and I'm happy to answer more questions!
    Cheers,
    Laurent

  • Hi all

    Thanks for helpful comments. I am just wondering the best criteria for selecting de novo variants. I've tried to find it out from previous literatures but couldn't as PBT is a new function.

    Now I am working on trio exome data set and have a filtering criteria: TP>20 and DP(mother, father, child)>20. For AC (allele count), what's it the exact meaning? Allele count of child?

    If anyone works on this, please share/comment on your suggestion.

    Cheers
    J

  • LaurentLaurent Member, Broadie

    Hi sehrrot,

    I am sorry for the delayed response... I hadn't seen your post. Can you give me a little more information about the data you're looking at? (platform, coverage, target)

    From my experience, I usually use TP>=30 for a specific set; TP>=20 will be more sensitive obviously. Note that DP and AC are used to compute the genotype likelihoods (PLs) and therefore also TP. That said, looking for de novo variant is always a difficult endeavor and it can help to look at these fields too. Regarding the meaning of AC, it is the total allele count in genotypes, for each ALT allele, in the same order as listed. Note that for many fields you can find their description in the VCF format description:
    http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41

    I hope this helps. Also, I'd be very interested in hearing more about your experience using PhaseByTransmission, whether it was useful to find de novo variants, if you have suggestions for improvements, etc.

    Thanks!
    Laurent

  • Hi Laurent

    Thanks for your response. Yes, I've used TP>=20 for choosing de novo mutations (DNMs) from the output. I also tried to choose genotypes that have >=20 DP in mother, father and child. Then, checked them by manual integration (IGV) - took a lot of time. It looks fine so far but I need to do validation sequencing.

    However, I haven't got any DNMs on sex chromosomes. Also, though I used pedigree information, PhasebyTransmission did not properly find DNMs on sex chromosomes. I still don't have any idea on that. Is there anyone having similar experiences?

    J

  • LaurentLaurent Member, Broadie

    Hi sehrrot,

    Great to hear that the mutations look good on IGV; I hope they'll validate too. Regarding sex chromosomes, the current version of PBT treats all chromosomes as autosomal, so be careful! We are currently working on a sex-chromosomes aware version but not sure when it'll be ready and released.

    Cheers,
    Laurent

  • Hi Laurent

    thanks for your comment about sex chromosomes. I reckoned that as well. So, I did write a manual script to filter/find dnvs from them but I couldn't find any (it might be due to low number of samples)

    I am looking forward to an updated version
    Thanks
    J

Sign In or Register to comment.