Finding de novo variants from trio

joonomics
edited July 2012

Hi all

Is there any option in GATK to find de novo variant from trio?


(+) nice and fancy new website! good :D


  Geraldine_VdAuwera

    Hi there,

    Sorry it took a while to get back to you. I recommend you read the method post on Pedigree Analysis using the GATK, as well as the tech documentation for the PhaseByTransmission walker.

    I hope this helps!

  • 1/ The upper panel is the output from “violation”, it looks like it include all non-mendelian transmission in the violation table including missing (missing in one of the parents was considered to be violation too).
    2/ The lower panel shows the results after excluding the missing part, you will see the Mother_AD, Father_AD and child_AD all have some weird looking symbols in it and I am not sure what they mean. How to read this output.
    3/ I have hundreds of rows from the lower panel. Of course, not all of them will be de novo mutations. My question is how to filter real de novo mutations from these hundreds of candidates.
    4/ Is there any good documentation of how this violation file was generated?

    1 880639 2 YRI 10 TC/TC 62 [[I@5fbc31d] 0,187,2817 TC/TC 57 [[I@53c6a7fc] 0,171,1860 T/T [98] [I@173ebc5c 204,90,0
    1 955597 3 YRI null ./. -1 [null] . T|T 2 [[I@3d66c39] 59,6,0 G|T [9] [I@725b1426 39,0,194
    1 1291159 4 YRI null ./. -1 [null] . G|G 1 [[I@4f7eae6c] 28,3,0 G|G [1] [I@5ed5d3a 41,3,0
    1 1848734 4 YRI null ./. -1 [null] . G|G 5 [[I@72b6507e] 181,15,0 G|G [10] [I@1f5ebb08 325,27,0

    1 880639 2 YRI 10 TC/TC 62 [[I@5fbc31d] 0,187,2817 TC/TC 57 [[I@53c6a7fc] 0,171,1860 T/T [98] [I@173ebc5c 204,90,0
    1 1878553 2 YRI 19 G/G 79 [[I@6decb6a] 239,99,0 GC/GC 98 [[I@cfc9fac] 0,295,3819 GC/GC [153] [I@1d56dbdd 0,460,5945
    1 6170572 1 YRI 127 C/C 71 [[I@1ccd2bfc] 0,214,2431 C/C 81 [[I@6202bc29] 0,244,2783 C/T [118] [I@218f5a04 753,0,2438
    1 8000208 2 YRI 36 CTTA/CTTA 40 [[I@12eea1e7] 0,120,2516 CTTA/CTTA 40 [[I@70c74e66] 0,120,2516 C/C [62] [I@2cec4462 415,134,0

  ebanks

    Thanks for the report. I'm about to fix this and it will be available in the upcoming 2.3 release.

  Max

    Hi all,

    I hope its okay to use the same thread instead of making a new one.
    Basically my problem is exactly the same, I want to call de novo variants from trio data. So I checked the documentation about Pedigree Analysis using the GATK and PhaseByTransmission and I THINK I know what to do. However, I prefer to check twice before wasting lots of time and disc space since I've never worked with GATK before ;)

    So my plan would be to use the UnifiedGenotyper on each trio to get a single multisample VCF file for each of my trios. Afterwards, I would use the PhaseByTransmission walker to recalibrate my multisample VCFs. For both steps I provide my created PED file which includes the same Sample IDs as in the RG Headers. Then, I can start filtering the variants and searching for de novos.
    Is this correct and/or are there any other recommendations for de novo Trio calling ?

    Thanks !

  Geraldine_VdAuwera

    No problem, Max -- as long as it's on the same topic it's okay.

    Your plan sounds good, although you may also want to look into using the ReadBackedPhasing walker. Make sure you have a look at the presentation on "Genotype phasing and refinement" from our last workshop here:

  Max

    Thanks a lot for your feedback !
    I'm will take a look at the workshop slides and the ReadBackedPhasing

  • Hi there

    I have a few trios for which I ran the following :

    • per-trio joint genotyping from HaplotypeCaller gvcf files (gatk 3.4)
    • PhaseByTransmission, ReadBackedPhasing and SelectVariants with --mendelianViolation flag

    If I look at one of the trios, I see the following :

    • PhaseByTransmission log says: "INFO 17:00:49,977 PhaseByTransmission - Number of remaining single mendelian violations in trios: 81"
    • mendelian violations vcf has 418 variants (of which 81 have the TP annotation, the rest don't).

    I am wondering:

    • why some variants in mendelian violations have the TP annotation, and some don't
    • does the TP annotation indicate a likely-to-be-real denovo (matching the same number as reported by PhaseByTransmission run)? What are the other ones?
    • I am actually interested in all denovo-like mutations - especially the false ones, to get an idea of false positives. Are there any built in filtering flags, or would looking at all the mendelian violations give me the full set of denovos (real and not)?

    Many thanks

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  Sheila

    Hi Vicky,

    Can you check the VCF header to find out what the TP annotation means? I think that will answer your questions :smile:


  • Hi @Sheila

    The vcf header says
    "Phred score of the genotype combination and phase given that the genotypes are correct"
    and elsewhere in the forum, somebody indicated that TP stands for Transmission Probability. But, I still don't understand why they are sometimes missing .. anywhere else I can look?


  Sheila

    Hi Vicky,

    I just tried running PhaseByTransmission, and all my sites have the TP annotation. I'm not sure why it is missing for some of your sites. Can you tell me the exact commands you ran?


  vsvinti
edited April 2016
    edited April 2016

    Hi @Sheila

    I am running PhaseByTransmission, ReadBackedPhasing and selecting medelian violations with gatk 3.4-46-gbc02625 using the following commands :

    java -jar GenomeAnalysisTK.jar -T PhaseByTransmission -l INFO -R $ngs_reference_seq.fasta -ped $pedfile -V $trio_name.raw.vcf.gz -o $ 
    # log file prints : "Number of remaining single mendelian violations in trios: 81"
    java GenomeAnalysisTK.jar -T ReadBackedPhasing -l INFO -R $ngs_reference_seq.fasta -ped $pedfile -V $ -L $ -o $ $bam_args 
    java -jar GenomeAnalysisTK.jar -T SelectVariants -l INFO -R $ngs_reference_seq.fasta -ped $pedfile -V $ --mendelianViolation -o $trio_name.mendelian.violations.vcf.gz 

    The * contains 109764 sites, of which 2566 lack the TP annotation.
    The *mendelian.violations.vcf.gz contains 418 sites, of which 337 lack the TP annotation (81 with TP, matching number from log file).

    I understand from the forums that it is the latter vcf file I need to look at, to determine denovo variants. As mentioned, I care about all denovo-like sites (not just real ones), but need to identify an annotation that scores how 'real' it might be (I thought TP serves this purpose).

    There might be some fundamental concept that I am not understanding, or some internal site filtering (quality, depth etc) used before calculating TP that excludes some of the sites. Please let me know how/where I can investigate further.


  Sheila

    Hi Vicky,

    Sorry for the delay. I am checking with the team on this. I will get back to you soon.


  Sheila

    Hi Vicky,

    Can you please post before and after records from PhaseByTransmission that that are in your Mendelian Violations file? Please include records that have and don't have the TP annotation.


  Geraldine_VdAuwera

    Hi @vsvinti, we actually have some newer tools to identify de novo variants. Have a look at the genotype refinement pipeline here. If you have trios I think you'll find these newer tools produce results that are easier to interpret.

  vsvinti
edited April 2016
    edited April 2016

    Here are two records from the phase by transmission output, one with and one without the TP annotation:

    1   3352763 rs369442219 CTT C,CT    2692.90 .   AC=1,5;AF=0.167,0.833;AN=6;BaseQRankSum=-3.150e-01;ClippingRankSum=-1.103e+00;DB;DP=82;FS=0.000;MLEAC=1,5;MLEAF=0.167,0.833;MQ=59.96;MQRankSum=-6.300e-01;QD=30.65;ReadPosRankSum=1.10;SOR=0.392    GT:AD:DP:GQ:PGT:PID:PL  1/2:0,6,16:22:91:.:.:955,448,382,163,0,91   2/2:1,0,21:22:51:1|1:3352763_CT_C:920,923,946,51,74,0   2/2:0,0,17:17:60:1|1:3352763_CT_C:844,844,844,60,60,0
    1   12891160    rs200573978 G   A   168.13  .   AC=1;AF=0.167;AN=6;BaseQRankSum=-1.835e+00;ClippingRankSum=-3.640e-01;DB;DP=275;FS=79.801;MLEAC=1;MLEAF=0.167;MQ=46.13;MQRankSum=-2.831e+00;QD=1.79;ReadPosRankSum=-6.300e-02;SOR=4.504 GT:AD:DP:GQ:PGT:PID:PL:TP   0/1:83,11:94:99:0|1:12891160_G_A:199,0,3482:11  0/0:95,5:100:91:0|1:12891160_G_A:0,91,4229:11   0/0:76,3:79:99:0|1:12891160_G_A:0,109,3352:11

    I applied genotype refinement to my entire cohort, though I did not use the denovo annotations in that context.

    1. My understanding was that I should be doing per-trio joint genotyping separately for denovos. Is that no longer necessary? Does the new denovo method return similar results to one above ?

    2. PossibleDeNovo notes indicate that it works better on pre-genotype-refined data, so (depending on 1.) I would either have to put my trio-based calls through it, or extract these from overall cohort geno refinement. The notes also mention PhaseByTransmission (point 3 under caveats). However, the new refinement-denovos path does not include this step - am I right?

    3. I see that low and high confidence denovos are determined based on posterior probabilities : is that PP (and not JP) ? If so, I could just look at this annotation as an indicator?

    4. Lastly, how does this compare with MVLikelihoodRatio? Is this integrated in any of the above, or different, or do they give similar results?

    Sorry for all the questions. Basically, I would like to obtain a list of sites, with some score as to whether they are denovos. I suppose any of the above would do (as long as I use the appropriate annotation), though now that I know about them, I am curious as to how they are different.


  vsvinti

    Any further thoughts on my last post, @Sheila @Geraldine_VdAuwera ?

