Extracting de novo mutation from multi sample vcf

MaxMax Posts: 27Member
edited May 2013 in Ask the GATK team

Hi all,

I'm currently trying to extract de novo mutations from my multi-sample vcf files (trios). I've already read the VCF file specification documentation but wanted to check if I got this right. So I would call a de novo mutation candidate in the following cases:

1.Child has the genotype 0|1 , 1|0 or 1|1 and both parents have 0|0

2.Child has the genotype 1|0 or 1|1 and the mother 0|0

3.Child: 0|1 or 1|1 and the father 0|0

Is this correct ? And are there any other cases which indicate a de novo mutation which I missed so far ?

Thanks !

Post edited by Max on
Tagged:

Best Answers

Answers

  • MaxMax Posts: 27Member

    Allright, thanks a lot Geraldine !

    Are there any recommendations how to handle genotypes that are left unphased? I saw that some of entries of my VCF could not be phased successfully.

  • flescaiflescai Posts: 53Member ✭✭

    is there any flag or de novo probability annotated in the phased vcf file, what would allow to extract the variants with a grep or SelectVariants? I have several trios and it's not immediate to check the genotypes into the multisample VCF file. thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hi @flescai,

    I'm not sure I understand what you're asking -- are you trying to select specifically the sites that have been phased, excluding unphased ones? Or something else?

    Geraldine Van der Auwera, PhD

  • flescaiflescai Posts: 53Member ✭✭
    edited June 2013

    Apologies @Geraldine_VdAuwera, I would like to select specifically de novo variants, i.e. those new variants not transmitted by the parents. You could do this with a script, with conditions like those listed at the beginning of this thread, but I was wondering if there was an annotation (like TD ~ value, or denovo=true) that's already flagging them up. I read that PhaseByTrasmission models de novo variants, and I thought it would somehow highlight which of them are likely to be de novo, without us to go back to check the genotypes of the parents "manually".

    If they are categorised or given a probability, I can then select them with SelectVariants or something else, instead of extracting them with the conditions listed at the top (if I have many trios it becomes trickier to compare them in the VCF when the walker is somehow modelling it anyway).

    cheers,
    F

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Ah, I see, thanks for clarifying. Currently that information can be output to the optional MendelianViolationsFile. If that is not satisfactory, would you say that you would prefer to have a "de novo" annotation in the VCF file itself, to make it more explicit?

    Geraldine Van der Auwera, PhD

  • flescaiflescai Posts: 53Member ✭✭

    wow, I didn't print that file before.

    it is hugely useful! of course it needs a bit of scripting to extract the de-novo ones (still need to compare the mother and father genotypes, when they both exist). If I understood correctly, a low TP doesn't necessarily mean it's a de-novo, it could be due to a missing parent, or unphased genotypes. I think I can work with it for now, but obviously if you would add an explicit annotation in the VCF in future that would be useful.

    I also need to report a formatting problem in that file, not sure if it's a small bug: not all lines have the same number of colums. If I want to check the genotypes for example:

    $ cut -f 6,10,14 mendelianViolations.txt | more
    MOTHER_GT   FATHER_GT   CHILD_GT
    G|G ./. G|G
    ./. G|A G|G
    G|G ./. G|G
    G|G ./. G|G
    ./. G|G G|G
    ./. G|G G|G
    G|G ./. G|G
    G|G ./. G|G
    G|G ./. G|G
    ./. G|G A|G
    ./. G|G G|G
    G|G:238:197,51:71,9,0   .   95,12,0
    G|G:153:119,42:109,12,0 .   108,12,0
    .   G|G G|G
    G|G:238:197,50:119,12,0 .   226,27,0
    G|G:210:178,41:24,3,0   .   184,21,0
    G|G:238:137,112:237,30,0    .   172,21,0
    ./. G|G G|G
    G|G:238:146,102:261,30,0    .   71,9,0
    G|G:135:89,52:85,9,0    .   70,9,0
    G|G ./. G|G
    .   G|G G|G
    G|G:161:139,30:30,3,0   .   47,6,0
    

    it becomes slightly complicated to compare the genotypes in this file. I am using v2.5-2-gf57256b

    thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Happy to help. FYI there is also a --mendelianViolations flag for SelectVariants, now that I think of it, which should output the actual sites. With the caveat that they might not all really be de novo as you say.

    Re the table you posted, if I'm not missing something all the lines have 3 tab-delimited columns. It's just that the aggregated data in some columns have to be teased apart further...

    Geraldine Van der Auwera, PhD

  • flescaiflescai Posts: 53Member ✭✭

    Hi @Geraldine_VdAuwera, I believe it has 3 tab delimited because I selected only three :-) but in those lines where the first column is aggregated with the allele depths and probabilities, the others are messed up and you don't see the genotypes anymore. here's a full output for that file without cutting the colums, if that helps

    CHROM   POS AC  FAMILY  TP  MOTHER_GT   MOTHER_DP   MOTHER_AD   MOTHER_PL   FATHER_GT   FATHER_DP   FATHER_AD   FATHER_PL   CHILD_GT    CHILD_DP    CHILD_AD    CHILD_PL
    1   69270   254 FAM001  null    ./. -1  .   .   G|G -1  0,3 76,9,0  G|G -1  0,1 27,3,0
    1   69270   254 FAM002  4   G|G -1  0,5 132,15,0    ./. -1  .   .   G|G -1  0,1 25,3,0
    1   69270   254 FAM014  null    ./. -1  .   .   G|G -1  0,1 25,3,0  G|G -1  0,2 51,6,0
    1   69270   254 FAM016  null    ./. -1  .   .   G|G -1  0,2 49,6,0  G|G -1  0,2 49,6,0
    1   69270   254 FAM020  4   G|G -1  0,2 48,6,0  ./. -1  .   .   G|G -1  0,2 50,6,0
    1   69270   254 FAM085  3   G|G -1  0,3 79,9,0  ./. -1  .   .   G|G -1  0,1 27,3,0
    1   69270   254 FAM086  3   G|G -1  0,1 27,3,0  ./. -1  .   .   G|G -1  0,3 75,9,0
    1   69270   254 FAM091  null    ./. -1  .   .   G|G -1  0,1 27,3,0  G|G -1  0,2 50,6,0
    1   69270   254 FAM093  null    ./. -1  .   .   G|G -1  0,2 53,6,0  G|G -1  0,3 72,9,0
    1   69270   254 FAM096  null    ./. -1  .   .   G|G -1  0,3 71,9,0  G|G -1  0,1 20,3,0
    1   69270   254 FAM101  2   G|G:-1:0,1:25,3,0   .   .   .   .   G|G -1  0,1 23,3,0
    1   69270   254 FAM104  5   G|G:-1:0,2:52,6,0   .   .   .   .   G|G -1  0,3 78,9,0
    1   69270   254 FAM106  5   .   .   .   .   G|G -1  0,3 75,9,0  G|G -1  0,2 52,6,0
    1   69270   254 FAM107  5   G|G:-1:0,3:82,9,0   .   .   .   .   G|G -1  0,2 51,6,0
    1   69270   254 FAM112  3   G|G:-1:0,1:27,3,0   .   .   .   .   G|G -1  0,3 75,9,0
    1   69270   254 FAM118  null    ./. -1  .   .   G|G -1  0,2 43,6,0  G|G -1  0,1 18,3,0
    1   69428   17  FAM015  17  T|T -1  33,0    0,100,746   ./. -1  .   .   T|G -1  0,1 20,3,0
    1   69511   421 FAM002  127 G|G -1  0,71    1608,217,0  ./. -1  .   .   G|G -1  0,54    1199,160,0
    1   69511   421 FAM006  null    ./. -1  .   .   G|G -1  0,33    815,100,0   G|G -1  0,34    777,99,0
    1   69511   421 FAM009  68  G|G -1  0,22    566,70,0    ./. -1  .   .   G|G -1  0,25    627,73,0
    1   69511   421 FAM010  90  G|G -1  0,31    656,91,0    ./. -1  .   .   G|G -1  0,39    869,117,0
    1   69511   421 FAM011  null    ./. -1  .   .   G|G -1  0,32    685,95,0    G|G -1  0,30    642,88,0
    1   69511   421 FAM012  63  G|G -1  0,53    1296,161,0  ./. -1  .   .   G|G -1  0,21    500,63,0
    1   69511   421 FAM015  4   G|G -1  0,29    646,87,0    ./. -1  .   .   G|G -1  .   20,3,0
    1   69511   421 FAM019  null    ./. -1  .   .   G|G -1  0,38    842,110,0   G|G -1  0,41    1015,127,0
    1   69511   421 FAM083  null    ./. -1  .   .   G|G -1  0,22    481,63,0    A|G -1  1,0 0,3,20
    1   69511   421 FAM085  117 G|G -1  0,39    862,118,0   ./. -1  .   .   G|G -1  2,52    1282,156,0
    1   69511   421 FAM086  127 G|G -1  1,126   2869,365,0  ./. -1  .   .   G|G -1  0,96    2178,288,0
    1   69511   421 FAM091  null    ./. -1  .   .   G|G -1  1,47    1067,139,0  G|G -1  0,22    493,67,0
    1   69511   421 FAM092  null    ./. -1  .   .   G|G -1  0,38    914,113,0   G|G -1  0,32    715,97,0
    1   69511   421 FAM101  92  G|G:-1:0,44:1020,129,0  .   .   .   .   G|G -1  0,32    713,93,0
    1   69511   421 FAM102  78  G|G:-1:0,29:732,93,0    .   .   .   .   G|G -1  0,27    614,79,0
    1   69511   421 FAM104  75  G|G:-1:0,25:610,76,0    .   .   .   .   G|G -1  0,26    638,83,0
    1   69511   421 FAM105  97  .   .   .   .   G|G -1  0,33    807,98,0    G|G -1  1,41    922,120,0
    1   69511   421 FAM106  80  .   .   .   .   G|G -1  0,28    609,81,0    G|G -1  0,40    895,118,0
    1   69511   421 FAM107  105 G|G:-1:1,36:809,105,0   .   .   .   .   G|G -1  0,74    1675,218,0
    1   69511   421 FAM110  91  .   .   .   .   G|G -1  0,31    725,92,0    G|G -1  1,36    829,110,0
    1   69511   421 FAM112  127 G|G:-1:0,49:1143,145,0  .   .   .   .   G|G -1  2,64    1499,192,0
    1   69511   421 FAM114  127 G|G:-1:0,97:2304,294,0  .   .   .   .   G|G -1  0,77    1673,227,0
    

    it's 17 columns in normal lines and 14 colums in those where 3 have been aggregated to the first genotype.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hah, that makes a lot more sense! I don't work with MV files myself, but I thought the content was a little... minimal...

    I see the problem now. And that is why it is always better to post unedited lines to the forum :)

    This must be a bug when the output is written out. I can't imagine it would be a desired behavior to condense some of the annotations and not others. I'll ask the tool's author, @Laurent, to take a look at this.

    Geraldine Van der Auwera, PhD

  • ManojKManojK PunePosts: 1Member

    @Max said: Hi all,

    I'm currently trying to extract de novo mutations from my multi-sample vcf files (trios). I've already read the VCF file specification documentation but wanted to check if I got this right. So I would call a de novo mutation candidate in the following cases:

    1.Child has the genotype 0|1 , 1|0 or 1|1 and both parents have 0|0

    2.Child has the genotype 1|0 or 1|1 and the mother 0|0

    3.Child: 0|1 or 1|1 and the father 0|0

    Is this correct ? And are there any other cases which indicate a de novo mutation which I missed so far ?

    Thanks !

    Can these three rules applied to unphased genotypes to?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hi @ManojK,

    Yes, the same rules apply to unphased genotypes.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.