NON_REF in gvcf

I see something like this
chr1 17697 . G C, 72.77 . BaseQRankSum=0.322;ClippingRankSum=-1.517;DP=8;MLEAC=1,0;MLEAF=0
.500,0.00;MQ=40.00;MQ0=0;MQRankSum=0.322;ReadPosRankSum=0.956 GT:AD:DP:GQ:PL:SB 0/1:5,3,0:8:99:101,0,178,116,187,303:0,5,0,3
in my gvcf file.
apparently NON_REF was treated as an allele and was assigned a read depth (0 in this case) and genotype likelihood in combination with G, C alleles
It is very confusing!

Tagged:

Answers

  • ruanruan Member

    chr1 17697 . G C,<> 72.77 . BaseQRankSum=0.322;ClippingRankSum=-1.517;DP=8;MLEAC=1,0;MLEAF=0 .500,0.00;MQ=40.00;MQ0=0;MQRankSum=0.322;ReadPosRankSum=0.956 GT:AD:DP:GQ:PL:SB 0/1:5,3,0:8:99:101,0,178,116,187,303:0,5,0,3

  • ruanruan Member

    It seems anything inside <> was deleted by the post system

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ruan‌

    Hi.

    Anything within <> is interpreted as HTML by your browser, so you need to use the code notation (either backticks or add four spaces at the front of the line). See markdown syntax for more detail. You can also use the formatting button with the C in the commenting box. This will display as <NON_REF>.

    As for NON_REF itself, this is a symbolic allele. Please refer to the gvcf documentation here: http://www.broadinstitute.org/gatk/guide/article?id=4017

    -Sheila

  • mikedmiked Member

    Hello,

    I would like to understand how HC is storing genotype information in the gVCF .

    For a single-sample gVCF :

    
    3       128672437       rs146586501     T       C,TCC,TCCCTCCCCCTCC,<NON_REF>   0       .       DB;DP=4;MLEAC=0,0,0,1;MLEAF=0.00,0.00,0.00,0.500;MQ=40.78;MQ0=0        GT:AD:DP:GQ:PL:SB       0/4:1,0,0,0,0:1:3:38,29,368,3,117,88,48,83,38,125,0,58,32,32,26:0,0,0,0
    </pre>
    
    <p>
    
    After running through CombineGVCFs and GenotypeGVCFs the genotype for the same sample becomes:
    
    
    3       128672437       .       T       C       120.59    ........ 0/1:1,0:1:9:9,0,339
    </pre>
    
    <p>
    
    I wanted to know for the above gVCF example, what does a genotype of 0/4 mean ( one allele is T, another allele is  )  ?
    
    I've read http://www.broadinstitute.org/gatk/guide/article?id=4017 and I'm still confused when I encounter the above example. Can you clarify on :
    
    "  symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way."
    
    Thanks for the help.
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @miked‌

    Hi,

    The 4 represents the "Non-Ref" allele. There are 3 alternate alleles listed and the 4th is the Non-ref allele.

    The Non-ref allele represents all other alleles seen at the position that are not one of the alternate alleles listed or the reference allele.

    Basically, the indel site can have a lot of different alleles, too many to list. The 0/4 shows that the most likely genotype is reference/unlisted allele.

    -Sheila

  • mikedmiked Member

    Thanks for the response.

    In the HC gVCF the genotype call was 0/4 and after GenotypeGVCF the call became 0/1.

    Any reason why that is? I understand the genotypes are collapsed after CombineGVCFs and the joint-calling process kicks in during GenotypeGVCFs. I'm asking on why the call was not 0/1 in the original first pass of the BAM during the HC step.

    Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @miked‌

    Hi,

    In the original HC pass, the sample had a highest likelihood genotype of 0/4, but when taking into account all of the samples, the highest likelihood genotype is 0/1.

    This is the effect of taking into account more data. With only one sample, you may not have enough evidence to make the correct call, but as you add data from other samples, you gain more evidence supporting your call.

    In this case, the added data shifted the evidence towards a genotype of 0/1.
    -Sheila

  • aakumaraakumar Member

    Dear support team,

    Adding to the topic thread regarding "NON_REF" allele:

    In our individual sample gVCF we found "NON_REF" information present. For eg:
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1
    chr1 14313 . T . . END=14906 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
    chr1 14907 . A G, 140.90 . DP=5;MLEAC=2,0;MLEAF=1.00,0.00;MQ=37.54;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,5,0:5:15:169,15,0,169,15,169:0,0,2,3

    But, when doing joint Genotype calling using 'GenotypeGVCFs' with approx 200 samples, in the single gVCF file obtained the "NON_REF" information is absent:

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1...200

    chr1 14354 . C A 356.22 . AC=8;AF=0.077;AN=104;BaseQRankSum=-1.380e+00;ClippingRankSum=-1.930e-01;DP=428;FS=2.567;GQ_MEAN=17.00;GQ_STDDEV=32.22;InbreedingCoeff=-0.1588;MLEAC=1
    3;MLEAF=0.125;MQ=24.34;MQ0=0;MQRankSum=0.887;NCC=129;QD=3.33;ReadPosRankSum=-5.400e-02;SOR=0.373 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:
    3:0,3,35 ./.:0,0:0 0/0:2,0:2:0:0,0,73 1/1:0,2:2:6:59,6,0 0/0:19,0:19:31:0,31,630 0/1:4,2:6:34:34,0,89 0/1:4,2:6:37:37,0,92 ./.:0,0:0 0/0:1,0:1:3:0,3,38 0/0:1,0:1:0:0,0
    ,35 0/0:1,0:1:0:0,0,27 0/0:1,0:1:3:0,3,36 0/0:1,0:1:0:0,0,34 0/0:18,0:18:54:0,54,641 ./.:0,0:0 0/0:17,0:17:51:0,51,577 ./.:0,0:0

    the information of is important for us for post Genotype analysis. Hence, are we missing any specific parameters in GenotypeGVCFs command or is there any alternative way to extract that information.

    Thank you for the help

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The output of GenotypeGVCFs is not a GVCF, so the non-ref allele is not included on purpose.

    In the next version, it will be possible to have it emit reference confidence for non variant sites with the -allSites argument. This is already available in the latest nightly build if you'd like to try it out.

  • carinacarina CopenhagenMember
    edited November 2016

    Hi
    Thank you for always great and clear answers :)
    In regards to NON-REF I have following question, that I haven't been able to find answers to:
    I have used haplotypeCaller -ERC BP_RESOLUTION. I have noticed that some of my return lines look like this:

    chr14   92801203    .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:0,535:535:0:0,0,0
    

    As you can see the genotype for the reference allele (0/0) has been reported, even though none has been observed.
    Is there a way for me to correct for this?

    The same for the next line where only very few of the reference allele has been observed:

    chr15   28365618    .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:8,1029:1037:0:0,0,0
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @carina
    Hi,

    Notice the PLs at those sites are all 0. That means the tool is not confident in any of the genotypes. The default in this case is 0/0. However, once you run GenotypeGVCFs, you will get a ./. (no-call) which tells you there was no confidence in the genotype at the site.

    I suspect those sites are in a messy region. You can have a look at the BAM file and bamout file to check.

    -Sheila

Sign In or Register to comment.