GenotypeConcordance

genegene Member
edited November 2012 in Ask the GATK team

Hi,

I have the following problem:

I am evaluating genotype concordance using:

-T VariantEval --evalModule GenotypeConcordance -comp ref.vcf -eval sample1.observed.vcf

If I use a reference genotype file with multiple samples in it where one of the genotype columns is NA1234 (the sample in question), then the sensitivity for all SNP types (HOM_REF,HET,HOM_VAR) decreases drastically. This is because the GATK gets confused when there is more than one sample in the reference file. I know this because if I use a reference genotype file (ref.vcf) with only a single hapmap sample (NA1234) everything works fine and sensitivity is good. So this is not a detection problem is a problem when SNPs are being compared against the reference.

I tried passing the sample name using the --sample parameter for -T VariantEval, but this does not work either (sensitivity is still way off).

In previous versions of the GATK this was done automatically where genotypes where compared based on the sample name within the detection vcf file (sample1.observed.vcf ) vs the ref.vcf file without having to specify the sample name explicitly.

How can I avoid this problem? I want to have a master reference genotype file with multiple samples that I can use for different samples.

I am using GATK version v1.6

Thank you,
Gene

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Gene,

    Does the problem persist if you run with the latest version (2.2)?

  • genegene Member
    edited November 2012

    I tried version 2.2 and this seems to fix the problem. Thanks for your input.

    I have a new question:
    I am trying to compare two different SNP sets against the same hapmap reference.
    I use VariantEval with the same identical genotype reference (--evalModule GenotypeConcordance -comp $HAPMAP_REF) and enrichment intervals (-L $INTS) for both comparisons.

    When reading the VariantEval Output, the first comparison produces:

    n_true_HET_called_HET                      10250
    n_true_HET_called_HOM_REF                     61
    n_true_HET_called_HOM_VAR                     22
    n_true_HET_called_MIXED                        0
    n_true_HET_called_NO_CALL                   1034
    n_true_HET_called_UNAVAILABLE                  0
    Total HETs:
    11367 = 10250+61+22+1034
    

    The second comparison produces:

    n_true_HET_called_HET                    10264
    n_true_HET_called_HOM_REF                   50
    n_true_HET_called_HOM_VAR                   23
    n_true_HET_called_MIXED                      0
    n_true_HET_called_NO_CALL                 2624
    n_true_HET_called_UNAVAILABLE                0
    Total HETs:
    12961 = 10264+50+23+2624
    

    If the intervals -L and the reference -comp are identical in both comparisons, how is it possible that the number of SNPs in a specific category (HETs in this case) differ? My understanding of this is that for a specific set of intervals -L and a specific genotype reference -comp the number of SNPs being considered should ALWAYS be the same no matter which SNP set -eval is being evaluated.
    What I am missing here? Could you please clarify the meaning of the _NO_CALL field? (this seems to be the culprit of my problem)

    Thank you so much,
    Gene

    Post edited by Geraldine_VdAuwera on
  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Anything in the eval or comp track will trigger, not just the comp.
    NO_CALL means that there is no data available in the file. So your results do make sense.

  • genegene Member
    edited November 2012

    Could you please elaborate a little bit more? What do you mean when you say that anything can trigger?

    Another data point: If I do the intersection of my intervals -L with the genotype reference -comp I obtain 10911 heterozygous SNPs.

    I always thought that the sum of

    HET_called_HET + HET_called_HOM_REF + HET_called_HOM_VAR + HET_called_NO_CALL 
    

    is the total number of HET SNPs in the genotype reference -comp.

    So how can I understand the relationship between the resulting 11367 HETs in the 1st comparison, the 12961 HETs in the second comparison, and the 10911 HETs in the genotype reference that are in the reference?

    In other words, I thought that a good way of computing sensitivity was by evaluating the ratio:

    SNPs_detected = HET_called_HET + HET_called_HOM_REF + HET_called_HOM_VAR
    
    SNPs_detected / (SNPs_detected + _NO_CALLs)
    

    would you say this definition is correct?

    Thanks again,
    Gene

    Post edited by Geraldine_VdAuwera on
  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Can you please narrow down this discrepancy to one position that you believe to be incorrect? In other words, please use a progressively smaller -L to narrow down a place where you believe the numbers are wrong. Then we can look at the various records together to see whether there's a problem or not. Thanks.

  • Hi Eric,

    I followed your advice and repeated the analyis above:
    " I am comparing two different SNP sets against the same hapmap reference. I use VariantEval with the same identical genotype reference (--evalModule GenotypeConcordance -comp $HAPMAP_REF) and enrichment intervals (-L $INTS) for both comparisons."

    This time I am evaluating results for a single chromosome only (chr21).

    The break down of SNP categories in the given intervals for chr21 according to the reference that I am using is (hapmap_3.3 reference):
    HET s 120
    HOM_REFs 338
    HOM_VARs 78
    NO_CALLs 1

    I am going to focus here only on HETs but the same is true for the other SNP categories.

    ---------------------------------------------------------------------------------------------------------------------------

    The first set of SNPs produces the following results after running VariantEval (version 2.2-3):
    n_true_HET_called_HET 114
    n_true_HET_called_HOM_REF 0
    n_true_HET_called_HOM_VAR 0
    n_true_HET_called_MIXED 0
    n_true_HET_called_NO_CALL 56
    n_true_HET_called_UNAVAILABLE 0
    Total HETs:
    114+56 = 170

    The second set of SNPs produces the following results after running VariantEval (version 2.2-3):
    n_true_HET_called_HET 114
    n_true_HET_called_HOM_REF 0
    n_true_HET_called_HOM_VAR 0
    n_true_HET_called_MIXED 0
    n_true_HET_called_NO_CALL 24
    n_true_HET_called_UNAVAILABLE 0
    Total HETs:
    114+24 = 138

    ---------------------------------------------------------------------------------------------------------------------------

    I decided to repeat the same analysis this time using a version of the GATK for which this type of analysis worked fine before (version 1.0.4):
    The first set produces the following results:
    total_true_het 120 <=== Total HETs
    %_het/het 95
    n_het/no-call 6
    n_het/ref 0
    n_het/het 114
    n_het/hom 0

    The second set produces the followig results:
    total_true_het 120 <==== Total HETs
    %_het/het 99
    n_het/no-call 1
    n_het/ref 0
    n_het/het 119
    n_het/hom 0

    As you can see, for the older version of the GATK, the number of total HETs remains constant (most importantly is consistent with the actual SNP counts in hapmap_3.3) and you can see an improvement in sensitivity from the first (95%) to the second set (99%). For the newer version, this comparison does not make sense if the total number of HETs in the genotype reference is seen by VariantEval as variable.

    This suggests to me that for newer versions (> 1.6) of the GATK either there is a bug on the VariantEval tool or the meaning of a NO_CALL (no-call) has changed.

    I would really appreciate if you could comment on this issue and let me know if this is really a bug or I am misinterpreting something.

    Thank you very much,
    Gene

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Gene,

    You are absolutely correct: the GenotypeConcordance module isn't handling no-calls correctly. In the short term, you can fix this by subsetting your truth ("comp") VCF to just the sample in your eval VCF. I will file a bug report, but it truthfully may take some time before we get a chance to fix this properly.
    Thanks again for letting us know!

  • Hi Eric,

    I am glad it was not my imagination and that in a small way I can contribute to the amazing job you guys at the GSA are doing.
    It would be great if you could shoot me a message whenever this gets resolved.

    Regards,
    Gene

  • May I ask if this bug has been fixed in the newer version of GATK?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    GenotypeConcordance has been rewritten as a separate walker (instead of being a VariantEval module) and should no longer suffer from the bug. This new tool will be released in version 2.4 later this week.

  • silentiosilentio Member
    edited May 2013

    @Geraldine_VdAuwera said:
    GenotypeConcordance has been rewritten as a separate walker (instead of being a VariantEval module) and should no longer suffer from the bug. This new tool will be released in version 2.4 later this week.

    Hi, I'm using the latest version of GATK, and using the GenotypeConcordance as a seperate walker.

    java -Xmx10g -jar genoConcor.sh
    -R $ref_dir/hg19.fasta
    -T GenotypeConcordance
    --comp Pool_05_5.vcf
    --eval Pool_N_4.vcf
    --moltenize
    -o 05_5vsN_4.out

    However, the output are strange. All zero values. For example:

    GATKReport.v1.1:5
    GATKTable:4:19:%s:%s:%s:%.3f:;
    GATKTable:GenotypeConcordance_CompProportions:Per-sample concordance tables: proportions of genotypes called in comp
    Sample Eval_Genotype Comp_Genotype Proportion
    ALL HET HET 0.000
    ALL HET HET 0.000
    ALL HET HET 0.000
    ALL HOM_REF HOM_REF 0.000
    ALL HOM_REF HOM_REF 0.000
    ALL HOM_REF HOM_REF 0.000
    ALL HOM_VAR HOM_VAR 0.000
    ALL HOM_VAR HOM_VAR 0.000
    ALL HOM_VAR HOM_VAR 0.000
    ALL MIXED MIXED 0.000
    ALL MIXED MIXED 0.000
    ALL MIXED MIXED 0.000
    ALL Mismatching_Alleles Mismatching_Alleles 0.000
    ALL NO_CALL NO_CALL 0.000
    ALL NO_CALL NO_CALL 0.000
    ALL NO_CALL NO_CALL 0.000
    ALL UNAVAILABLE UNAVAILABLE 0.000
    ALL UNAVAILABLE UNAVAILABLE 0.000
    ALL UNAVAILABLE UNAVAILABLE 0.000

    There are some outputs for :GATKTable:SiteConcordance_Summary:Site-level summary statistics

    ALLELES_MATCH EVAL_SUPERSET_TRUTH EVAL_SUBSET_TRUTH ALLELES_DO_NOT_MATCH EVAL_ONLY TRUTH_ONLY
    36885 16 78 965 19783 1590

    Anyone know how to solve this problem? Thanks!

  • CarneiroCarneiro Charlestown, MAMember

    can you check the the 2 vcf files have the same samples and actually have calls in it ?

  • @Carneiro said:
    can you check the the 2 vcf files have the same samples and actually have calls in it ?

    The two vcfs files do have calls in it. Each vcf gets one sample (one column of genotype) in them, and the samples are different.

  • CarneiroCarneiro Charlestown, MAMember

    if the samples are different, there will never be a match. You're getting the expected output. If you want to compute the genotype concordance, all vcfs must contain the same sample(s) so they can be compared. Otherwise you'll be comparing apples with oranges.

  • I am wondering how GenotypeConcordance checks for overlapping sites: presumably it looks at chr, pos and alleles, and not just snp IDs?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Strictly speaking, overlapping sites are just sites with the same chr and pos. Snp IDs are not useful, and alleles are looked at in the context of the genotype concordance evaluation in the second phase of the process.

  • robertbrobertb torontoMember

    Hi,

    I sequenced the same sample on two different machines and compred the results to the NIAB truthset.
    I'm getting no values for one of the samples in the concordance tables.

    I don't know what to make of this. I mean even if the samples had been different you'd still get calls at the same site. Can anyone suggest what's wrong. Thanks.

    :GATKReport.v1.1:5

    :GATKTable:20:1:%s:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:;

    :GATKTable:GenotypeConcordance_CompProportions:Per-sample concordance tables: proportions of genotypes called in comp

    Sample NO_CALL_HOM_REF NO_CALL_HET NO_CALL_HOM_VAR HOM_REF_HOM_REF HOM_REF_HET HOM_REF_HOM_VAR HET_HOM_REF HET_HET HET_HOM_VAR HOM_VAR_HOM_REF HOM_VAR_HET HOM_VAR_HOM_VAR UNAVAILABLE_HOM_REF UNAVAILABLE_HET UNAVAILABLE_HOM_VAR MIXED_HOM_REF MIXED_HET MIXED_HOM_VAR Mismatching_Alleles
    ALL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

    :GATKTable:38:1:%s:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:;

    :GATKTable:GenotypeConcordance_Counts:Per-sample concordance tables: comparison counts

    Sample NO_CALL_NO_CALL NO_CALL_HOM_REF NO_CALL_HET NO_CALL_HOM_VAR NO_CALL_UNAVAILABLE NO_CALL_MIXED HOM_REF_NO_CALL HOM_REF_HOM_REF HOM_REF_HET HOM_REF_HOM_VAR HOM_REF_UNAVAILABLE HOM_REF_MIXED HET_NO_CALL HET_HOM_REF HET_HET HET_HOM_VAR HET_UNAVAILABLE HET_MIXED HOM_VAR_NO_CALL HOM_VAR_HOM_REF HOM_VAR_HET HOM_VAR_HOM_VAR HOM_VAR_UNAVAILABLE HOM_VAR_MIXED UNAVAILABLE_NO_CALL UNAVAILABLE_HOM_REF UNAVAILABLE_HET UNAVAILABLE_HOM_VAR UNAVAILABLE_UNAVAILABLE UNAVAILABLE_MIXED MIXED_NO_CALL MIXED_HOM_REF MIXED_HET MIXED_HOM_VAR MIXED_UNAVAILABLE MIXED_MIXED Mismatching_Alleles
    ALL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    :GATKTable:20:1:%s:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:%.3f:;

    :GATKTable:GenotypeConcordance_EvalProportions:Per-sample concordance tables: proportions of genotypes called in eval

    Sample HOM_REF_NO_CALL HOM_REF_HOM_REF HOM_REF_HET HOM_REF_HOM_VAR HOM_REF_UNAVAILABLE HOM_REF_MIXED HET_NO_CALL HET_HOM_REF HET_HET HET_HOM_VAR HET_UNAVAILABLE HET_MIXED HOM_VAR_NO_CALL HOM_VAR_HOM_REF HOM_VAR_HET HOM_VAR_HOM_VAR HOM_VAR_UNAVAILABLE HOM_VAR_MIXED Mismatching_Alleles
    ALL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

    :GATKTable:4:1:%s:%.3f:%.3f:%.3f:;

    :GATKTable:GenotypeConcordance_Summary:Per-sample summary statistics: NRS, NRD, and OGC

    Sample Non-Reference Sensitivity Non-Reference Discrepancy Overall_Genotype_Concordance
    ALL 0.000 1.000 1.000

    :GATKTable:6:1:%d:%d:%d:%d:%d:%d:;

    :GATKTable:SiteConcordance_Summary:Site-level summary statistics

    ALLELES_MATCH EVAL_SUPERSET_TRUTH EVAL_SUBSET_TRUTH ALLELES_DO_NOT_MATCH EVAL_ONLY TRUTH_ONLY
    3052 2 11 13 450 704

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @robertb
    Hi,

    Do you get the same result when you compare the two sample runs against each other? I am wondering if the name of the sample is affecting anything here.

    Thanks,
    Sheila

    P.S. You may consider using the Picard version of GenotypeConcordance, as that may be better than the GATK version.

  • robertbrobertb torontoMember

    it was the samples names, thanks.

Sign In or Register to comment.