Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

GenotypeConcordance

genegene Posts: 17Member
edited November 2012 in Ask the GATK team

Hi,

I have the following problem:

I am evaluating genotype concordance using:

-T VariantEval --evalModule GenotypeConcordance -comp ref.vcf -eval sample1.observed.vcf

If I use a reference genotype file with multiple samples in it where one of the genotype columns is NA1234 (the sample in question), then the sensitivity for all SNP types (HOM_REF,HET,HOM_VAR) decreases drastically. This is because the GATK gets confused when there is more than one sample in the reference file. I know this because if I use a reference genotype file (ref.vcf) with only a single hapmap sample (NA1234) everything works fine and sensitivity is good. So this is not a detection problem is a problem when SNPs are being compared against the reference.

I tried passing the sample name using the --sample parameter for -T VariantEval, but this does not work either (sensitivity is still way off).

In previous versions of the GATK this was done automatically where genotypes where compared based on the sample name within the detection vcf file (sample1.observed.vcf ) vs the ref.vcf file without having to specify the sample name explicitly.

How can I avoid this problem? I want to have a master reference genotype file with multiple samples that I can use for different samples.

I am using GATK version v1.6

Thank you, Gene

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,418Administrator, GATK Developer admin

    Hi Gene,

    Does the problem persist if you run with the latest version (2.2)?

    Geraldine Van der Auwera, PhD

  • genegene Posts: 17Member
    edited November 2012

    I tried version 2.2 and this seems to fix the problem. Thanks for your input.

    I have a new question: I am trying to compare two different SNP sets against the same hapmap reference. I use VariantEval with the same identical genotype reference (--evalModule GenotypeConcordance -comp $HAPMAP_REF) and enrichment intervals (-L $INTS) for both comparisons.

    When reading the VariantEval Output, the first comparison produces:

    n_true_HET_called_HET                      10250
    n_true_HET_called_HOM_REF                     61
    n_true_HET_called_HOM_VAR                     22
    n_true_HET_called_MIXED                        0
    n_true_HET_called_NO_CALL                   1034
    n_true_HET_called_UNAVAILABLE                  0
    Total HETs:
    11367 = 10250+61+22+1034
    

    The second comparison produces:

    n_true_HET_called_HET                    10264
    n_true_HET_called_HOM_REF                   50
    n_true_HET_called_HOM_VAR                   23
    n_true_HET_called_MIXED                      0
    n_true_HET_called_NO_CALL                 2624
    n_true_HET_called_UNAVAILABLE                0
    Total HETs:
    12961 = 10264+50+23+2624
    

    If the intervals -L and the reference -comp are identical in both comparisons, how is it possible that the number of SNPs in a specific category (HETs in this case) differ? My understanding of this is that for a specific set of intervals -L and a specific genotype reference -comp the number of SNPs being considered should ALWAYS be the same no matter which SNP set -eval is being evaluated. What I am missing here? Could you please clarify the meaning of the _NO_CALL field? (this seems to be the culprit of my problem)

    Thank you so much, Gene

    Post edited by Geraldine_VdAuwera on
  • ebanksebanks Posts: 683GATK Developer mod

    Anything in the eval or comp track will trigger, not just the comp. NO_CALL means that there is no data available in the file. So your results do make sense.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • genegene Posts: 17Member
    edited November 2012

    Could you please elaborate a little bit more? What do you mean when you say that anything can trigger?

    Another data point: If I do the intersection of my intervals -L with the genotype reference -comp I obtain 10911 heterozygous SNPs.

    I always thought that the sum of

    HET_called_HET + HET_called_HOM_REF + HET_called_HOM_VAR + HET_called_NO_CALL 
    

    is the total number of HET SNPs in the genotype reference -comp.

    So how can I understand the relationship between the resulting 11367 HETs in the 1st comparison, the 12961 HETs in the second comparison, and the 10911 HETs in the genotype reference that are in the reference?

    In other words, I thought that a good way of computing sensitivity was by evaluating the ratio:

    SNPs_detected = HET_called_HET + HET_called_HOM_REF + HET_called_HOM_VAR
    
    SNPs_detected / (SNPs_detected + _NO_CALLs)
    

    would you say this definition is correct?

    Thanks again, Gene

    Post edited by Geraldine_VdAuwera on
  • ebanksebanks Posts: 683GATK Developer mod

    Can you please narrow down this discrepancy to one position that you believe to be incorrect? In other words, please use a progressively smaller -L to narrow down a place where you believe the numbers are wrong. Then we can look at the various records together to see whether there's a problem or not. Thanks.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • genegene Posts: 17Member

    Hi Eric,

    I followed your advice and repeated the analyis above: " I am comparing two different SNP sets against the same hapmap reference. I use VariantEval with the same identical genotype reference (--evalModule GenotypeConcordance -comp $HAPMAP_REF) and enrichment intervals (-L $INTS) for both comparisons."

    This time I am evaluating results for a single chromosome only (chr21).

    The break down of SNP categories in the given intervals for chr21 according to the reference that I am using is (hapmap_3.3 reference): HET s 120 HOM_REFs 338 HOM_VARs 78 NO_CALLs 1

    I am going to focus here only on HETs but the same is true for the other SNP categories.

    ---------------------------------------------------------------------------------------------------------------------------

    The first set of SNPs produces the following results after running VariantEval (version 2.2-3): n_true_HET_called_HET 114 n_true_HET_called_HOM_REF 0 n_true_HET_called_HOM_VAR 0 n_true_HET_called_MIXED 0 n_true_HET_called_NO_CALL 56 n_true_HET_called_UNAVAILABLE 0 Total HETs: 114+56 = 170

    The second set of SNPs produces the following results after running VariantEval (version 2.2-3): n_true_HET_called_HET 114 n_true_HET_called_HOM_REF 0 n_true_HET_called_HOM_VAR 0 n_true_HET_called_MIXED 0 n_true_HET_called_NO_CALL 24 n_true_HET_called_UNAVAILABLE 0 Total HETs: 114+24 = 138

    ---------------------------------------------------------------------------------------------------------------------------

    I decided to repeat the same analysis this time using a version of the GATK for which this type of analysis worked fine before (version 1.0.4): The first set produces the following results: total_true_het 120 <=== Total HETs %_het/het 95 n_het/no-call 6 n_het/ref 0 n_het/het 114 n_het/hom 0

    The second set produces the followig results: total_true_het 120 <==== Total HETs %_het/het 99 n_het/no-call 1 n_het/ref 0 n_het/het 119 n_het/hom 0

    As you can see, for the older version of the GATK, the number of total HETs remains constant (most importantly is consistent with the actual SNP counts in hapmap_3.3) and you can see an improvement in sensitivity from the first (95%) to the second set (99%). For the newer version, this comparison does not make sense if the total number of HETs in the genotype reference is seen by VariantEval as variable.

    This suggests to me that for newer versions (> 1.6) of the GATK either there is a bug on the VariantEval tool or the meaning of a NO_CALL (no-call) has changed.

    I would really appreciate if you could comment on this issue and let me know if this is really a bug or I am misinterpreting something.

    Thank you very much, Gene

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Gene,

    You are absolutely correct: the GenotypeConcordance module isn't handling no-calls correctly. In the short term, you can fix this by subsetting your truth ("comp") VCF to just the sample in your eval VCF. I will file a bug report, but it truthfully may take some time before we get a chance to fix this properly. Thanks again for letting us know!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • genegene Posts: 17Member

    Hi Eric,

    I am glad it was not my imagination and that in a small way I can contribute to the amazing job you guys at the GSA are doing. It would be great if you could shoot me a message whenever this gets resolved.

    Regards, Gene

  • chaolongchaolong Posts: 1

    May I ask if this bug has been fixed in the newer version of GATK?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,418Administrator, GATK Developer admin

    GenotypeConcordance has been rewritten as a separate walker (instead of being a VariantEval module) and should no longer suffer from the bug. This new tool will be released in version 2.4 later this week.

    Geraldine Van der Auwera, PhD

  • silentiosilentio Posts: 5Member
    edited May 2013

    @Geraldine_VdAuwera said: GenotypeConcordance has been rewritten as a separate walker (instead of being a VariantEval module) and should no longer suffer from the bug. This new tool will be released in version 2.4 later this week.

    Hi, I'm using the latest version of GATK, and using the GenotypeConcordance as a seperate walker.

    java -Xmx10g -jar genoConcor.sh -R $ref_dir/hg19.fasta -T GenotypeConcordance --comp Pool_05_5.vcf --eval Pool_N_4.vcf --moltenize -o 05_5vsN_4.out

    However, the output are strange. All zero values. For example:

    GATKReport.v1.1:5 GATKTable:4:19:%s:%s:%s:%.3f:; GATKTable:GenotypeConcordance_CompProportions:Per-sample concordance tables: proportions of genotypes called in comp Sample Eval_Genotype Comp_Genotype Proportion ALL HET HET 0.000 ALL HET HET 0.000 ALL HET HET 0.000 ALL HOM_REF HOM_REF 0.000 ALL HOM_REF HOM_REF 0.000 ALL HOM_REF HOM_REF 0.000 ALL HOM_VAR HOM_VAR 0.000 ALL HOM_VAR HOM_VAR 0.000 ALL HOM_VAR HOM_VAR 0.000 ALL MIXED MIXED 0.000 ALL MIXED MIXED 0.000 ALL MIXED MIXED 0.000 ALL Mismatching_Alleles Mismatching_Alleles 0.000 ALL NO_CALL NO_CALL 0.000 ALL NO_CALL NO_CALL 0.000 ALL NO_CALL NO_CALL 0.000 ALL UNAVAILABLE UNAVAILABLE 0.000 ALL UNAVAILABLE UNAVAILABLE 0.000 ALL UNAVAILABLE UNAVAILABLE 0.000

    There are some outputs for :GATKTable:SiteConcordance_Summary:Site-level summary statistics

    ALLELES_MATCH EVAL_SUPERSET_TRUTH EVAL_SUBSET_TRUTH ALLELES_DO_NOT_MATCH EVAL_ONLY TRUTH_ONLY 36885 16 78 965 19783 1590

    Anyone know how to solve this problem? Thanks!

    Post edited by silentio on
  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    can you check the the 2 vcf files have the same samples and actually have calls in it ?

  • silentiosilentio Posts: 5Member

    @Carneiro said: can you check the the 2 vcf files have the same samples and actually have calls in it ?

    The two vcfs files do have calls in it. Each vcf gets one sample (one column of genotype) in them, and the samples are different.

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    if the samples are different, there will never be a match. You're getting the expected output. If you want to compute the genotype concordance, all vcfs must contain the same sample(s) so they can be compared. Otherwise you'll be comparing apples with oranges.

Sign In or Register to comment.