The current GATK version is 3.6-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# GenotypeConcordance

Member Posts: 17
edited November 2012

Hi,

I have the following problem:

I am evaluating genotype concordance using:

-T VariantEval --evalModule GenotypeConcordance -comp ref.vcf -eval sample1.observed.vcf


If I use a reference genotype file with multiple samples in it where one of the genotype columns is NA1234 (the sample in question), then the sensitivity for all SNP types (HOM_REF,HET,HOM_VAR) decreases drastically. This is because the GATK gets confused when there is more than one sample in the reference file. I know this because if I use a reference genotype file (ref.vcf) with only a single hapmap sample (NA1234) everything works fine and sensitivity is good. So this is not a detection problem is a problem when SNPs are being compared against the reference.

I tried passing the sample name using the --sample parameter for -T VariantEval, but this does not work either (sensitivity is still way off).

In previous versions of the GATK this was done automatically where genotypes where compared based on the sample name within the detection vcf file (sample1.observed.vcf ) vs the ref.vcf file without having to specify the sample name explicitly.

How can I avoid this problem? I want to have a master reference genotype file with multiple samples that I can use for different samples.

I am using GATK version v1.6

Thank you,
Gene

Post edited by Geraldine_VdAuwera on
Tagged:

Hi Gene,

Does the problem persist if you run with the latest version (2.2)?

Geraldine Van der Auwera, PhD

• Member Posts: 17
edited November 2012

I tried version 2.2 and this seems to fix the problem. Thanks for your input.

I have a new question:
I am trying to compare two different SNP sets against the same hapmap reference.
I use VariantEval with the same identical genotype reference (--evalModule GenotypeConcordance -comp $HAPMAP_REF) and enrichment intervals (-L$INTS) for both comparisons.

When reading the VariantEval Output, the first comparison produces:

n_true_HET_called_HET                      10250
n_true_HET_called_HOM_REF                     61
n_true_HET_called_HOM_VAR                     22
n_true_HET_called_MIXED                        0
n_true_HET_called_NO_CALL                   1034
n_true_HET_called_UNAVAILABLE                  0
Total HETs:
11367 = 10250+61+22+1034


The second comparison produces:

n_true_HET_called_HET                    10264
n_true_HET_called_HOM_REF                   50
n_true_HET_called_HOM_VAR                   23
n_true_HET_called_MIXED                      0
n_true_HET_called_NO_CALL                 2624
n_true_HET_called_UNAVAILABLE                0
Total HETs:
12961 = 10264+50+23+2624


If the intervals -L and the reference -comp are identical in both comparisons, how is it possible that the number of SNPs in a specific category (HETs in this case) differ? My understanding of this is that for a specific set of intervals -L and a specific genotype reference -comp the number of SNPs being considered should ALWAYS be the same no matter which SNP set -eval is being evaluated.
What I am missing here? Could you please clarify the meaning of the _NO_CALL field? (this seems to be the culprit of my problem)

Thank you so much,
Gene

Post edited by Geraldine_VdAuwera on

Anything in the eval or comp track will trigger, not just the comp.
NO_CALL means that there is no data available in the file. So your results do make sense.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 17
edited November 2012

Could you please elaborate a little bit more? What do you mean when you say that anything can trigger?

Another data point: If I do the intersection of my intervals -L with the genotype reference -comp I obtain 10911 heterozygous SNPs.

I always thought that the sum of

HET_called_HET + HET_called_HOM_REF + HET_called_HOM_VAR + HET_called_NO_CALL


is the total number of HET SNPs in the genotype reference -comp.

So how can I understand the relationship between the resulting 11367 HETs in the 1st comparison, the 12961 HETs in the second comparison, and the 10911 HETs in the genotype reference that are in the reference?

In other words, I thought that a good way of computing sensitivity was by evaluating the ratio:

SNPs_detected = HET_called_HET + HET_called_HOM_REF + HET_called_HOM_VAR

SNPs_detected / (SNPs_detected + _NO_CALLs)


would you say this definition is correct?

Thanks again,
Gene

Post edited by Geraldine_VdAuwera on

Can you please narrow down this discrepancy to one position that you believe to be incorrect? In other words, please use a progressively smaller -L to narrow down a place where you believe the numbers are wrong. Then we can look at the various records together to see whether there's a problem or not. Thanks.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 17

Hi Eric,

" I am comparing two different SNP sets against the same hapmap reference. I use VariantEval with the same identical genotype reference (--evalModule GenotypeConcordance -comp $HAPMAP_REF) and enrichment intervals (-L$INTS) for both comparisons."

This time I am evaluating results for a single chromosome only (chr21).

The break down of SNP categories in the given intervals for chr21 according to the reference that I am using is (hapmap_3.3 reference):
HET s 120
HOM_REFs 338
HOM_VARs 78
NO_CALLs 1

I am going to focus here only on HETs but the same is true for the other SNP categories.

# ---------------------------------------------------------------------------------------------------------------------------

The first set of SNPs produces the following results after running VariantEval (version 2.2-3):
n_true_HET_called_HET 114
n_true_HET_called_HOM_REF 0
n_true_HET_called_HOM_VAR 0
n_true_HET_called_MIXED 0
n_true_HET_called_NO_CALL 56
n_true_HET_called_UNAVAILABLE 0
Total HETs:
114+56 = 170

The second set of SNPs produces the following results after running VariantEval (version 2.2-3):
n_true_HET_called_HET 114
n_true_HET_called_HOM_REF 0
n_true_HET_called_HOM_VAR 0
n_true_HET_called_MIXED 0
n_true_HET_called_NO_CALL 24
n_true_HET_called_UNAVAILABLE 0
Total HETs:
114+24 = 138

# ---------------------------------------------------------------------------------------------------------------------------

I decided to repeat the same analysis this time using a version of the GATK for which this type of analysis worked fine before (version 1.0.4):
The first set produces the following results:
total_true_het 120 <=== Total HETs
%_het/het 95
n_het/no-call 6
n_het/ref 0
n_het/het 114
n_het/hom 0

The second set produces the followig results:
total_true_het 120 <==== Total HETs
%_het/het 99
n_het/no-call 1
n_het/ref 0
n_het/het 119
n_het/hom 0

As you can see, for the older version of the GATK, the number of total HETs remains constant (most importantly is consistent with the actual SNP counts in hapmap_3.3) and you can see an improvement in sensitivity from the first (95%) to the second set (99%). For the newer version, this comparison does not make sense if the total number of HETs in the genotype reference is seen by VariantEval as variable.

This suggests to me that for newer versions (> 1.6) of the GATK either there is a bug on the VariantEval tool or the meaning of a NO_CALL (no-call) has changed.

I would really appreciate if you could comment on this issue and let me know if this is really a bug or I am misinterpreting something.

Thank you very much,
Gene

Hi Gene,

You are absolutely correct: the GenotypeConcordance module isn't handling no-calls correctly. In the short term, you can fix this by subsetting your truth ("comp") VCF to just the sample in your eval VCF. I will file a bug report, but it truthfully may take some time before we get a chance to fix this properly.
Thanks again for letting us know!

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 17

Hi Eric,

I am glad it was not my imagination and that in a small way I can contribute to the amazing job you guys at the GSA are doing.
It would be great if you could shoot me a message whenever this gets resolved.

Regards,
Gene

• Posts: 1

May I ask if this bug has been fixed in the newer version of GATK?

GenotypeConcordance has been rewritten as a separate walker (instead of being a VariantEval module) and should no longer suffer from the bug. This new tool will be released in version 2.4 later this week.

Geraldine Van der Auwera, PhD

• Member Posts: 6
edited May 2013

@Geraldine_VdAuwera said:
GenotypeConcordance has been rewritten as a separate walker (instead of being a VariantEval module) and should no longer suffer from the bug. This new tool will be released in version 2.4 later this week.

Hi, I'm using the latest version of GATK, and using the GenotypeConcordance as a seperate walker.

java -Xmx10g -jar genoConcor.sh
-R \$ref_dir/hg19.fasta
-T GenotypeConcordance
--comp Pool_05_5.vcf
--eval Pool_N_4.vcf
--moltenize
-o 05_5vsN_4.out

However, the output are strange. All zero values. For example:

GATKReport.v1.1:5
GATKTable:4:19:%s:%s:%s:%.3f:;
GATKTable:GenotypeConcordance_CompProportions:Per-sample concordance tables: proportions of genotypes called in comp
Sample Eval_Genotype Comp_Genotype Proportion
ALL HET HET 0.000
ALL HET HET 0.000
ALL HET HET 0.000
ALL HOM_REF HOM_REF 0.000
ALL HOM_REF HOM_REF 0.000
ALL HOM_REF HOM_REF 0.000
ALL HOM_VAR HOM_VAR 0.000
ALL HOM_VAR HOM_VAR 0.000
ALL HOM_VAR HOM_VAR 0.000
ALL MIXED MIXED 0.000
ALL MIXED MIXED 0.000
ALL MIXED MIXED 0.000
ALL Mismatching_Alleles Mismatching_Alleles 0.000
ALL NO_CALL NO_CALL 0.000
ALL NO_CALL NO_CALL 0.000
ALL NO_CALL NO_CALL 0.000
ALL UNAVAILABLE UNAVAILABLE 0.000
ALL UNAVAILABLE UNAVAILABLE 0.000
ALL UNAVAILABLE UNAVAILABLE 0.000

There are some outputs for :GATKTable:SiteConcordance_Summary:Site-level summary statistics

ALLELES_MATCH EVAL_SUPERSET_TRUTH EVAL_SUBSET_TRUTH ALLELES_DO_NOT_MATCH EVAL_ONLY TRUTH_ONLY
36885 16 78 965 19783 1590

Anyone know how to solve this problem? Thanks!

Post edited by silentio on

can you check the the 2 vcf files have the same samples and actually have calls in it ?

• Member Posts: 6

@Carneiro said:
can you check the the 2 vcf files have the same samples and actually have calls in it ?

The two vcfs files do have calls in it. Each vcf gets one sample (one column of genotype) in them, and the samples are different.

if the samples are different, there will never be a match. You're getting the expected output. If you want to compute the genotype concordance, all vcfs must contain the same sample(s) so they can be compared. Otherwise you'll be comparing apples with oranges.

• Member Posts: 74

I am wondering how GenotypeConcordance checks for overlapping sites: presumably it looks at chr, pos and alleles, and not just snp IDs?