The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# Possible bug in CombineVariants

Member Posts: 36 ✭✭
edited February 2014

I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is:

1) Generate a VCF containing unique IDs (using GATK CallVariants)
2) Replace the VCF header with potentially non-unique IDs (using tabix -r)
3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants)

It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3.

The relevant lines are as follows:

combined:

HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802
0/0:300           0/0:127 0/0:292           0/0:290           0/0:127 0/0:127
0/0:299           0/0:127 0/0:299           0/0:293           0/0:127 0/0:127


combined.renamed:

HG00421 HG00422 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127


combined.renamed.merged:

HG00421 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127
`

Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before.

I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch.

Thanks,

John Wallace

Post edited by Geraldine_VdAuwera on
Tagged:

## Answers

• Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 701 admin

Or just rename them with a suffix: Foo.1 Foo.2 etc.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 36 ✭✭

I think throwing an error would alert users to the issue, but it would be nice if it was supported. Ideally, there would be some kind of "smart" merging of the 2 genotypes, but simply taking the first one seen would be just as valid.

As for extracting the duplicates to separate files, that is somewhat difficult in our pipeline because there are anywhere from 1-6+ copies of the same person (same person, different DNA run), so we don't know a priori how many files to create. However, we have amended our scripts to optionally uniqify the samples before renaming, which avoids this issue.

I have not seen this issue arise with unique sample IDs, so I think that the problem was between the chair and keyboard. Thanks so much for the help; you can file this as a feature request.

• Administrator, Dev Posts: 11,139 admin

John, I can see how it would be useful to you and it would be ok because you clearly know what you're doing, but in general it would be unsafe and could potentially cause severe problems down the road for naive users. Building in this option in a way that would be both smart and safe for the average user would simply require too much work, at a time when we really can't spare the resources. So I don't see this feature happening in the foreseeable future, sorry. Good luck with your work!

Geraldine Van der Auwera, PhD

Sign In or Register to comment.