We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

combineGVCFs with duplicate sample id?

rbatorskyrbatorsky cambridge maMember

I am performing the joint calling workflow on a large batch of samples and I have a handful that were sequenced twice, using two different capture kits. For these, the sample ID in the GVCFs are the same. I am looking for an option like -genotypeMergeOption UNIQUIFY to combineGVCFs that will make the sample names unique. I see that if two GVCFs with the same ID are given to combineGVCFs that the ID is present only once in the resulting combined GVCF header, and if the ID is present in two different combined GVCFs that are given to genotypeGVCF that the ID is only present once in the output. What is the recommended practice here? I would like to avoid rerunning my pipeline again to make the names unique in the single sample GVCF.

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @rbatorsky,

    At the BAM level, before calling, you could have run Picard AddOrReplaceReadGroups to replace your sample names. At the GVCF level, I believe you will have to manually edit the column header that names your sample. It's possible some downstream tools may complain about inconsistencies for this sample column header and the VCF header. So for the sake of consistency, you might want to change each instance the sample name appears in the GVCF header.

  • jfarrelljfarrell Member ✭✭

    I ran into this issue also with some unexpected duplicates. So what is the combineGVCFs tool generating for duplicate samples? Is it selecting the first or last sample gvcf? Or is it combining both gvcfs somehow to produce one column?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @jfarrell
    Hi,

    I believe it combines the information from all samples into one.

    -Sheila

Sign In or Register to comment.