GenotypeGVCFS gives fewer samples than input.

rohitmanderohitmande San Diego, CAMember

Hi everyone,

I ran GenotypeGVCFs with the following command
java -Xmx$64g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R --variant 08_22_2016_murat5.list --dbsnp dbsnp_144.hg38.vcf.gz -o murat5_08_22_2016_raw.vcf -log murat5_08_22_2016_raw.log -L MedExome_hg38_capture_targets.bed -nt 1 --max_alternate_alleles 6

The list I am inputting into --variant contains the paths to 397 gvcfs. When I run vcftools --vcf murat5_08_22_2016_raw.vcf I get the output:

VCFtools - v0.1.12b
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
--vcf murat5_08_22_2016_raw.vcf

After filtering, kept 380 out of 380 Individuals
After filtering, kept 397293 out of a possible 397293 Sites
Run Time = 95.00 seconds

Is there any reason why 17 samples are thrown out?

Thank you very much.

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rohitmande
    Hi,

    I suspect the sample names are the same in the GVCFs that are missing. GenotypeGVCFs merges the same sample name GVCFs into one sample in its output.

    -Sheila

  • rohitmanderohitmande San Diego, CAMember

    Hi Sheila,

    I looked at the input list of gvcfs and could not find any duplicates. I also ran the command cat 08_22_2016.list | sort | uniq -d and it did not return any results.

    Issue · Github
    by Sheila

    Issue Number
    1224
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • rohitmanderohitmande San Diego, CAMember

    Hi Sheila,

    At the suggestion of another thread on this forum, I combined our 397 gvcfs into batches of 200 and 197, respectively and ran genotypegvcfs on those two combined gvcfs. VCFtools still gives the output

    VCFtools - v0.1.12b
    (C) Adam Auton and Anthony Marcketta 2009

    Parameters as interpreted:
    --vcf murat5_08_31_2016.vcf

    After filtering, kept 380 out of 380 Individuals
    After filtering, kept 396692 out of a possible 396692 Sites
    Run Time = 8.00 seconds

    We confirmed that all of the input samples are distinct. Is there any reason why 17 samples are missing?

Sign In or Register to comment.