To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

CombineGVCFs - All alleles are missing

astrandastrand New YorkMember
edited January 2015 in Ask the GATK team

Hi,

I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing.

For example, at position 16050036, this is what comes up in the final gvcf file:
22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0

But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site:
22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0

The command I'm using to generate these files is:

java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L ${numchr}
where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks,
Alva

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You mean that in the combined GVCF, the genotypes are missing, right? The alleles seem to be there.

    Which version are you using?

  • astrandastrand New YorkMember

    Yes, that's what I meant. I am using the latest version (3.3).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We've seen a few cases recently where GenotypeGVCFs assigns no-calls (./.) instead of a proper genotype when coverage is extremely low. I think that's what's happening here. It may take some time before we can develop a fix for this as it's fairly low priority (those sites aren't really usable anyway). If all your sites are like this you should really check your data QC. But if it's just the first sites in your files it's not necessarily unexpected -- the coverage tends to be very low at the beginning of contigs (in WGS) or capture targets (in WEx).

  • astrandastrand New YorkMember

    Hi Geraldine,

    What do you mean by "data QC"? The example I used in my post is a test run that I did for 3 individuals on 1 chromosome. I was able to generate a perfectly normal final vcf file using GenotypeGVCFs. However, because I have a lot more files for which I have to run this analysis (and for which, ideally, I should use CombineGVCFs), I decided to test CombineGVCFs on those 3 individuals. That's when I saw the missing genotypes. They are missing from every single site that gets called. Like I said, using GenotypeGVCFs produces perfectly acceptable output, but not CombineGVCFs, so not sure what is going on there.

    Thanks anyways for the information,
    Alva

  • astrandastrand New YorkMember

    OK, thanks Sheila. Where does the information regarding the genotypes get stored for GenotypeGVCFs to use? I'll run the Combined files through GenotypeGVCFs and let you know if I experience any problems.

    Alva

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @astrand
    Hi Alva,

    The information is in the other fields present in the CombineGVCFs file. GenotypeGVCFs uses the AD and PL fields for all the samples to determine a proper final genotype.

    -Sheila

Sign In or Register to comment.