Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Too few indels for VQSR in Joint analysis of 75 exomes


I am trying to execute the pipeline for joint analysis of 75 exomes. I guess it is a reasonable number to use for VQSR but I have the well known problem with the too few INDELs.
After the GenotypeGVCFs step I obtain: 82183 SNPs and 109 INDELs.
Is it possible that I get these numbers from the joint analysis of 75 exomes? Or there could be an error somewhere?

HaplotypeCaller command for 1 among the 75

java -Xmx64g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I UD_NA001_baserecal_precalread_mrdupgrp.bam -R ucsc.hg19.fa -nct 10 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --intervals clinical_exome_cod.bed -A Coverage -A FisherStrand -A BaseQualityRankSumTest -A HaplotypeScore -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions -o UD_NA001_P_baserecal_precalread_mrdupgrp_varcall.g.vcf

GenotypeGVCFs command for all the vcfs together:

java -Xmx25g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -V UD_NA001_baserecal_precalread_mrdupgrp_varcall.g.vcf -V UD_NA001_baserecal_precalread_mrdupgrp_varcall.g.vcf -R ucsc.hg19.fa -A Coverage -A FisherStrand -A BaseQualityRankSumTest -A HaplotypeScore -A InbreedingCoeff -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest -o UD_JOINT_genot.g.vcf



Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Those numbers seem extremely low for an exome. Are you sure this is exome, not a targeted gene panel? Have you checked how much territory is covered in "clinical_exome_cod.bed"?

  • VergiliusVergilius ItalyMember

    Dear Geraldine,
    sorry for the late reply, I had to verify few things in the code. Anyway this still looks weird.
    I measured the number of bases of my target file and it is 54 million bases. And I use it as it is.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    What kind of data is it? I mean how was it sequenced, what's the coverage and so on?
  • VergiliusVergilius ItalyMember

    These samples have been sequenced with Illumina NextSeq 500. Mostly they are famlies (a proband with parents) with 200x coverage for the proband and 100x for parents. Read length is 150bp.
    Let me know if you need more information

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Ok, that all seems reasonable. Did you do any QC metrics collection? Duplication rate, alignment and so on?
  • VergiliusVergilius ItalyMember

    I obtained the following numbers using flagstats for the BAM files (after alignment with BWA and after duplicate removal with samblaster)

    Average number of reads in all samples: 100469266
    Average removed duplicate reads: 9541987 (9%)
    Average properly paired aligned: 89049755,2 (89%)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That seems reasonable. Maybe your base qualities are really low, which would inhibit calling. You should look at that as well.

  • VergiliusVergilius ItalyMember

    Ok. I use to look at the plots in FastQC and they look good in almost all the fastq being the Phred score mostly between 20-40. Is there in GATK another valuable method for base score evaluation?

  • VergiliusVergilius ItalyMember

    Another question I have is the following. The joint analysis of tens of samples together is a trial. Whilst I am running these analyses in TRIOs using GenotypeGVCf and I am getting almost the same number of variants 70-80k.
    What would be instead the expected number of variants to get for analyses of single samples, trios and tens of samples?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • VergiliusVergilius ItalyMember

    For the TRIOs the bam files contain the SM value that indicates the exact sample and the VCF files contain all the sample names. While for the Joint analysis I made there was one sample missing.

    I re-ran the variant calling for that sample and now the genotypeGVCfs output is in accord with the samples number but still it has 80k variants.

    Anyway, I was now reading you post at (http://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups) and found that the @RG information into the bam files are not precise as you meant there since I did not include the lane numbers.

    Since you told me that 80k variants are too less also for trios of exomes I will first concentrate on this.


Sign In or Register to comment.