To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Questions about GenotypeVCFs output

NH2ANH2A Member
edited January 17 in Ask the GATK team

Hi,

I generated 12 .g.vcf files with HaplotypeCaller in GVCF mode and then a vcf with GenotypeGVCFs. What's the easiest way to split this vcf per sample ? Should I apply hard filtering first and then split the vcf per sample ? (These vcf are normal exomes, I would like to use them afterwards with Mutect2).

Moreover, lines of the vcf file are annotated differently : BaseQRankSum, ClippingRankSum, ExcessHet, MQRankSum, ReadPosRankSum appear at some lines but not all of them (I get the same when I create a vcf with HaplotypeCaller without GVCF mode). Do you know why ? Is it possible to change this ? Will it be a problem at the hard filtering step when I'll handle variants with SelectVariants, and VariantFiltration ?

Thanks a lot!

Post edited by NH2A on

Answers

  • NH2ANH2A Member

    Hi, sorry I have another question about the output :
    Why is the "INFO" field identical for all the samples ? Shouldn't some values, like the "StrandOddsRatio", be specific to each sample ?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @NH2A
    Hi,

    What's the easiest way to split this vcf per sample ?

    You can use SelectVariants.

    Should I apply hard filtering first and then split the vcf per sample ?

    You can hard filter first, then split the VCFs. Have a look at this article. But, why do you want to split the VCF into per-sample VCFs for use in Mutect2?

    BaseQRankSum, ClippingRankSum, ExcessHet, MQRankSum, ReadPosRankSum appear at some lines but not all of them

    Some annotations require a mix of ref and alt reads to be calculated. Have a look at the annotations section in the tool docs for more information on specific annotations.

    Why is the "INFO" field identical for all the samples ? Shouldn't some values, like the "StrandOddsRatio", be specific to each sample ?

    The INFO annotations are site-level annotations, whereas the FORMAT level annotations are sample level annotations. The INFO annotations take into account all samples and tend to be an average overall. We recommend filtering on INFO annotations so you get the sites which are most likely to have a variant. If you are interested in filtering at the sample level, you may be interested in the Genotype Refinement Workflow.

    -Sheila

  • NH2ANH2A Member

    Hi Sheila, thanks for your answers,

    About the INFO field :
    While using SelectVariants, do I have to specify that I want to filter on the INFO field and not the FORMAT field ?
    If I filter on the INFO field, can I use the same thresholds as when we filter on the FORMAT field ? (for SNPs "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0", for INDELs "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0").

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @NH2A
    Hi,

    I think this article on JEXL will help.

    -Sheila

Sign In or Register to comment.