How many "-an" arguments in VariantRecalibrator should be used ?

boymin2020boymin2020 New YorkMember
edited September 2016 in Ask the GATK team

I encountered a problem when using VariantRecalibrator function of GATK.
The following error often came out when I used the several common “-an” augments.
“##### ERROR MESSAGE: Bad input: Found annotations with zero variance. They must be excluded before proceeding.”
When I decreased the number of “-an”, another error came out.

ERROR MESSAGE: No data found.

I am confused about the “-an” argument. Which of them should be used?

-an QD
-an MQ
-an DP
-an FS
-an InbreedingCoeff
-an MQRankSum
-an ReadPosRankSum

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @boymin2020
    Hi,

    We have an article here that tells you what annotations we recommend using. However, the error you are getting is unexpected. Can you tell us how you generated the input VCF and what is in the input VCF? Also, which version of GATK are you using.

    Thanks,
    Sheila

  • boymin2020boymin2020 New YorkMember

    @Sheila said:
    @boymin2020
    Hi,

    We have an article here that tells you what annotations we recommend using. However, the error you are getting is unexpected. Can you tell us how you generated the input VCF and what is in the input VCF? Also, which version of GATK are you using.

    Thanks,
    Sheila

    Hi Sheila,
    I modified my script according to the article you recommend. Since my data are WES data, I used the recommended "-an" arguments as shown below.
    -an QD
    -an FS
    -an SOR
    -an ReadPosRankSum
    -an MQRankSum
    -an InbreedingCoeff
    But, I got the following error.
    "Bad input: Found annotations with zero variance. They must be excluded before proceeding."
    @Geraldine_VdAuwera said the analysis should be performed on whole genome, so I am concatenating the 22 chromosome gvcf files.

    1, we generated the vcf files by HaplotypeCaller.
    2, The vcf file include human genomic variants, sample number is very large.
    3, The version of GATK is 3.6.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @boymin2020 , you can't run VQSR on the GVCFs; you need to do joint genotyping on them first with GenotypeGVCFs.
  • boymin2020boymin2020 New YorkMember
    edited October 2016

    @Geraldine_VdAuwera said:
    @boymin2020 , you can't run VQSR on the GVCFs; you need to do joint genotyping on them first with GenotypeGVCFs.

    Thank you @Geraldine_VdAuwera, I did do joint genotyping with GenotypeGVCFs. If GenotypeGVCFs defaulty convert the format from gvcf to vcf, I got the right vcf files now. The command is shown below:

    java -Xmx24G \
    -jar $jar \
    -V:VCF $list \
    -T GenotypeGVCFs \
    -D $DBSNP \
    -R $ref \
    -L $bed \
    $infofields \
    --standard_min_confidence_threshold_for_calling 30 \
    --standard_min_confidence_threshold_for_emitting 30 \
    -o ${outFil}

    /# infofields="-A AlleleBalance -A BaseQualityRankSumTest -A Coverage -A HomopolymerRun -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A SpanningDeletions -A FisherStrand -A InbreedingCoeff"
    /# there are lots of subjects, so I include them in a list, which is shown here, -V:VCF $list

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    OK, it sounds like you have the right files then. I would recommend being more precise when you describe what you are working with. This is what you told us previously:

    @Geraldine_VdAuwera said the analysis should be performed on whole genome, so I am concatenating the 22 chromosome gvcf files.

    1, we generated the vcf files by HaplotypeCaller.
    2, The vcf file include human genomic variants, sample number is very large.
    3, The version of GATK is 3.6.

    Which is not very accurate since it leaves out the very important GenotypeGVCFs step. We can help you better and faster if we have the correct information in hand from the start.

    The right way to decide which annotations to use or not, when you have this problem, is to look at the log output of VariantRecalibrator, which provides statistical information useful for this purpose. If you post yours I can show you how to interpret it.

  • boymin2020boymin2020 New YorkMember

    @Geraldine_VdAuwera said:
    OK, it sounds like you have the right files then. I would recommend being more precise when you describe what you are working with. This is what you told us previously:

    @Geraldine_VdAuwera said the analysis should be performed on whole genome, so I am concatenating the 22 chromosome gvcf files.

    1, we generated the vcf files by HaplotypeCaller.
    2, The vcf file include human genomic variants, sample number is very large.
    3, The version of GATK is 3.6.

    Which is not very accurate since it leaves out the very important GenotypeGVCFs step. We can help you better and faster if we have the correct information in hand from the start.

    The right way to decide which annotations to use or not, when you have this problem, is to look at the log output of VariantRecalibrator, which provides statistical information useful for this purpose. If you post yours I can show you how to interpret it.

    Thanks your so detailed reply.
    I am working on WES data of 1000+ subjects.The bellow is my thought :
    step 1: cut the chromosome several pieces based on same range (2,000,000bp), for example,
    the CHROM1 were cut to 13 pieces (2000000bp/piece, the last one is shorter). therefore, the whole genome were cut to 50 pieces.
    step 2: do joint calling for every piece, by accompanied is that the file format from gvcf to vcf.
    java -Xmx24G
    -jar $jar \
    -V:VCF $list \
    -T GenotypeGVCFs \
    -D $DBSNP \
    -R $ref \
    -L $bed \
    $infofields \
    --standard_min_confidence_threshold_for_calling 30 \
    --standard_min_confidence_threshold_for_emitting 30 \
    -o ${outFil}
    step3 : concatenate the pieces at chromosome level with CatVariants, as a result, 22 merged vcf files were constructed
    java -cp $jar org.broadinstitute.gatk.tools.CatVariants \
    -R $ref \
    -assumeSorted \
    -V:VCF $in \
    -log $log \
    -out $out
    step4: recalibrator the 22 vcf files with VariantRecalibrator
    java -Xmx24G \
    -jar $jar \
    -T VariantRecalibrator \
    -R $ref \
    -nt 4 \
    -L $target \
    -mode SNP \
    -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
    -resource:omni,known=false,training=true,truth=false,prior=14.0 $omni \
    -resource:1000G,known=false,training=true,truth=false,prior=10.1 $G1000 \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp \
    -input $in \
    -recalFile $out/${chr}.SNPs.recal \
    -tranchesFile $out/${chr}.SNPs.tranches \
    -rscriptFile $out/${chr}.SNPs.plots.R

    Here an error came out just for CHROM22 , please see the attached file.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @boymin2020
    Hi,

    Does the issue still occur when you run on the entire exome, not just chromosome 22?

    Thanks,
    Sheila

  • boymin2020boymin2020 New YorkMember

    @Sheila said:
    @boymin2020
    Hi,

    Does the issue still occur when you run on the entire exome, not just chromosome 22?

    Thanks,
    Sheila

    Hi Sheila,

    Thanks for asking.
    I already figured it out, that is because all of ReadPosRankSum and MQRankSum values are 0.

Sign In or Register to comment.