Error in running variant recalibration

tanu06tanu06 CanadaMember
edited March 6 in Ask the GATK team

Hello,
I am using GATK variant re calibration , it works fine on SNPs but throws an error on indel file. The error and my sample file are as follows:
ERROR MESSAGE: Your input file has a malformed header: The FORMAT field was provided but there is no genotype/sample data

Input file:

##contig=<ID=17.5307,length=909>
##contig=<ID=17.5308,length=865>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Q11     Q11D1   Q11D2   Q11D4   Q11D5
1.1     1773    .       TTTTGAAATATTTAGATAA     T       407.08  .       AC=1;AF=0.167;AN=6;BaseQRankSum=-1.076e+00;ClippingRankSum=0.00;DP=104;ExcessHet=3.0103;FS=12.041;MLEAC=1;MLEAF=0.167;MQ=56.82;MQRankSum=1.75;QD=25.44;ReadPosRankSum=1.21;SOR=1.402    GT:AD:DP:GQ:PL  0/0:55,0:55:99:0,108,1620       0:11,0:11:99:0,253      0:13,0:13:99:0,357      0:9,0:9:99:0,204        1:3,13:16:99:450,0
1.1     1792    .       CTTTAAAAGAAAATACTGGACAATTTTTTGATTTGAATTGGTTTTGAAATATGAATATATTGTATAATATGAGATTAAGGTAAATTATTGAAATTCAATATATATGACATTCTTATTCTTTTTTCTGGGTTTTTTGATGATT  C       407.08  .       AC=1;AF=0.167;AN=6;BaseQRankSum=-6.730e-01;ClippingRankSum=1.35;DP=99;ExcessHet=3.0103;FS=12.041;MLEAC=1;MLEAF=0.167;MQ=56.82;MQRankSum=1.35;QD=25.44;ReadPosRankSum=2.56;SOR=1.402     GT:AD:DP:GQ:PL  0/0:50,0:50:55:0,55,1227        0:11,0:11:99:0,253      0:13,0:13:99:0,357      0:9,0:9:99:0,204        1:3,13:16:99:450,0

Please suggest.

Thanks
Tanushree

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @tanu06 Did you do any processing on the file that might have messed with the spacing between fields, eg replaced tabs with spaces? That could cause parsing issues.

  • dovabdovab Member

    Hi @Geraldine_VdAuwera, I work with @tanu06 and we did try to change the spacing and we continue to get the error message.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I see — can you clarify how the file was produced? Please list all operations that were involved no matter how minor.

  • dovabdovab Member
    edited March 6

    java -jar /usr/local/src/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta --variant A_raw_variants.g.vcf.gz --variant B_raw_variants.g.vcf.gz --variant C_raw_variants.g.vcf.gz --variant D_raw_variants.g.vcf.gz --variant E_raw_variants.g.vcf.gz -o A_E.g.vcf

    grep -v INDEL A_E.g.vcf > A_E_indels.vcf

    grep -v ‘#’ A_E_indels.vcf | sort | less --chop-long-lines

    java -jar /usr/local/src/GenomeAnalysisTK.jar -T SelectVariants -R training.fasta -V A_E.g.vcf -selectType INDEL -o A_E_indels.vcf

    java -Xmx4g -jar /usr/local/src/GenomeAnalysisTK.jar -T VariantRecalibrator -R reference.fasta -input A_E_indels.vcf -recalFile A_E_indel.recal -tranchesFile A_E_indel.tranches -resource:Drone,known=true,training=true,truth=true,prior=12.0 INDEL_TRAINING.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -nt 4

    This last step is the one that is failing. I originally had the same error with SelectVariants but grep helped with that. the same thing doesn't help with VariantRecalibrator, though. Again, when I followed the exact same procedure for SNPs it worked.

    Thank you,
    Dova

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dovab
    Hi Dova,

    What happens if you run VariantRecalibrator with --mode SNP and --mode INDEL separately on the output of GenotypeGVCFs without doing any grepping or SelectVariants?

    Thanks,
    Sheila

Sign In or Register to comment.