We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

excessive alleles in the output vcf

tennis_musictennis_music Member
edited February 2013 in Ask the GATK team

Hello,

I follow the best practice v4 for my exome seq project (<30 sample). I found that there are many excessive alleles in almost every sample. Here is an example for one individual:

chr2 133033833 rs199778267 A C 61.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.920;DB;DP=8;Dels=0.00;FS=3.332;HaplotypeScore=4.9989;MLEAC=1;MLEAF=0.500;MQ=70.00;MQ0=0;MQRankSum=0.322;QD=7.72;ReadPosRankSum=-1.221;VQSLOD=13.38;culprit=QD GT:AD:DP:GQ:PL 0/1:5,3:8:90:90,0,178

chr2 133033833 . A AACC 355.75 PASS AC=2;AF=1.00;AN=2;DP=8;FS=0.000;HaplotypeScore=153.8094;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;QD=44.47;RPA=2,3;RU=ACC;STR;VQSLOD=13.38;culprit=QD GT:AD:DP:GQ:PL 1/1:0,8:8:24:393,24,0

We see that the second variant is homozygous and the first variant is heterozygous. But this can't be true since they are on the same location. Either the first is a fake one or the second is heterozygous. Below is the script I used. Is there anything wrong in it?

many thanks

----------------------------- begin of script -----------------------------

FixMateInformation.jar \  
    INPUT=sample.sam.gz \  
    OUTPUT=sample.mate.bam \  
    TMP_DIR=/scratch/local \  
    SO=coordinate \  
    VERBOSITY=ERROR \  
    QUIET=true \  
    CREATE_INDEX=true \  
    VALIDATION_STRINGENCY=SILENT  

MarkDuplicates.jar \  
    INPUT=sample.mate.bam \  
    OUTPUT=sample.mate.dup.bam \  
    M=sample.mate.bam.duplicate \  
    VERBOSITY=ERROR \  
    QUIET=true \  
    REMOVE_DUPLICATES=true \  
    ASSUME_SORTED=true \  
    TMP_DIR=/scratch/local \  
    VALIDATION_STRINGENCY=SILENT  

SortSam.jar \  
    INPUT=sample.mate.dup.bam \  
    OUTPUT=sample.mate.dup.sort.bam \  
    CREATE_INDEX=true \  
    SO=coordinate \  
    COMPRESSION_LEVEL=5 \  
    MAX_RECORDS_IN_RAM=5000000 \  
    VERBOSITY=ERROR \  
    QUIET=true \  
    TMP_DIR=/scratch/local \  
    VALIDATION_STRINGENCY=SILENT  

GenomeAnalysisTK.jar -T IndelRealigner \  
    -R hg19.fasta \  
    -targetIntervals hg19.indel.intervals \  
    -I sample.mate.dup.sort.bam \  
    -o sample.mate.dup.sort.realign.bam  

BuildBamIndex.jar \  
    INPUT=sample.mate.dup.sort.realign.bam \  
    VERBOSITY=ERROR \  
    QUIET=true \  
    VALIDATION_STRINGENCY=SILENT  

GenomeAnalysisTK.jar -T BaseRecalibrator \  
    -R hg19.fasta \  
    -knownSites hg19.dbsnp.vcf \  
    -I sample.mate.dup.sort.realign.bam \  
    -o sample.mate.dup.sort.realign.grp  

GenomeAnalysisTK.jar -T PrintReads \  
    -R hg19.fasta \  
    -I sample.mate.dup.sort.realign.bam \  
    -BQSR sample.mate.dup.sort.realign.grp \  
    -o sample.mate.dup.sort.realign.recal.bam  

SortSam.jar \  
    INPUT=sample.mate.dup.sort.realign.recal.bam \  
    OUTPUT=sample.mate.dup.sort.realign.recal.sort.bam \  
    CREATE_INDEX=true \  
    SO=coordinate \  
    COMPRESSION_LEVEL=5 \  
    MAX_RECORDS_IN_RAM=5000000 \  
    VERBOSITY=ERROR \  
    QUIET=true \  
    TMP_DIR=/scratch/local \  
    VALIDATION_STRINGENCY=SILENT  

GenomeAnalysisTK.jar -T UnifiedGenotyper \  
    -R hg19.fasta \  
    -glm BOTH \  
    -I sample.mate.dup.sort.realign.recal.sort.bam \  
    -o sample.vcf --dbsnp hg19.dbsnp.vcf  

GenomeAnalysisTK.jar -T VariantRecalibrator \  
    -R hg19.fasta \  
    -mode BOTH \  
    -input sample.vcf \  
    -recalFile sample.recal \  
    -tranchesFile sample.tranches \  
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hg19.hapmap_3.3.vcf \  
    -resource:omni,known=false,training=true,truth=false,prior=12.0 hg19.1000G_omni2.5.vcf \  
    -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 hg19.dbsnp.vcf \  
    -an QD -an DP -an ReadPosRankSum -an MQRankSum -an FS -an MQ \  
    --maxGaussians 4  

GenomeAnalysisTK.jar -T ApplyRecalibration \  
    -R hg19.fasta \  
    -mode BOTH \  
    -input sample.vcf \  
    -o sample.tmp.vcf \  
    -recalFile sample.recal \  
    -tranchesFile sample.tranches \  
    --ts_filter_level 99.0  

GenomeAnalysisTK.jar -T SelectVariants \  
    -R hg19.fasta \  
    --variant sample.tmp.vcf \  
    -o sample.recal.vcf \  
    -select "VQSLOD>4.0"  
Post edited by Geraldine_VdAuwera on

Answers

  • jfarrelljfarrell Member ✭✭

    This link has some details here on the whole exome variant recalibration settings for SNPs and INDELS that should help....

    http://gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project

    Two different models for SNPs and INDELS are now needed for the variant recalibration step. Your pipeline did one. The documentation also indicates more samples may be needed for the recalibration of indels.

    Also why is this variant being called to start with? rs199778267 is intergenic and not in an exon. For exome sequencing, the unified genotyper can be limited to the exome targets with the argument -L target_intervals.bed.

    Also was the unified genotyper being run on all samples at once or one at a time? From the script, it looks like it was one at a time. Better results are obtained by running UC on all 30 samples.

Sign In or Register to comment.