Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Does Base Recalibrator require every contig to contain a known SNP or Indel?

ianwianw Bangor, UKMember

Hi all,

Apologies if this seems like a duplication of previous questions but I have been unable to find an answer to my problem.

In the absence of any existing high quality SNPs upon which to base VQSR and BQSR, I have called and filtered SNPs using Haplotype Caller and am now looking to feed my HQ SNPs into BQSR. When running BaseRecalibrator, however, I receive a message saying that the contigs in my vcf file of known sites are incompatible with the contigs in my reference file. In the vcf file's header section, the contigs are the exact same length and in the exact same order as in the reference fasta file. There are 556 contigs lacking any HQ SNPs, so these contigs' names do not appear in the content of the vcf file. I assume that this is what is causing the program to fail.

I wondered if this was a bug or if it is intentional? I'm not sure how sorting the vcf, as is recommended, will help in a situation like this if not all contigs are represented in the body of the vcf?

Kind regards,

Ian

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ianw
    Hi Ian,

    Can you tell us the exact commands you ran to get the final VCF of HQ SNPs? If you used the entire reference in your command, all the contigs from the reference should be present in the VCF header.

    -Sheila

  • ianwianw Bangor, UKMember
    edited June 2016

    Hi @Sheila,

    As far as I can tell, every single contig from the reference is present and correct in the header of the VCF. That's why I'm assuming it's because not every one of those contigs actually contains a HQ variant. The commands I've used up until this stage for SNPs are as follows:

    I originally created intervals for the reference genome, with the contigs ordered by size.

    After running GenotypeGCVFs, I ran SelectVariants for each interval of the genome for SNPs and indels (not shown but identical) separately, for example:

     bsub -G cichlid -e select_e_%J -o select_o_%J -R'select[mem>28000] rusage[mem=28000]' -M28000 -n 16 -q normal /nfs/users/nfs_m/mm21/programs/jre1.7.0_71/bin/java -Xmx28000m -jar /software/vertres/bin-external/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SelectVariants -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa --variant 190516MassokoGWAS.${i}_allSites.vcf.gz -o ${i}_raw_snps.vcf.gz -selectType SNP;
    

    Then, for each interval, I identified parameters by which to identify HQ variants, and filtered accordingly using VariantFiltration:

     bsub -G cichlid -e varfilt_e_%J -o varfilt_o_%J -R'select[mem>28000] rusage[mem=28000]' -M28000 -n 32 -q long /nfs/users/nfs_m/mm21/programs/jre1.7.0_71/bin/java -Xmx28000m -jar /software/vertres/bin-external/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T VariantFiltration -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa --variant ${i}_raw_snps.vcf.gz --filterName StrandBias --filterExpression 'FS > 10.0' --filterName MappingQuality --filterExpression 'MQ < 55.0' --filterName QualByDepth --filterExpression 'QD < 5.0' --filterName MQRankSum --filterExpression 'MQRankSum < 2.5 || MQRankSum >-2.5' --filterName ReadPosRankSum --filterExpression 'ReadPosRankSum > 2.5 || ReadPosRankSum < -2.5' --mask ${i}_raw_indels.vcf.gz --maskName InDel --clusterWindowSize 10 --logging_level ERROR -o ${i}_filtered_snps.vcf.gz;
    
     bsub -G cichlid -e varfilt_e_%J -o varfilt_o_%J -R'select[mem>28000] rusage[mem=28000]' -M28000 -n 32 -q long /nfs/users/nfs_m/mm21/programs/jre1.7.0_71/bin/java -Xmx28000m -jar /software/vertres/bin-external/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T VariantFiltration -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa --variant ${i}_raw_indels.vcf.gz --filterName StrandBias --filterExpression 'FS > 10.0' --filterName MappingQuality --filterExpression 'MQ < 55.0' --filterName QualByDepth --filterExpression 'QD < 5.0' --filterName MQRankSum --filterExpression 'MQRankSum > 2.5 || MQRankSum < -2.5' --filterName ReadPosRankSum --filterExpression 'ReadPosRankSum > 2.5 || ReadPosRankSum < -2.5' --clusterWindowSize 10 --logging_level ERROR -o ${i}_filtered_indels.vcf.gz;
    

    Once for the SNPs and once for the indels (not shown but identical), I then concatenated the intervals before pulling out only those lines that either form the header of the VCF or contain a HQ variant:

     bsub -G cichlid -o concat_o_%J -e concat_e_%J -R'select[mem>8000] rusage[mem=8000]' -M8000 -n 32 -q normal /nfs/users/nfs_m/mm21/programs/jre1.7.0_71/bin/java -XX:+UseSerialGC -Xmx8000m -cp /software/vertres/bin-external/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa -V UMD_1_4_filtered_snps.vcf.gz -V UMD_5_10_filtered_snps.vcf.gz ... -V UMD_551_END_filtered_snps.vcf.gz -out all_filtered_snps.vcf -assumeSorted
    
     cat all_filtered_snps.vcf | grep -E '#|PASS' > all_passed_filtered_snps.vcf
    

    It is the following command that then complains about incompatible contigs:

     bsub -G cichlid -e baserecal_e_%J -o baserecal_o_%J -R'select[mem>16000] rusage[mem=16000]' -M16000 -n 16 -q normal /nfs/users/nfs_m/mm21/programs/jre1.7.0_71/bin/java -Xmx16000m -jar /software/vertres/bin-external/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T BaseRecalibrator -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa -I ${i}_realigned.bam -knownSites all_passed_filtered_snps.vcf -knownSites all_passed_filtered_indels.vcf -o ${i}_recal_data.table;
    

    Any ideas will be very welcome, even if I've just made a silly mistake!! :wink:

    Thanks,

    Ian

  • ianwianw Bangor, UKMember

    Hi @Geraldine_VdAuwera,

    Your first comment is good to know, thanks - that removes one concern :)

    Regarding your second point, yes, some pointers on what I could change may be helpful, please. Unfortunately, I'd split the genome into intervals because of storage issues on the server I'm using. Also, apart from pulling out the desired lines from the VCF, I haven't used any non-GATK commands, although I suspect there was an option I missed to just output variants that 'PASS' the filters..?

    Many thanks,

    Ian

  • ianwianw Bangor, UKMember

    Hi @Sheila,

    I had one vcf per interval until the VariantFiltration step, after which I concatenated them using CatVariants. I used the article you provided when writing out my commands, but I didn't use SelectVariants afterwards to exclude filtered sites. I'm trying a second run of SelectVariants (so that my pipeline goes 'SelectVariants' -> 'VariantFiltration' -> 'SelectVariants') now to see if it was the step I performed outside of GATK that caused the issues. I'll let you know how I get on.

    Thanks,

    Ian

  • ianwianw Bangor, UKMember

    Hi @Sheila and @Geraldine_VdAuwera,

    I passed the VariantFiltration output back to SelectVariants so that it could pull out only those sites that passed the filters. The output of SelectVariants has now been accepted as the input for BaseRecalibration, which is running quite happily as I type. Thanks for the help!

    Ian

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ianw
    Hi Ian,

    Great news! :smile:

    -Sheila

Sign In or Register to comment.