BaseRecalibrator: Lexicographically sorted human genome sequence detected in knownSites

Hello,

I've tried everything but still get an error: when I run:

java -jar /data/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fasta -I reordered.bam -knownSites hg19.dbsnp.sorted.vcf -o recalibration_report.grp

ERROR MESSAGE: Lexicographically sorted human genome sequence detected in knownSites. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1328for more information. Error details: knownSites contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18, chr18_gl000207_random, chr19, chr19_gl000208_random, chr19_gl000209_random, chr1_gl000191_random, chr1_gl000192_random, chr2, chr20, chr21, chr21_gl000210_random, chr22, chr3, chr4, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr5, chr6, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7, chr7_gl000195_random, chr8, chr8_gl000196_random, chr8_gl000197_random, chr9, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chrM, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249, chrX, chrY]

ERROR ------------------------------------------------------------------------------------------

I made the bam from a fastq and used ucsc.hg19.fasta as the reference. Made the dictionary file, sorted and indexed bam, ran MarkDuplicates and AddOrReplaceReadGroups. Next, I used RealignerTargetCreator followed by the IndelRealigner. This all worked without errors.

I downloaded the latest version of dbSNP150
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz
and followed the these steps to prepare the file:
2. gunzip 00-All.vcf.gz

  1. awk '/^#/ {print $0}' 00-All.vcf > head.txt

  2. sed -i 's/chrMT/chrM/g' head.txt

  3. awk '/^#/ {next}{print $0}' 00-All.vcf | sed 's/^/chr/' > 1.vcf

  4. sed -i 's/chrMT/chrM/g' 1.vcf completed step

  5. cat head.txt 1.vcf > hg19.dbsnp.vcf

  6. IGVTools/igvtools index hg19.dbsnp.vcf

  7. awk '/^#/ {next}{print $1}' hg19.dbsnp.vcf | sort |uniq

Next I ran BaseRecalibrator:

java -jar /data/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fasta -I initial.bam -knownSites hg19.dbsnp.vcf -o recalibration_report.grp

When I got an error message about cotig's not being ordered the same I ran:
picard ReorderSam on the initial.bam file and SortVcf on the hg19.dbsnp.vcf.

After I ran BaseRecalibrator again:

java -jar /data/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fasta -I reordered.bam -knownSites hg19.dbsnp.sorted.vcf -o recalibration_report.grp

** Lexicographically sorted human genome sequence detected in knownSites**.

I'm not sure what the problem is? Could someone please suggest a fix?

Thanks,

Lena

Comments

  • Here are the few lines from the fasta file:
    grep '>' hg19.fasta | more

    chrM
    chr1
    chr2
    chr3
    chr4
    chr5
    chr6
    chr7
    chr8
    chr9
    chr10
    chr11
    chr12
    chr13
    chr14
    chr15
    chr16
    chr17
    chr18
    chr19
    chr20
    chr21
    chr22
    chrX

    --More--
    and here is the ordered bam file:
    samtools view reordered.bam | cut -f 3 | grep chr | uniq -c | more
    1699 chrM
    1056500 chr1
    77084 chr2
    314511 chr3
    36983 chr4
    62576 chr5
    30587 chr6
    1052909 chr7
    235395 chr8
    161193 chr9
    48468 chr10
    52081 chr11
    200495 chr12
    310 chr13
    25201 chr14
    19100 chr15
    7364 chr16
    492468 chr17
    93 chr18
    580901 chr19
    1665 chr20
    186184 chr21
    159828 chr22
    --More--

    Please advice something.

    Thanks,

    Lena

  • and here is the sorted dbsnp vcf:

    contig=<ID=chrM,length=16571>

    contig=<ID=chr1,length=249250621>

    contig=<ID=chr2,length=243199373>

    contig=<ID=chr3,length=198022430>

    contig=<ID=chr4,length=191154276>

    contig=<ID=chr5,length=180915260>

    contig=<ID=chr6,length=171115067>

    contig=<ID=chr7,length=159138663>

    contig=<ID=chr8,length=146364022>

    contig=<ID=chr9,length=141213431>

    contig=<ID=chr10,length=135534747>

    contig=<ID=chr11,length=135006516>

    contig=<ID=chr12,length=133851895>

    contig=<ID=chr13,length=115169878>

    contig=<ID=chr14,length=107349540>

    contig=<ID=chr15,length=102531392>

    contig=<ID=chr16,length=90354753>

    contig=<ID=chr17,length=81195210>

    contig=<ID=chr18,length=78077248>

    contig=<ID=chr19,length=59128983>

    contig=<ID=chr20,length=63025520>

    contig=<ID=chr21,length=48129895>

    contig=<ID=chr22,length=51304566>

    contig=<ID=chrX,length=155270560>

    contig=<ID=chrY,length=59373566>

    contig=<ID=chr1_gl000191_random,length=106433>

    contig=<ID=chr1_gl000192_random,length=547496>

    contig=<ID=chr4_ctg9_hap1,length=590426>

    contig=<ID=chr4_gl000193_random,length=189789>

    contig=<ID=chr4_gl000194_random,length=191469>

    contig=<ID=chr6_apd_hap1,length=4622290>

    contig=<ID=chr6_cox_hap2,length=4795371>

    contig=<ID=chr6_dbb_hap3,length=4610396>

    contig=<ID=chr6_mann_hap4,length=4683263>

    contig=<ID=chr6_mcf_hap5,length=4833398>

    contig=<ID=chr6_qbl_hap6,length=4611984>

    contig=<ID=chr6_ssto_hap7,length=4928567>

    contig=<ID=chr7_gl000195_random,length=182896>

    contig=<ID=chr8_gl000196_random,length=38914>

    contig=<ID=chr8_gl000197_random,length=37175>

    contig=<ID=chr9_gl000198_random,length=90085>

    contig=<ID=chr9_gl000199_random,length=169874>

    contig=<ID=chr9_gl000200_random,length=187035>

    contig=<ID=chr9_gl000201_random,length=36148>

    contig=<ID=chr11_gl000202_random,length=40103>

    contig=<ID=chr17_ctg5_hap1,length=1680828>

    contig=<ID=chr17_gl000203_random,length=37498>

    contig=<ID=chr17_gl000204_random,length=81310>

    contig=<ID=chr17_gl000205_random,length=174588>

    contig=<ID=chr17_gl000206_random,length=41001>

    contig=<ID=chr18_gl000207_random,length=4262>

    contig=<ID=chr19_gl000208_random,length=92689>

    contig=<ID=chr19_gl000209_random,length=159169>

    contig=<ID=chr21_gl000210_random,length=27682>

    contig=<ID=chrUn_gl000211,length=166566>

    contig=<ID=chrUn_gl000212,length=186858>

    contig=<ID=chrUn_gl000213,length=164239>

    contig=<ID=chrUn_gl000214,length=137718>

    contig=<ID=chrUn_gl000215,length=172545>

    contig=<ID=chrUn_gl000216,length=172294>

    contig=<ID=chrUn_gl000217,length=172149>

    contig=<ID=chrUn_gl000218,length=161147>

    contig=<ID=chrUn_gl000219,length=179198>

    contig=<ID=chrUn_gl000220,length=161802>

    contig=<ID=chrUn_gl000221,length=155397>

    contig=<ID=chrUn_gl000222,length=186861>

    contig=<ID=chrUn_gl000223,length=180455>

    contig=<ID=chrUn_gl000224,length=179693>

    contig=<ID=chrUn_gl000225,length=211173>

    contig=<ID=chrUn_gl000226,length=15008>

    contig=<ID=chrUn_gl000227,length=128374>

    contig=<ID=chrUn_gl000228,length=129120>

    contig=<ID=chrUn_gl000229,length=19913>

    contig=<ID=chrUn_gl000230,length=43691>

    contig=<ID=chrUn_gl000231,length=27386>

    contig=<ID=chrUn_gl000232,length=40652>

    contig=<ID=chrUn_gl000233,length=45941>

    contig=<ID=chrUn_gl000234,length=40531>

    contig=<ID=chrUn_gl000235,length=34474>

    contig=<ID=chrUn_gl000236,length=41934>

    contig=<ID=chrUn_gl000237,length=45867>

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @yelekley
    Hi Lena,

    Can you try deleting the VCF index and running the tool again? There was a bug in SortVcf that may be causing this issue.

    Thanks
    Sheila

    Issue · Github
    by Sheila

    Issue Number
    2174
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Yes, it worked. I wish it was mentioned in the documentation for BaseRecalibrator. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @yelekley The issue didn't have anything to do with BaseRecalibrator, it was SortVcf failing to generate a new index. This has been fixed so if you use the latest version of Picard to do the sorting it should work normally now.

  • TamaraPTamaraP Member

    Hi,
    I was looking at this post to see if it can guide me in the right direction. I generated GVCF files using BAM files. I would like to analyze this individual for homozygous regions in order to decide whether to prioritize homozygous or compound heterozygous variants. I tried using the GVCF I generated with GATK on Homozygosity Mapper but it gives an error message because the chromosomes are not in numerical order. I believe they are in size order as chromosome 19 shows up after 20. Can you suggest a way to sort the GVCF file correctly? And how do you recommend analyzing for runs of homozygosity in this individual?
    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @TamaraP
    Hi,

    You will need to run GenotypeGVCFs on your GVCF to produce a final VCF. We do not recommend using GVCF in the final analysis.

    For changing the ordering of the VCF, I think you can use SortVcf.

    And how do you recommend analyzing for runs of homozygosity in this individual?

    I am not sure what your end goal is. Can you tell us more about what you hope to accomplish?

    Thanks,
    Sheila

Sign In or Register to comment.