We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Joint genotyping vcf files

I am trying to run joint genotyping on some vcf files sent by our collaborators. My steps so far:

  1. The vcf files from our collaborators were created using the reference genome “Homo_sapiens.GRCh38.dna.primary_assembly_sorted.fa” and the GATK HaplotypeCaller (GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.5-0) - NOTE: A 'sorted' version of the reference genome was used.

  2. I downloaded “Homo_sapiens.GRCh38.dna.primary_assembly.fa” from ensembl and created a dictionary “Homo_sapiens.GRCh38.dna.primary_assembly.dict” using picard:

java -jar picard.jar CreateSequenceDictionary R=~/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa O=~/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.dict

  1. I try to sort the vcf file using the reference genome dictionary with:

java -jar picard.jar SortVcf I=~/data/S12.vcf O=~/data/S12_sorted.vcf SEQUENCE_DICTIONARY=~/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.dict

I get the following error:

[Fri Jul 01 10:40:21 EDT 2016] picard.vcf.SortVcf INPUT=[/mnt/jk/data/S12.vcf] OUTPUT=/mnt/Jk/data/S12_sorted.vcf SEQUENCE_DICTIONARY=/mnt/jk/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_ IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false

[Fri Jul 01 10:40:22 EDT 2016] picard.vcf.SortVcf done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=MT,length=16569,dict_index=0,assembly=null) was found when SAMSe quenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112)
at picard.vcf.SortVcf.doWork(SortVcf.java:81)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=MT,length=16569,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,as sembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:142)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110)
... 4 more

Are these steps correct? What am I doing wrong?

thanks!

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @bsmith030465
    Hi,

    What version of Picard are you using? If you are not using the latest version, can you try using the latest version?

    Thanks,
    Sheila

  • bsmith030465bsmith030465 caMember

    I downloaded the latest version (picard-2.5.0), remade the dictionary and reran the picard SortVcf tool. I get the same error though...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Make sure you delete the old index as it will be out of date now.
  • bsmith030465bsmith030465 caMember

    I deleted the old .dict and .fai files before remaking the dictionary and re-running SortVcf.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    No I mean the idx of your vcf.
  • bsmith030465bsmith030465 caMember

    Hmm....I only have the vcf and bam file for each file (sent by our collaborators). I don't have an idx file for the vcf files that were sent. Is that what you mean?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @bsmith030465
    Ah, yes maybe that is the issue. You need a VCF index file. GATK creates an index file if you run any tool. You can run ValidateVariants as a quick way to generate the VCF index :smile:

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh my bad, I misread and thought you were having difficulty using the VCF after sorting it. That'll teach me to read too quickly.

    Can you confirm you regenerated both the .fai and the .dict of your reference, and post what command lines you used?

  • bsmith030465bsmith030465 caMember

    Sheila, Geraldine,

    Thanks for your responses.

    The vcf files I have were created using omo_sapiens.GRCh38.dna.primary_assembly_sorted.fa as reference sequence (as per vcf file annotation).

    1) If I try to validate the snps using command:

    Validate variants

    {JAVA}/java -jar ${GATK}/GenomeAnalysisTK.jar -T ValidateVariants -R ${GATK}/hg38bundle/Homo_sapiens.GRCh38.dna.primary_assembly.fa -V ${DATA}/SL54.vcf --dbsnp ${GATK}/hg38bundle/dbsnp_144.hg38.vcf

    I get error:

    ERROR MESSAGE: Input files /mnt/jk/data/SL54.vcf and reference have incompatible contigs. Please see https://www.broadinstitute.org/gatk/guide/article?id=63for more information. Error details: The contig order in /mnt/jk/data/SL54.vcf and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..

    2) I make my dict and fai files for my reference sequence using commands:

    FNAME=Homo_sapiens.GRCh38.dna.primary_assembly
    ${JAVA}/java -jar ${PICARD}/picard.jar CreateSequenceDictionary R=${GATK}/hg38bundle/${FNAME}.fa O=${GATK}/hg38bundle/${FNAME}.dict

    $ samtools faidx ${GATK}/hg38bundle/${FNAME}.fa

    3) I try to sort vcf files based on my reference sequence

    FNAME=Homo_sapiens.GRCh38.dna.primary_assembly
    ${JAVA}/java -jar ${PICARD}/picard.jar SortVcf I=${DATA}/SL54.vcf O=${DATA}SL54_sorted.vcf SEQUENCE_DICTIONARY=${GATK}/hg38bundle/${FNAME}.dict

    and I get the error in my original post.

    Should I be doing something differently?

    thanks for all your help!

  • bsmith030465bsmith030465 caMember

    Also, how can I 'sort' my reference sequence (as was used in the vcf files that I have '..GRCh38.dna.primary_assembly_sorted.fa')

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @bsmith030465
    Hi,

    So, you created the VCF using the exact same reference you ran ValidateVariants with? How did you create the VCF? Did you use GATK tools? If so, can you post the exact command you ran?

    -Sheila

  • bsmith030465bsmith030465 caMember

    oops...this wasn't the answer...but the question is solved (corrupted ref sequence file)..

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @bsmith030465
    Hi,

    Glad to hear it!

    -Sheila

Sign In or Register to comment.