Reference genome and VCF file

Dear Team,

I am trying to analyse a non - human multi chromosome genome. I have also selected a vcf file to use a a source of known variant sites. Does the reference genome need the chromosomes to be in the same order as the VCF file? Or does it need to be alphanumerical?
Also, I think I need to modify the headers on the reference file so to match the names used on the vcf for the same chromosomes. Can you lease confirm this? Sorry for the noob questions.

Regards,
Max

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Max,

    Yes, the reference and VCF should be sorted the same way, in coordinate order. The contig names and ordering in your VCF must exactly match the ones of the reference you are using.

    If the names of the contigs are not the same, this suggests the data were aligned to different versions of your organism's reference. It may not be enough to just change the contig names; there may be differences in the reference sequences themselves. I would recommend you look up exactly what reference version was used, and if necessary, lift over the known variants or realign your sequence data to the other reference.

  • mstagliamontemstagliamonte USAMember

    Hi, Geraldine,

    Thank you. I contacted the authors of the vcf file to make sure I am using the right reference genome version. What do you mean when you say that reference and VCF need to be sorted in coordinate order? I mean, in the individual chromosomes, the variants are reported in coordinate order. As long as the chromosomes are in the same order in the two files, does it matter to GATK the specific order they are reported? In my VCF I have chr1, chr10, chr11, chr12... chr2, chr3 and so on.
    I read a caveat in the FAQ regarding using uman genome and chromosome order, I am not sure it applies to my genome as well.

    Also, I would like to manually add to the reference file the mitochondrion genome as well. I am not currently interested in analysing its genome. This is absent in the VCF file, but I would only use it to map its reads, rather than having the mapping algorithm trying to align them on any of the chromosomes. Would this procedure generate any problems?

    Thanks again for your help
    Max

  • mstagliamontemstagliamonte USAMember

    Thanks, that's great!
    Looking forward to keep going with my analysis :)

    Regards,
    Max

Sign In or Register to comment.