extract individual chr from human_g1k_v37.fasta

blueskypyblueskypy Member ✭✭
edited October 2013 in Ask the GATK team

I used the following command:
samtools faidx human_g1k_v37.fasta 22 > chr.22.fasta

However, the file size is different from the size of the chr22 at NCBI:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr22.fa.gz

chr22.fa:     52037577
chr.22.fasta: 52159696

which one should I use for variant calling on chr22? Thanks a lot!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The NCBI one might be a slightly different version, or there may be extra information left over in the header of the file you generated. Your file is probably fine, but if you're worried, do a file diff to figure out exactly what are the differences.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Although, why are you not just using the full reference and using -L chr22 to restrict your analysis? Are you trying to generate a minimal test dataset? Keep in mind that if you try to use a BAM file with data for other contigs with this short reference, you will get a sequence dictionary mismatch error.

  • blueskypyblueskypy Member ✭✭
    edited October 2013

    wc -m gives the same difference, i.e. ~ 120K. But here might be the answer:
    For each file, the (#.of.line - 2) * line.length + length.of.the.last.line gives same result, 51304566

    @Geraldine_VdAuwera said:
    The NCBI one might be a slightly different version, or there may be extra information left over in the header of the file you generated. Your file is probably fine, but if you're worried, do a file diff to figure out exactly what are the differences.

  • blueskypyblueskypy Member ✭✭
    edited October 2013

    Thanks Geraldine! Maybe that's the way I should go. I just did not want to add -L chr22 to every step of my pipeline; I though I could just change the refGenome.

    @Geraldine_VdAuwera said:
    Although, why are you not just using the full reference and using -L chr22 to restrict your analysis? Are you trying to generate a minimal test dataset? Keep in mind that if you try to use a BAM file with data for other contigs with this short reference, you will get a sequence dictionary mismatch error.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I see -- the problem is that by changing the ref genome, you're sort of painting yourself into a corner for later. It's fine if you're sure that you'll only be using resources that are limited to chr22, but that means also modifying the dbsnp files and everything else. Simpler to just use -L, in my opinion.

  • blueskypyblueskypy Member ✭✭
    edited October 2013

    Do I have to change the dbsnp file if the ref genome is chr22? From bwa to HaplotypeCaller, I used dbsnp_137.b37.vcf in BaseRecalibrator, ReduceReads, and HaplotypeCaller. There is no complaint. But I did get error in VQSR as posted in another thread.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It depends on how each tool accesses data. Generally it's just simpler to not chop things up...

Sign In or Register to comment.