Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

What is correct way to generate the input vcf and vcf.idx file for GATK4

cbaocbao Member, Broadie ✭✭

Hi,

We want to call SNVs by M2 for our WGS data. But, we do not have gnomad.vcf for WGS. So, we downloaded vcf files for each chr from gnomad website. Could you please let me know the "correct" way to generate both vcf and index file for GATK4 pipeline. Any suggestions are welcome! Thank you so much!

Does bcftools and tabix work?

Best,
Chunyang

Tagged:

Best Answers

  • cbaocbao ✭✭
    Accepted Answer

    I fixed the issue by changing header of my vcf. My pipeline is running without issue!

  • cbaocbao ✭✭
    Accepted Answer

    Sure, @Sheila,

    My VCF header is not in a "correct" format because there is no length field in the line start with "##contig". So, I just use awk to add contig lenght to my VCF header. Moreover, I think it also works if I use "-R ref.fasta" in a GATK cmd. Thanks a lot!

    contig_length=cat Homo_sapiens_assembly19.fasta.fai | \
    grep ^[0-9X] | awk '{printf("##contig=<ID=%s,length=%d>\n",$1,$2);}'`

    cat my.vcf | \
    grep '#' | grep -v '^##contig=' | \
    awk -v c="${contig_length}" '/^#CHROM/ {printf c"\n"} {print}' > new.vcf
    `

    More detials: https://www.biostars.org/p/198660/

    Thanks,
    Chunyang

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cbao
    Hi Chunyang,

    Have a look at this thread. There is a tool called IndexFeatureFile in GATK4 as well.

    -Sheila

  • cbaocbao Member, Broadie ✭✭

    Thanks, @Sheila.

    I tried to use this tool. But, I got the following ERROR when I used SelectVariants. Do you have any suggestions? Thanks a lot!

    ERROR:
    htsjdk.tribble.TribbleException: Contig 1 does not have a length field.

    My cmd:
    gatk --java-options "-Xmx${command_mem}g" IndexFeatureFile -F ${input_vcf} gatk --java-options "-Xmx${command_mem}g" SelectVariants -V ${input_vcf} -O ${output_name}.vcf.gz

    Best,
    Chunyang

  • cbaocbao Member, Broadie ✭✭

    Hi @Sheila,

    I guess my vcf header is not in a correct format. Like:

    contig=<ID=1>

    contig=<ID=2>

    contig=<ID=3>

    contig=<ID=4>

    contig=<ID=5>

    contig=<ID=6>

    contig=<ID=7>

    contig=<ID=8>

    Do you have a sense how to fix it? Thanks!

    Chunyang

  • cbaocbao Member, Broadie ✭✭
    Accepted Answer

    I fixed the issue by changing header of my vcf. My pipeline is running without issue!

  • cbaocbao Member, Broadie ✭✭
    Accepted Answer

    Sure, @Sheila,

    My VCF header is not in a "correct" format because there is no length field in the line start with "##contig". So, I just use awk to add contig lenght to my VCF header. Moreover, I think it also works if I use "-R ref.fasta" in a GATK cmd. Thanks a lot!

    contig_length=cat Homo_sapiens_assembly19.fasta.fai | \
    grep ^[0-9X] | awk '{printf("##contig=<ID=%s,length=%d>\n",$1,$2);}'`

    cat my.vcf | \
    grep '#' | grep -v '^##contig=' | \
    awk -v c="${contig_length}" '/^#CHROM/ {printf c"\n"} {print}' > new.vcf
    `

    More detials: https://www.biostars.org/p/198660/

    Thanks,
    Chunyang

Sign In or Register to comment.