Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

What is correct way to generate the input vcf and vcf.idx file for GATK4

cbaocbao Member, Broadie ✭✭

Hi,

We want to call SNVs by M2 for our WGS data. But, we do not have gnomad.vcf for WGS. So, we downloaded vcf files for each chr from gnomad website. Could you please let me know the "correct" way to generate both vcf and index file for GATK4 pipeline. Any suggestions are welcome! Thank you so much!

Does bcftools and tabix work?

Best,
Chunyang

Tagged:

Best Answers

  • cbaocbao ✭✭
    Accepted Answer

    I fixed the issue by changing header of my vcf. My pipeline is running without issue!

  • cbaocbao ✭✭
    Accepted Answer

    Sure, @Sheila,

    My VCF header is not in a "correct" format because there is no length field in the line start with "##contig". So, I just use awk to add contig lenght to my VCF header. Moreover, I think it also works if I use "-R ref.fasta" in a GATK cmd. Thanks a lot!

    contig_length=cat Homo_sapiens_assembly19.fasta.fai | \
    grep ^[0-9X] | awk '{printf("##contig=<ID=%s,length=%d>\n",$1,$2);}'`

    cat my.vcf | \
    grep '#' | grep -v '^##contig=' | \
    awk -v c="${contig_length}" '/^#CHROM/ {printf c"\n"} {print}' > new.vcf
    `

    More detials: https://www.biostars.org/p/198660/

    Thanks,
    Chunyang

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cbao
    Hi Chunyang,

    Have a look at this thread. There is a tool called IndexFeatureFile in GATK4 as well.

    -Sheila

  • cbaocbao Member, Broadie ✭✭

    Thanks, @Sheila.

    I tried to use this tool. But, I got the following ERROR when I used SelectVariants. Do you have any suggestions? Thanks a lot!

    ERROR:
    htsjdk.tribble.TribbleException: Contig 1 does not have a length field.

    My cmd:
    gatk --java-options "-Xmx${command_mem}g" IndexFeatureFile -F ${input_vcf} gatk --java-options "-Xmx${command_mem}g" SelectVariants -V ${input_vcf} -O ${output_name}.vcf.gz

    Best,
    Chunyang

  • cbaocbao Member, Broadie ✭✭

    Hi @Sheila,

    I guess my vcf header is not in a correct format. Like:

    contig=<ID=1>

    contig=<ID=2>

    contig=<ID=3>

    contig=<ID=4>

    contig=<ID=5>

    contig=<ID=6>

    contig=<ID=7>

    contig=<ID=8>

    Do you have a sense how to fix it? Thanks!

    Chunyang

  • cbaocbao Member, Broadie ✭✭
    Accepted Answer

    I fixed the issue by changing header of my vcf. My pipeline is running without issue!

  • cbaocbao Member, Broadie ✭✭
    Accepted Answer

    Sure, @Sheila,

    My VCF header is not in a "correct" format because there is no length field in the line start with "##contig". So, I just use awk to add contig lenght to my VCF header. Moreover, I think it also works if I use "-R ref.fasta" in a GATK cmd. Thanks a lot!

    contig_length=cat Homo_sapiens_assembly19.fasta.fai | \
    grep ^[0-9X] | awk '{printf("##contig=<ID=%s,length=%d>\n",$1,$2);}'`

    cat my.vcf | \
    grep '#' | grep -v '^##contig=' | \
    awk -v c="${contig_length}" '/^#CHROM/ {printf c"\n"} {print}' > new.vcf
    `

    More detials: https://www.biostars.org/p/198660/

    Thanks,
    Chunyang

Sign In or Register to comment.